The layered guardrails trap
The layered guardrails trap: if all your layers fail for the same reason, you only have one defense
I’ve been chewing on an idea for the past few weeks that, the more I look at it, the worse it looks for a fair share of the AI security architectures being deployed right now across enterprises, both in Spain and elsewhere. The idea is simple to state and rather uncomfortable to accept:
If the seven layers of your defense-in-depth all fail for the same reason, you don’t have seven defenses. You have one, dressed up as seven.
This isn’t rhetoric. It’s a structural property of how we’re building these systems, and it connects directly to two recent impossibility results that are worth having on the table before signing off on the next agent deployment project.
The pattern we see in nearly every deployment
You walk into a mid-sized or large company that’s integrating generative AI or agentic systems into production and you find, with minor variations, the same security architecture wrapped around the model:
An input classifier that flags “suspicious” prompts. A constitutional rewriting layer that “softens” risky inputs. A sanitizer that strips known injection patterns. An output filter that catches whatever the previous three missed. Sometimes, on top of all that, a second “judge” model that scores whether the first model’s response is acceptable.
On paper this looks like textbook defense-in-depth. Five independent layers, each with its own logic, each adding work for the attacker. The diagram lands well in a steering committee deck. I was in one of those presentations a couple of months back where the head of AI was proudly walking the room through the diagram: five boxes in cascade, neat little arrows, the whole nine yards. When I asked what model was actually behind each box, it turned out all five were GPT-4o with different system prompts. We had to break it to him gently.
The problem starts when you ask why each layer fails.
The concept almost nobody is looking at: uncorrelated defense layers
In classical security, the kind we’ve been doing for twenty years in networks, endpoints and applications, defense-in-depth works because each layer fails for different reasons. A firewall fails because of a misconfigured rule. An IDS fails because the signature wasn’t updated. An EDR fails because the attacker went living-off-the-land. A SIEM fails because the correlation rule didn’t trigger. The attacker has to break four mechanisms whose failure surfaces are essentially independent.
That independence is exactly what gives the stack its value. If the probability that each layer fails is p, and the failures are independent, the joint failure probability is p to the n. Four layers, p=0.1 each, joint failure of 0.0001. That’s usable security.
Now let’s transplant that same pattern to AI. The five layers I described above (input classifier, rewriter, sanitizer, output filter, judge) share an uncomfortable property: virtually all of them depend on a language model doing semantic inference over the same input. Sometimes it’s literally the same model. Sometimes it’s five different models trained on heavily overlapping data distributions, with similar architectures, vulnerable to the same families of adversarial attacks. It’s like asking five siblings who grew up in the same house to vote independently: technically that’s five votes, but the variance is the father’s.
When an attacker finds a prompt that evades the input classifier, that same prompt has a vastly higher than baseline probability of evading the rewriter, the output filter and the judge too. Not because the attacker broke five mechanisms: because they broke one class of mechanism five times. The p’s are no longer independent. The formula collapses.
This is what theorists call correlation between defense layers. It’s the blind spot of nearly every architecture we’re auditing these days at Rodela, and the first thing we put on the table when a client walks us through their “defensive stack” around an agent.
Why the recent impossibility results make the picture worse
Here come two theoretical results, both published recently, that I think any AI security lead ought to know about.
The first is what a recent paper, The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?, formalizes as the defense trilemma. The proof (mechanically verified in Lean 4) is geometric: any wrapper that envelops a model and tries to simultaneously satisfy continuity, utility preservation and completeness over a connected prompt space is mathematically impossible. You can have two of the three properties, not all three. The trilemma also scales: it applies turn by turn in dialogues, it still holds for stochastic defenses in expectation, and the effective Lipschitz constant degrades exponentially with the depth of the agentic pipeline.
The second block is the NP-hardness results on reward hacking detection. Semantic self-verification, that is, having a system verify that its behavior aligns with a non-trivial set of natural-language guidelines, is NP-complete by reduction to 3-SAT. Aligning to sufficiently rich sets of human values incurs exponential overhead. And reward hacking shows up as a structural equilibrium in any optimized agent, getting combinatorially worse as you add tools.
What these two results say, translated to CISO-speak: your wrapper can’t be complete, and your behavioral-drift detector can’t be either in polynomial time. So far, bad news. But here’s the key bit.
When you put both impossibilities together, you discover that the layers of your defense-in-depth aren’t merely correlated: they’re correlated precisely over the same critical cases. The edge of the trilemma is where wrappers fail. The rare, high-impact slice is where reward hacking monitoring fails. In real systems, those two sets overlap massively, because both are consequences of running models whose alignment surface is the same across all layers.
Your input classifier and your output filter fail on the same prompts. Your rewriter and your judge have the same blind spots. It’s not bad luck. It’s topology.
What this looks like in a real audit
In the assessments we’ve run at Rodela over the last few months for insurance and fintech clients, the pattern repeats with uncomfortable clarity. You ask the security team to enumerate the defensive layers around the AI deployment. They list five, six, sometimes eight. You draw the diagram. It looks solid.
Then you start labeling each layer by what breaks it. Not by what it’s called, not by which vendor sells it: by the family of attacks or failures it’s vulnerable to. And that’s when the problem shows up. Five of the eight layers are vulnerable to the same class of adversarial prompt injection. Four depend on the same base model. Three are trained on datasets with 80% or more overlap.
I remember an assessment a few months back where the team had built a guardrail they called “the second lock.” A separate model, different prompt, supposedly independent. We tried a relatively well-known jailbreak payload (not even novel, it had been sitting in a public repo for months) and the main model rejected it without issue. We piped it in the long way, via a document uploaded into the RAG, and that went through. The “second lock” didn’t even flinch: it cheerfully signed off on the output. When we walked the room through it, someone muttered that what they had wasn’t a second lock, it was the same lock painted a different color. I wrote that one down.
The client thinks they have eight defenses. In terms of independent failure surface, they have about one and a half.
The right question to audit an agentic architecture isn’t how many layers do you have?, it’s how many independent failure modes do your layers actually cover? It’s a different question, considerably more uncomfortable, and almost nobody is answering it.
What to do about it
I’m not writing this to set off alarms. I’m writing it because there are concrete lines of action, and ignoring the problem only shifts it from the design phase to the post-mortem.
First: characterize correlation, don’t count layers. When you design or audit a defensive architecture for AI, the metric isn’t the number of controls. It’s the correlation matrix between the failure modes of those controls. Heterogeneous layers (a deterministic control, a statistical control, a human control, a cryptographic control) have low correlation by construction. Layers that are five different LLMs doing the same kind of semantic judgment have high correlation. This can be estimated, documented and demanded, and in fact it’s one of the deliverables we’re standardizing in our assessment methodology at Rodela: an explicit failure-mode-per-layer matrix, instead of the usual checklist of which controls are present.
Second: introduce deterministic architectural mediation on consequential actions. This is what some recent work calls Trinity-style architecture: information flow control with mandatory labels, privilege separation between perception and execution, and a finite action calculus mediated by a reference monitor. The beauty of it is that none of these are LLMs judging LLMs. They are deterministic controls operating outside the reach of the trilemma. When an action can move money, execute code in production, send external communications or touch critical systems, that authorization shouldn’t depend on a behavioral inference about what the model “intended” to do. It should depend on capabilities, unforgeable provenance and mediation. Put another way: if the only thing standing between your agent and a six-figure wire transfer is another AI agent’s opinion that it shouldn’t go through, you don’t have an architecture. You have a group chat.
Third: flatten the alignment boundary instead of trying to eliminate it. If you set the safety threshold so that behavior at the boundary stays benign (a polite refusal, a clarification request, a human hand-off), the trilemma is still mathematically true, but it stops biting in practice. This is a training lever, and it’s probably the highest return-per-euro investment you can make against the structural impossibility.
Fourth: treat reward hacking monitoring as a sampling problem aimed at high-impact slices, not as uniform coverage. The no-free-lunch results say explicitly that uniform coverage is intractable, and that rare, high-impact failures are exactly what the inspector statistically misses. Concentrating the evaluation budget on actions that cross trust boundaries (code changes in production, financial transactions, external communications, writes to systems of record) is operationally more cost-effective per unit of spend.
Fifth: ask vendors to document correlation, not completeness. Any vendor promising “we prevent all unsafe outputs” is asking you to sign off on something mathematically impossible. The right question in an RFP isn’t what percentage of attacks do you block?, a number that means nothing outside the vendor’s own benchmark, but what classes of failure does your architecture cover, and what is the documented correlation between your layers’ failure modes and the base model the client is deploying?
Closing
Defense-in-depth isn’t broken. What’s broken is the lazy version that stacks homogeneous controls and counts them as if they were independent. It’s a mistake we already made in network security twenty years ago, learned the hard way, and ought to know better than to repeat.
The good news is that the discipline knows how to do this. Cryptography manages bounded probability of compromise. Operating systems manage privilege escalation through layered mediation. Networks manage adversarial behavior through monitoring and response. In every one of those domains, maturity arrived when we stopped promising elimination and started managing bounded, characterized, monitored failure.
AI security is at exactly that point in the cycle. We’ve spent two years promising elimination. It’s time to start managing the wall.
If your guardrail architecture falls apart when you ask not how many layers it has, but how many independent failure modes those layers cover, that’s the problem that matters. And it’s the one worth solving before the first incident solves it for you.
If these topics resonate, or you’re heading into an agentic deployment in a regulated sector, reach out. At Rodela we’ve spent the last few months running assessments and designing defensive architectures for clients sitting exactly at this point.