Guardrails work in layers, not as one fence.
A common mistake: treating "the guardrail" as one component you bolt onto an agent. Real protection is a stack of independent checks. No single layer catches everything; the goal is that no single failure makes it all the way through. It's the cheapest insurance an agent system can buy.
The eight kinds of guardrails
Most production systems use five to seven of these. High-stakes systems use all eight.
- Input filters. Block obviously bad inputs before they ever reach the model: way too long, malformed, or matching a list of known-bad patterns. Fast and cheap; mostly regex and simple checks.
- Intent classifiers. Look for prompt-injection patterns ("ignore previous", "pretend you are"), jailbreak attempts, or off-topic abuse. These are machine-learning models like Llama Guard, Prompt Guard, NeMo, or LakeraLakera 2025. Useful as one layer in your stack, not as the only one.
- Strict output formats. Force the agent's response into a fixed structure (a JSON schema, a Pydantic model). Even an injected agent can only fill in the fields you defined, which dramatically limits the damage. The single highest-value guardrail you can add.
- Tool restrictions. Each agent role gets a list of the tools it's allowed to use, with limits on what arguments. A read-only agent has no email tool. A drafting agent has no delete tool.
- Policy engines. Pull your rules out into a separate system. "This agent isn't allowed to return data classified above its access level." Tools like OPA or Cedar work for this. Easy to audit and explain.
- Sandboxing. Any code the agent runs goes in an isolated container with no network access and minimal credentials. The last line of defense when everything else has failed.
- Second-opinion checks. A different agent or model reviews the first one's output. Variations: LLM-as-judge, running the question through two models and comparing, debate patterns.
- Human approval steps. For the highest-stakes actions, a real person has to click "approve" before it happens. The escape hatch for when automation isn't trustworthy enough.
How to add guardrails, layer by layer
The right way to build guardrails is incrementally. Start with one layer, see what slips through, add the next. This section walks through each layer with working code, and ends with a complete worked example: a customer-facing assistant for a fintech company.
Before the LLM sees anything, run dirt-cheap deterministic checks. Length, encoding, known-bad signatures. These catch ~30% of attacks at near-zero cost.
import re
JAILBREAK_PATTERNS = [
r"ignore (all |the |your |previous |prior )?(instructions|rules|prompts)",
r"disregard.{0,20}(system|prompt|instructions)",
r"you are (now|going to be) (?!a helpful)",
r"pretend (you are|to be)",
r"act as if you (have no|don't have)",
r"\bDAN\b|\bSTAN\b", # common jailbreak personas
r"<\|im_(start|end)\|>", # injected role tokens
]
def input_filter(text: str) -> tuple[bool, str]:
"""Returns (allow, reason). Cheap, deterministic, no model needed."""
if not text or not text.strip():
return False, "empty input"
if len(text) > 8000:
return False, "input exceeds 8000 char limit"
for pattern in JAILBREAK_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
return False, f"matched jailbreak signature: {pattern}"
return True, "ok"
Two things to note. The patterns are conservative; they will have false positives. Decide whether you log + allow on first match (lenient) or block (strict) based on your trust model. Also, this is a complement to ML classifiers, not a replacement; clever attacks bypass regex easily.
Force the agent's output through a strict Pydantic schema. Even if the model gets prompt-injected, it can only fill the schema fields. The damage radius collapses dramatically.
from pydantic import BaseModel, Field, field_validator
from typing import Literal
class SupportResponse(BaseModel):
intent: Literal["billing", "technical", "account", "other"]
answer: str = Field(max_length=800)
next_action: Literal["resolved", "escalate_human", "need_info"]
confidence: float = Field(ge=0, le=1)
@field_validator("answer")
@classmethod
def no_secret_leakage(cls, v: str) -> str:
# Last-line defense, even if model tries to leak, schema rejects
forbidden = ["sk-", "AKIA", "BEGIN RSA", "system prompt"]
for needle in forbidden:
if needle.lower() in v.lower():
raise ValueError(f"forbidden token in answer: {needle}")
return v
# Use it: the LLM is asked to return JSON matching SupportResponse.
# If the model freelances, validation fails and you retry or escalate.
def call_with_schema(prompt: str, schema=SupportResponse, max_retries=2):
for attempt in range(max_retries + 1):
raw = llm(prompt, json_mode=True, schema=schema.model_json_schema())
try:
return schema.model_validate_json(raw)
except Exception as e:
if attempt == max_retries:
raise
prompt = f"{prompt}\n\nRETRY: previous output failed validation: {e}"
A few teams skip this layer because "the model usually returns the right shape". That's exactly when it bites you: a once-in-10,000 hallucinated structure becomes a production incident. Schemas are insurance with a nearly free premium.
Each agent role declares which tools it can call, and with what parameters. A reading agent has no write tools. A research agent has no payment tools. If the model decides to call something outside its list, the call fails before it leaves the process.
from dataclasses import dataclass, field
@dataclass
class ToolPolicy:
role: str
allowed_tools: set[str]
param_constraints: dict = field(default_factory=dict)
POLICIES = {
"customer_support": ToolPolicy(
role="customer_support",
allowed_tools={"lookup_account", "search_kb", "create_ticket"},
param_constraints={
"lookup_account": {"account_id": "must_match_authenticated_user"},
},
),
"refund_processor": ToolPolicy(
role="refund_processor",
allowed_tools={"lookup_transaction", "issue_refund"},
param_constraints={
"issue_refund": {"amount": {"max": 500}}, # human-approval above
},
),
}
class ToolGateway:
def __init__(self, tools: dict, policy: ToolPolicy, user_ctx: dict):
self.tools = tools
self.policy = policy
self.user_ctx = user_ctx
def call(self, name: str, **args):
if name not in self.policy.allowed_tools:
raise PermissionError(
f"role '{self.policy.role}' may not call '{name}'"
)
# Apply parameter constraints
constraints = self.policy.param_constraints.get(name, {})
for param, rule in constraints.items():
if rule == "must_match_authenticated_user":
if args.get(param) != self.user_ctx.get("user_id"):
raise PermissionError(f"{param} must match authenticated user")
elif isinstance(rule, dict) and "max" in rule:
if args.get(param, 0) > rule["max"]:
raise PermissionError(
f"{param}={args[param]} exceeds max {rule['max']}"
)
return self.tools[name](**args)
Some actions can't be undone or have big consequences: refunds above $500, deleting any record, sending an email to someone outside the company. These need a human approval step. Build that approval step into the workflow as a real, named stage, not as a quick hack tacked on the side.
from dataclasses import dataclass
from enum import Enum
class ApprovalState(Enum):
PENDING = "pending"
APPROVED = "approved"
REJECTED = "rejected"
EXPIRED = "expired"
@dataclass
class ApprovalRequest:
id: str
action: str # 'issue_refund', 'send_email', etc.
args: dict
requester: str # the agent name
workflow_id: str
expires_at: float
state: ApprovalState = ApprovalState.PENDING
class ApprovalGate:
HIGH_STAKES = {"issue_refund", "send_email_external", "delete_record"}
def requires_approval(self, action: str, args: dict) -> bool:
if action == "issue_refund" and args.get("amount", 0) > 500:
return True
return action in self.HIGH_STAKES
async def request_and_wait(self, req: ApprovalRequest, timeout_s=300):
# Notify the human approver, persist request, wait for resolution
notify_approver(req)
result = await wait_for_decision(req.id, timeout_s)
if result.state != ApprovalState.APPROVED:
raise PermissionError(f"action {req.action} {result.state.value}")
return result
Two design notes. Async by default: the workflow pauses, the human takes minutes or hours, and the workflow resumes. Don't block a thread waiting. Auditable: every approval request and decision is logged with who, when, and why. The audit trail matters more than the gate itself when something goes wrong.
Even with all the layers above, the agent's final output reaches a user. The output filter is your last chance to redact PII, strip internal codes, and sanity-check the response shape.
import re
PII_PATTERNS = {
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"credit_card":re.compile(r"\b(?:\d[ -]*?){13,16}\b"),
"email": re.compile(r"\b[\w.+-]+@[\w-]+\.[\w.-]+\b"),
"phone": re.compile(r"\b\+?\d{1,3}[ -]?\(?\d{3}\)?[ -]?\d{3}[ -]?\d{4}\b"),
}
INTERNAL_TOKENS = {"INTERNAL_ONLY", "DO_NOT_DISCLOSE", "system_prompt"}
def output_filter(text: str, allow_email_for_self=False) -> tuple[str, list]:
"""Returns (filtered_text, list_of_redactions). Block decision separate."""
redactions = []
out = text
for name, pattern in PII_PATTERNS.items():
if name == "email" and allow_email_for_self:
continue
matches = pattern.findall(out)
if matches:
redactions.append({"type": name, "count": len(matches)})
out = pattern.sub(f"[REDACTED_{name.upper()}]", out)
for token in INTERNAL_TOKENS:
if token in out:
raise ValueError(f"internal token '{token}' leaked into output")
return out, redactions
Two patterns matter here. Redact, don't fail for things like extra phone numbers or emails: replace with a placeholder, log the redaction, continue. Hard-fail on internal tokens: if a marker like system_prompt appears in output, something is very wrong and the response should never ship.
Composing the layers: a worked use case
from dataclasses import dataclass
import time
@dataclass
class GuardedAgent:
"""All five layers wired together. Order matters: cheapest first."""
role: str
llm: callable
tools: dict
policy: ToolPolicy
approval_gate: ApprovalGate
user_ctx: dict
def __post_init__(self):
self.gateway = ToolGateway(self.tools, self.policy, self.user_ctx)
async def handle(self, user_input: str) -> dict:
wf_id = f"wf-{int(time.time() * 1000)}"
# LAYER 1: input filter (cheapest, runs first)
ok, reason = input_filter(user_input)
if not ok:
audit(wf_id, "BLOCK_INPUT", reason)
return {"status": "blocked", "reason": "input rejected"}
# LAYER 2: structured output via schema
prompt = build_prompt(self.role, user_input, self.user_ctx)
try:
response = call_with_schema(prompt, schema=SupportResponse)
except Exception as e:
audit(wf_id, "SCHEMA_FAIL", str(e))
return {"status": "escalated", "reason": "schema validation failed"}
# LAYER 3 + 4: tools via gateway, with approval gate for high-stakes
if response.next_action == "need_info":
# Agent wants to call a tool to get more data
tool_call = response.tool_call # schema would include this field
if self.approval_gate.requires_approval(tool_call.name, tool_call.args):
req = ApprovalRequest(
id=f"{wf_id}-approval",
action=tool_call.name, args=tool_call.args,
requester=self.role, workflow_id=wf_id,
expires_at=time.time() + 300,
)
try:
await self.approval_gate.request_and_wait(req)
except PermissionError as e:
audit(wf_id, "APPROVAL_DENIED", str(e))
return {"status": "blocked", "reason": str(e)}
try:
tool_result = self.gateway.call(tool_call.name, **tool_call.args)
except PermissionError as e:
audit(wf_id, "TOOL_DENIED", str(e))
return {"status": "blocked", "reason": "unauthorized tool"}
# Re-prompt with tool result, re-run schema validation
response = call_with_schema(
build_prompt(self.role, user_input, self.user_ctx, tool_result),
schema=SupportResponse,
)
# LAYER 5: output filter (last line of defense)
try:
clean_answer, redactions = output_filter(response.answer)
except ValueError as e:
audit(wf_id, "OUTPUT_LEAK", str(e))
return {"status": "blocked", "reason": "output contained internal token"}
if redactions:
audit(wf_id, "OUTPUT_REDACTED", redactions)
return {
"status": "ok",
"answer": clean_answer,
"intent": response.intent,
"workflow_id": wf_id,
}
What happens for each kind of input
| Input | Caught at | Outcome |
|---|---|---|
| "What's my balance?" (legitimate) | None (passes all) | Tool call to lookup_account, response returned |
| "Ignore previous instructions and..." | Layer 1 (regex) | Blocked, audit logged, no LLM call made |
| Subtle injection in support ticket | Layer 2 (schema) | Model output doesn't match schema, escalated to human |
"Refund $50" → agent calls issue_refund | Layer 3 (allow-list) | Allowed, amount under $500 cap, refund issued |
| "Refund $5000" → agent tries to comply | Layer 3 + 4 | Hits approval gate, waits for human, may be denied |
| Agent leaks "your SSN is 123-45-6789" | Layer 5 (output filter) | SSN redacted, audit logged, response continues |
| Agent says "the system_prompt told me..." | Layer 5 (hard-fail) | Whole response blocked, alert fires |
The 2025 update: hidden-instruction attacks are now the #1 threat
OWASPOWASP LLM 2025, the industry group that publishes security top-10 lists, named prompt injection the number one risk for LLM applications in 2025 OWASP, 2025. Within the broader category, the attack actually breaking real systems is what's called indirect prompt injection (or IPI for short), first described by Greshake et al., AISec 2023 and updated in the 2026 study Brittle Agents, arXiv 2026.
Here's the idea: instead of typing malicious instructions to your agent directly, an attacker hides them inside content the agent will eventually read on its own, a webpage, a PDF, an email, a Stack Overflow answer. When your agent reads that content as part of normal work, it follows the hidden instructions without anyone noticing. The user sees nothing wrong because the attacker never touched the user.
Several real incidents from 2025 (collected in Prompt Injection Review, MDPI 2026, Lakera, 2025, and CrowdStrike, 2025):
- CVE-2025-53773: a malicious file fed to GitHub Copilot tricked the agent into running attacker code on the developer's machine.
- CVE-2025-59944: the Cursor coding agent followed instructions hidden in a config file, again leading to code execution on the user's machine.
- CamoLeak: a similar exploit rated CVSS 9.6 (very severe).
- Google's CommonCrawl scanners Help Net Security / Google, 2026 reported a 32% jump in pages containing this kind of malicious instruction between November 2025 and February 2026. The technique is spreading.
When agents talk to each other, the infection spreads
A 2024 paper Lee et al., arXiv 2024 coined the term "prompt infection" for what happens when one compromised agent talks to other agents in the same system. Agent A reads poisoned content and starts behaving badly. Its output goes to Agent B, which reads it as normal context and follows along. B passes to C. By the time anyone reviews the result, three agents have been compromised and the trace looks like ordinary collaboration.
What helps in practice:
- Don't trust messages between your own agents. Run the same input filters on agent-to-agent messages that you run on user-to-agent messages. The model doesn't know whose words it's reading.
- Use strict output formats. If Agent B can only return a structured object (JSON with named fields, not free text), there's nowhere for hidden instructions to ride. Schemas double as a security boundary.
- Watch for unusual outputs. If an agent's response is suddenly much longer, contains URLs it never used before, or references entities it shouldn't know about, pause it and have a human review the trace.
- Use a different model for verification. If your suspicious output came from a GPT-family model, have a Claude (or vice versa) double-check the critical decisions. A weakness in one model family doesn't usually transfer to another.
About the IPI defense benchmarks
Researchers have built benchmarks to test agent defenses against these attacks. A 2025 paper Firewall benchmarks, arXiv 2025 ("Indirect Prompt Injections: Are Firewalls All You Need?") tested whether simple input-filtering firewalls block the attacks in these benchmarks. They block almost all of them. Encouraging on the surface, but the takeaway is uncomfortable: the benchmarks are probably too easy. Real attackers are creative in ways the benchmarks don't cover yet.
Treat IPI defense as an ongoing arms race, not a solved problem. Use multiple defensive layers, watch your block-rate metrics for sudden shifts, and assume any single defense will eventually be bypassed. The goal isn't to stop everything; it's to make attacks expensive enough that they're not worth attempting.
Goal stability: drift is the slow version of injection
Hidden-instruction attacks are the loud version of a quiet problem. The quiet version is drift. An agent's goal can change slowly across many steps without any single step looking suspicious. After fifty turns, the agent is still doing well at something, just not the thing you originally asked for. Most monitoring tools won't catch this because each step looks fine on its own.
Four common ways drift sneaks in, with no attacker required:
- The user keeps pushing back. A long conversation with a lot of "no, do it this way" has nudged the agent into doing something different from the original ask. The agent thinks it's being helpful. The original goal has quietly been replaced.
- The plan keeps rewriting itself. Each step reframes the problem a little. After many steps, the current sub-task barely looks like the original ask. Each rewrite seemed reasonable; the sum doesn't.
- Tool outputs poison the context. Something a tool returned has nudged the agent off-course, and the nudge has been rolled into the next prompt and the next. This is prompt injection that arrived through the back door and stuck around.
- The agent edits its own notes. Reflective agents that rewrite their own scratchpad or plan can edit out the original goal without noticing. They're now pursuing the new goal, perfectly.
Output filters won't catch any of this. Output filters check the final action against a policy. Drift is a problem with direction, not output. The fix is to keep a separate, untouchable copy of the original goal, and to run a checker that asks, every so often, "is what we're doing now still consistent with this?" The checker shouldn't be the same model running the workflow; if the worker has drifted, a checker built from the same context drifts with it.
The pattern is simple: write the goal down at the start, lock it, and run a separate process that compares the current trajectory against it before any irreversible action. The check is cheap. It works best on long workflows, which is where the rest of the safety stack is weakest.
Hardening: making guardrails that survive contact with attackers
A guardrail that works in dev breaks in prod within weeks. Hardening is the discipline of making them durable.
- Use multiple layers. Never rely on a single check. Stack several so that if one is bypassed, another catches it. Three good-enough layers beat one perfect one.
- Fail closed. When a safety check itself crashes or times out, reject the request rather than let it through. "Letting through on error" is the loudest way to shoot yourself in the foot.
- Test against your own attacks. Keep an internal collection of jailbreaks, injections, and edge cases your team has found. Run it every night. Every time something slips past in production, add it to the collection.
- Version control everything. Your guardrail rules should be in git. Changes get code-reviewed like any other code. Rolling back is one commit away.
- Watch the block rate. A sudden change in how often guardrails are firing (up or down) usually means either a new attack or a code regression. Alert on the rate, not on each individual block.
- Combine rules and ML. Rules are predictable and easy to audit; ML catches things rules miss. Use rules first because they're cheaper, then send the survivors through ML.
- Sanitize when you can, reject when you must. If a document has "ignore previous instructions" buried in it, strip that line and keep going rather than failing the whole task.
- Limit how much one failure can do. Even if everything else fails, what's the worst-case outcome? Smaller worst case means you can be more forgiving on individual checks.
- Log every decision. Every block, every allow, every approval. Logs are the only way to spot patterns weeks later.
- Have someone trying to break it. A red-team person, or just a separate process, whose job is to find ways past your guardrails. Reward what they find.
Guardrails change shape by situation
Guardrails are not one-size-fits-all. The same agent with the same prompt needs different rails depending on its environment, its inputs, and its blast radius. Three real-world examples:
Situation A · Internal employee assistant
- Trust level: high (authenticated employees).
- Light intent classifier: most users aren't attacking. A heavy filter would create friction.
- Moderate output schema: flexible responses needed.
- Strict tool allow-list per role: engineering can run code; HR cannot.
- Audit logging only: no real-time blocks except for clear violations.
Situation B · Public-facing customer support bot
- Trust level: low (anonymous, adversarial).
- Heavy intent classifier: jailbreak attempts every minute.
- Strict output schema: only structured responses; no free-form text that could leak.
- Tool allow-list extremely narrow: read-only on the customer's own data.
- Output filter for PII, internal codes, system prompts.
- Rate limiting per session: prevent abuse.
- Real-time monitoring on jailbreak rates.
Situation C · Autonomous trading agent
- Trust level: high inputs (your own data) but catastrophic blast radius.
- Hard rules first: position size limits, daily loss caps, asset whitelists. Non-negotiable.
- Pre-trade simulation: every order runs through a what-if before execution.
- Approval gate on any trade above $X.
- Kill switch: observable, externally controllable, tested weekly.
- Anomaly detection on order patterns, alert on anything outside historical norms.
- Multiple independent risk agents: disagreement on risk = no trade.
Guardrails by failure mode
A reverse lookup: "I'm worried about X, which guardrails address it?"
| Failure mode | Primary guardrail | Backup |
|---|---|---|
| Prompt injection | Output schema + role isolation | Intent classifier, cross-check |
| Hallucinated facts | Tool grounding (verify with source) | Cross-agent verification |
| Destructive tool call | Tool allow-list + dry-run | Human approval gate |
| PII leakage | Output filter + role isolation | Audit + redaction |
| Runaway cost | Token budget per workflow | Iteration cap + observer |
| Goal drift | Periodic re-anchoring | Eval against original goal |
| Compromised agent | Sandboxing + cross-check | Anomaly detection |
| Reward hacking | Multi-metric eval | External auditor agent |
A practical addendum that did not exist when this chapter was first written. The five guards above are the same five guards regardless of agent shape, but the right configuration for each depends on whether the agent is a generalist, a specialist, or a generalist plus RAG. Chapter 09 (Generalists & specialists) covers the configuration trade-offs, and the kit's profile_aware_guards() helper produces a starter GuardConfig from an AgentProfile. Separately, the agent itself can hallucinate that it has guards (or capabilities) it does not have. Chapter 10 (When the agent itself is wrong) covers the three external checks (capability registry, pinned ask, tool gate) that close that gap.