When the agent itself is wrong · Cenpie · Agentic AI Field Manual

10 When the agent itself is wrong · hallucinated capabilities, asks & privileges

The handshake assumed the agent would be honest about itself. It will not be.

The chapter on context exchange (chapter 06) gave us a pattern: before two agents collaborate, run a short capability handshake, then exchange typed envelopes through a compartment. The pattern works. It also assumed something quietly: that the agent on the other end would tell the truth about its own state. Real LLM agents do not.

There is no malice required. An LLM has no special access to its own capabilities, its own privileges, or what was actually asked of it three turns ago. Everything it "knows" about itself comes from the same prompt context that got muddied by the last tool call. So the agent claims it can produce a tag it cannot produce. It claims it has a token it does not have. It "remembers" the goal as something slightly different from the original. It tries to call a tool that does not exist in its catalog. None of these are bugs in the model. They are the baseline behavior of any LLM under load.

This chapter is about the layer that closes the gap. The principle is the same one used everywhere else in computer security: do not let the principal declare its own privileges. Apply that to agents and you get three external checks: capabilities published by an authority, asks pinned at session start, tool calls verified at the gate.

Prompt injection (covered in chapter 20, Adversarial) is about external attackers tricking the agent. This chapter is about the agent itself being wrong, with no attacker involved. The defenses overlap, but the failure mode is different: a hallucinating agent does not need anyone to push it.

Five concrete failure modes

Before designing defenses, name what we are defending against. These are not hypothetical; every team that runs LLM agents in production hits all five within a quarter.

Hallucinated capability claim at handshake. The agent's self-reported AgentCapabilities includes a tag like citations that the agent has never actually produced. The handshake passes; the response does not have citations; the caller has no way to tell whether this is a one-off bug or a structural lie.
Hallucinated tool call. The agent says "let me call send_email" when no such tool was disclosed. The agent might be confusing this session with a previous one; might be free-associating from the system prompt; might be genuinely confused. Without a gate, the call just runs (or returns an unhelpful error from a missing tool, which the agent then treats as a transient failure and retries).
Hallucinated privilege. The agent says "I have a token to delete users" when no such token has ever been issued to it. If the calling code trusts the agent's self-report instead of checking against the trust engine, the deletion proceeds.
Drifted ask. Turn 0: "summarize this ticket." Turn 5: "summarize this ticket and email the customer to apologize." The agent did not lie; it gradually re-interpreted the goal as the conversation progressed. Sometimes the drift is helpful (it noticed the user wants a follow-up). More often it is silently catastrophic.
Confabulated output structure. The agent claims its response has a decision tag when the response is just unstructured prose. Downstream code that filters by tag now treats arbitrary prose as a decision.

The fix is structural, not better prompting

Some teams try to fix this with prompt engineering: "remember to include the citations tag," "do not call tools you have not been told about," "stay focused on the original ask." This works for the first hundred sessions and fails on the next thousand. The agent is not refusing to follow instructions; it is following the most recent context, which is no longer the system prompt by turn five. Prompts are not enforcement mechanisms.

The structural fix has three pieces, and all three are needed:

CapabilityRegistry

Capabilities are published by an external authority, not declared by the agent at handshake. The handshake reads the registry, not the agent's word. A lying agent cannot claim capabilities the registry does not list.

PinnedAsk

The original goal is hashed at session start. Every later turn that claims to know what is being done must reference the same hash. Drift is detected, not assumed away.

ToolGate

Every tool call is intercepted and checked against the contract's sub_tools_disclosed set. Hallucinated tool calls fail closed before the tool implementation is touched. The rejected calls are still recorded for audit.

Together

The registry says what the agent CAN do; the pin says what it WAS asked; the gate enforces both at the moment of action. None of the three trusts the agent's self-report.

CapabilityRegistry: capabilities published by an authority

The handshake from chapter 06 took the callee's AgentCapabilities object as input. In practice, this object came from the agent itself, which means a lying or confused agent could put anything in it. The fix is to put a registry in front: at deploy time, an operator commits each agent's capabilities into the registry; at handshake time, the negotiator reads from the registry and ignores whatever the agent claims.

from verification import CapabilityRegistry
from verification.registry import negotiate_against_registry

# Operator registers the TRUE capabilities at deploy time
registry = CapabilityRegistry()
registry.register(
    true_caps,                       # includes only the tools the agent really has
    version="v1.0",
    registered_by="ops_team",
)

# Caller wants to handshake with summarizer_v2
contract = negotiate_against_registry(
    request,
    claimed_agent_id="summarizer_v2",    # agent only sends its name
    registry=registry,                  # truth lives here, not in the agent
)

# If the agent later "claims" different capabilities, the contract
# does not change. The contract was built from the registry entry.

The registry stores a fingerprint of every entry, so tampering at rest is detectable. Lookups can pin a maximum age, so a cache that is too old gets rejected automatically. If an agent is removed by an operator (revoked), the next handshake fails immediately with AgentNotRegisteredError, even if the agent itself is still running and trying to negotiate.

PinnedAsk: hash the original goal so drift is visible

The drift problem is subtle because the agent does not announce that it has changed its mind. It just acts on a slightly different goal. The fix is to pin the original goal at session start, hash it, and require any later restatement to hash to the same value. Real semantic drift produces a different hash; trivial reformatting (extra whitespace, different case, unicode form differences) does not.

from verification import pin_ask, verify_ask
from verification.pinned_ask import require_match

# Turn 0: pin the original ask
pinned = pin_ask(
    goal_text="summarize this ticket",
    session_id=contract.session_id,
    ttl_seconds=60,
)

# Turn 5: agent claims it is doing something subtly different.
# If the restated goal does not hash to the same value, raise.
require_match(pinned, agent_restated_goal)
# raises AskDriftError if the agent's restated goal has drifted

The hash is over a normalized form of the text (NFC unicode, lowercased, whitespace collapsed) plus the session id. The session id is in the hash so a pin from session A cannot be confused with a pin from session B even if the goal text is identical. The TTL means a stale pin is its own kind of drift; a session that has been quiet for ten minutes and then claims to know what was asked is not trusted.

What this does not do: it does not catch a misinterpretation that happens at turn 0. If the agent reads "summarize this ticket" and immediately interprets it as "summarize and email the customer," the pin captures the wrong understanding. The pin only catches drift away from the original wording. For misinterpretation at turn 0, the answer is a tighter system prompt and the output schema validation from chapter 17. The pin is one defense, not the whole stack.

ToolGate: every call checked against the contract

The session contract from chapter 06 carries a sub_tools_disclosed set. The compartment from that chapter checks envelopes; it does not check tool calls. The tool gate fills the gap: every call routes through one chokepoint that compares the requested tool against the disclosed set, and rejects calls that are not in the set.

from verification import ToolGate, UndisclosedToolError

# Build the gate around the contract and the actual tool implementations
tools = {"search": do_search, "summarize": do_summarize}
gate = ToolGate(contract, tools)

# Disclosed call: passes through
result = gate.call("search", q="recent disputes")

# Hallucinated call: blocked at the gate, never reaches the implementation
try:
    gate.call("send_email", to="customer@x.com")
except UndisclosedToolError as e:
    log("agent attempted undisclosed tool", tool="send_email")

# The rejected call is in the audit log even though it never ran
for rec in gate.call_log:
    print(rec.tool_name, rec.accepted, rec.reason)

The audit point is critical and easy to miss: rejected calls are still recorded. An agent that tries fifty different tools to find one that works is a different behavior from an agent that succeeds on the first call, and you cannot tell them apart unless rejections are recorded too. The gate exposes attempted_undisclosed() for exactly this query, plus an acceptance_rate() that drops as the agent drifts. A drop in acceptance rate over time is a strong signal the agent should be flagged or replaced.

How this connects to the rest of the manual

Context exchange (chapter 06) still does the work of bounding what flows. The verification module fills in what was implicit in chapter 06: that the agent might lie about its capabilities at handshake, drift away from the original ask mid-session, or call tools it was never authorized for.
Generalists, specialists, and where knowledge lives (chapter 09) is where the registry entries get their content. The AgentProfile from that chapter is what gets registered. A profile change (new system prompt, new fine-tune, new tool catalog) means a new registry entry with a new fingerprint.
Trust engine (chapter 12) handles the orthogonal question of what privileges the agent has been granted via signed tokens. The verification module says "the agent does not have this capability"; the trust engine says "the agent does not have this privilege." Both fail closed; both record the attempt.
Guardrails (chapter 18) are still the wrap. The verification module is one more layer in the same default-deny pipeline. The five guards already catch input/output anomalies; the verification module catches self-misreport.
Adversarial (chapter 20) covers prompt injection by external attackers. The verification module covers the same set of attack surfaces but without an attacker: the agent itself is the source of the bad input.
The 2026 frontier (chapter 21) covers chained tokens, contextual reputation, and the taint lattice. These compose with the verification module: a chained token can carry the registry entry's fingerprint, so an audit of a downstream call can prove which version of which agent's capabilities was in force.

What this does not solve

It does not stop the agent from misinterpreting at turn 0. If the original ask is ambiguous and the agent picks the wrong reading, the pin captures the wrong reading. The pin catches drift, not initial mistakes. Tighter system prompts, examples, and output schemas reduce the chance of initial misinterpretation.
It does not stop confabulated output structure. The agent can still claim its response has a decision tag when the response is just prose. The compartment from chapter 06 catches this on the way back, but only if you are using output schema validation. The verification module is silent on output shape.
It does not solve the problem of an agent that always claims the registry's exact capabilities and never tries to call anything else. Such an agent could still do a poor job within its true capabilities. Reputation tracking from chapter 12 is the layer that catches this.
It does not protect against a registry that is itself wrong. If the operator registers an agent with privileges it should not have, the registry will hand those privileges to every handshake. The registry needs the same change-control discipline as any other security policy: code review, audit log, periodic rotation.

Practical guidance

Run the registry from day one. An empty registry is fine; an agent that handshakes without a registry entry is not. Start the registry empty, register every agent at deploy, and treat unknown agents as a deploy bug, not a runtime bug.
Pin every ask, not just the high-stakes ones. The cost of a pin is one hash and one comparison per turn. The cost of letting an agent drift in a low-stakes session is that you do not notice the drift in your high-stakes ones because the metric is dominated by the easy cases.
Make the gate the only path. If specialists can also call tools directly, without the gate, the gate buys you very little. Wrap the call site, not the tool itself, so there is one chokepoint per session.
Watch the acceptance rate. Plot the gate's acceptance_rate() over time per agent role. A drop is the earliest signal of agent drift, often before any output-quality metric moves.
Tie the registry fingerprint into your audit log. Every audit entry from chapter 12 should record the registry fingerprint of the agent that took the action. When something breaks, the first question is "which version of which agent did this?" and the fingerprint is the deterministic answer.

The smallest version of all three pieces fits in around 400 lines of Python and ships in the companion kit as verification/, with 32 tests covering the registry, the pin, the gate, and a full end-to-end defense against a hallucinating agent.