Tone Dark
Tint
10 When the agent itself is wrong · hallucinated capabilities, asks & privileges

The handshake assumed the agent would be honest about itself. It will not be.

The chapter on context exchange (chapter 06) gave us a pattern: before two agents collaborate, run a short capability handshake, then exchange typed envelopes through a compartment. The pattern works. It also assumed something quietly: that the agent on the other end would tell the truth about its own state. Real LLM agents do not.

There is no malice required. An LLM has no special access to its own capabilities, its own privileges, or what was actually asked of it three turns ago. Everything it "knows" about itself comes from the same prompt context that got muddied by the last tool call. So the agent claims it can produce a tag it cannot produce. It claims it has a token it does not have. It "remembers" the goal as something slightly different from the original. It tries to call a tool that does not exist in its catalog. None of these are bugs in the model. They are the baseline behavior of any LLM under load.

This chapter is about the layer that closes the gap. The principle is the same one used everywhere else in computer security: do not let the principal declare its own privileges. Apply that to agents and you get three external checks: capabilities published by an authority, asks pinned at session start, tool calls verified at the gate.

Prompt injection (covered in chapter 20, Adversarial) is about external attackers tricking the agent. This chapter is about the agent itself being wrong, with no attacker involved. The defenses overlap, but the failure mode is different: a hallucinating agent does not need anyone to push it.

Five concrete failure modes

Before designing defenses, name what we are defending against. These are not hypothetical; every team that runs LLM agents in production hits all five within a quarter.

The fix is structural, not better prompting

Some teams try to fix this with prompt engineering: "remember to include the citations tag," "do not call tools you have not been told about," "stay focused on the original ask." This works for the first hundred sessions and fails on the next thousand. The agent is not refusing to follow instructions; it is following the most recent context, which is no longer the system prompt by turn five. Prompts are not enforcement mechanisms.

The structural fix has three pieces, and all three are needed:

CapabilityRegistry
Capabilities are published by an external authority, not declared by the agent at handshake. The handshake reads the registry, not the agent's word. A lying agent cannot claim capabilities the registry does not list.
PinnedAsk
The original goal is hashed at session start. Every later turn that claims to know what is being done must reference the same hash. Drift is detected, not assumed away.
ToolGate
Every tool call is intercepted and checked against the contract's sub_tools_disclosed set. Hallucinated tool calls fail closed before the tool implementation is touched. The rejected calls are still recorded for audit.
Together
The registry says what the agent CAN do; the pin says what it WAS asked; the gate enforces both at the moment of action. None of the three trusts the agent's self-report.

CapabilityRegistry: capabilities published by an authority

The handshake from chapter 06 took the callee's AgentCapabilities object as input. In practice, this object came from the agent itself, which means a lying or confused agent could put anything in it. The fix is to put a registry in front: at deploy time, an operator commits each agent's capabilities into the registry; at handshake time, the negotiator reads from the registry and ignores whatever the agent claims.

from verification import CapabilityRegistry
from verification.registry import negotiate_against_registry

# Operator registers the TRUE capabilities at deploy time
registry = CapabilityRegistry()
registry.register(
    true_caps,                       # includes only the tools the agent really has
    version="v1.0",
    registered_by="ops_team",
)

# Caller wants to handshake with summarizer_v2
contract = negotiate_against_registry(
    request,
    claimed_agent_id="summarizer_v2",    # agent only sends its name
    registry=registry,                  # truth lives here, not in the agent
)

# If the agent later "claims" different capabilities, the contract
# does not change. The contract was built from the registry entry.

The registry stores a fingerprint of every entry, so tampering at rest is detectable. Lookups can pin a maximum age, so a cache that is too old gets rejected automatically. If an agent is removed by an operator (revoked), the next handshake fails immediately with AgentNotRegisteredError, even if the agent itself is still running and trying to negotiate.

PinnedAsk: hash the original goal so drift is visible

The drift problem is subtle because the agent does not announce that it has changed its mind. It just acts on a slightly different goal. The fix is to pin the original goal at session start, hash it, and require any later restatement to hash to the same value. Real semantic drift produces a different hash; trivial reformatting (extra whitespace, different case, unicode form differences) does not.

from verification import pin_ask, verify_ask
from verification.pinned_ask import require_match

# Turn 0: pin the original ask
pinned = pin_ask(
    goal_text="summarize this ticket",
    session_id=contract.session_id,
    ttl_seconds=60,
)

# Turn 5: agent claims it is doing something subtly different.
# If the restated goal does not hash to the same value, raise.
require_match(pinned, agent_restated_goal)
# raises AskDriftError if the agent's restated goal has drifted

The hash is over a normalized form of the text (NFC unicode, lowercased, whitespace collapsed) plus the session id. The session id is in the hash so a pin from session A cannot be confused with a pin from session B even if the goal text is identical. The TTL means a stale pin is its own kind of drift; a session that has been quiet for ten minutes and then claims to know what was asked is not trusted.

What this does not do: it does not catch a misinterpretation that happens at turn 0. If the agent reads "summarize this ticket" and immediately interprets it as "summarize and email the customer," the pin captures the wrong understanding. The pin only catches drift away from the original wording. For misinterpretation at turn 0, the answer is a tighter system prompt and the output schema validation from chapter 17. The pin is one defense, not the whole stack.

ToolGate: every call checked against the contract

The session contract from chapter 06 carries a sub_tools_disclosed set. The compartment from that chapter checks envelopes; it does not check tool calls. The tool gate fills the gap: every call routes through one chokepoint that compares the requested tool against the disclosed set, and rejects calls that are not in the set.

from verification import ToolGate, UndisclosedToolError

# Build the gate around the contract and the actual tool implementations
tools = {"search": do_search, "summarize": do_summarize}
gate = ToolGate(contract, tools)

# Disclosed call: passes through
result = gate.call("search", q="recent disputes")

# Hallucinated call: blocked at the gate, never reaches the implementation
try:
    gate.call("send_email", to="customer@x.com")
except UndisclosedToolError as e:
    log("agent attempted undisclosed tool", tool="send_email")

# The rejected call is in the audit log even though it never ran
for rec in gate.call_log:
    print(rec.tool_name, rec.accepted, rec.reason)

The audit point is critical and easy to miss: rejected calls are still recorded. An agent that tries fifty different tools to find one that works is a different behavior from an agent that succeeds on the first call, and you cannot tell them apart unless rejections are recorded too. The gate exposes attempted_undisclosed() for exactly this query, plus an acceptance_rate() that drops as the agent drifts. A drop in acceptance rate over time is a strong signal the agent should be flagged or replaced.

How this connects to the rest of the manual

What this does not solve

Practical guidance

The smallest version of all three pieces fits in around 400 lines of Python and ships in the companion kit as verification/, with 32 tests covering the registry, the pin, the gate, and a full end-to-end defense against a hallucinating agent.