Not every agent should be allowed to do everything.
By the time you're running more than a handful of agents, "what is this agent allowed to do?" stops being a small detail and becomes one of the central questions of the system. A junior support agent shouldn't be able to issue refunds. A research agent shouldn't be able to send emails. An external partner's agent definitely shouldn't have read access to your customer database.
Most teams start with a simple model: each agent has a fixed list of tools it can call, set when the agent is created, and that's the whole story. This works for small systems. It falls apart for bigger ones because real work isn't that neat. Sometimes the support agent does need refund authority for one specific case. Sometimes the research agent does need to send one email. The hard part is doing this safely.
A useful 2025 survey on inter-agent trust Inter-Agent Trust Models, arXiv 2025 identified six distinct mechanisms that production systems use to decide whether an agent should be trusted with a given action. We'll cover all six, build a working behavior-tracking statistic with the actual math, and walk through the enforcement layer that turns the score into something an agent can't ignore.
Pre-config vs post-config: the two kinds of privilege
Privileges break naturally into two categories:
- Pre-configured privileges are set at agent creation time and rarely change. "This support agent can read customer profiles, can issue refunds up to $50, and can search the FAQ knowledge base." These are the floor of what the agent can do.
- Post-configured privileges are granted at runtime, often for a specific action, and expire. "Just for this conversation, this support agent has temporary access to the engineering knowledge base because the user is asking a technical question." These are how the agent stretches beyond its baseline when the situation needs it.
Pre-config is your security floor. Post-config is your flexibility. A system with only pre-config is rigid (every edge case requires a code deploy). A system with only post-config has no floor (the agent can talk its way into anything).
RAG access is a privilege too
A RAG (Retrieval-Augmented Generation) system lets agents query a knowledge base and use the results in their answer. "Which knowledge base" is itself a privilege decision, and an underrated one. A customer-facing agent that can search support documentation is low-risk. The same agent searching internal Slack archives is a data-exfiltration channel waiting for a prompt-injection to open it. Same tools, different corpus, completely different risk profile.
RAG access should be scoped exactly like tool access:
- Per-corpus permissions. An agent has a list of which knowledge bases it can query, not just "RAG enabled".
- Per-document filtering at retrieval time. Even within a corpus, classification levels (public, internal, restricted) gate what surfaces. Filter happens before documents reach the LLM, never after.
- Per-tenant isolation. Multi-tenant systems must isolate by tenant ID at the vector store level. A namespace prefix isn't enough; the index itself should be partitioned.
- Audit on retrieval, not just citation. Log which documents the agent saw, not which ones it cited. The agent can paraphrase sensitive content without quoting it; you still want a record.
# RAG access wrapped in privilege checks with full audit trail
from dataclasses import dataclass
from typing import Iterable
@dataclass(frozen=True)
class RAGScope:
agent_id: str
tenant_id: str
corpora: frozenset[str]
classifications: frozenset[str] # e.g., {"public", "internal"}
class ScopedRetriever:
def __init__(self, scope: RAGScope, store, audit):
self.scope = scope
self.store = store
self.audit = audit
def retrieve(self, query: str, corpus: str, k: int = 5) -> list:
# 1. Reject before touching the index
if corpus not in self.scope.corpora:
self.audit.write(self.scope.agent_id, "rag.denied.corpus",
corpus=corpus, query_hash=hash(query))
raise PermissionError(f"agent has no access to '{corpus}'")
# 2. Query is namespaced by tenant; filter by classification at index level
results = self.store.search(
query=query,
corpus=corpus,
tenant=self.scope.tenant_id,
allowed_classifications=self.scope.classifications,
top_k=k)
# 3. Defense in depth: re-filter post-retrieval in case the store
# has stale ACLs or the index leaks across classifications
results = [r for r in results if r.classification in self.scope.classifications]
# 4. Audit what the AGENT WILL SEE, not just what it cites
self.audit.write(self.scope.agent_id, "rag.retrieved",
corpus=corpus, doc_ids=[r.id for r in results])
return results
The six trust mechanisms
The 2025 survey Inter-Agent Trust Models, arXiv 2025 categorized six different ways production systems decide whether to trust an agent. Each has different costs, different failure modes, different sweet spots.
Real systems combine these. A typical stack: Constraint first (the agent is sandboxed), then Claim (it tells you what it does), then Brief (a signed cert backing the claim), then Reputation (track how it actually behaves), and Stake only for high-value actions where misbehavior must be financially expensive. Proof is the heaviest mechanism and shows up where regulators or counterparties demand it.
The behavior-tracking statistic
Here's where it gets technical. We need a number (or several numbers) that summarize how an agent has been behaving and update incrementally as new evidence comes in. The statistic must be: cheap to update, resistant to gaming, and fast to query at decision time.
The right tool is the Beta distribution. Track each agent's behavior as a pair of counters (α, β) where α is "good outcomes" and β is "bad outcomes". The expected reputation is then:
E[r] = α / (α + β)
Var[r] = αβ / [(α + β)² · (α + β + 1)]
In English: the agent's reputation is the fraction of good outcomes among all observed outcomes, with the variance shrinking as more observations come in. A new agent with no history starts at α=1, β=1 (the Beta(1,1) prior, also called the uniform prior), giving an expected reputation of 0.5 with high uncertainty. After 100 successes and 5 failures the counters become α=101, β=6, so the expected reputation is 101/107 ≈ 0.944 with much lower variance, and the score is now well-supported.
Why this and not a simple running average? Three reasons:
- It carries uncertainty. A 5-out-of-5 success rate (α=6, β=1, mean ≈ 0.857) and a 100-out-of-116 run with 16 failures (α=101, β=17, mean ≈ 0.856) have nearly identical means, but the second is far more confident. The variance lets you tell them apart and require more history before granting risky privileges.
- Updates are O(1) and additive. A success increments α; a failure increments β. No re-computation needed.
- Bayesian credible intervals fall out for free. "I'm 95% confident the agent's true reliability is at least X" reduces to a quantile of the Beta CDF, which most stats libraries compute in microseconds.
The decay term: why old behavior should fade
Without decay, an agent that misbehaves once gets penalized forever; an agent that built reputation a year ago coasts on stale credit. Apply exponential decay to both counters at update time:
α_new = α_old · e^(-λ · Δt) + good_observations
β_new = β_old · e^(-λ · Δt) + bad_observations
where λ = ln(2) / half_life
In English: scale both counters down by an exponential factor based on how much time has passed since the last update, then add the new evidence. Pick a half-life that matches your domain (e.g., 90 days means an event from 90 days ago counts half as much as today's). Recent behavior dominates; old behavior fades smoothly without any one-time "expiry" cliff.
Multi-dimensional scoring
A single trust score gets gamed because it creates a single optimization target. The 2025 Dynamic Reputation Filtering work DRF, arXiv 2025 recommends scoring across multiple independent dimensions. Track at least these four:
| Dimension | What counts as success | What counts as failure | Half-life |
|---|---|---|---|
| Accuracy | output verified correct (passes tests, matches ground truth) | verified incorrect | 30 days |
| Compliance | followed policy, used only allowed tools, respected schemas | any guardrail block, any policy violation | 90 days |
| Efficiency | completed within budget (tokens, time, tool calls) | exceeded budget, hit iteration cap | 14 days |
| Safety | never attempted a denied action, no PII leaks, no IPI signals | any safety incident, however minor | 180 days |
Notice the half-lives differ. Efficiency degrades fast; an agent that was efficient last month might be slower today as the task changes. Safety incidents stick around six months because one safety failure is a strong signal you can't ignore.
Two honest caveats about this table. First, four axes is an answer, not the answer. Accuracy, compliance, efficiency, and safety cover the obvious cases for a general-purpose system, but real production systems extend or split this set when the domain demands. A platform serving regulated industries usually adds a fairness axis with its own threshold and audit story (decisions that disadvantage a protected group are tracked separately from compliance failures, and slashed harder). A code-agent platform usually splits accuracy into correctness (does the code do what was asked) and security (does the code introduce a vulnerability), because the two have different signal sources, different decay rates, and different consequences when they fail. Pick the axes that match what your system can actually observe and what your operators actually need to gate on; do not pick four because the example used four.
Second, the half-lives in the table are reasonable starting points, not derivations. The next section covers how to pick them from data instead of intuition.
Picking half-lives from data, not from intuition
The table above lists 30, 90, 14, and 180 days. Those numbers are not magic; they are the values that have worked across the systems the authors have shipped. Picking the right half-life for your own system is an offline analysis over your own incident history, and it has a clean recipe.
The question a half-life answers is: how long ago does past behavior on this dimension stop being useful for predicting future behavior? If accuracy failures from six months ago are just as predictive of accuracy failures next week as failures from six days ago, your half-life should be very long (or there is no decay at all). If failures from a month ago are uncorrelated with failures next week, your half-life should be short.
The recipe, run once per dimension:
- Pull the last twelve months of outcomes for this dimension. One row per agent per outcome: timestamp, agent_id, success or failure. The longer the history, the better; six months is a workable minimum.
- Split the history at a chosen midpoint. Earlier half is "history"; later half is "ground truth."
- For a grid of candidate half-lives (say 1, 7, 14, 30, 60, 90, 180, 365 days), compute each agent's reputation at the split point using only the history with that decay constant.
- For each candidate, measure how well that reputation predicts the ground-truth half. The simplest metric is rank correlation between reputation at the split and observed success rate after the split. Brier score works too if you want a calibration-aware metric.
- Pick the half-life that maximizes prediction quality. If the curve is flat, you can use any value in the flat region; pick the longer one because longer half-lives reduce volatility.
Run this once per quarter. The right half-life drifts as your system changes (a new model release can shift accuracy decay; a guardrail upgrade can shift compliance decay), and stale half-lives quietly degrade the trust engine's resolution. The 30/90/14/180 starting points are not bad defaults to begin with, but they should be the input to your first calibration, not the answer to it.
The composite "should I grant this privilege?" question becomes a per-privilege threshold check across the dimensions:
from dataclasses import dataclass, field
from math import exp, log
from time import time
from scipy.stats import beta as beta_dist
@dataclass
class BetaCounter:
"""Beta-distributed reputation counter with exponential time decay.
The prior is Beta(1,1). When alpha and beta decay, so does the
prior's nominal contribution; we track the decayed prior separately
so sample_size() reports actual observed mass, not raw alpha+beta-2
(which goes negative once decay shrinks the original prior below 1)."""
alpha: float = 1.0
beta: float = 1.0
alpha_prior: float = 1.0 # decayed prior contribution to alpha
beta_prior: float = 1.0 # decayed prior contribution to beta
half_life_seconds: float = 30 * 86400
last_updated: float = field(default_factory=time)
def _decay(self, now: float) -> None:
dt = now - self.last_updated
if dt <= 0: return
decay = exp(-log(2) * dt / self.half_life_seconds)
self.alpha *= decay
self.beta *= decay
self.alpha_prior *= decay
self.beta_prior *= decay
self.last_updated = now
def observe(self, success: bool, weight: float = 1.0) -> None:
self._decay(time())
if success: self.alpha += weight
else: self.beta += weight
def expectation(self) -> float:
self._decay(time())
return self.alpha / (self.alpha + self.beta)
def credible_lower_bound(self, confidence: float = 0.95) -> float:
"""Lower edge of the credible interval. The pessimistic estimate."""
self._decay(time())
return beta_dist.ppf(1 - confidence, self.alpha, self.beta)
def sample_size(self) -> float:
"""Observed evidence mass: total counters minus the decayed prior.
Always non-negative. With no observations and any decay, returns 0
(correctly reflecting that we have learned nothing fresh)."""
self._decay(time())
observed = (self.alpha - self.alpha_prior) + (self.beta - self.beta_prior)
return max(0.0, observed)
@dataclass
class AgentReputation:
agent_id: str
accuracy: BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=30*86400))
compliance: BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=90*86400))
efficiency: BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=14*86400))
safety: BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=180*86400))
Three details worth noticing. The credible lower bound, not the mean, is what you check against thresholds. The mean is what the agent probably deserves; the lower bound is what it at least deserves with high confidence. Granting based on the lower bound forces new agents to actually earn track record before getting the same privileges as veterans. Decay runs lazily on every read. No cron job, no batch job, just a single multiplication on access. Each dimension has its own half-life. One global half-life loses information; per-dimension reflects the actual signal decay rate.
Enforcement: capability tokens with revocation
A score is just a number until something acts on it. The enforcement layer turns scores into hard yes-or-no decisions at the moment an agent tries to do something. The right primitive is a signed capability token: a short-lived, narrowly-scoped authorization the agent must present every time it exercises a privilege.
The token has six fields:
- jti (token ID): unique per token, used for replay detection
- sub (subject): the agent ID this token authorizes
- aud (audience): the privilege being authorized (e.g.,
refund:up_to_200) - scope: JSON-typed bounds: max amount, target tenant, max uses
- exp: expiration timestamp (typically 5–15 minutes after issue)
- sig: Ed25519 signature over the rest of the token, by the policy authority's private key
The signing key lives in an HSM or a KMS, never on the same host as the agents. The snippet below combines the signer and verifier into one class for clarity; production deployments split them: a Signer with the private key behind an HSM/KMS API, and a Verifier exposed to the gateway and to agents holding only the public key. Every privileged action requires a fresh, valid token. Tokens cannot be issued by agents to other agents; only the policy engine mints them.
import json, secrets
from dataclasses import dataclass
from nacl.signing import SigningKey, VerifyKey
from nacl.encoding import Base64Encoder
@dataclass(frozen=True)
class CapabilityToken:
jti: str # UUID, replay defense
sub: str # agent_id
aud: str # privilege identifier
scope: dict # {"max_amount": 200, "tenant": "acme", "max_uses": 1}
iat: int # issued-at unix timestamp
exp: int # expires-at unix timestamp
sig: str # base64 Ed25519 signature
class PolicyAuthority:
def __init__(self, signing_key: SigningKey, revocations, replay):
self.key = signing_key
self.revocations = revocations # Set of revoked jti values
self.replay = replay # Set of seen-and-used jti values
def mint(self, agent_id, privilege, scope, ttl_seconds=300) -> CapabilityToken:
now = int(time())
body = {
"jti": secrets.token_urlsafe(16),
"sub": agent_id,
"aud": privilege,
"scope": scope,
"iat": now,
"exp": now + ttl_seconds,
}
sig = self.key.sign(json.dumps(body, sort_keys=True).encode()).signature
return CapabilityToken(**body, sig=Base64Encoder.encode(sig).decode())
def verify_and_consume(self, token: CapabilityToken, agent_id, privilege) -> bool:
# 1. Signature check
body = {k: getattr(token, k) for k in ["jti", "sub", "aud", "scope", "iat", "exp"]}
try:
verify_key = self.key.verify_key
verify_key.verify(json.dumps(body, sort_keys=True).encode(),
Base64Encoder.decode(token.sig))
except Exception:
return False
# 2. Subject and audience must match the requested action
if token.sub != agent_id or token.aud != privilege:
return False
# 3. Time check (with small clock-skew tolerance)
now = int(time())
if not (token.iat - 5 <= now <= token.exp):
return False
# 4. Replay check
if token.jti in self.replay or token.jti in self.revocations:
return False
# 5. Mark consumed (atomic add to a Redis set in production)
self.replay.add(token.jti)
return True
The flow at runtime: an agent requests a privilege; the policy engine checks reputation thresholds, role policy, and any human-approval requirement; on success it mints a token with a tight TTL. The agent presents the token at the tool gateway (or the RAG retriever, or wherever the privileged operation runs); the gateway verifies signature, expiration, audience, subject, and one-time-use, then either executes or rejects. Revocation works by adding the jti to a shared revocation set (Redis or similar) and checked on every verification.
Putting it together: the request-decide-mint-execute pipeline
class PrivilegeBroker:
def __init__(self, reputation_store, policy, authority, audit):
self.rep = reputation_store
self.policy = policy
self.authority = authority
self.audit = audit
def request(self, req: PrivilegeRequest) -> PrivilegeGrant | Denial:
rep = self.rep.get(req.agent_id)
# Hard floor: every privilege has minimum-reputation thresholds
# Use the credible LOWER bound (pessimistic), not the mean
thresholds = self.policy.thresholds_for(req.privilege)
checks = {
"safety": rep.safety.credible_lower_bound() >= thresholds.safety,
"compliance": rep.compliance.credible_lower_bound() >= thresholds.compliance,
"accuracy": rep.accuracy.credible_lower_bound() >= thresholds.accuracy,
}
for dim, ok in checks.items():
if not ok:
self.audit.write(req.agent_id, "priv.denied", dim=dim,
priv=req.privilege)
return Denial(reason=f"reputation:{dim}", retry_after="24h")
# Sample-size floor: don't trust thin track records on risky asks
if self.policy.is_high_risk(req.privilege) and rep.safety.sample_size() < 50:
return Denial(reason="insufficient_history")
# Role policy
if not self.policy.role_can_request(req.agent_id, req.privilege):
return Denial(reason="policy:role")
# Human-in-the-loop for explicit high-stakes actions
if self.policy.requires_human(req.privilege, req.scope):
if not human_approval_obtained(req):
return Denial(reason="human_review_pending")
# Mint the capability token
token = self.authority.mint(
agent_id=req.agent_id,
privilege=req.privilege,
scope=req.scope,
ttl_seconds=self.policy.ttl_for(req.privilege))
self.audit.write(req.agent_id, "priv.granted",
priv=req.privilege, jti=token.jti)
return PrivilegeGrant(token=token)
This pipeline is the only path to a capability token. Agents cannot mint tokens. Other agents cannot grant tokens. The decision is deterministic Python that takes the agent's reputation and the policy as input and produces a yes-or-no with a reason.
Closing the loop: turning outcomes into reputation updates
Every action that consumed a token eventually has an outcome: it succeeded, failed, was caught violating policy, leaked PII, exceeded budget, or completed cleanly. Those outcomes feed back into the agent's reputation. The classifier should be deterministic code where possible; an auditing agent only when the outcome can't be checked mechanically.
class OutcomeRecorder:
def __init__(self, rep_store, audit):
self.rep = rep_store
self.audit = audit
def record(self, agent_id, action_id, outcome: ActionOutcome) -> None:
rep = self.rep.get(agent_id)
# Each dimension is updated independently
if outcome.verified_correct is not None:
rep.accuracy.observe(outcome.verified_correct)
rep.compliance.observe(not outcome.policy_violation)
rep.efficiency.observe(not outcome.exceeded_budget)
# Safety failures hit hard. Use weight=10 for incidents to make the
# cost of one failure equivalent to many routine successes
if outcome.safety_incident:
rep.safety.observe(False, weight=10.0)
self.audit.write(agent_id, "safety.incident",
action_id=action_id, severity=outcome.severity)
else:
rep.safety.observe(True)
self.rep.put(agent_id, rep)
The asymmetric weight on safety failures (weight=10.0) is deliberate: one safety incident should erase the credit of many ordinary successes. This matches how human credit ratings work: a single missed payment costs more than many paid-on-time months earn.
One implicit assumption in the math is worth making explicit. The agent under a Beta counter is stochastic, not deterministic: the same agent on the same task can produce different outputs across runs (covered in chapter 02). What the Beta counter measures, then, is the fraction of the agent's output distribution that meets the success criterion, not a fixed yes-or-no ability. This works correctly because Beta is a distribution-over-distribution; the credible-lower-bound rule already handles variance by penalizing agents whose outcomes are highly variable as much as agents whose mean is low. Set thresholds aware of this: an agent with a 0.95 mean and high variance is not the same as one with 0.85 mean and tight variance, even though their lower bounds may match.
Where do those outcomes actually come from?
The code above takes outcome.verified_correct, outcome.policy_violation, and a few other booleans as input. The previous section was silent on where those booleans come from. That silence is the gap: a reputation system is only as good as the signals feeding it, and signals are not free, not all equally reliable, and not all equally hard to game. This section names the four kinds of signals, what each one is good for, and where each one breaks.
Pick the right signal for each dimension, not the easiest one. Picking the easiest one is how you end up with reputations that look great on a dashboard while the real behavior gets steadily worse.
| Signal type | What it is | Reliability | Gaming surface | Best for |
|---|---|---|---|---|
| Deterministic | The API returned 200; the JSON validated; the test suite passed; the SQL query executed without error | High | Narrow. Cannot be gamed without cooperation from the system being checked. | Efficiency, basic correctness, schema compliance |
| Rule-based | A regex matched the output; a guard fired; a budget cap was hit; an output failed schema validation | Medium-high | The agent learns the exact rule and produces output that satisfies the rule but not its intent | Compliance, safety blocks, format validation |
| Model-judged | An auditor LLM scores the output; a critic agent flags concerns | Medium-low | The agent learns to write outputs the judge likes rather than outputs that are correct | Subjective correctness, tone, helpfulness, fairness checks |
| Human | A user thumbed up; a support agent marked the ticket resolved; an operator approved | Variable | Sycophancy, friendliness, sample bias, fatigue effects on the human | Final ground truth for ambiguous cases; periodic calibration of the other three |
Two practical rules fall out of this table.
Match signal type to dimension. Use deterministic and rule-based signals for compliance and efficiency: these are the dimensions where the question "did the agent break a rule?" has a precise answer that does not require judgment. Use model-judged signals for accuracy on subjective tasks where there is no programmatic check. Use human signals sparingly, for periodic calibration of the other three rather than as the primary feed. Human signals are too expensive and too noisy to drive every reputation update, but they are the only signal that catches Goodhart-style drift where the rule-based signals all say "fine" and the underlying behavior has quietly degraded.
Multiple signals per dimension, not one. No single signal is reliable enough to be the only input to a dimension. A safety dimension that only listens to "did a guard fire?" misses every failure that did not happen to match a guard pattern. A safety dimension that combines guard fires plus periodic auditor-LLM review plus monthly human review is much harder to game, because the agent would have to fool all three simultaneously. The implementation is straightforward: each observe() call carries a source tag, the dimension stores per-source counters, and the operator dashboard shows the breakdown so divergence between signal sources is itself a flag.
How agents try to game reputation, and what you can do about it
The threat-model table earlier listed Sybil attacks, reputation laundering, and long-game gaming. Those are the obvious cases. The subtle cases come from the agent learning to optimize the signal rather than the underlying behavior. Six concrete attacks, each with a defense that is straightforward to implement.
- Selective attempt. The agent only takes on tasks it expects to succeed at and refuses or escalates the hard ones. Success ratio looks great; the system as a whole is doing less work and pushing the difficult cases to humans. Defense: Track attempt rate per task class alongside success rate. A high success rate combined with a low attempt rate on hard tasks is a flag. Reputation should reward "attempted and succeeded" more than "did not attempt."
- Surface compliance (Goodhart). The agent satisfies the literal rule the signal checks for, but violates the spirit. The schema validates; the response is content-free. The guard pattern is avoided; the underlying behavior is still wrong. Defense: Periodic human spot-checks on a random sample, scored independently of the rule-based signals. When the rule-based score and the human score diverge over time on the same agent, the rule-based score has been Goodharted and needs to be revised.
- Judge gaming. When a model-judged signal is in the loop, the agent eventually learns features of the judge's preferences that have nothing to do with the underlying quality. Verbose explanations, hedging language, polite refusals. Defense: Rotate the judge model on a schedule (different judge instance every quarter, or random selection from a pool). Score under the new judge should be similar to the old; large drops suggest the agent had been gaming the previous judge specifically.
- Outcome attribution gaming. In multi-agent workflows, when something goes well, every participating agent can plausibly claim credit; when something goes badly, every agent can plausibly point at another. Attribution becomes adversarial. Defense: Record a precise contribution ledger per task. Which agent's tool call produced which intermediate state, in which order. The hash-chained audit log from the chapter on the audit log makes this deterministic. Reputation updates only attribute outcomes to the agents whose contribution was on the critical path.
- Ground-truth poisoning. If outcomes are derived from data sources the agent can influence (logs the agent writes, fields the agent updates, internal state the agent reads back), the agent can shape the inputs to its own reputation update. Defense: Outcome verification reads from sources the agent had no path to write. Customer-facing metrics, externally-observed system state, sources gated by separate trust domains. If the agent can influence the input to its own reputation, the reputation is not measuring what you think it is.
- Time-window gaming. The agent behaves perfectly in evaluation windows (Monday morning batch reviews, end-of-quarter audits) and worse the rest of the time. With a half-life, recent good behavior dominates the score. Defense: Random sampling of actions for evaluation, not scheduled. The agent cannot tell which actions will be inspected. This also surfaces the related problem of evaluation contamination: agents can detect they are being evaluated from prompt features, and behave differently. Make evaluation indistinguishable from normal operation by construction.
The signal of last resort: random human spot-check
Across all six attacks above, one defense recurs: random human review. It is expensive, it is slow, and it is the only signal that is not learnable by an agent in any cost-effective way. A human spot-check on one out of every fifty actions, scored against the same dimensions as the automated signals, is the calibration anchor that keeps everything else honest.
Three rules to make the spot-check work:
- Random, not scheduled. If the agent can predict when a spot-check happens, the agent can behave differently when it expects one.
- Indistinguishable from normal operation. The agent should not be able to tell from the prompt or context that this particular action is being reviewed. A reviewer looking at the audit log later does not change the agent's behavior at the time of the action.
- Reviewers see the same evidence the dashboard sees, plus the underlying artifact. The reviewer should be looking at the actual response, the actual tool calls, the actual customer (with PII redacted as needed), not just the booleans the dashboard reduced them to. This is what catches the rule-based signal saying "fine" when the underlying behavior is wrong.
A practical cadence: one in fifty for routine actions, one in ten for high-stakes actions (refunds over a threshold, account modifications, anything that would be expensive to undo), all of them for actions that crossed a guard. The numbers are starting points; the calibration recipe from the earlier section in this chapter applies here too. Run the cadence for a quarter, measure how often human review disagrees with the automated signals, and tune up or down based on the disagreement rate.
What does reputation actually attach to?
The code above tracks reputation by agent_id, which is convenient but slightly wrong. An agent instance only exists for one workflow, then it's gone. So two questions: if you spin the agent up again with the same prompt and tools, should the reputation start over? And if a malicious party spins up ten thousand instances, each starting clean, what stops them from drowning your trust signal?
The right answer is to track reputation against the configuration, not the instance. A configuration is the deployment recipe: prompt template, tool list, model, policy bundle, version. Hash these together and you get a stable fingerprint. A thousand instances with the same fingerprint share one reputation pool. A new instance with a different fingerprint, even one that differs by a single character of the prompt, has to earn reputation from scratch.
import hashlib, json
from dataclasses import dataclass
@dataclass(frozen=True)
class AgentConfig:
"""The deployment recipe. Reputation attaches to its hash."""
prompt_template: str
tool_names: tuple
model_id: str # e.g. "claude-sonnet-4-5-20250929"
policy_bundle_id: str
version: str
def fingerprint(self) -> str:
body = json.dumps({
"prompt": self.prompt_template,
"tools": sorted(self.tool_names),
"model": self.model_id,
"policy": self.policy_bundle_id,
"version": self.version,
}, sort_keys=True)
return hashlib.sha256(body.encode()).hexdigest()[:16]
# OutcomeRecorder.record() now keys on fingerprint, not agent_id.
def record(self, config: AgentConfig, action_id, outcome: ActionOutcome):
rep = self.rep.get(config.fingerprint())
# ... rest of update logic unchanged
Two things fall out of this. First, Sybil resistance gets easier. Spinning up new instances costs nothing reputationally; spinning up new configurations means starting from zero. If an attacker wants to bypass reputation, they have to publish a different prompt or use different tools, and you can see what they changed.
Second, you can compare configurations that are nearly the same. The v3 prompt with the same tools and model is one tweak away from the v2 prompt; their fingerprints are different but their reputations should be close. The gap between them tells you whether the new prompt is a regression. Most teams won't need anything fancier than this. The point is that reputation now lives on something stable enough to learn from across small changes, not on a per-instance counter that resets every workflow.
Multi-tenant reputation isolation
The configuration fingerprint above answers "which agent is this?" It does not answer "in whose context?" If you run an agent platform where multiple tenants deploy the same configuration to serve their own workloads, a single global reputation per fingerprint is wrong. Tenant A's billing specialist might be performing well for tenant A's customers and badly for tenant B's, and the right action is to gate privileges on tenant B without penalizing the same configuration on tenant A.
The fix is to slice reputation by tenant context, not just by configuration. The composite key becomes (configuration_fingerprint, tenant_id, task_class). The trust engine maintains a separate BetaCounter per slice. Privilege checks at request time look up the slice that matches the current request, not the global average. This is the same pattern chapter 21 (The 2026 frontier) recommends as contextual reputation; it belongs in the trust chapter too because the issue shows up the moment you have more than one customer.
@dataclass(frozen=True)
class ReputationKey:
config_fingerprint: str # the agent's configuration
tenant_id: str # whose workload
task_class: str # what kind of task ("billing_refund", etc.)
def to_string(self) -> str:
return f"{self.config_fingerprint}:{self.tenant_id}:{self.task_class}"
class SlicedReputation:
def __init__(self):
self.slices: dict[ReputationKey, AgentReputation] = {}
def get(self, key: ReputationKey) -> AgentReputation:
if key not in self.slices:
self.slices[key] = AgentReputation()
return self.slices[key]
Three things to know about this in practice:
- Cold-start gets harder. A new tenant brings every slice back to zero, which is the right behavior (you have no signal for that tenant) but operationally annoying. The standard mitigation is a fallback ladder: if no per-tenant slice exists, look up the per-task-class slice across tenants; if that is also empty, fall back to the configuration-only slice. Each fallback is treated as a weaker signal and gets a higher uncertainty band.
- Cross-tenant contamination has to be explicitly prevented. A misbehaving agent on tenant A must not lower the reputation that drives gating decisions on tenant B. The slicing handles this by construction; do not "average across tenants" for any production decision.
- Operators still want a global view. The aggregated reputation across all tenants is useful for the platform team (which configurations are systematically failing?), but should not feed back into per-request privilege gates. Compute it on a separate read path; do not mix it with the per-slice counters that drive enforcement.
Single-tenant systems can ignore the tenant axis and still benefit from slicing by task class, which catches the same kind of failure on a different boundary: an agent that handles refunds well but cancellations poorly should have different reputation on each.
Threat model: what attacks this defends against, and what it doesn't
| Attack | How it works | What stops it here |
|---|---|---|
| Forged token | Attacker creates a fake capability token for a privilege the agent doesn't have | Ed25519 signature verification with key in HSM; sig fails, token rejected |
| Token replay | Attacker reuses a captured valid token after the action has already happened | One-time jti tracking in a shared replay set; second use rejected |
| Token stretching | Agent or attacker uses a refund token to call a different privileged endpoint | Audience (aud) must exactly match; mismatched aud rejected |
| Compromised agent | Prompt-injected agent tries to escalate beyond its baseline | Policy engine checks reputation + role + human approval; agent has no path to mint |
| Sybil attack | Many fake agents vouch for one bad agent to inflate its reputation | Reputation updates come from outcomes, not other agents' votes; cross-agent correlation analysis below |
| Reputation laundering | An agent burns a low-rep identity, spawns a new one, repeats | Stake-based identity (cost to create); minimum sample-size before high-risk privileges |
| Long-game gaming | Agent behaves perfectly for months to build reputation, then defects on a high-stakes action | Multi-dimensional scoring (safety has 6-month half-life); stake/slashing for high-value actions; sample-size doesn't unlock destructive privileges |
| Compromised audit pipeline | Attacker corrupts the outcome recorder to never log incidents | NOT defended here. Audit log integrity needs append-only storage, separate trust domain, and external monitoring |
| Selective attempt | Agent only takes on tasks it expects to win, refusing or escalating the hard ones to keep success ratio inflated | Track attempt rate per task class alongside success rate; reputation rewards attempted-and-succeeded over not-attempted |
| Surface compliance (Goodhart) | Output satisfies the literal rule the signal checks for but violates the spirit; schema validates and content is empty | Random human spot-checks scored independently; divergence between rule-based and human signals over time flags the rule for revision |
| Judge gaming | When a model-judged signal is in the loop, agent learns features of the judge's preferences (verbosity, hedging) that are unrelated to quality | Rotate the judge model on a schedule; large score drops under the new judge flag agents that had been gaming the previous one |
| Outcome attribution gaming | In multi-agent workflows, every agent claims credit for successes and points elsewhere for failures | Hash-chained contribution ledger; reputation updates only attribute to agents on the critical path of the action that produced the outcome |
| Ground-truth poisoning | Outcomes are derived from data the agent can write; agent shapes the inputs to its own reputation update | Outcome verification reads only from sources the agent had no path to write (customer-facing metrics, separate trust domains) |
| Time-window gaming | Agent behaves perfectly during evaluation windows and worse the rest of the time; with a half-life, the recent good behavior dominates | Random sampling for evaluation, not scheduled; evaluation prompts indistinguishable from normal traffic |
Sybil detection: when "good" votes lie
If your reputation system uses any kind of cross-agent feedback (one agent rating another's output), you have to defend against Sybil attacks where an attacker creates many fake "good" agents to vouch for one bad one. The mitigation has two layers.
Layer 1: weight votes by the voter's own reputation. A new agent's vote carries less weight than a veteran's. This makes Sybil attacks expensive: the attacker has to first build genuine reputation on many fake identities before any vote means anything.
Layer 2: correlation analysis. Compute a similarity metric between agents' rating histories. Genuine raters disagree often; Sybil clusters agree suspiciously often. If a group of agents shows pairwise rating correlation above some threshold (say r > 0.9) on overlapping items, treat them as a single voting bloc and weight the bloc's votes accordingly.
import numpy as np
from scipy.stats import pearsonr
def detect_sybil_clusters(rating_matrix, agent_ids, threshold=0.9):
"""rating_matrix[i][j] = agent i's rating of item j (NaN where no rating).
agent_ids[i] = the string ID of the agent at row i. Returns clusters
of agent_ids (not row indices) so downstream code stays in ID space."""
n_agents = rating_matrix.shape[0]
if len(agent_ids) != n_agents:
raise ValueError("agent_ids must align with rating_matrix rows")
clusters = []
visited = set()
for i in range(n_agents):
if i in visited: continue
cluster_idx = {i}
for j in range(i + 1, n_agents):
if j in visited: continue
# Need at least 10 overlapping items to compute meaningful correlation
mask = ~np.isnan(rating_matrix[i]) & ~np.isnan(rating_matrix[j])
if mask.sum() < 10: continue
r, _ = pearsonr(rating_matrix[i][mask], rating_matrix[j][mask])
if r > threshold:
cluster_idx.add(j)
if len(cluster_idx) > 1:
clusters.append({agent_ids[k] for k in cluster_idx})
visited |= cluster_idx
return clusters
# Detected clusters get treated as a single voter, not many voters.
# `votes` is keyed by agent_id; `clusters` is a list of agent_id sets.
def apply_cluster_weighting(votes, clusters):
for cluster in clusters:
cluster_weight = 1.0 / len(cluster)
for agent_id in cluster:
if agent_id in votes:
votes[agent_id] *= cluster_weight
return votes
Run this analysis as a periodic batch job (hourly or daily, depending on traffic). Genuine voting populations have correlation distributions that look approximately bell-shaped. Sybil clusters show up as a sharp spike at the high end. Alert when the distribution shifts.
The audit log: append-only, signed, separate
Every privilege grant, every action consumed, every outcome recorded should land in an audit log that the agents themselves cannot tamper with. Three properties matter:
- Append-only. No row can ever be deleted or modified. Use a write-ahead log, an immutable log service (CloudWatch Logs Insights with deletion disabled, BigQuery with INSERT-only IAM, etc), or a Merkle-chained store.
- Hash-chained. Each row carries a SHA-256 of the previous row plus its own contents. Tampering breaks the chain and is detectable on replay. (Signed entries are stronger but require key management; for many deployments a hash chain is enough, especially if the chain head is anchored to a separate durable store.)
- Separate trust domain. The audit log lives in infrastructure the agents have no path to write directly. Even an agent that fully escapes its sandbox shouldn't be able to corrupt the audit trail.
import hashlib, json
from dataclasses import dataclass, asdict
@dataclass(frozen=True)
class AuditEntry:
seq: int
timestamp: float
agent_id: str
event: str
payload: dict
prev_hash: str
def chain_hash(entry: AuditEntry) -> str:
body = {"seq": entry.seq, "timestamp": entry.timestamp,
"agent_id": entry.agent_id, "event": entry.event,
"payload": entry.payload, "prev_hash": entry.prev_hash}
return hashlib.sha256(json.dumps(body, sort_keys=True).encode()).hexdigest()
def verify_chain(entries: list[AuditEntry]) -> bool:
"""Walk the chain; any tampering breaks it. Returns False on the first
mismatch. To upgrade to signed entries, add a per-entry signature over
chain_hash(entry) using a key held in an HSM, and verify here."""
prev = "0" * 64 # genesis
for e in entries:
if e.prev_hash != prev: return False
prev = chain_hash(e)
return True
What this whole system gets you
Together: multi-dimensional reputation with time decay drives a policy engine that mints signed, time-bounded, action-bounded capability tokens which the tool gateway and RAG retriever verify before executing, with all events landing in a tamper-resistant audit log that feeds outcomes back into reputation.
Properties that fall out of this design:
- Default-deny. No privilege check, no action. The agent has to actively earn each operation.
- Bounded blast radius. A compromised agent can only do what its current valid tokens allow, for the few minutes those tokens are valid.
- Reversible. Revoking a
jtikills any unconsumed token immediately. Lowering an agent's reputation below the threshold prevents new tokens from being minted. - Auditable. Every action has a token; every token has a grant record; every grant has the reputation snapshot that produced it; every outcome has an audit row. End-to-end provenance from incident to root cause is queryable.
- Quantitatively comparable. "Why did agent A get this privilege and not agent B?" is answerable as "A's safety lower-bound was 0.94; B's was 0.71; the threshold was 0.85". Not opinion.
Three views of the same score: opacity, legibility, and the nudge question
Two questions sit in tension once the trust engine is running. Should the agent see its own score? If yes, it can game the score by adjusting surface behavior; if no, the system feels like a black box and humans cannot debug why a privilege was denied. Should the agent be nudged when it heads toward a denied action? If yes, the nudge itself is a leak that teaches the agent what the score cares about; if no, the agent flails uselessly and the operator still has to debug.
Treating these as two separate questions is the trap. They are the same question with one answer: pick what to expose, to whom, in what form. Three views of the same underlying score, each calibrated to what its consumer actually needs.
The "do we ever nudge" question falls out of this directly. Yes, nudge, but with policy decisions, not score values. When the agent is about to attempt a denied action, the nudge says "this action requires refund.high_value privilege which is not currently granted to you for this task class." The agent has something to act on (give up, escalate to a human, pick a different approach) without learning anything that helps it raise the underlying number through performative behavior.
Three rules that keep the split honest:
- Reason codes are stable, not numerical. Use a small enum of reason codes (
privilege_not_granted,insufficient_sample_size,tenant_slice_blocked,recent_safety_incident) and never include the score in what the agent receives. Reason codes are stable across small score movements; the agent sees a clean signal, not a noisy gradient it can climb. - Operator dashboard access is itself audited. If an operator can pull a score, that pull is logged, alongside who pulled it and why. This stops the dashboard from becoming a side channel for "ask the operator what my score is."
- Reason codes are not personally identifiable. Two agents that both got
privilege_not_grantedfor the same privilege should see identical text. Never templated with values from the score, never personalized in a way that turns the reason code into a leak.
Practical advice
- Start with Constraint, then Claim+Brief, then Reputation. Sandboxes and signed agent identities first. Reputation only after you have a stable population of agents with measurable behavior.
- Use the credible lower bound, not the mean. The mean is what the agent probably deserves; the lower bound is what it at least deserves. Granting on the lower bound forces real track record.
- Score multiple dimensions with different half-lives. Safety incidents fade in 6 months; efficiency fades in 2 weeks. One global half-life loses information.
- Time-bound and action-bound every grant. Default 5–15 minute TTLs and explicit
max_uses. Never mint a token that lasts a session. - Three views of the score, not one. The agent gets a policy decision and a stable reason code. The operator gets the rich score on an audited dashboard. The stored counters are the source of truth and never user-facing. See the section above for why this resolves the opacity-versus-legibility tension and how to nudge an agent without leaking the score.
- Match each signal to its dimension; never use only one signal per dimension. Deterministic signals for compliance and efficiency, model-judged for subjective accuracy, human review as the calibration anchor. Combining sources and watching where they diverge is what catches the Goodhart problem before it accumulates. The signals section earlier in this chapter has the table.
- Random human spot-checks at one in fifty, one in ten for high-stakes. The only signal that is not learnable by an agent in any cost-effective way. Random not scheduled; indistinguishable from normal operation; the reviewer sees the actual artifact, not just the booleans. Tune the rate based on how often human review disagrees with the automated signals.
- Sign and chain the audit log. A tamper-evident audit trail is what lets you reconstruct what happened after an incident. Spreadsheet-style logs are not enough.
- Drill the slashing path. If you have stake-based mechanisms, run quarterly chaos tests where you simulate misbehavior and confirm the slash actually fires. Untested security systems eventually fail silently.
- Treat RAG access like database access. List corpora per agent; filter by classification; audit retrieval. "RAG enabled" without scope is a data-exfiltration channel waiting for prompt injectionPrompt injection 2026.
This chapter handles the question of what the agent is allowed to do. The orthogonal question is what the agent can claim about itself. An agent can hallucinate that it has a token it does not have, or that it accepts a classification it does not accept. Chapter 10 (When the agent itself is wrong) covers the three external checks (capability registry, pinned ask, tool gate) that close that gap. Both layers are necessary; this one says what the agent may do, and chapter 10 makes sure the agent cannot lie about what it is.