Tone Dark
Tint
12 Trust, privileges & RAG · what your agent is allowed to do

Not every agent should be allowed to do everything.

By the time you're running more than a handful of agents, "what is this agent allowed to do?" stops being a small detail and becomes one of the central questions of the system. A junior support agent shouldn't be able to issue refunds. A research agent shouldn't be able to send emails. An external partner's agent definitely shouldn't have read access to your customer database.

Most teams start with a simple model: each agent has a fixed list of tools it can call, set when the agent is created, and that's the whole story. This works for small systems. It falls apart for bigger ones because real work isn't that neat. Sometimes the support agent does need refund authority for one specific case. Sometimes the research agent does need to send one email. The hard part is doing this safely.

A useful 2025 survey on inter-agent trust Inter-Agent Trust Models, arXiv 2025 identified six distinct mechanisms that production systems use to decide whether an agent should be trusted with a given action. We'll cover all six, build a working behavior-tracking statistic with the actual math, and walk through the enforcement layer that turns the score into something an agent can't ignore.

Pre-config vs post-config: the two kinds of privilege

Privileges break naturally into two categories:

Pre-config is your security floor. Post-config is your flexibility. A system with only pre-config is rigid (every edge case requires a code deploy). A system with only post-config has no floor (the agent can talk its way into anything).

RAG access is a privilege too

A RAG (Retrieval-Augmented Generation) system lets agents query a knowledge base and use the results in their answer. "Which knowledge base" is itself a privilege decision, and an underrated one. A customer-facing agent that can search support documentation is low-risk. The same agent searching internal Slack archives is a data-exfiltration channel waiting for a prompt-injection to open it. Same tools, different corpus, completely different risk profile.

RAG access should be scoped exactly like tool access:

# RAG access wrapped in privilege checks with full audit trail
from dataclasses import dataclass
from typing import Iterable

@dataclass(frozen=True)
class RAGScope:
    agent_id: str
    tenant_id: str
    corpora: frozenset[str]
    classifications: frozenset[str]   # e.g., {"public", "internal"}

class ScopedRetriever:
    def __init__(self, scope: RAGScope, store, audit):
        self.scope = scope
        self.store = store
        self.audit = audit

    def retrieve(self, query: str, corpus: str, k: int = 5) -> list:
        # 1. Reject before touching the index
        if corpus not in self.scope.corpora:
            self.audit.write(self.scope.agent_id, "rag.denied.corpus",
                             corpus=corpus, query_hash=hash(query))
            raise PermissionError(f"agent has no access to '{corpus}'")

        # 2. Query is namespaced by tenant; filter by classification at index level
        results = self.store.search(
            query=query,
            corpus=corpus,
            tenant=self.scope.tenant_id,
            allowed_classifications=self.scope.classifications,
            top_k=k)

        # 3. Defense in depth: re-filter post-retrieval in case the store
        #    has stale ACLs or the index leaks across classifications
        results = [r for r in results if r.classification in self.scope.classifications]

        # 4. Audit what the AGENT WILL SEE, not just what it cites
        self.audit.write(self.scope.agent_id, "rag.retrieved",
                         corpus=corpus, doc_ids=[r.id for r in results])
        return results

The six trust mechanisms

The 2025 survey Inter-Agent Trust Models, arXiv 2025 categorized six different ways production systems decide whether to trust an agent. Each has different costs, different failure modes, different sweet spots.

1 · Brief
A signed statement from a third party you trust: "this agent has property X". Verifiable by signature; revocable via CRL or short TTL.
Cert: subject=agent_123, claim=model:claude-4.7, sig=...
2 · Claim
The agent describes itself. Cheap, but you must trust the speaker. The Agent Card in A2AYang 2025 is a Claim.
{"name": "BookingAgent", "skills": ["flights"]}
3 · Proof
Cryptographic evidence an action happened or an agent has a property. Hardest to fake. Examples: ZK proofs, TEE attestations.
Signed receipt this code ran in an SGX enclave.
4 · Stake
Agent puts up something it loses on misbehavior. Money, tokens, reputation. Bad behavior triggers automatic slashing.
Stake 100 credits; lose 10 per rule violation.
5 · Reputation
A score (or set of scores) built from historical behavior. Updated by the system, visible to whoever needs to decide.
"Agent has 4.7/5 across 12,408 interactions."
6 · Constraint
Don't trust at all. Box the agent in so it can only do safe things. Sandboxes, allow-lists, output schemas.
Container: no network, read-only FS, syscall filter.

Real systems combine these. A typical stack: Constraint first (the agent is sandboxed), then Claim (it tells you what it does), then Brief (a signed cert backing the claim), then Reputation (track how it actually behaves), and Stake only for high-value actions where misbehavior must be financially expensive. Proof is the heaviest mechanism and shows up where regulators or counterparties demand it.

The behavior-tracking statistic

Here's where it gets technical. We need a number (or several numbers) that summarize how an agent has been behaving and update incrementally as new evidence comes in. The statistic must be: cheap to update, resistant to gaming, and fast to query at decision time.

The right tool is the Beta distribution. Track each agent's behavior as a pair of counters (α, β) where α is "good outcomes" and β is "bad outcomes". The expected reputation is then:

E[r] = α / (α + β)

Var[r] = αβ / [(α + β)² · (α + β + 1)]

In English: the agent's reputation is the fraction of good outcomes among all observed outcomes, with the variance shrinking as more observations come in. A new agent with no history starts at α=1, β=1 (the Beta(1,1) prior, also called the uniform prior), giving an expected reputation of 0.5 with high uncertainty. After 100 successes and 5 failures the counters become α=101, β=6, so the expected reputation is 101/107 ≈ 0.944 with much lower variance, and the score is now well-supported.

Why this and not a simple running average? Three reasons:

The decay term: why old behavior should fade

Without decay, an agent that misbehaves once gets penalized forever; an agent that built reputation a year ago coasts on stale credit. Apply exponential decay to both counters at update time:

α_new = α_old · e^(-λ · Δt) + good_observations
β_new = β_old · e^(-λ · Δt) + bad_observations

where λ = ln(2) / half_life

In English: scale both counters down by an exponential factor based on how much time has passed since the last update, then add the new evidence. Pick a half-life that matches your domain (e.g., 90 days means an event from 90 days ago counts half as much as today's). Recent behavior dominates; old behavior fades smoothly without any one-time "expiry" cliff.

Multi-dimensional scoring

A single trust score gets gamed because it creates a single optimization target. The 2025 Dynamic Reputation Filtering work DRF, arXiv 2025 recommends scoring across multiple independent dimensions. Track at least these four:

DimensionWhat counts as successWhat counts as failureHalf-life
Accuracy output verified correct (passes tests, matches ground truth) verified incorrect 30 days
Compliance followed policy, used only allowed tools, respected schemas any guardrail block, any policy violation 90 days
Efficiency completed within budget (tokens, time, tool calls) exceeded budget, hit iteration cap 14 days
Safety never attempted a denied action, no PII leaks, no IPI signals any safety incident, however minor 180 days

Notice the half-lives differ. Efficiency degrades fast; an agent that was efficient last month might be slower today as the task changes. Safety incidents stick around six months because one safety failure is a strong signal you can't ignore.

Two honest caveats about this table. First, four axes is an answer, not the answer. Accuracy, compliance, efficiency, and safety cover the obvious cases for a general-purpose system, but real production systems extend or split this set when the domain demands. A platform serving regulated industries usually adds a fairness axis with its own threshold and audit story (decisions that disadvantage a protected group are tracked separately from compliance failures, and slashed harder). A code-agent platform usually splits accuracy into correctness (does the code do what was asked) and security (does the code introduce a vulnerability), because the two have different signal sources, different decay rates, and different consequences when they fail. Pick the axes that match what your system can actually observe and what your operators actually need to gate on; do not pick four because the example used four.

Second, the half-lives in the table are reasonable starting points, not derivations. The next section covers how to pick them from data instead of intuition.

Picking half-lives from data, not from intuition

The table above lists 30, 90, 14, and 180 days. Those numbers are not magic; they are the values that have worked across the systems the authors have shipped. Picking the right half-life for your own system is an offline analysis over your own incident history, and it has a clean recipe.

The question a half-life answers is: how long ago does past behavior on this dimension stop being useful for predicting future behavior? If accuracy failures from six months ago are just as predictive of accuracy failures next week as failures from six days ago, your half-life should be very long (or there is no decay at all). If failures from a month ago are uncorrelated with failures next week, your half-life should be short.

The recipe, run once per dimension:

  1. Pull the last twelve months of outcomes for this dimension. One row per agent per outcome: timestamp, agent_id, success or failure. The longer the history, the better; six months is a workable minimum.
  2. Split the history at a chosen midpoint. Earlier half is "history"; later half is "ground truth."
  3. For a grid of candidate half-lives (say 1, 7, 14, 30, 60, 90, 180, 365 days), compute each agent's reputation at the split point using only the history with that decay constant.
  4. For each candidate, measure how well that reputation predicts the ground-truth half. The simplest metric is rank correlation between reputation at the split and observed success rate after the split. Brier score works too if you want a calibration-aware metric.
  5. Pick the half-life that maximizes prediction quality. If the curve is flat, you can use any value in the flat region; pick the longer one because longer half-lives reduce volatility.

Run this once per quarter. The right half-life drifts as your system changes (a new model release can shift accuracy decay; a guardrail upgrade can shift compliance decay), and stale half-lives quietly degrade the trust engine's resolution. The 30/90/14/180 starting points are not bad defaults to begin with, but they should be the input to your first calibration, not the answer to it.

The composite "should I grant this privilege?" question becomes a per-privilege threshold check across the dimensions:

from dataclasses import dataclass, field
from math import exp, log
from time import time
from scipy.stats import beta as beta_dist

@dataclass
class BetaCounter:
    """Beta-distributed reputation counter with exponential time decay.
    The prior is Beta(1,1). When alpha and beta decay, so does the
    prior's nominal contribution; we track the decayed prior separately
    so sample_size() reports actual observed mass, not raw alpha+beta-2
    (which goes negative once decay shrinks the original prior below 1)."""
    alpha: float = 1.0
    beta: float = 1.0
    alpha_prior: float = 1.0   # decayed prior contribution to alpha
    beta_prior: float = 1.0    # decayed prior contribution to beta
    half_life_seconds: float = 30 * 86400
    last_updated: float = field(default_factory=time)

    def _decay(self, now: float) -> None:
        dt = now - self.last_updated
        if dt <= 0: return
        decay = exp(-log(2) * dt / self.half_life_seconds)
        self.alpha *= decay
        self.beta *= decay
        self.alpha_prior *= decay
        self.beta_prior *= decay
        self.last_updated = now

    def observe(self, success: bool, weight: float = 1.0) -> None:
        self._decay(time())
        if success: self.alpha += weight
        else:       self.beta  += weight

    def expectation(self) -> float:
        self._decay(time())
        return self.alpha / (self.alpha + self.beta)

    def credible_lower_bound(self, confidence: float = 0.95) -> float:
        """Lower edge of the credible interval. The pessimistic estimate."""
        self._decay(time())
        return beta_dist.ppf(1 - confidence, self.alpha, self.beta)

    def sample_size(self) -> float:
        """Observed evidence mass: total counters minus the decayed prior.
        Always non-negative. With no observations and any decay, returns 0
        (correctly reflecting that we have learned nothing fresh)."""
        self._decay(time())
        observed = (self.alpha - self.alpha_prior) + (self.beta - self.beta_prior)
        return max(0.0, observed)

@dataclass
class AgentReputation:
    agent_id: str
    accuracy:   BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=30*86400))
    compliance: BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=90*86400))
    efficiency: BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=14*86400))
    safety:     BetaCounter = field(default_factory=lambda: BetaCounter(half_life_seconds=180*86400))

Three details worth noticing. The credible lower bound, not the mean, is what you check against thresholds. The mean is what the agent probably deserves; the lower bound is what it at least deserves with high confidence. Granting based on the lower bound forces new agents to actually earn track record before getting the same privileges as veterans. Decay runs lazily on every read. No cron job, no batch job, just a single multiplication on access. Each dimension has its own half-life. One global half-life loses information; per-dimension reflects the actual signal decay rate.

Enforcement: capability tokens with revocation

A score is just a number until something acts on it. The enforcement layer turns scores into hard yes-or-no decisions at the moment an agent tries to do something. The right primitive is a signed capability token: a short-lived, narrowly-scoped authorization the agent must present every time it exercises a privilege.

The token has six fields:

The signing key lives in an HSM or a KMS, never on the same host as the agents. The snippet below combines the signer and verifier into one class for clarity; production deployments split them: a Signer with the private key behind an HSM/KMS API, and a Verifier exposed to the gateway and to agents holding only the public key. Every privileged action requires a fresh, valid token. Tokens cannot be issued by agents to other agents; only the policy engine mints them.

import json, secrets
from dataclasses import dataclass
from nacl.signing import SigningKey, VerifyKey
from nacl.encoding import Base64Encoder

@dataclass(frozen=True)
class CapabilityToken:
    jti: str       # UUID, replay defense
    sub: str       # agent_id
    aud: str       # privilege identifier
    scope: dict    # {"max_amount": 200, "tenant": "acme", "max_uses": 1}
    iat: int       # issued-at unix timestamp
    exp: int       # expires-at unix timestamp
    sig: str       # base64 Ed25519 signature

class PolicyAuthority:
    def __init__(self, signing_key: SigningKey, revocations, replay):
        self.key = signing_key
        self.revocations = revocations  # Set of revoked jti values
        self.replay = replay            # Set of seen-and-used jti values

    def mint(self, agent_id, privilege, scope, ttl_seconds=300) -> CapabilityToken:
        now = int(time())
        body = {
            "jti": secrets.token_urlsafe(16),
            "sub": agent_id,
            "aud": privilege,
            "scope": scope,
            "iat": now,
            "exp": now + ttl_seconds,
        }
        sig = self.key.sign(json.dumps(body, sort_keys=True).encode()).signature
        return CapabilityToken(**body, sig=Base64Encoder.encode(sig).decode())

    def verify_and_consume(self, token: CapabilityToken, agent_id, privilege) -> bool:
        # 1. Signature check
        body = {k: getattr(token, k) for k in ["jti", "sub", "aud", "scope", "iat", "exp"]}
        try:
            verify_key = self.key.verify_key
            verify_key.verify(json.dumps(body, sort_keys=True).encode(),
                              Base64Encoder.decode(token.sig))
        except Exception:
            return False

        # 2. Subject and audience must match the requested action
        if token.sub != agent_id or token.aud != privilege:
            return False

        # 3. Time check (with small clock-skew tolerance)
        now = int(time())
        if not (token.iat - 5 <= now <= token.exp):
            return False

        # 4. Replay check
        if token.jti in self.replay or token.jti in self.revocations:
            return False

        # 5. Mark consumed (atomic add to a Redis set in production)
        self.replay.add(token.jti)
        return True

The flow at runtime: an agent requests a privilege; the policy engine checks reputation thresholds, role policy, and any human-approval requirement; on success it mints a token with a tight TTL. The agent presents the token at the tool gateway (or the RAG retriever, or wherever the privileged operation runs); the gateway verifies signature, expiration, audience, subject, and one-time-use, then either executes or rejects. Revocation works by adding the jti to a shared revocation set (Redis or similar) and checked on every verification.

Putting it together: the request-decide-mint-execute pipeline

class PrivilegeBroker:
    def __init__(self, reputation_store, policy, authority, audit):
        self.rep = reputation_store
        self.policy = policy
        self.authority = authority
        self.audit = audit

    def request(self, req: PrivilegeRequest) -> PrivilegeGrant | Denial:
        rep = self.rep.get(req.agent_id)

        # Hard floor: every privilege has minimum-reputation thresholds
        # Use the credible LOWER bound (pessimistic), not the mean
        thresholds = self.policy.thresholds_for(req.privilege)
        checks = {
            "safety":     rep.safety.credible_lower_bound()     >= thresholds.safety,
            "compliance": rep.compliance.credible_lower_bound() >= thresholds.compliance,
            "accuracy":   rep.accuracy.credible_lower_bound()   >= thresholds.accuracy,
        }
        for dim, ok in checks.items():
            if not ok:
                self.audit.write(req.agent_id, "priv.denied", dim=dim,
                                 priv=req.privilege)
                return Denial(reason=f"reputation:{dim}", retry_after="24h")

        # Sample-size floor: don't trust thin track records on risky asks
        if self.policy.is_high_risk(req.privilege) and rep.safety.sample_size() < 50:
            return Denial(reason="insufficient_history")

        # Role policy
        if not self.policy.role_can_request(req.agent_id, req.privilege):
            return Denial(reason="policy:role")

        # Human-in-the-loop for explicit high-stakes actions
        if self.policy.requires_human(req.privilege, req.scope):
            if not human_approval_obtained(req):
                return Denial(reason="human_review_pending")

        # Mint the capability token
        token = self.authority.mint(
            agent_id=req.agent_id,
            privilege=req.privilege,
            scope=req.scope,
            ttl_seconds=self.policy.ttl_for(req.privilege))
        self.audit.write(req.agent_id, "priv.granted",
                         priv=req.privilege, jti=token.jti)
        return PrivilegeGrant(token=token)

This pipeline is the only path to a capability token. Agents cannot mint tokens. Other agents cannot grant tokens. The decision is deterministic Python that takes the agent's reputation and the policy as input and produces a yes-or-no with a reason.

Closing the loop: turning outcomes into reputation updates

Every action that consumed a token eventually has an outcome: it succeeded, failed, was caught violating policy, leaked PII, exceeded budget, or completed cleanly. Those outcomes feed back into the agent's reputation. The classifier should be deterministic code where possible; an auditing agent only when the outcome can't be checked mechanically.

class OutcomeRecorder:
    def __init__(self, rep_store, audit):
        self.rep = rep_store
        self.audit = audit

    def record(self, agent_id, action_id, outcome: ActionOutcome) -> None:
        rep = self.rep.get(agent_id)

        # Each dimension is updated independently
        if outcome.verified_correct is not None:
            rep.accuracy.observe(outcome.verified_correct)

        rep.compliance.observe(not outcome.policy_violation)
        rep.efficiency.observe(not outcome.exceeded_budget)

        # Safety failures hit hard. Use weight=10 for incidents to make the
        # cost of one failure equivalent to many routine successes
        if outcome.safety_incident:
            rep.safety.observe(False, weight=10.0)
            self.audit.write(agent_id, "safety.incident",
                             action_id=action_id, severity=outcome.severity)
        else:
            rep.safety.observe(True)

        self.rep.put(agent_id, rep)

The asymmetric weight on safety failures (weight=10.0) is deliberate: one safety incident should erase the credit of many ordinary successes. This matches how human credit ratings work: a single missed payment costs more than many paid-on-time months earn.

One implicit assumption in the math is worth making explicit. The agent under a Beta counter is stochastic, not deterministic: the same agent on the same task can produce different outputs across runs (covered in chapter 02). What the Beta counter measures, then, is the fraction of the agent's output distribution that meets the success criterion, not a fixed yes-or-no ability. This works correctly because Beta is a distribution-over-distribution; the credible-lower-bound rule already handles variance by penalizing agents whose outcomes are highly variable as much as agents whose mean is low. Set thresholds aware of this: an agent with a 0.95 mean and high variance is not the same as one with 0.85 mean and tight variance, even though their lower bounds may match.

Where do those outcomes actually come from?

The code above takes outcome.verified_correct, outcome.policy_violation, and a few other booleans as input. The previous section was silent on where those booleans come from. That silence is the gap: a reputation system is only as good as the signals feeding it, and signals are not free, not all equally reliable, and not all equally hard to game. This section names the four kinds of signals, what each one is good for, and where each one breaks.

Pick the right signal for each dimension, not the easiest one. Picking the easiest one is how you end up with reputations that look great on a dashboard while the real behavior gets steadily worse.

Signal typeWhat it isReliabilityGaming surfaceBest for
Deterministic The API returned 200; the JSON validated; the test suite passed; the SQL query executed without error High Narrow. Cannot be gamed without cooperation from the system being checked. Efficiency, basic correctness, schema compliance
Rule-based A regex matched the output; a guard fired; a budget cap was hit; an output failed schema validation Medium-high The agent learns the exact rule and produces output that satisfies the rule but not its intent Compliance, safety blocks, format validation
Model-judged An auditor LLM scores the output; a critic agent flags concerns Medium-low The agent learns to write outputs the judge likes rather than outputs that are correct Subjective correctness, tone, helpfulness, fairness checks
Human A user thumbed up; a support agent marked the ticket resolved; an operator approved Variable Sycophancy, friendliness, sample bias, fatigue effects on the human Final ground truth for ambiguous cases; periodic calibration of the other three

Two practical rules fall out of this table.

Match signal type to dimension. Use deterministic and rule-based signals for compliance and efficiency: these are the dimensions where the question "did the agent break a rule?" has a precise answer that does not require judgment. Use model-judged signals for accuracy on subjective tasks where there is no programmatic check. Use human signals sparingly, for periodic calibration of the other three rather than as the primary feed. Human signals are too expensive and too noisy to drive every reputation update, but they are the only signal that catches Goodhart-style drift where the rule-based signals all say "fine" and the underlying behavior has quietly degraded.

Multiple signals per dimension, not one. No single signal is reliable enough to be the only input to a dimension. A safety dimension that only listens to "did a guard fire?" misses every failure that did not happen to match a guard pattern. A safety dimension that combines guard fires plus periodic auditor-LLM review plus monthly human review is much harder to game, because the agent would have to fool all three simultaneously. The implementation is straightforward: each observe() call carries a source tag, the dimension stores per-source counters, and the operator dashboard shows the breakdown so divergence between signal sources is itself a flag.

How agents try to game reputation, and what you can do about it

The threat-model table earlier listed Sybil attacks, reputation laundering, and long-game gaming. Those are the obvious cases. The subtle cases come from the agent learning to optimize the signal rather than the underlying behavior. Six concrete attacks, each with a defense that is straightforward to implement.

The signal of last resort: random human spot-check

Across all six attacks above, one defense recurs: random human review. It is expensive, it is slow, and it is the only signal that is not learnable by an agent in any cost-effective way. A human spot-check on one out of every fifty actions, scored against the same dimensions as the automated signals, is the calibration anchor that keeps everything else honest.

Three rules to make the spot-check work:

A practical cadence: one in fifty for routine actions, one in ten for high-stakes actions (refunds over a threshold, account modifications, anything that would be expensive to undo), all of them for actions that crossed a guard. The numbers are starting points; the calibration recipe from the earlier section in this chapter applies here too. Run the cadence for a quarter, measure how often human review disagrees with the automated signals, and tune up or down based on the disagreement rate.

What does reputation actually attach to?

The code above tracks reputation by agent_id, which is convenient but slightly wrong. An agent instance only exists for one workflow, then it's gone. So two questions: if you spin the agent up again with the same prompt and tools, should the reputation start over? And if a malicious party spins up ten thousand instances, each starting clean, what stops them from drowning your trust signal?

The right answer is to track reputation against the configuration, not the instance. A configuration is the deployment recipe: prompt template, tool list, model, policy bundle, version. Hash these together and you get a stable fingerprint. A thousand instances with the same fingerprint share one reputation pool. A new instance with a different fingerprint, even one that differs by a single character of the prompt, has to earn reputation from scratch.

import hashlib, json
from dataclasses import dataclass

@dataclass(frozen=True)
class AgentConfig:
    """The deployment recipe. Reputation attaches to its hash."""
    prompt_template: str
    tool_names: tuple
    model_id: str           # e.g. "claude-sonnet-4-5-20250929"
    policy_bundle_id: str
    version: str

    def fingerprint(self) -> str:
        body = json.dumps({
            "prompt": self.prompt_template,
            "tools": sorted(self.tool_names),
            "model": self.model_id,
            "policy": self.policy_bundle_id,
            "version": self.version,
        }, sort_keys=True)
        return hashlib.sha256(body.encode()).hexdigest()[:16]

# OutcomeRecorder.record() now keys on fingerprint, not agent_id.
def record(self, config: AgentConfig, action_id, outcome: ActionOutcome):
    rep = self.rep.get(config.fingerprint())
    # ... rest of update logic unchanged

Two things fall out of this. First, Sybil resistance gets easier. Spinning up new instances costs nothing reputationally; spinning up new configurations means starting from zero. If an attacker wants to bypass reputation, they have to publish a different prompt or use different tools, and you can see what they changed.

Second, you can compare configurations that are nearly the same. The v3 prompt with the same tools and model is one tweak away from the v2 prompt; their fingerprints are different but their reputations should be close. The gap between them tells you whether the new prompt is a regression. Most teams won't need anything fancier than this. The point is that reputation now lives on something stable enough to learn from across small changes, not on a per-instance counter that resets every workflow.

Multi-tenant reputation isolation

The configuration fingerprint above answers "which agent is this?" It does not answer "in whose context?" If you run an agent platform where multiple tenants deploy the same configuration to serve their own workloads, a single global reputation per fingerprint is wrong. Tenant A's billing specialist might be performing well for tenant A's customers and badly for tenant B's, and the right action is to gate privileges on tenant B without penalizing the same configuration on tenant A.

The fix is to slice reputation by tenant context, not just by configuration. The composite key becomes (configuration_fingerprint, tenant_id, task_class). The trust engine maintains a separate BetaCounter per slice. Privilege checks at request time look up the slice that matches the current request, not the global average. This is the same pattern chapter 21 (The 2026 frontier) recommends as contextual reputation; it belongs in the trust chapter too because the issue shows up the moment you have more than one customer.

@dataclass(frozen=True)
class ReputationKey:
    config_fingerprint: str    # the agent's configuration
    tenant_id: str             # whose workload
    task_class: str            # what kind of task ("billing_refund", etc.)

    def to_string(self) -> str:
        return f"{self.config_fingerprint}:{self.tenant_id}:{self.task_class}"


class SlicedReputation:
    def __init__(self):
        self.slices: dict[ReputationKey, AgentReputation] = {}

    def get(self, key: ReputationKey) -> AgentReputation:
        if key not in self.slices:
            self.slices[key] = AgentReputation()
        return self.slices[key]

Three things to know about this in practice:

Single-tenant systems can ignore the tenant axis and still benefit from slicing by task class, which catches the same kind of failure on a different boundary: an agent that handles refunds well but cancellations poorly should have different reputation on each.

Threat model: what attacks this defends against, and what it doesn't

AttackHow it worksWhat stops it here
Forged token Attacker creates a fake capability token for a privilege the agent doesn't have Ed25519 signature verification with key in HSM; sig fails, token rejected
Token replay Attacker reuses a captured valid token after the action has already happened One-time jti tracking in a shared replay set; second use rejected
Token stretching Agent or attacker uses a refund token to call a different privileged endpoint Audience (aud) must exactly match; mismatched aud rejected
Compromised agent Prompt-injected agent tries to escalate beyond its baseline Policy engine checks reputation + role + human approval; agent has no path to mint
Sybil attack Many fake agents vouch for one bad agent to inflate its reputation Reputation updates come from outcomes, not other agents' votes; cross-agent correlation analysis below
Reputation laundering An agent burns a low-rep identity, spawns a new one, repeats Stake-based identity (cost to create); minimum sample-size before high-risk privileges
Long-game gaming Agent behaves perfectly for months to build reputation, then defects on a high-stakes action Multi-dimensional scoring (safety has 6-month half-life); stake/slashing for high-value actions; sample-size doesn't unlock destructive privileges
Compromised audit pipeline Attacker corrupts the outcome recorder to never log incidents NOT defended here. Audit log integrity needs append-only storage, separate trust domain, and external monitoring
Selective attempt Agent only takes on tasks it expects to win, refusing or escalating the hard ones to keep success ratio inflated Track attempt rate per task class alongside success rate; reputation rewards attempted-and-succeeded over not-attempted
Surface compliance (Goodhart) Output satisfies the literal rule the signal checks for but violates the spirit; schema validates and content is empty Random human spot-checks scored independently; divergence between rule-based and human signals over time flags the rule for revision
Judge gaming When a model-judged signal is in the loop, agent learns features of the judge's preferences (verbosity, hedging) that are unrelated to quality Rotate the judge model on a schedule; large score drops under the new judge flag agents that had been gaming the previous one
Outcome attribution gaming In multi-agent workflows, every agent claims credit for successes and points elsewhere for failures Hash-chained contribution ledger; reputation updates only attribute to agents on the critical path of the action that produced the outcome
Ground-truth poisoning Outcomes are derived from data the agent can write; agent shapes the inputs to its own reputation update Outcome verification reads only from sources the agent had no path to write (customer-facing metrics, separate trust domains)
Time-window gaming Agent behaves perfectly during evaluation windows and worse the rest of the time; with a half-life, the recent good behavior dominates Random sampling for evaluation, not scheduled; evaluation prompts indistinguishable from normal traffic

Sybil detection: when "good" votes lie

If your reputation system uses any kind of cross-agent feedback (one agent rating another's output), you have to defend against Sybil attacks where an attacker creates many fake "good" agents to vouch for one bad one. The mitigation has two layers.

Layer 1: weight votes by the voter's own reputation. A new agent's vote carries less weight than a veteran's. This makes Sybil attacks expensive: the attacker has to first build genuine reputation on many fake identities before any vote means anything.

Layer 2: correlation analysis. Compute a similarity metric between agents' rating histories. Genuine raters disagree often; Sybil clusters agree suspiciously often. If a group of agents shows pairwise rating correlation above some threshold (say r > 0.9) on overlapping items, treat them as a single voting bloc and weight the bloc's votes accordingly.

import numpy as np
from scipy.stats import pearsonr

def detect_sybil_clusters(rating_matrix, agent_ids, threshold=0.9):
    """rating_matrix[i][j] = agent i's rating of item j (NaN where no rating).
    agent_ids[i] = the string ID of the agent at row i. Returns clusters
    of agent_ids (not row indices) so downstream code stays in ID space."""
    n_agents = rating_matrix.shape[0]
    if len(agent_ids) != n_agents:
        raise ValueError("agent_ids must align with rating_matrix rows")
    clusters = []
    visited = set()

    for i in range(n_agents):
        if i in visited: continue
        cluster_idx = {i}
        for j in range(i + 1, n_agents):
            if j in visited: continue
            # Need at least 10 overlapping items to compute meaningful correlation
            mask = ~np.isnan(rating_matrix[i]) & ~np.isnan(rating_matrix[j])
            if mask.sum() < 10: continue
            r, _ = pearsonr(rating_matrix[i][mask], rating_matrix[j][mask])
            if r > threshold:
                cluster_idx.add(j)
        if len(cluster_idx) > 1:
            clusters.append({agent_ids[k] for k in cluster_idx})
            visited |= cluster_idx
    return clusters

# Detected clusters get treated as a single voter, not many voters.
# `votes` is keyed by agent_id; `clusters` is a list of agent_id sets.
def apply_cluster_weighting(votes, clusters):
    for cluster in clusters:
        cluster_weight = 1.0 / len(cluster)
        for agent_id in cluster:
            if agent_id in votes:
                votes[agent_id] *= cluster_weight
    return votes

Run this analysis as a periodic batch job (hourly or daily, depending on traffic). Genuine voting populations have correlation distributions that look approximately bell-shaped. Sybil clusters show up as a sharp spike at the high end. Alert when the distribution shifts.

The audit log: append-only, signed, separate

Every privilege grant, every action consumed, every outcome recorded should land in an audit log that the agents themselves cannot tamper with. Three properties matter:

import hashlib, json
from dataclasses import dataclass, asdict

@dataclass(frozen=True)
class AuditEntry:
    seq: int
    timestamp: float
    agent_id: str
    event: str
    payload: dict
    prev_hash: str

def chain_hash(entry: AuditEntry) -> str:
    body = {"seq": entry.seq, "timestamp": entry.timestamp,
            "agent_id": entry.agent_id, "event": entry.event,
            "payload": entry.payload, "prev_hash": entry.prev_hash}
    return hashlib.sha256(json.dumps(body, sort_keys=True).encode()).hexdigest()

def verify_chain(entries: list[AuditEntry]) -> bool:
    """Walk the chain; any tampering breaks it. Returns False on the first
    mismatch. To upgrade to signed entries, add a per-entry signature over
    chain_hash(entry) using a key held in an HSM, and verify here."""
    prev = "0" * 64  # genesis
    for e in entries:
        if e.prev_hash != prev: return False
        prev = chain_hash(e)
    return True

What this whole system gets you

Together: multi-dimensional reputation with time decay drives a policy engine that mints signed, time-bounded, action-bounded capability tokens which the tool gateway and RAG retriever verify before executing, with all events landing in a tamper-resistant audit log that feeds outcomes back into reputation.

Properties that fall out of this design:

Three views of the same score: opacity, legibility, and the nudge question

Two questions sit in tension once the trust engine is running. Should the agent see its own score? If yes, it can game the score by adjusting surface behavior; if no, the system feels like a black box and humans cannot debug why a privilege was denied. Should the agent be nudged when it heads toward a denied action? If yes, the nudge itself is a leak that teaches the agent what the score cares about; if no, the agent flails uselessly and the operator still has to debug.

Treating these as two separate questions is the trap. They are the same question with one answer: pick what to expose, to whom, in what form. Three views of the same underlying score, each calibrated to what its consumer actually needs.

Stored view (rich and opaque)
The full Beta-distributed counters per dimension per slice. Numbers, posteriors, history. Lives in the trust engine; never user-facing. Consumed by enforcement code at request time and by the audit log at incident time. The agent never sees this. The operator never sees this directly either; they see it through the operator dashboard.
Operator view (rich and audited)
A dashboard showing scores per agent per dimension, with the policy decision and the threshold next to it. Plus the recent outcomes that moved the score. Plus the contributing slices for sliced reputation. This is how a human answers "why was this privilege denied?" Access is logged so the dashboard itself cannot be used as a side channel for an agent operator to read out the score.
Agent view (binary or coarse)
The agent receives the policy decision, not the score. Allowed or denied; if denied, a stable reason code and a short human-readable explanation. The agent learns "I do not have privilege X for this kind of task." It does not learn "my safety score is 0.89 and the threshold is 0.85." The first is operationally useful and unfakeable. The second is a metric the agent can chase.
Why the split works
The agent gets enough information to self-correct (do not ask for what you cannot have) but not enough to game (it has no number to optimize against). The operator gets full debuggability without exposing the score to the workload. The stored view is the source of truth and lives behind both interfaces. Three views, one number underneath.

The "do we ever nudge" question falls out of this directly. Yes, nudge, but with policy decisions, not score values. When the agent is about to attempt a denied action, the nudge says "this action requires refund.high_value privilege which is not currently granted to you for this task class." The agent has something to act on (give up, escalate to a human, pick a different approach) without learning anything that helps it raise the underlying number through performative behavior.

Three rules that keep the split honest:

Practical advice

The shape of mature agent systems is converging on something like this: lightweight cryptographic identity, default-deny privileges, time-bounded signed grants, multi-dimensional reputation with explicit decay, and a tamper-resistant audit log feeding outcomes back into the score. None of these pieces is novel on its own; the architecture is mostly about making sure they all add up to something that holds together when an agent (or its operator) misbehaves.

This chapter handles the question of what the agent is allowed to do. The orthogonal question is what the agent can claim about itself. An agent can hallucinate that it has a token it does not have, or that it accepts a classification it does not accept. Chapter 10 (When the agent itself is wrong) covers the three external checks (capability registry, pinned ask, tool gate) that close that gap. Both layers are necessary; this one says what the agent may do, and chapter 10 makes sure the agent cannot lie about what it is.