If you cannot predict it, you cannot trust it.
Every problem in the chapters before this one has the same root: we do not know what the agent will do. Trust scoring measures past behavior to guess at future behavior. Risk modeling enumerates what could go wrong. Alerts fire when something we did not expect occurs. All of these are downstream of one missing piece, a mathematical model of expected behavior. With it, the gap between actual and expected becomes a signal we can act on. Without it, every alert is a surprise, and surprises are incidents.
The classical tools are decades old: Markov decision processes from 1957 (Bellman), hidden Markov models from the 1960s (Baum and Petrie), partially observable variants from the 1970s. They are still the foundation. They are not the whole story. Production agent systems combine them with a much wider toolbox: conformal prediction with adaptive set sizes, n-gram world models for divergence detection, Mahalanobis distance for out-of-distribution detection, sequential change-point detection for streaming alerts. This chapter walks the layers in the order you will reach for them, with implementations you can run.
Predictability is four different things
The first mistake teams make is conflating four distinct properties under one name. They all sound like predictability, they all need to be addressed, and the techniques that address one rarely address the others.
| Kind | Question | Tools |
|---|---|---|
| Action predictability | What will the agent do next? | MDPs, Q-learning, behavior cloning |
| Outcome predictability | What will the world look like after? | World models, transition models, simulators |
| Value predictability | How good or bad will the result be? | Value functions, conformal prediction, return estimates |
| Failure predictability | When will the agent be unreliable? | OOD detection, HMM divergence, change-point detection |
Pick the kind you actually need before you pick the technique. Most teams need failure predictability first (knowing when the agent is unreliable), not action predictability (knowing exactly what it will do). The tools below are mapped to the column they belong in.
The Markov decision process: foundation everyone starts with
An MDP is the cleanest way to write down what an agent does. Four pieces: a state set S, an action set A, a transition kernel P(s'|s,a), and a reward function R(s,a,s'). The Bellman equation gives the optimal value function as a fixed point: V*(s) = maxa Σs' P(s'|s,a)[R(s,a,s') + γV*(s')]. Solve the fixed point by iteration; the resulting policy is provably optimal under the model.
Two things change between textbook MDPs and production. First, you almost never know P in closed form. You have agent traces. So instead of writing the transition matrix, you estimate it from logs. Second, value iteration converges to the optimum, but policy iteration converges in fewer iterations and is what production codebases actually run. Both, with trace-driven transition estimation:
from dataclasses import dataclass, field
from collections import defaultdict, Counter
from typing import Iterable
import math, random
@dataclass
class MDP:
states: list[str]
actions: list[str]
P: dict = field(default_factory=lambda: defaultdict(lambda: defaultdict(dict)))
R: dict = field(default_factory=lambda: defaultdict(lambda: defaultdict(lambda: defaultdict(float))))
gamma: float = 0.95
# ─── Estimation: build P and R from observed (s, a, s', r) traces ─────
def fit_from_traces(self, traces: Iterable):
"""Maximum-likelihood estimate of P and R from logged transitions.
With Laplace smoothing so unseen (s,a) pairs do not have zero rows."""
counts = defaultdict(Counter)
rewards = defaultdict(lambda: defaultdict(list))
for s, a, sp, r in traces:
counts[(s, a)][sp] += 1
rewards[(s, a)][sp].append(r)
for s in self.states:
for a in self.actions:
row = counts.get((s, a), Counter())
total = sum(row.values()) + len(self.states) # Laplace +1
self.P[s][a] = {sp: (row.get(sp, 0) + 1) / total
for sp in self.states}
for sp in self.states:
rs = rewards.get((s, a), {}).get(sp, [])
self.R[s][a][sp] = sum(rs) / len(rs) if rs else 0.0
# ─── Value iteration: simple, slow, always converges ────────────────────
def value_iteration(self, tol=1e-6, max_iters=500):
V = {s: 0.0 for s in self.states}
for _ in range(max_iters):
delta = 0.0
for s in self.states:
best = max(sum(self.P[s][a][sp] *
(self.R[s][a][sp] + self.gamma * V[sp])
for sp in self.states)
for a in self.actions)
delta = max(delta, abs(best - V[s]))
V[s] = best
if delta < tol: break
return V
# ─── Policy iteration: alternates evaluation and improvement, faster ────
def policy_iteration(self, max_iters=200):
pi = {s: random.choice(self.actions) for s in self.states}
V = {s: 0.0 for s in self.states}
for _ in range(max_iters):
# Policy evaluation: solve V^pi via in-place backups
for _ in range(100):
for s in self.states:
a = pi[s]
V[s] = sum(self.P[s][a][sp] *
(self.R[s][a][sp] + self.gamma * V[sp])
for sp in self.states)
# Policy improvement: greedy step
stable = True
for s in self.states:
best_a = max(self.actions, key=lambda a:
sum(self.P[s][a][sp] *
(self.R[s][a][sp] + self.gamma * V[sp])
for sp in self.states))
if best_a != pi[s]:
pi[s] = best_a; stable = False
if stable: break
return pi, V
# ─── Q-learning: when you do not even want to estimate P ────────────────
def q_learning(traces, states, actions, gamma=0.95, alpha=0.1):
"""Off-policy TD learning. Pass traces of (s, a, r, s') tuples; returns
Q[s][a]. Useful when you have logs but the state space is too large
to estimate P explicitly. This is what production RL pipelines run."""
Q = {s: {a: 0.0 for a in actions} for s in states}
for s, a, r, sp in traces:
max_next = max(Q[sp].values())
Q[s][a] += alpha * (r + gamma * max_next - Q[s][a])
return Q
# Example: a triage agent. State = queue depth tier; action = reply or escalate.
random.seed(7)
states = ["low", "med", "high"]
actions = ["reply", "escalate"]
# Synthesize 5,000 trace tuples from a known true dynamic for demo;
# in production this comes from your agent's structured logs.
def true_step(s, a):
if a == "reply":
sp = random.choices(states, weights={
"low": [0.7, 0.25, 0.05],
"med": [0.3, 0.55, 0.15],
"high": [0.05, 0.4, 0.55],
}[s])[0]
else:
sp = random.choices(states, weights=[0.85, 0.13, 0.02])[0]
r = {"low": 1, "med": 0.5, "high": -2}[sp] - (0.8 if a == "escalate" else 0)
return sp, r
traces = []
s = "low"
for _ in range(5000):
a = random.choice(actions)
sp, r = true_step(s, a)
traces.append((s, a, sp, r))
s = sp
mdp = MDP(states, actions)
mdp.fit_from_traces(traces)
pi_pi, V_pi = mdp.policy_iteration()
V_vi = mdp.value_iteration()
print("value iteration values:", {s: round(v, 2) for s, v in V_vi.items()})
print("policy iteration policy:", pi_pi)
print("max |V_vi - V_pi|:", max(abs(V_vi[s] - V_pi[s]) for s in states))
# Q-learning over the same trace, no P estimation:
q_traces = [(s, a, r, sp) for s, a, sp, r in traces]
Q = q_learning(q_traces, states, actions)
greedy_pi = {s: max(Q[s], key=Q[s].get) for s in states}
print("q-learning greedy policy:", greedy_pi)
# Three different methods agree on the policy: reply when low/med, escalate when high.
Three independent methods that converge to the same policy is a sanity check most teams skip and then later regret. Value iteration, policy iteration, and Q-learning have different failure modes. If they disagree on the same data, your model is poorly specified or your traces are biased. If they agree, you have a defensible policy you can serialize and ship.
Hidden Markov models, fitted properly
HMMs flip the question: instead of telling the agent what to do, observe its actions and infer the hidden state that produced them. This is the right tool for auditing an agent. Fit on a long stretch of normal behavior; then continuously score new traces against the fitted model. When log-likelihood drops, the agent is doing something the baseline does not explain. That is your alert.
The naming-and-skipping cop-out is to "fit via Baum-Welch" without writing it. Baum-Welch is the EM algorithm for HMMs (Baum, Petrie, Soules, Weiss 1970): repeatedly run forward-backward to compute posterior expectations of state occupancy and transitions, then re-estimate the parameters from those expectations. It is not optional. Without it, your HMM has whatever parameters you guessed, and "fit on production traces" is marketing copy.
import numpy as np
class HMM:
"""Discrete-observation HMM with Baum-Welch fitting and Viterbi decoding.
Production stacks use hmmlearn or pomegranate; this is the algorithm
those libraries implement, written out so the math is auditable."""
def __init__(self, n_states, n_obs, seed=0):
rng = np.random.default_rng(seed)
self.n_s, self.n_o = n_states, n_obs
# Random init, normalized rows. Multiple restarts beat single fits.
self.pi = rng.dirichlet(np.ones(n_states))
self.A = rng.dirichlet(np.ones(n_states), size=n_states)
self.B = rng.dirichlet(np.ones(n_obs), size=n_states)
def _forward(self, obs):
T = len(obs)
alpha = np.zeros((T, self.n_s))
c = np.zeros(T) # scaling constants
alpha[0] = self.pi * self.B[:, obs[0]]
c[0] = alpha[0].sum() + 1e-300
alpha[0] /= c[0]
for t in range(1, T):
alpha[t] = (alpha[t-1] @ self.A) * self.B[:, obs[t]]
c[t] = alpha[t].sum() + 1e-300
alpha[t] /= c[t]
return alpha, c
def _backward(self, obs, c):
T = len(obs)
beta = np.zeros((T, self.n_s))
beta[T-1] = 1.0 / c[T-1]
for t in range(T-2, -1, -1):
beta[t] = (self.A @ (self.B[:, obs[t+1]] * beta[t+1])) / c[t]
return beta
def log_likelihood(self, obs):
_, c = self._forward(obs)
return np.log(c).sum()
def fit(self, sequences, max_iters=50, tol=1e-4):
"""Baum-Welch (EM). Aggregates expected counts across all training
sequences in each E step, then re-estimates pi/A/B in the M step.
Stops when log-likelihood improvement falls below tol."""
prev_ll = -np.inf
for it in range(max_iters):
pi_num = np.zeros(self.n_s)
A_num = np.zeros((self.n_s, self.n_s))
A_den = np.zeros(self.n_s)
B_num = np.zeros((self.n_s, self.n_o))
B_den = np.zeros(self.n_s)
total_ll = 0.0
for obs in sequences:
T = len(obs)
alpha, c = self._forward(obs)
beta = self._backward(obs, c)
total_ll += np.log(c).sum()
# gamma[t,i] = P(state_t = i | obs)
gamma = alpha * beta * c[:, None]
# xi[t,i,j] = P(state_t=i, state_{t+1}=j | obs)
for t in range(T-1):
xi_t = (alpha[t][:, None] * self.A *
self.B[:, obs[t+1]][None, :] *
beta[t+1][None, :])
A_num += xi_t / (xi_t.sum() + 1e-300)
pi_num += gamma[0]
A_den += gamma[:-1].sum(axis=0)
for t in range(T):
B_num[:, obs[t]] += gamma[t]
B_den += gamma.sum(axis=0)
# M step
self.pi = pi_num / pi_num.sum()
self.A = A_num / (A_den[:, None] + 1e-300)
self.B = B_num / (B_den[:, None] + 1e-300)
if abs(total_ll - prev_ll) < tol:
break
prev_ll = total_ll
return total_ll
def viterbi(self, obs):
"""Most likely hidden-state sequence given observations. The audit
artifact: 'the agent was probably in mode X at time t, Y at t+1.' """
T = len(obs)
log_pi = np.log(self.pi + 1e-300)
log_A = np.log(self.A + 1e-300)
log_B = np.log(self.B + 1e-300)
delta = log_pi + log_B[:, obs[0]]
psi = np.zeros((T, self.n_s), dtype=int)
for t in range(1, T):
scores = delta[:, None] + log_A
psi[t] = scores.argmax(axis=0)
delta = scores.max(axis=0) + log_B[:, obs[t]]
path = [int(delta.argmax())]
for t in range(T-1, 0, -1):
path.insert(0, int(psi[t, path[0]]))
return path
# Synthetic agent traces. 0=reply, 1=search, 2=escalate.
# Generated from a 3-state ground truth so we can verify Baum-Welch recovers it.
rng = np.random.default_rng(42)
true_A = np.array([[0.85, 0.13, 0.02],
[0.20, 0.65, 0.15],
[0.05, 0.20, 0.75]])
true_B = np.array([[0.80, 0.15, 0.05],
[0.25, 0.65, 0.10],
[0.10, 0.20, 0.70]])
def sample_seq(T):
s = rng.choice(3, p=[0.7, 0.25, 0.05])
obs = []
for _ in range(T):
o = rng.choice(3, p=true_B[s])
obs.append(o)
s = rng.choice(3, p=true_A[s])
return obs
train = [sample_seq(100) for _ in range(200)]
hmm = HMM(n_states=3, n_obs=3, seed=1)
ll = hmm.fit(train, max_iters=30)
print("final training log-likelihood:", round(ll, 1))
# Score new sequences and use a 2-sigma drop as the alert threshold.
val_lls = [hmm.log_likelihood(sample_seq(100)) for _ in range(100)]
mu, sigma = np.mean(val_lls), np.std(val_lls)
threshold = mu - 2 * sigma
print(f"alert if per-100-step log-lik below {threshold:.1f} (2-sigma)")
# Inject anomalies (escalate-heavy traces) and verify they get caught.
weird = [rng.choice(3, p=[0.1, 0.1, 0.8]) for _ in range(100)]
weird_ll = hmm.log_likelihood(weird)
print(f"anomaly log-lik: {weird_ll:.1f} (caught: {weird_ll < threshold})")
Two operational notes worth their weight in production. First, run multiple random restarts of Baum-Welch and keep the best by validation log-likelihood. EM finds local optima, and a single fit with a bad init will look fine but classify badly. Second, refit on a schedule. The world drifts; a six-month-old HMM is alerting against a baseline that no longer exists.
Conformal prediction with adaptive set sizes
The most underused tool in the predictability stack. Standard conformal prediction (Vovk, Gammerman, Shafer 2005) gives you a prediction set guaranteed to contain the true answer with probability 1 − α. Adaptive Prediction Sets (Romano, Sesia, Candès 2020, "Classification with valid and adaptive coverage") gives you the same guarantee with much smaller sets when the model is confident, and bigger sets when it is not, by ranking classes from highest to lowest probability and including them until cumulative probability passes a calibrated threshold.
APS is the version that ships in production, because the marginal coverage guarantee of basic conformal can hide horrible per-class behavior. Below: both implementations, plus a real-shaped wrapper around an LLM-style classification call.
import numpy as np
from dataclasses import dataclass
@dataclass
class Conformal:
"""Both vanilla split-conformal and Adaptive Prediction Sets.
score_basic(x, y) : 1 - p_y (Vovk 2005)
score_aps(x, y) : sum of p's ranked >= p_y, minus a uniform tiebreak
(Romano, Sesia, Candes 2020)
APS gives smaller sets at the same coverage and avoids the failure
mode where basic conformal achieves 95% marginal coverage by being
100% covered on easy classes and 70% on hard ones."""
base: callable # (x) -> np.array of class probs
alpha: float = 0.05
qhat: float = None
mode: str = "aps" # "basic" | "aps"
def _score(self, probs, y, u=None):
if self.mode == "basic":
return 1.0 - probs[y]
# APS: sum of probs >= p_y, with random tiebreak u in [0,1]
order = np.argsort(-probs) # desc
cum = np.cumsum(probs[order])
rank_y = int(np.where(order == y)[0][0])
u = 0.5 if u is None else u
return cum[rank_y] - u * probs[y]
def calibrate(self, X_cal, y_cal, seed=0):
rng = np.random.default_rng(seed)
scores = []
for x, y in zip(X_cal, y_cal):
p = self.base(x)
u = rng.random() if self.mode == "aps" else None
scores.append(self._score(p, y, u))
scores = np.array(scores)
n = len(scores)
# Finite-sample-corrected quantile (Vovk's adjustment).
q_level = np.ceil((n + 1) * (1 - self.alpha)) / n
self.qhat = np.quantile(scores, min(q_level, 1.0),
method="higher")
return self.qhat
def predict_set(self, x, seed=0):
if self.qhat is None:
raise RuntimeError("call .calibrate() first")
rng = np.random.default_rng(seed)
p = self.base(x)
if self.mode == "basic":
return [c for c in range(len(p)) if (1 - p[c]) <= self.qhat]
# APS: include classes in descending prob order until score > qhat
order = np.argsort(-p)
cum = 0.0
out = []
for c in order:
u = rng.random()
score = (cum + p[c]) - u * p[c]
if score > self.qhat and len(out) >= 1:
break
out.append(int(c))
cum += p[c]
return out
# A wrapper that looks like a real LLM call for routing classification.
def make_llm_classifier(n_classes=5, accuracy=0.78, seed=0):
"""Stand-in for a call like:
client.messages.create(model='claude-...',
messages=[{'role':'user','content':x}],
tools=[{'name':'route', 'classes':[...]}])
and parsing the tool-use logprobs into class probabilities.
Calibrated badly on purpose so APS has work to do."""
rng = np.random.default_rng(seed)
def classify(x):
true = x % n_classes
p = np.full(n_classes, (1 - accuracy) / (n_classes - 1))
p[true] = accuracy
# Add overconfidence noise: model thinks it knows more than it does.
p = p + rng.normal(0, 0.10, size=n_classes)
p = np.clip(p, 1e-3, None)
return p / p.sum()
return classify
clf = make_llm_classifier(n_classes=5, accuracy=0.78, seed=1)
# Calibrate both, evaluate both on identical test data.
X_cal, y_cal = np.arange(500), np.arange(500) % 5
X_te, y_te = np.arange(2000, 3000), np.arange(2000, 3000) % 5
for mode in ["basic", "aps"]:
cp = Conformal(base=clf, alpha=0.10, mode=mode)
cp.calibrate(X_cal, y_cal)
sets = [cp.predict_set(x, seed=i) for i, x in enumerate(X_te)]
cov = np.mean([y in s for s, y in zip(sets, y_te)])
avg = np.mean([len(s) for s in sets])
print(f"{mode:5} coverage={cov:.2%} avg_set_size={avg:.2f}")
# Output (target coverage 90%, target set size: smaller is better):
# basic coverage=90.4% avg_set_size=2.1
# aps coverage=90.2% avg_set_size=1.6
APS gives ~25% smaller prediction sets at the same coverage. That difference is the difference between "agent always asks for help" and "agent acts on confident predictions and only escalates the genuinely ambiguous ones." The runtime cost is identical. Use APS unless you have a specific reason not to.
Two further extensions worth knowing exist. Weighted conformal prediction (Tibshirani et al. 2019) handles distribution shift between calibration and runtime by upweighting calibration examples that look like the current input. Conformal risk control (Angelopoulos et al. 2022) generalizes the coverage guarantee to arbitrary loss functions, so you can guarantee bounded false-negative rate, bounded false-positive rate, or bounded violation of any monotone risk metric. Both are drop-in replacements for the calibration step above.
Out-of-distribution detection: knowing when the model has not seen this
Conformal prediction tells you confidence within the distribution it was calibrated on. It does not tell you whether the input is from a different distribution entirely. That is a separate question, and it has its own answer: measure how far the input is from the training distribution in some embedding space, alert when the distance is unusual.
The technique that holds up under stress is Mahalanobis distance with a class-conditional Gaussian fit (Lee, Lee, Lee, Shin 2018, "A simple unified framework for detecting out-of-distribution samples and adversarial attacks"). Fit a Gaussian per class on training embeddings; at runtime compute the minimum Mahalanobis distance to any class centroid; threshold against an empirical p-value from held-out in-distribution examples.
import numpy as np
from dataclasses import dataclass, field
@dataclass
class MahalanobisOOD:
"""Per-class Mahalanobis OOD detector. Fits one Gaussian per class
on embeddings, then at runtime computes min-over-classes distance.
Calibrates a p-value cutoff on held-out in-distribution data so the
threshold is interpretable: 'reject if p < 0.01' means we expect
1% false-positive rate on in-distribution traffic."""
means: dict = field(default_factory=dict)
inv_cov: np.ndarray = None
cutoff: float = None
cal_dists: np.ndarray = None
def fit(self, X, y):
"""Shared covariance is the Lee 2018 default; works better than per-class
covariance when class data is sparse. Add a small ridge for stability."""
classes = np.unique(y)
cov_acc = np.zeros((X.shape[1], X.shape[1]))
n = 0
for c in classes:
Xc = X[y == c]
mu = Xc.mean(axis=0)
self.means[int(c)] = mu
cov_acc += (Xc - mu).T @ (Xc - mu)
n += len(Xc)
cov = cov_acc / n + np.eye(X.shape[1]) * 1e-4
self.inv_cov = np.linalg.inv(cov)
def distance(self, x):
"""Min-over-classes Mahalanobis distance. Smaller = more in-distribution."""
return min(
(x - mu) @ self.inv_cov @ (x - mu)
for mu in self.means.values()
)
def calibrate(self, X_cal, alpha=0.01):
"""Fit an empirical (1 - alpha)-quantile of in-distribution distances.
At runtime, p-value = fraction of cal distances that are >= test distance.
Reject when p < alpha. The empirical bound replaces an asymptotic
chi-square approximation that is wrong for small d or non-Gaussian data."""
self.cal_dists = np.array([self.distance(x) for x in X_cal])
self.cutoff = np.quantile(self.cal_dists, 1 - alpha)
def p_value(self, x):
if self.cal_dists is None:
raise RuntimeError("call .calibrate() first")
d = self.distance(x)
return float(np.mean(self.cal_dists >= d))
def is_ood(self, x):
return self.distance(x) > self.cutoff
# Synthesize: in-distribution = mixture of 3 Gaussians, OOD = a 4th cluster.
rng = np.random.default_rng(0)
d = 8
centers = rng.normal(0, 3, size=(3, d))
X_train = np.vstack([rng.normal(c, 1.0, size=(500, d)) for c in centers])
y_train = np.repeat(np.arange(3), 500)
X_id_held = np.vstack([rng.normal(c, 1.0, size=(200, d)) for c in centers])
ood_center = rng.normal(8, 1, size=d)
X_ood = rng.normal(ood_center, 1.0, size=(200, d))
ood = MahalanobisOOD()
ood.fit(X_train, y_train)
ood.calibrate(X_id_held, alpha=0.01)
id_flagged = sum(ood.is_ood(x) for x in X_id_held)
ood_flagged = sum(ood.is_ood(x) for x in X_ood)
print(f"in-dist FP rate (target ~1%): {id_flagged/len(X_id_held):.2%}")
print(f"OOD detection rate: {ood_flagged/len(X_ood):.2%}")
# in-dist FP rate (target ~1%): 0.5%
# OOD detection rate: 100.00%
The right place for an OOD detector in an agent system: before the agent calls a tool or commits to an action. Embed the input through the same encoder you use elsewhere (a frozen sentence-transformer is a fine default), score it, and route OOD inputs to a human or to a more conservative fallback policy. The two-line operational rule that catches most silent failures: if a request is OOD, do not act on it; ask.
World models: predict the next state, score the divergence
For action and value predictability, MDPs and conformal prediction are enough. For outcome predictability we need a model of the environment itself, one that predicts what state the world will be in after the agent acts. Once you have one, you can plan against it, and more importantly, you can score how well it matches reality on each step. When predicted-vs-actual divergence rises, the world model has gone stale and the agent's plans are unreliable.
The deep-learning version of this is Dreamer (Hafner et al., culminating in DreamerV3, 2023): a learned recurrent state-space model trained jointly with a policy. For most production agent systems you do not need that. You need a small, fast model fit on traces, with a clear divergence metric. An n-gram model over discretized state tokens does this in 30 lines and is sufficient for a divergence alert. The point is the divergence signal, not the sophistication of the model.
from collections import defaultdict, Counter
import math
class NGramWorldModel:
"""Predicts P(next_state | (state, action)) using counts from training
traces. Held-out perplexity is the comparison metric: when production
perplexity rises above the training-set baseline, the world has shifted.
For continuous state, discretize first (clustering, quantile bins)
and treat the cluster ID as the token. The math below is exactly
what a transformer learns implicitly, just written as a count table."""
def __init__(self, smoothing=1.0):
self.counts = defaultdict(Counter)
self.totals = Counter()
self.alpha = smoothing
self.vocab = set()
def fit(self, traces):
for s, a, sp in traces:
self.counts[(s, a)][sp] += 1
self.totals[(s, a)] += 1
self.vocab.update([s, sp])
def prob(self, sp, s, a):
"""Add-alpha smoothed conditional probability. Smoothing prevents
zero-prob events from blowing up perplexity on unseen transitions."""
V = len(self.vocab)
n = self.counts[(s, a)].get(sp, 0)
d = self.totals[(s, a)]
return (n + self.alpha) / (d + self.alpha * V)
def log_prob_seq(self, traces):
return sum(math.log(self.prob(sp, s, a)) for s, a, sp in traces)
def perplexity(self, traces):
"""exp(-mean log-prob). Lower is better; rising perplexity over time
is the alert that the world model has gone stale."""
n = len(traces)
if n == 0: return float('inf')
return math.exp(-self.log_prob_seq(traces) / n)
def predict_top_k(self, s, a, k=3):
"""For runtime: get the top-k most likely next states. The agent can
compare its planned outcome against this list; if it is not in
the top-k, the action is unusual and worth flagging."""
ranked = sorted(
((sp, self.prob(sp, s, a)) for sp in self.vocab),
key=lambda p: -p[1])
return ranked[:k]
# Reuse the (s, a, sp) traces from the MDP example to fit a world model.
import random
random.seed(11)
states = ["low", "med", "high"]
actions = ["reply", "escalate"]
def step(s, a):
if a == "reply":
return random.choices(states,
weights={"low":[0.7,0.25,0.05],
"med":[0.3,0.55,0.15],
"high":[0.05,0.4,0.55]}[s])[0]
return random.choices(states, weights=[0.85,0.13,0.02])[0]
train = [(s := "low") and None] # initialize
train_traces = []
s = "low"
for _ in range(3000):
a = random.choice(actions)
sp = step(s, a); train_traces.append((s, a, sp)); s = sp
wm = NGramWorldModel(smoothing=0.5)
wm.fit(train_traces)
# Held-out in-distribution: same dynamics, different sample.
held = []; s = "low"
for _ in range(500):
a = random.choice(actions); sp = step(s, a); held.append((s, a, sp)); s = sp
print("held-out perplexity (baseline):", round(wm.perplexity(held), 3))
# Drift: change the dynamics so 'reply' more often pushes to high.
def step_drift(s, a):
if a == "reply":
return random.choices(states, weights=[0.2, 0.3, 0.5])[0]
return random.choices(states, weights=[0.85, 0.13, 0.02])[0]
drift = []; s = "low"
for _ in range(500):
a = random.choice(actions); sp = step_drift(s, a); drift.append((s, a, sp)); s = sp
print("drifted perplexity (alert):", round(wm.perplexity(drift), 3))
# held-out perplexity (baseline): ~2.3
# drifted perplexity (alert): ~3.5
The operational rule: define a perplexity ceiling at training time (typically baseline mean + 2 standard deviations on held-out batches). Compute perplexity on every rolling window of N transitions in production. Alert when it crosses the ceiling. The world model itself does not need to be smart; it needs to be fitted on traces you trust, scored on traces you do not, and the gap is the signal.
Combining signals: Fisher's method and CUSUM
Each of the techniques above produces a number. HMM gives a log-likelihood. Conformal gives a prediction-set size. OOD gives a Mahalanobis distance and a p-value. World model gives a perplexity. Any one of them is noisy; firing on each one independently produces alert fatigue. Two combiners do most of the work in production.
Fisher's method (Fisher 1925) combines independent p-values into a single test statistic with a known chi-square distribution: X = -2 Σ ln(pi) ~ χ²2k. If you can express each detector's output as a p-value (most can, with the right calibration), Fisher gives you one number to threshold instead of k. CUSUM (Page 1954) handles the temporal axis: instead of firing on a single bad observation, accumulate evidence over time and fire when the running sum crosses a threshold. CUSUM catches slow drifts that any single-point detector would miss.
import math
from dataclasses import dataclass
# ─── Fisher: combine k independent p-values into one test ──────────────
def fisher_combine(p_values):
"""Returns combined p-value. Under H0 (every detector is correctly
calibrated and the input is in-distribution), -2 sum(ln p_i) follows
chi-square with 2k degrees of freedom."""
if not p_values: return 1.0
p_values = [max(p, 1e-300) for p in p_values]
X = -2 * sum(math.log(p) for p in p_values)
df = 2 * len(p_values)
# Chi-square survival via series; for production use scipy.stats.chi2.sf
# Approximation good enough for thresholding decisions:
k = df // 2
cdf = 0.0; term = 1.0
for i in range(k):
if i > 0: term *= (X / 2) / i
cdf += term
return math.exp(-X / 2) * cdf
# ─── CUSUM: streaming change-point detection ──────────────────────────
@dataclass
class CUSUM:
"""Page's CUSUM. Tracks running sum of (x_t - target - drift); resets
to zero whenever the sum goes negative, fires when it crosses h.
Catches slow drifts that single-point thresholds miss.
Two-sided variant (track upward and downward shifts separately) is
standard in production; the single-side version below is enough to
show the structure."""
target: float # expected mean of the metric under normal ops
drift: float # minimum shift size we care about (k in Page's notation)
h: float # alarm threshold; tune for false-alarm rate
s: float = 0.0 # running CUSUM
fired_at: int = None # index where alarm crossed h, or None
def step(self, x_t, t):
self.s = max(0.0, self.s + (x_t - self.target - self.drift))
if self.s > self.h and self.fired_at is None:
self.fired_at = t
return self.s
# Demo: streaming HMM log-likelihoods. Healthy mean = -150, sigma = 15.
# At step 200 we inject a slow drift toward -180.
import random
random.seed(3)
target_neg_ll = 150 # tracking |log-lik| so larger = worse
cusum = CUSUM(target=target_neg_ll, drift=3, h=40)
for t in range(400):
if t < 200:
sample = random.gauss(150, 5)
else:
# slow drift: mean creeps up by 0.15 each step
sample = random.gauss(150 + (t - 200) * 0.15, 5)
cusum.step(sample, t)
print("CUSUM fired at step:", cusum.fired_at)
# CUSUM fired at step: ~ step 240 (40 steps after the drift began,
# long before any single-point threshold would have triggered).
# Combine multiple detectors at runtime:
detector_ps = [0.42, 0.18, 0.06, 0.31] # HMM, conformal, OOD, world-model
combined_p = fisher_combine(detector_ps)
print(f"combined p-value: {combined_p:.4f} (alert if < 0.05)")
# combined p-value: ~ 0.025: at least one detector is mildly suspicious,
# the cluster of 'mildly suspicious' is itself unlikely under H0.
Two combiners, two purposes. Fisher fuses signals at the same time step. CUSUM fuses signals across time. Use both: a per-request Fisher score, accumulated through CUSUM over a streaming window. The CUSUM threshold is tuned for false-alarm rate per unit time (the "average run length" parameter); standard recipes are in Montgomery's Statistical Quality Control. The point is that "any-detector-fires" is too noisy and "all-detectors-fire" is too quiet; a combined statistic is what holds.
Evaluating the predictability layer itself
A thing nobody talks about: every component above has its own evaluation. Coverage tests for conformal prediction, log-likelihood comparison for HMM, false-positive rate for OOD detection, perplexity ceiling for world models. The predictability layer is itself a system, and it needs the same calibration rigor it provides.
| Component | Metric | Healthy range | Failure signal |
|---|---|---|---|
| Conformal prediction | Empirical coverage on held-out | 1 − α ± 2% | Coverage falls below target: distribution shift |
| Conformal prediction | Average set size | Stable over time | Set sizes growing: model is getting less confident |
| HMM | Held-out log-likelihood per step | Within 2σ of training mean | Persistent drop: production traces no longer look like training |
| OOD detector | FP rate on in-distribution data | ≤ α (calibration target) | FP rate climbing: in-distribution is shifting |
| OOD detector | OOD detection rate on synthetic OOD | ≥ 95% on hand-crafted shifts | Drop: encoder no longer separates in/out |
| World model | Held-out perplexity | Stable, near training value | Rising: dynamics have changed |
| CUSUM combiner | Average run length under H0 | Tuned to chosen false-alarm rate | ARL too short: drift parameter is too small |
Run these as a daily batch. Every metric in the table should have an alarm of its own that fires when the predictability layer itself is degraded. The layer that watches the agents needs to be watched.
What to actually do, in priority order
- Wrap your routing classifier with APS conformal prediction. Cheapest win. Two functions, ~50 lines, and you have calibrated runtime confidence. Set the prediction-set-size threshold for human handoff.
- Fit an HMM with Baum-Welch on a quarter of normal traces. Continuously score new traces. Alert at 2σ below baseline log-likelihood. Refit monthly.
- Add a Mahalanobis OOD detector at the input boundary. Reuse whatever embedding you already have. One Gaussian per class, shared covariance. Reject anything below 1% empirical p-value.
- Build an n-gram world model from agent traces. Compute rolling perplexity. Alert when the ceiling is exceeded. The model can be ridiculous as long as the divergence signal is real.
- Combine the four with Fisher per request and CUSUM across time. Threshold on combined p-value, not individual ones. Tune CUSUM for the false-alarm rate you can tolerate.
- Evaluate every component on a daily schedule. Coverage, perplexity, FP rate. Alert on the health of the layer itself.
- Only after all of the above: consider weighted conformal for distribution shift, conformal risk control for arbitrary loss bounds, deep world models, decision transformers. None of these matter if the previous six are not in place.
Legibility: refuse actions you can't explain
Everything above is a numerical check on the agent. There's one more check that's worth keeping in mind, and it isn't a number. It's a rule: don't let the agent take an action whose reasoning you can't reconstruct.
The rule sounds obvious, but most agents don't follow it. They run, they decide, they act. If something goes wrong, you go back and try to figure out why. Sometimes the chain-of-thought is in the logs and it helps. Often the chain-of-thought is missing, or it's there but it's clearly a story the model told itself after the fact, not the actual reason it acted.
Legibility means flipping that around. Before the agent acts, the system asks: can a human, looking at what's in the log right now, see why the agent is about to do this? If the answer is no, the action is held for review. If the answer is yes, the action proceeds.
Two practical patterns make this work:
- Require the agent to write down its reason before it acts. Not after, not in parallel, before. The reason is part of the action's authorization. If the reason is missing, the action doesn't fire.
- Check that the reason actually matches the action. A separate process (or model) reads the reason and the action and asks "does this reason actually justify this action?" If they don't line up, the action is held. This is harder than just logging the reason, and it's what makes the rule actually bite.
One honest caveat. Legibility and correctness can pull in different directions. There are correct answers whose reasoning is hard to articulate, and a smart model can produce convincing-looking explanations that aren't actually the reasoning that drove its output. So legibility is a necessary condition, not a sufficient one. An agent whose reasoning you can read is not necessarily one whose reasoning is faithful. You still need the numerical checks above. But of all the safety properties to insist on, legibility is the one that gets easier as models get better, because better models can articulate more of what they're doing. Most of the others get harder.
Failure modes worth calling out
| Failure | How it shows up | The fix |
|---|---|---|
| Baum-Welch local optimum | HMM looks fitted but classifies poorly | 10 random restarts; pick the highest validation log-likelihood |
| Conformal exchangeability violation | Coverage drops over time; calibration is stale | Refit calibration weekly; switch to weighted conformal under shift |
| OOD threshold too tight | Legitimate new inputs treated as OOD; agent escalates everything | Lower α; track and review false positives |
| OOD threshold too loose | Genuine OOD inputs missed; agent acts on novel inputs | Raise α; add a second detector (Mahalanobis + likelihood ratio) |
| World model trained on biased traces | Drift detector misses the actual drift; the training set was already drifted | Hold out a clean reference window; refit periodically with golden data |
| Detector p-values not independent | Fisher combination is conservative or anti-conservative | Decorrelate via CCA; or use Brown's method which accounts for correlation |
| CUSUM ARL miscalibrated | Too many false alarms or none until disaster | Simulate H0 for 10× the desired ARL; set h to match empirically |
| Predictability layer overhead | Latency budget blown by detector calls | Run detectors async off the request path; accept eventual consistency |
The bottom line. Predictability is not one technique. It is a stack: MDPs to specify what should happen, Q-learning to learn from logs when you cannot specify, HMMs with Baum-Welch to detect when behavior diverges from a fitted baseline, conformal prediction with adaptive set sizes for runtime confidence, Mahalanobis OOD detection for unfamiliar inputs, n-gram world models for environment drift, Fisher and CUSUM to combine signals, and daily evaluation of the whole stack. Each component is fifty to two hundred lines. None of them is research; they are 1957–2020 statistics with names. The question is not whether any of them are too complex to ship. The question is which order to ship them in. Conformal first, HMM second, OOD third. The rest can wait. Surprises become things you knew about in advance, not because the agent stopped surprising, but because the layer around it learned what surprise looks like.