End-to-end walkthrough

23 End-to-end · one system, every layer

Putting all eighteen layers on one system, in order.

The previous chapters cover the layers in isolation. This one builds one coherent system that uses every layer. The example: a customer support agent for a mid-size SaaS company, "Cenpie SaaS." Customers ask it questions. It answers them, retrieves information from product docs, looks up orders in the CRM, drafts replies, escalates when uncertain, logs everything for audit. It is the most common agent shape that ships in production. Walking it through end to end shows how the patterns combine.

Each section names the chapter it draws from in square brackets. Reading the chapter linked there will fill in the depth that this overview compresses for space.

Step 1: Architecture and patterns [Ch 02, 04]

The agent is decomposed into four roles. A planner reads the customer message and produces a plan: what kind of question is this, what tools are needed, what is the expected reply shape. A retriever fetches relevant context from product docs and the CRM. A drafter writes the customer-facing reply using the retrieved context. A reviewer checks the draft against guardrails and either approves it or hands off to a human. Four agents, four explicit roles, one queue between them.

The pattern choice is "planner-executor with review," from chapter 05. The architecture choice is message-passing through a typed queue with envelope schemas, from chapter 02. We did not pick "swarm" or "blackboard" patterns; both fail the operational simplicity test for a customer-facing system where every action has to be defensible.

Step 2: Protocols and tool surface [Ch 04]

Each agent communicates through a small set of MCP tools. The retriever has two: doc_search(query, k) and crm_lookup(customer_id). The drafter has one: compose_reply(context, intent, tone). The reviewer has two: validate(draft, policies) and escalate(reason, draft, context). No agent has direct database access; every tool call goes through the control plane (chapter 13).

Why MCPMCP 2025 and not bespoke RPC: tooling consistency across agents and across operators. The same retriever can be invoked by a different planner tomorrow, by a human operator's assistant the day after, by a debug script during incident review. The tool definitions are versioned in source control alongside the agent code.

Step 3: Memory and parallelism [Ch 07, 07]

The retriever runs document search and CRM lookup in parallel by default. They are independent, neither blocks the other, and Python asyncio.gather handles the fan-out. This is the "fan-out, then fan-in" pattern from chapter 07; the parallel version saves about 60% of latency over sequential because both tools take similar time.

Memory has three layers. Conversation memory: the running thread of messages within a single support conversation. Customer memory: durable facts about this customer (entitlements, prior tickets, preferences) loaded once at conversation start. Knowledge memory: the vector index over product docs, refreshed nightly. The split matters because each layer has a different lifetime and a different invalidation rule. Chapter 08 walks through why mixing them in a single store creates problems we do not need.

Step 4: Context exchange between agents [Ch 06]

Before any agent calls another, the planner runs a one-line handshake. The retriever publishes capabilities saying it accepts up to CONFIDENTIAL input, returns a tuple tagged chunks and provenance, and may call the doc_search sub-tool. The drafter publishes capabilities saying it accepts up to CONFIDENTIAL, returns a draft tagged draft and citations, and calls only compose_reply. The reviewer publishes a capability that requires its inputs to carry the citations tag (so it cannot be asked to validate a draft that did not come from the drafter).

Each tool call passes through a compartment. The retriever's response goes back wrapped in a ContextEnvelope with classification=CONFIDENTIAL and tags=("chunks", "provenance"). The drafter receives the envelope, gates it inbound (raises if classification is too high or tags are missing), and reads the payload. When the drafter sends its draft to the reviewer, the same gate runs again on the boundary. The chapter 06 module handles the wrapping, the gating, and the redaction in roughly twenty lines of code per call site. Without it, every cross-agent message would be a place where the wrong field could leak. With it, the leak surface collapses to a single function.

Step 5: Trust, capabilities, and audit [Ch 12]

Each agent has a reputation tracked through Beta-distributed counters. The drafter's reputation goes up when human reviewers approve its drafts, down when they reject them. The retriever's reputation goes up when the drafter actually used the retrieved chunks, down when it ignored them. After ninety days of operation, reputation thresholds become real: an agent with a reputation below a calibrated floor can no longer act without explicit human approval on every step.

Privileges are issued as Ed25519-signed tokens that bind {agent, tool, tenant, expiry}. The compose_reply tool requires a token signed by the privilege broker; the broker only issues one if the agent's current reputation clears the floor. Every tool call is recorded in a hash-chained audit log. The chain anchor is committed to a separate durable store hourly. This is the exact mechanism described in chapter 12.

Step 6: Real-time data control [Ch 13]

Before the retriever's crm_lookup can return any rows, the request goes through the control plane. The PDP checks: is this agent's tenant allowed to see this customer's data, is the data classification compatible with the agent's tier, is there a consent record on file for this purpose, what redactions apply. The decision arrives in under five milliseconds. PII fields (email, phone) are redacted at the retrieval boundary, so the drafter never sees the raw values, only typed placeholders.

Lineage is tracked through Python contextvars. Every chunk the retriever returns gets logged with its source, its classification tier, and the trace ID. Six months later, when a customer files a Subject Access Request, we can produce a complete record of every time their data was touched, by which agent, under which legal basis. Right-to-erasure is implemented as a multi-store coordinator: when a customer asks for deletion, the coordinator iterates Postgres, the vector index, the warehouse, and the conversation logs, issuing the appropriate primitive in each, and produces a signed certificate. Chapter 13 has the full implementation.

Step 7: Predictability layer [Ch 14]

Three predictability components run alongside the agent. First, the planner's intent classifier is wrapped with adaptive conformal prediction (Romano 2020): each input gets a calibrated prediction set; if the set has more than three plausible intents, the system escalates to a human rather than guessing. Second, an HMM is fitted weekly on production action traces; rolling log-likelihood drops trigger investigation. Third, a Mahalanobis OOD detector at the input boundary flags messages that look unlike anything in the training distribution; OOD inputs are routed straight to a human.

Combining these: each component emits a p-value. Fisher's method gives a single combined p-value per request. CUSUM accumulates evidence over time so slow drifts do not slip through individual thresholds. The combiner threshold is tuned for an average false-alarm rate of one per agent per day. This stack catches more silent regressions than any dashboard does, because each component covers a different failure mode.

Step 8: Risk model [Ch 15]

Each tool is scored on a 5x5 likelihood-by-impact matrix. doc_search is low-impact, low-likelihood (returns text from public docs); it goes in the Accept band. crm_lookup is medium-impact, medium-likelihood (returns customer data, can be wrong); it goes in the Monitor band, meaning every call is logged in detail. compose_reply is high-impact, low-likelihood-of-bad (the reviewer catches most issues); it goes in the Alert band, meaning each call generates a summary log entry that is reviewed in the next-day batch. Hypothetically, a refund_customer tool would be in the Block band; we do not give the agent this tool at all.

Step 9: Alerting [Ch 16]

Alert tiers are wired through the control plane and the predictability layer. Notice: a single retrieval returned no results. Warn: rolling 1-hour escalation rate exceeds baseline by 30%. Action: the predictability combiner has fired CUSUM, the agent's behavior has drifted. Page: the audit chain is broken (a hash mismatch on the durable anchor) or PII appeared in a draft after redaction (the egress redactor caught it but the existence is alarming). Each alert has a runbook. Each runbook has been rehearsed at least once.

Step 10: Evaluation [Ch 17]

Three evaluation tracks. First, a regression suite of 200 sealed conversations from the last quarter, replayed through every release. Pass rate must hold above 95%; below that, the release is blocked. Second, simulated user testing in the style of TAU-benchYao 2024: a model-driven simulated customer engages the agent on synthetic scenarios; we measure resolution rate, escalation rate, and correctness. Third, a human review sample of one out of every fifty production conversations, reviewed by support agents on a daily rota; their verdicts feed back into the trust engine and the predictability calibration.

None of these is the published benchmarks the labs use. Those are useful for capability claims; they do not measure whether this agent works for this customer base. Production evaluation is internal, sealed, and continuous.

Step 11: Guardrails [Ch 18]

The reviewer runs five guardrails on every draft before it ships. Length: input bounded, output bounded. Keyword block: an explicit list of words the agent should never produce (account-number patterns, internal-system names that should not appear in customer-facing copy). Allow-list: the only URLs the agent can include in a reply are those from the company docs domain. Output schema: structured replies must validate against a JSON schema. Rate limit: per-customer rate limit on agent replies, enforced at the queue layer to prevent runaway loops.

Two guardrails run on the retrieval side. Prompt-injection detection: every retrieved chunk is scanned for instruction-like patterns ("ignore previous instructions...", "system:") and quarantined if it matches; this is the GreshakeGreshake 2023-style indirect prompt-injection defense from chapter 18. Source authority check: chunks have a provenance score; chunks below a threshold are filtered out of the retrieval set entirely.

Step 12: Infrastructure and deployment [Ch 19]

The agent runs as four stateless services behind a queue, deployed via a blue-green pattern. Each service is independently versioned. Rollouts are gradual: 1% traffic for one hour, 10% for two hours, 100% on success. A feature flag at the planner can disable specific intent categories instantly without a deploy. State is in Redis (conversation memory) and Postgres (customer memory and audit chain anchors). The vector index is in a managed service.

Latency budgets per stage: planner 200ms, retriever 400ms (including parallel doc + CRM), drafter 1200ms (the model call), reviewer 300ms (cheap guardrails plus a small validator model), control plane 30ms across the whole loop. Total p99 budget: 2300ms. Above that, the customer notices.

Step 13: Adversarial considerations [Ch 20]

The two adversarial scenarios that matter for this system. Compromised tool: a malicious or hijacked MCP server returns instructions in the response payload. The defense is the prompt-injection detector at the retrieval boundary, plus the source authority check, plus the principle that no tool's output is trusted as control flow. Compromised agent: an attacker who somehow gains the agent's privileges. The defense is per-tool capability tokens with short expiries (minutes, not days), the audit chain (after-the-fact detection), and the privilege broker (which can be revoked centrally).

Multi-agent disagreement is rare in this design because the agents have non-overlapping roles. The reviewer is the only check on the drafter's output; if the reviewer is wrong, the human review sample catches it within a day. We do not run multi-agent debateDu 2024 for this system because the latency cost is too high for synchronous customer-facing replies.

Putting it together: the request lifecycle

A single customer support request, end to end:

Customer message arrives. Conversation queue picks it up. Trace ID assigned. Conversation memory loaded; customer memory loaded.
Mahalanobis OOD detector fires on the message embedding. If OOD, route directly to human; we are done.
Planner receives message + memory. Produces an intent classification with conformal prediction set. If set size > 3, escalate to human; we are done.
Planner emits a structured plan: which tools, in which order. Plan is logged.
Retriever runs doc_search and crm_lookup in parallel. Each tool call goes through the control plane: PDP decision, classification check, redaction obligations applied. Lineage records every chunk and row touched.
Drafter receives plan + retrieved context. Composes a reply. The model call is the most expensive step; budget 1.2 seconds.
Reviewer runs five guardrails on the draft. Output schema validated. PII redactor sweeps the output one more time as belt-and-braces. Rate limit checked.
If reviewer approves: reply is sent to the customer, audit entry chained, predictability combiner updated, trust counters updated.
If reviewer rejects: draft is logged with reason, the agent's reputation counter takes a hit, the conversation is escalated to a human with the draft attached as a starting point.
After-action: alerting picks up any threshold crosses (escalation rate, latency, audit chain). Daily evaluation sample picks one in fifty conversations for human review. Trust, predictability, and HMM all update overnight.

Twelve steps, eleven chapters of this manual, one coherent system. None of the layers is optional once you cross the threshold from prototype to production. The whole point of the manual is that each layer is small enough to ship on its own; the whole point of this chapter is that they only deliver value when they ship together.

What this system does not have

Worth naming, because reading the design above can give the impression that more is always better. This system has no:

Self-improving training loop. The agent does not fine-tune on its own outputs. The reasons are in the future chapter; the short version is that the safety story for self-modifying systems is not yet good enough to ship in a customer-facing product.
Multi-agent debate or consensus voting. The latency cost is too high for synchronous customer interactions. Debate-style designs make sense for offline research-style tasks; not here.
Embodied or vision-grounded actions. Pure text-and-API. Every visual or computer-use capability adds attack surface and operational complexity that this product does not need.
Direct database write access. Every mutation goes through a tool with explicit human approval. The agent does not have a privilege class that includes database writes; we do not give it one.
Long-running autonomous sessions. Each request-response cycle is bounded. There are no agents running for hours in the background; the operational complexity of managing such sessions does not pay back at this product's scale.

This is the system that ships first. Add capabilities only when there is a documented need that the simpler system cannot meet.

The bottom line. A real production agent system uses every chapter of this manual. Architecture, patterns, protocols, parallelism, memory, guidance, trust, control plane, predictability, risk, alerting, evaluation, guardrails, infra, adversarial, the 2026 security frontier, all of them. The reason the manual runs to twenty-eight chapters is not that each one is optional; it is that each one solves a real problem the others cannot. Build the simplest version of every layer first. Add depth where production data tells you to. Ship.