Code that pairs with the manual.
Reading is one thing. Touching the code is another. Everything below is small enough to read in one sitting, well-tested, and tweakable. Pull it down, run the demos, change the knobs, see what breaks. That's how the patterns from the manual stop being abstract and start being yours.
Each framework here implements one core idea from the manual in the smallest honest form. Nothing is a toy; nothing is bloated. The trust engine has the actual Beta-distribution math. The capability tokens are real signed Ed25519 tokens. The audit log is hash-chained. Where the manual says "this is how it works," the code shows how it works.
Companion code, runnable in 30 seconds.
Eleven small, tweakable agent modules that pair with the manual: seven V1 modules covering single-call agent security, plus four V2 modules covering multi-hop delegation security. Ships with a ScriptedLLM (a built-in stub that returns scripted responses, so every demo runs offline without an API key); plug in OpenAI, Anthropic, or Ollama whenever you're ready. 201 pytest cases for the modules, all green; plus 6 workbook exercises.
unzip cenpie-agent-kit.zip && cd cenpie-agent-kit
pip install -r requirements.txt
python cli.py all
What's inside
The kit is structured as seven V1 modules plus four V2 graduations, with a shared CLI, test suite, and a workbook that walks you through implementing the most important pieces yourself. Each module corresponds to a chapter of the manual; reading the chapter and the matching module side by side is the recommended path.
The minimum viable agent: a perceive → decide → act loop with a tool registry, an observer hook for tracing, a hard max_steps cap, and fail-safe handling for unknown or broken tools. Less than 150 lines of code total. Works with any LLM-like object; ships with a deterministic ScriptedLLM so demos run offline.
Tweak this if you want to: swap the ScriptedLLM for OpenAI/Anthropic/Ollama using examples/real_llm_adapter.py; add custom tools by passing any callable; trace each step by passing an observer callback; tighten the max_steps cap for budget control.
The orchestrator pattern: one conductor decomposes the goal, hands subtasks to specialists, and merges their outputs. Specialists never talk to each other directly. Pluggable plan_fn for goal decomposition and merge_fn for synthesis. Failure isolation by default; can be configured to halt on first failure.
Tweak this if you want to: wrap a real tiny_agent.Agent as a Specialist; replace the default fan-out plan with your own decomposition logic; replace the heading-prefixed merge with structured JSON output; add an on_subtask callback for tracing.
How agents share what they know without leaking what they shouldn't. Three building blocks: ContextEnvelope (a frozen dataclass with provenance, a four-level classification lattice, TTL, and a derivation chain), negotiate (a five-rule capability handshake that returns a SessionContract or fails fast with a reason), and Compartment (a boundary with one outbound gate and one inbound gate that minimize, redact, and validate envelopes against the contract). About 200 lines total; 37 tests covering the math, the rules, and an end-to-end round trip.
Tweak this if you want to: add custom redaction patterns to the default set; define your own AgentCapabilities profiles; extend the classification lattice with intermediate levels; layer the compartment with the trust engine so token consumption and envelope gating share the same audit log.
The structural description of an agent: its scope (generalist, specialist, or generalist plus RAG), its domain, where its knowledge lives across five anchors (model weights, fine-tune deltas, system prompt, tool catalog, retrieved chunks), and a fingerprint that changes whenever any anchor's version changes. Plus profile_aware_guards(): a helper that adapts the chapter-18 guard configuration to the profile (specialists get strict output schema, RAG profiles get the taint flag for retrieved input, and so on). Now also includes DecodingPolicy with separate temperature settings for tool calls, confirmation, and prose; the kit ships with the recommended mixed-determinism defaults. About 240 lines; 30 tests.
Tweak this if you want to: add custom KnowledgeSource entries (your domain may have more than the default five); extend profile_aware_guards() with your own per-domain keyword block lists; wire the profile fingerprint into your audit-log entries so you can answer "which version of which agent did this".
Three external checks that defend against the agent itself being unreliable about its own state. CapabilityRegistry: capabilities are published by an authority and read by the handshake, ignoring whatever the agent claims. PinnedAsk: the original goal is hashed at session start; later turns must hash to the same value. ToolGate: every tool call is checked against the contract's disclosed-tools set, and rejected calls are recorded for audit. About 400 lines; 32 tests including a full end-to-end defense scenario.
Tweak this if you want to: back the registry with a real persistent store; add your own normalization rules to PinnedAsk for languages with different casing rules; make the tool gate emit audit events to your existing observability pipeline; use acceptance_rate() as a leading indicator for agent drift in monitoring.
The technical centerpiece of the kit. Behavior tracking and privilege enforcement built from the ground up: Beta-distributed reputation with exponential time decay, multi-dimensional scoring (accuracy / compliance / efficiency / safety) with per-dimension half-lives, signed Ed25519 capability tokens with replay protection, tamper-evident hash-chained audit log, and Pearson-correlation Sybil cluster detection.
Every formula in the manual's Trust chapter is implemented and tested. The unit suite includes the example values from the manual (100 successes + 5 failures → 101/107 ≈ 0.944) so you can verify the math matches the prose.
Tweak this if you want to: register custom privileges with their own thresholds and TTLs; plug in a real KMS by passing a stable signing key; subclass AuditLog to ship entries to durable storage; tune per-dimension half-lives; use the BetaCounter alone for any reputation-tracking use case.
A layered, default-deny safety pipeline. Five built-in guards covering the common cases: input length cap, keyword and regex blocklist (filters output too), tool allow-list with optional per-role configuration, output schema validation, and per-agent token-bucket rate limiting. Fail-closed semantics: if a guard itself raises, the pipeline blocks rather than passes through.
Tweak this if you want to: subclass Guard for custom checks (PII detection, profanity, internal-policy rules); pass per-role allow-lists for differentiated agent permissions; combine the OutputSchemaGuard with Pydantic or jsonschema for typed outputs; set per-environment rate limits.
The V2 graduation
The seven modules above are V1. They teach the patterns that worked for single-call agents through about mid-2025. The kit also ships four V2 modules, each one a graduation from a V1 module, that implement what the field has converged on for multi-hop delegation security in 2026. The V1 modules stay where they are; V2 sits next to them. Read the V1 first; you'll find V2 reads naturally as the next step. Chapter 21 of the manual explains what changed and why.
Chained capability tokens. Each delegation hop appends a signed block to the chain; scope can only narrow as the token moves agent to agent; the whole chain verifies offline using only the original public key. The reference implementation of the four foundations from the CSA blueprint (scope attenuation, token lineage, intent persistence, sensitive-action flagging) in roughly two hundred lines of Python.
Tweak this if you want to: swap the JSON serialization for protobuf; add Datalog policy checks per the AIP draft; back the public-key store with a real registry; add expiry-driven revocation.
Per-context reputation. Beta counters indexed by (agent_id, task_class, tenant_id, tool_id), with hierarchical fallback when the per-context evidence is thin. An agent's general track record on a task accumulates organically across tenants; a new tenant inherits priors from the agent's existing record until it has built up its own.
Tweak this if you want to: change the propagation weight (default 0.5) for less aggressive rollups; add new context dimensions; persist the store to a database; add a smoothing parameter to the hierarchical fallback.
Per-session DAG audit log. Every entry references both the previous entry's hash (for tamper evidence) and the token jti that authorized this action. Replay-from-token answers the question EU AI Act Article 14 effectively requires: "the user asked for X. Show me everything that happened because of that, across every agent involved."
Tweak this if you want to: back the store with append-only durable storage (Kafka, Postgres + WAL); add a tampering alarm that fires on chain-verification failure; index by session for faster queries; pipe entries to OpenTelemetry.
Trust-lattice labels at tool-result boundaries. Every input gets a TrustLevel (UNTRUSTED, SOURCED, INTERNAL, OPERATOR, HUMAN_AUTH). When inputs flow into a tool call, the lowest input wins. A privileged tool refuses to run when one of its arguments came from a web page. Borrowed from operating-system security, where it has been a hard rule since the 1970s.
Tweak this if you want to: add a sanitizer that downgrades trust on transformation; bind tool policies dynamically via decorator; integrate with the chained-token system so a token's sensitive flag bumps the minimum trust level required.
The workbook and the certificate
Reading code is one thing. Writing it is another. The kit ships with a workbook of six exercises that walk you through implementing the most important pieces yourself. When you finish, you can run a grader and generate a personalized certificate.
The flow:
# 1. Open the workbook. Each file has skeleton functions you fill in.
ls workbook/
# v1_tiny_agent.py v2_chains.py
# v1_reputation.py v2_audit_dag.py
# v1_tokens.py v2_taint.py
# 2. Implement them. Each docstring tells you what the grader checks.
# 3. Run the grader to see your progress.
python -m grader.grade
# 4. When you've passed all V1 exercises, earn the Foundations cert.
# When you've passed all V1 + V2 exercises, earn the Frontier cert.
python earn_certificate.py --name "Your Name" --org "Your Org"
Two tiers. The Foundations certificate is issued when you pass all three V1 exercises (the agent step, Beta reputation, capability tokens). The Frontier certificate is issued when you pass all six exercises (the V2 additions are scope attenuation, audit DAG replay, and the taint-aware tool gate). The Frontier certificate is the full graduation.
The certificate is generated as both an HTML file (printable as PDF from any browser) and a plain text file. It carries a SHA-256 verification hash over your workbook code plus your name, organization, score, and timestamp. Anyone can re-run the grader on the same workbook and recompute the hash. The org field accepts any value, including "Other", "Independent", a school, a startup, a department, a project name. There is no list to match against.
When you get stuck, solutions/ has reference implementations. Try not to peek until you've spent real time on each exercise. The kit's value is in the struggle.
What this kit is good for
This is a teaching kit, not a production framework, and the difference matters. The kit gives you the smallest legitimate version of each pattern, with every meaningful knob exposed. That makes it good at four specific things:
- Learning multi-agent design patterns. Each module maps cleanly to a chapter of the manual. The conductor-and-specialists pattern, the perceive-decide-act loop, the reputation-and-capability trust model, the layered guardrails pipeline. Read the chapter, run the matching code, modify a knob, observe what changes. The patterns are not buried under framework abstractions.
- Rapid prototyping. No API keys required.
ScriptedLLMships with the kit so every demo runs without an account.python cli.py allexercises every module end-to-end. Swap to OpenAI, Anthropic, or Ollama by importing one of the included adapters. The path from "I have an idea" to "I have a running prototype" is measured in minutes, not days. - Testing ideas before building real infra. One hundred and two pytest cases ship in
tests/; all green. Before you commit to LangGraph or AutoGen for a real project, you can sketch your design in this kit, write tests for the behavior you want, and learn whether the design holds up before the framework lock-in starts. - Demonstrating concepts in papers, talks, or teaching. Every file is short enough to read on a slide. The integrated example shows the V1 modules composed in seventy lines. Cite the kit in a paper and the reviewer can run it during review.
Exercise: what's this kit missing?
Before you read the answer, look through the source. Open tiny_agent/agent.py, orchestrator/core.py, trust_engine/engine.py, guardrails/pipeline.py. Try to list everything a real production agent system would have that this kit does not. The exercise is the point. Reading code with a question in mind teaches more than reading it with the answer in front of you.
A practical prompt to anchor your reading: if I tried to ship this kit as the backbone of a customer-facing agent at a real company, what would my SRE, my security lead, and my staff engineer each block on? Make a list. Then expand the box below to compare.
Show the answer
The omissions below are deliberate, not oversights. Each one is covered in the manual but kept out of the kit so the kit stays small enough to read in an afternoon. Production-grade versions of each are entire fields of engineering on their own.
What the kit doesn't have, organized by who would object
Your staff engineer would block on the agent layer:
- No dynamic planning. The orchestrator's default planner hands every specialist the goal verbatim. Real systems decompose the goal into subtasks, rank them, sequence them, and re-plan when a step fails. The kit's
plan_fnhook is where you would plug a real planner; the default is intentionally trivial. - No real task decomposition. Same root cause: smart decomposition needs a planner that understands dependencies and can route different subtasks to different specialists. This kit has the seam for it but not the implementation.
- No state graphs. The agent loop is a linear
for step in range(self.max_steps). Real systems use state graphs (LangGraph, custom DAGs) so they can represent branching, joining, and resumable execution. The straight-line loop is intentional for clarity; it does not scale to multi-stage pipelines. - No async coordination. The orchestrator runs specialists sequentially. The comment in
core.pyliterally says "trivially parallelizable with concurrent.futures." Trivially is doing some work in that sentence; once you add real concurrency, you also need cancellation, backpressure, partial-failure handling, and timeout policy. None of which the kit has. - No long-term memory. The agent loop carries a per-run
historylist and that is the entire memory model. No conversation memory across runs, no vector store, no episodic memory, no consolidation. Real agents need at least a thin memory layer (Mem0, Zep, or rolled-your-own); the kit does not include one. - No retries with strategy. A tool error halts the agent immediately. Real systems retry with backoff, fall back to alternative tools, or escalate to a human. The simple halt-on-error semantics are testable but operationally fragile.
- No tool ecosystem. A
Toolisname + func + description. There is no MCP integration, no tool registry, no schema-typed arguments, no permissioning per tool, no capability scoping at the tool level. Real production has all of these because tool surface is where most agent failures originate.
Your security lead would block on the trust and safety layer:
- Tokens are signed but identity is not. The
trust_engineuses Ed25519 signatures on capability tokens, which is real cryptography. What it lacks is an agent-identity PKI: every agent in this kit is a string name with no attached key pair, no certificate, no revocation list. Production multi-agent systems need signed identities so that "agent X requested this" can be cryptographically verified end to end. - No adversarial robustness. No prompt-injection detection, no input sanitization beyond the basic guardrails, no defense against indirect injection through tool output. The guardrail pipeline is keyword and schema based; it would not stop a determined attacker.
- No jailbreak resistance. The guardrails check structure (length, allowlists, schemas), not intent. There is no classifier scoring whether an output is attempting to bypass policy, no reasoning-model judge, no constitutional-AI style filter.
- No compliance-grade safety. No data classification, no consent tracking, no right-to-erasure plumbing, no regional residency routing, no signed audit trails crossing service boundaries. The audit log is hash-chained and useful for tamper evidence, but it is not by itself a compliance artifact.
- No real policy engine. The guardrails are a hand-wired pipeline. Production wants OPA/Rego or an equivalent declarative policy layer so that policy changes do not require code changes. The kit does not have that.
Your SRE would block on the operational layer:
- No metrics, no tracing, no observability. The
observercallback intiny_agent.Agentis the entire observability story. No OpenTelemetry, no structured logging, no SLO instrumentation. Adding these is straightforward; the kit just doesn't include them. - No rate limiting beyond per-agent token buckets. Per-tenant, per-tool, and per-budget limits are missing. So is queueing under load. So is graceful degradation when the LLM provider returns 429.
- No deployment story. The kit assumes you run it as a Python process. Real systems need containers, health checks, blue/green, canaries, secret management, and a control plane to update behavior without redeploying. None of that ships here.
The right way to think about this kit: it is a set of correctly-shaped seams. Every absent feature has a hook where the real version plugs in. plan_fn for planning. memory as a parameter you would add to the agent. observer for telemetry. plug-in adapters for real LLMs. The kit's job is to teach you what the seams are. The production system's job is to fill them in.
The integrated example
The package includes examples/integrated_demo.py, which wires all four frameworks together into a realistic mini-system: input passes through the guardrails pipeline, then to the orchestrator, which delegates to specialists; each specialist's outcome feeds back into the trust engine for reputation tracking, and the output passes through the guardrails one more time before being returned. End-to-end audit trail throughout. This is the pattern most production agent systems converge on, in 70 lines you can read in a minute.
Plugging in a real LLM
The included examples/real_llm_adapter.py ships sketch adapters for OpenAI, Anthropic, and Ollama (local models, no API key needed). Each adapter implements the same LLM.decide() protocol the rest of the kit expects, using a JSON-only protocol that any reasonable model can follow. Pick the one you want, install its SDK, set the relevant API key (or run Ollama locally), and you're using a real model.
# Replace ScriptedLLM with a real adapter
from tiny_agent import Agent, Tool
from examples.real_llm_adapter import AnthropicLLM
agent = Agent(
llm=AnthropicLLM(model="claude-haiku-4-5-20251001"),
tools=[Tool("search", my_real_search_function)],
)
result = agent.run("find the latest news on agent protocols")
print(result.final)
Test suite
The kit ships with 191 module tests plus a 6-exercise workbook covering every framework. The trust-engine tests in particular verify the math, signature integrity, replay protection, audit-chain tamper detection, and Sybil cluster detection. Run them locally to confirm everything works in your environment:
pytest tests/ -v
Specific things the suite verifies:
- tiny_agent (8 tests): tool execution, error handling, the
max_stepscap, observer hook, ScriptedLLM behavior across goal types. - orchestrator (8 tests): default plan, custom plans, custom merge functions, failure isolation, stop-on-first-failure flag, duplicate-name detection.
- trust_engine (29 tests): Beta math (including the 101/107 ≈ 0.944 example from the manual), credible lower bounds, exponential decay at half-life, signature/audience/subject/expiry/replay/revocation on tokens, audit-chain tamper detection at multiple layers, Sybil clustering, full engine integration.
- guardrails (17 tests): each built-in guard, fail-closed semantics on guard exceptions, per-role allow-lists, regex patterns, output filtering.
- chains (11 tests, V2): scope attenuation across numeric and allowlist scopes, two-hop offline verification, tamper detection on a forged hop, sensitive-flag persistence, provenance summary.
- contextual (8 tests, V2): per-context separation between tasks, per-tenant separation, hierarchical fallback for thin contexts, sample-size honesty, credible bounds for thick child contexts.
- audit_dag (8 tests, V2): hash chain verification, query-by-token, session traces, replay-from-token producing the full structured trace, agent-level queries still working.
- taint (13 tests, V2): trust-lattice ordering, the lowest-input-wins propagation rule, default-deny for unregistered tools, mixed-trust arg blocking, source tracking for audit.
License: use it freely, just point home
The kit is released under the Cenpie Educational License (full text in the LICENSE file inside the zip). Three things to know:
- Use it, change it, ship it. Personal projects, commercial projects, forks, embeddings, repackagings: all permitted. Publish your derivatives under any license you choose.
- Link back to cenpie.com. Anywhere user-facing in your project (README, About page, credits, docs) should carry a visible reference to cenpie.com. Source-only redistribution just needs the
LICENSEfile kept intact. The includedNOTICEfile is a ready-made attribution block you can copy. - It's yours, you own it. Whatever you build with this code is entirely your project. cenpie has no involvement, no warranty, no liability, no review of how you use it. Published for learning. Not a supported product.
Other resources
Beyond the agent kit, here are some pointers if you want to keep going:
- The manual itself. All twenty-eight chapters plus the glossary and references. Start at the Tutorial if you're new, or jump to whichever chapter answers a problem you're hitting today.
- The bibliography. Over 70 papers referenced in the manual, all with arXiv links. The References page groups them by topic so you can dive into the source material for any chapter.
- The Q&A. Common questions, follow-ups, and gotchas that don't fit into individual chapters. Worth scanning even if nothing's broken yet, on the Q&A page.