Tone Dark
Tint
09 Generalists, specialists & where knowledge lives

Is this agent a jack-of-all-trades or a master of one?

Pick the wrong answer and everything downstream gets harder. A generalist that should have been a specialist hallucinates inside narrow domains. A specialist that should have been a generalist refuses to answer adjacent questions and frustrates the user. A generalist with retrieved documents reads attacker-controlled text into the system prompt and gets jailbroken through its own knowledge base. The choice has consequences for how often you update the agent, where you put the guards, what failures you see at three in the morning, and which team owns the page.

This chapter is about that choice. There are three legitimate shapes for an agent in production. They differ in where the knowledge lives, how that knowledge gets refreshed, and which guardrails do real work versus which ones are theatre.

The chapter on memory (chapter 08) covered what the agent stores about each conversation. This chapter covers what the agent intrinsically knows about its domain. Per-session versus per-deployment. Both matter; mixing them up is the most common cause of "why does our agent know about the customer's last order but not the product they just searched for?"

The three legitimate shapes

Generalist
One broad model handles every task. Knowledge lives in the base model's weights. Cheap in code, expensive in tokens, brittle on narrow domains. The model knows a little about everything; it does not know your business.
Specialist
A model fine-tuned or heavily prompted for a single domain. Knowledge lives partly in weights (if fine-tuned) and partly in the system prompt. Hard to maintain across many domains, but very strong within one. Predictable output shape, narrower attack surface.
Generalist plus RAG
A broad model with retrieved domain documents at request time. Knowledge lives in an external index that is mixed into context. Easiest to update (refresh the index), hardest to defend (retrieved text is attacker-influenceable). Most production systems land here.
Conductor of specialists
A fourth shape that sidesteps the choice: a generalist orchestrator routes each subtask to the specialist best suited for it. Treats the question "which scope?" as "which scope per subtask?" Already covered as the orchestrator pattern in chapter 05.

Most teams reach for "generalist plus RAG" because it sounds like the best of both. It is the right answer for many cases. It is also the shape with the most subtle failure modes, because the line between "the model knows this" and "the retrieval told the model this" disappears once the retrieved text is folded into the prompt. Knowing where the knowledge actually came from for any given output is the hard problem this chapter exists to surface.

The three shapes above are starting points, not endpoints. Real production systems often stack them: a fine-tuned specialist that also retrieves, or a generalist that runs against a few small LoRA adapters loaded on demand. The taxonomy stays useful because each combination still has a primary identity (mostly fine-tune, mostly retrieval, mostly base) that drives the guard configuration. The "Adapter stacking" section below covers the composition rules.

Where knowledge actually lives, in five places

Whatever shape you pick, the agent's knowledge ends up in one of five concrete locations. Naming them is the first step to managing them.

LocationWhat lives thereUpdate cadenceAudit story
Model weightseverything the base model learned at training timeat the cadence the foundation provider releases new versions (months)opaque: you can probe it, you cannot list it
Fine-tune deltasdomain-specific behavior added by post-trainingper release of your fine-tune (weeks to months)versioned by adapter id; you own the training data
System promptthe role description, tone, constraints, examplesper deploy of your code (days to weeks)versioned by file; lives in source control
Tool catalognames, schemas, and descriptions of tools the agent may callper deploy of your code (days to weeks)declared in the registry from chapter 10
Retrieved contextchunks fetched per request from a vector store or live APIper index refresh (minutes to days)per request: which chunks, with what scores, from which version of the index

Most production failures happen at the seam between two of these. A fine-tune from last quarter contradicts a fresh retrieval. A system prompt assumes a tool that was renamed in this morning's deploy. A vector index was updated but the agent is still in a session with the old chunks cached. Naming the seams gives you something to put a test around.

The companion kit defines the structure as AgentProfile with a tuple of KnowledgeAnchor entries, one per source. Each anchor records the identifier, version, last refresh, and owner. The whole profile fingerprints to a short hash that changes whenever any anchor's version changes. Two agents with the same fingerprint are functionally equivalent; if they produce different outputs, the fingerprint tells you exactly which anchor moved.

from agent_profile import (
    AgentProfile, AgentScope,
    KnowledgeSource, KnowledgeAnchor,
)

specialist = AgentProfile(
    agent_id="billing_specialist",
    scope=AgentScope.SPECIALIST,
    domain="billing",
    anchors=(
        KnowledgeAnchor(source=KnowledgeSource.WEIGHTS,
                        identifier="claude-haiku-4-5",
                        version="2026-04-01",
                        last_refresh=1714521600.0,
                        owner="foundation_provider"),
        KnowledgeAnchor(source=KnowledgeSource.FINETUNE,
                        identifier="billing_corpus",
                        version="v2.3",
                        last_refresh=1714694400.0,
                        owner="billing_team"),
        KnowledgeAnchor(source=KnowledgeSource.SYSTEM_PROMPT,
                        identifier="billing_v5",
                        version="v5",
                        last_refresh=1714780800.0,
                        owner="platform_team"),
        KnowledgeAnchor(source=KnowledgeSource.TOOL_CATALOG,
                        identifier="lookup_invoice,issue_refund",
                        version="v3",
                        last_refresh=1714780800.0,
                        owner="platform_team"),
    ),
    output_shapes=("ticket_summary", "refund_decision"),
)

specialist.assert_consistent()      # raises if the profile is incoherent
print(specialist.fingerprint())  # 16-char hex; changes with any anchor version

Notice the owner field on each anchor. This is the operational point most teams skip on day one and regret on day ninety. The base-model weights are owned by the foundation provider; the fine-tune is owned by the billing team; the system prompt and tool catalog are owned by the platform team. When something breaks, the owner is who you page. When a customer asks "why does the agent think the refund policy is fourteen days?" the owner is who answers.

How the choice changes the guards

Chapter 18 covers five guards that wrap every agent: input length cap, keyword and regex blocklist, tool allow-list, output schema validation, and per-agent rate limit. These are the same five guards regardless of profile. What changes is how each one is configured. The wrong configuration looks fine and does nothing.

GuardGeneralistSpecialistGeneralist plus RAG
Input length capmedium ceiling; the prompt is boundedtighter; specialist prompts are predictablelooser; retrieval adds context
Keyword blocksbroad, conservative defaultsdomain-specific (medical agent blocks different patterns than legal)broad defaults plus patterns over retrieved chunks before they enter the prompt
Tool allow-listnarrow: a generalist can reason its way into any toolnarrowest: only the specialist's domain toolsnarrow plus the retrieval tool; never expose write tools to RAG
Output schemaoften skipped: output shape varies by querystrict: the output shape is known and enforcedoften skipped, but one or two pinned shapes for citations
Rate limitstricter: broad reasoning can chain expensive callslooser: specialist work is bounded and predictablestricter, plus a separate ceiling on retrieval bandwidth

The companion kit's profile_aware_guards() reads an AgentProfile and returns a GuardConfig with these numbers filled in. It is not an answer; it is a starting point that gets the obvious things right. Operators override field by field as their data tells them to.

from agent_profile import profile_aware_guards

cfg_gen  = profile_aware_guards(generalist_profile)
cfg_spec = profile_aware_guards(specialist_profile)
cfg_rag  = profile_aware_guards(rag_profile)

# cfg_gen.enforce_output_schema  ->  False
# cfg_spec.enforce_output_schema ->  True
# cfg_rag.taint_retrieved_input  ->  True   (retrieved chunks are untrusted)
# cfg_gen.requests_per_minute    ->  20     (broad reasoning, slower limit)
# cfg_spec.requests_per_minute   ->  60     (bounded work, can run faster)

The new failure mode RAG introduces

Generalist plus RAG looks like the obvious win. It usually is. It also adds a failure mode that pure generalists and pure specialists do not have: the retrieved content is attacker-influenceable. If your knowledge base ingests customer-uploaded documents, vendor product sheets, or anything from the open web, the retrieved chunks can carry instructions that the model treats as if they came from the system prompt. This is indirect prompt injection (the technique from Greshake 2023) at the knowledge layer, not just the input layer.

The fix has three parts and all three need to be in place:

The profile_aware_guards() helper sets taint_retrieved_input=True automatically for any GENERALIST_PLUS_RAG profile. The downstream tool gates then know to enforce the lattice. The taint flag is metadata, not magic; the rest of the system has to honor it. Chapter 21 (The 2026 frontier) covers the lattice.

How the choice changes when you update

The update cadence is where the three shapes diverge most in operations.

The kit's KnowledgeAnchor.staleness_seconds() and profile.stalest_anchor() give you the data to drive this in code. A weekly job that walks the registry and flags any anchor older than its policy ceiling is a five-line script that has saved several teams I know from production drift incidents. The script does not need to be sophisticated; it just needs to exist.

Adapter stacking: when one agent uses several knowledge sources at once

The cleanest way to read the taxonomy is "one agent, one shape." That holds for most early systems. By 2026, many production systems use stacks that draw from more than one source at the same time. The most common stack is fine-tune plus RAG plus a fixed system prompt: a small LoRA adapter trained on your domain, a vector index for facts that change often, and a system prompt that pins the tone and refuses out-of-scope requests. Three knowledge sources, one inference call.

The kit's AgentProfile already supports this directly. A profile carries a tuple of KnowledgeAnchor entries, one per source, and there is no rule against having a FINETUNE anchor and a RETRIEVED anchor and a SYSTEM_PROMPT anchor on the same agent. The scope field captures which source dominates, so the guard configuration still has something to key off.

stacked = AgentProfile(
    agent_id="support_agent_v3",
    scope=AgentScope.SPECIALIST,        # fine-tune dominates
    domain="customer_support",
    anchors=(
        KnowledgeAnchor(source=KnowledgeSource.WEIGHTS,
                        identifier="claude-haiku-4-5",
                        version="2026-04-01", ...),
        KnowledgeAnchor(source=KnowledgeSource.FINETUNE,        # LoRA adapter
                        identifier="support_lora_v4",
                        version="v4.2", ...),
        KnowledgeAnchor(source=KnowledgeSource.SYSTEM_PROMPT,
                        identifier="support_prompt_v9",
                        version="v9", ...),
        KnowledgeAnchor(source=KnowledgeSource.RETRIEVED,         # live RAG
                        identifier="product_kb_index",
                        version="2026-04-30", ...),
    ),
)

Stacking pays off when the sources update at different cadences. The base model changes every few months. The LoRA adapter changes per training run, weeks to months. The system prompt changes per deploy, days. The retrieved index changes nightly. Pinning the slow-moving knowledge to weights and the fast-moving knowledge to retrieval lets each layer do the work it is good at, instead of forcing one layer to absorb all the volatility.

Three rules that keep stacks honest:

Shared base versus separate bases

Once you have several specialists in production, a second design question shows up: do they all sit on top of the same base model, or does each have its own base? The choice does not change the taxonomy from earlier in this chapter, but it does change the deploy cost, the update cadence, the failure-mode correlation, and the adversarial blast radius.

PropertyShared base (one model, many adapters)Separate bases (each specialist its own model)
Deploy costOne large model loaded once; adapters are megabytes eachOne large model per specialist; multiplies the GPU bill
Update cadenceSingle base update affects every specialist on it; adapters update independentlyEach specialist updates on its own schedule; no shared release train
Failure-mode correlationA bug in the shared base hits every specialist at onceA bug in one model is contained to that specialist
Adversarial blast radiusA jailbreak that works against the base works against every specialist on itEach specialist must be jailbroken separately
Cross-specialist consistencyTone and refusal patterns stay similar across specialists by constructionSpecialists drift in style; explicit work is needed to keep them consistent
Auditability of "which model said this"Easy: there is one base; the adapter id pins the restEasy too, but the deploy graph is wider

The shared-base pattern is dominant in 2026 because LoRA-style adapters made it cheap to spin up new specialists without re-paying the base cost. The separate-base pattern is what you reach for when the failure correlation matters more than the deploy savings: regulated domains where one specialist's failure cannot be allowed to cascade to others, or red-team-prone settings where you want jailbreaks contained to one specialist.

The AgentProfile captures the choice in the WEIGHTS anchor. Two specialists with the same WEIGHTS.identifier share a base; two specialists with different identifiers do not. Auditing across the registry tells you which model anyone is sitting on. A simple rule that has saved several teams: alert when more than 80% of registered agents share a single WEIGHTS.identifier, because at that point a single base-model regression takes most of the system down at once.

How to actually pick

A short decision tree, in the order most teams should walk it:

Two things that look like profile choices but are not

A couple of decisions that get confused with this one:

What this is not

Practical guidance

The smallest version of this idea fits in a single dataclass with five fields, plus a helper that picks guard numbers from it. The full module is in the companion kit as agent_profile/, with twenty tests that cover the consistency rules and the guard-tuning behavior.