Is this agent a jack-of-all-trades or a master of one?
Pick the wrong answer and everything downstream gets harder. A generalist that should have been a specialist hallucinates inside narrow domains. A specialist that should have been a generalist refuses to answer adjacent questions and frustrates the user. A generalist with retrieved documents reads attacker-controlled text into the system prompt and gets jailbroken through its own knowledge base. The choice has consequences for how often you update the agent, where you put the guards, what failures you see at three in the morning, and which team owns the page.
This chapter is about that choice. There are three legitimate shapes for an agent in production. They differ in where the knowledge lives, how that knowledge gets refreshed, and which guardrails do real work versus which ones are theatre.
The three legitimate shapes
Most teams reach for "generalist plus RAG" because it sounds like the best of both. It is the right answer for many cases. It is also the shape with the most subtle failure modes, because the line between "the model knows this" and "the retrieval told the model this" disappears once the retrieved text is folded into the prompt. Knowing where the knowledge actually came from for any given output is the hard problem this chapter exists to surface.
The three shapes above are starting points, not endpoints. Real production systems often stack them: a fine-tuned specialist that also retrieves, or a generalist that runs against a few small LoRA adapters loaded on demand. The taxonomy stays useful because each combination still has a primary identity (mostly fine-tune, mostly retrieval, mostly base) that drives the guard configuration. The "Adapter stacking" section below covers the composition rules.
Where knowledge actually lives, in five places
Whatever shape you pick, the agent's knowledge ends up in one of five concrete locations. Naming them is the first step to managing them.
| Location | What lives there | Update cadence | Audit story |
|---|---|---|---|
| Model weights | everything the base model learned at training time | at the cadence the foundation provider releases new versions (months) | opaque: you can probe it, you cannot list it |
| Fine-tune deltas | domain-specific behavior added by post-training | per release of your fine-tune (weeks to months) | versioned by adapter id; you own the training data |
| System prompt | the role description, tone, constraints, examples | per deploy of your code (days to weeks) | versioned by file; lives in source control |
| Tool catalog | names, schemas, and descriptions of tools the agent may call | per deploy of your code (days to weeks) | declared in the registry from chapter 10 |
| Retrieved context | chunks fetched per request from a vector store or live API | per index refresh (minutes to days) | per request: which chunks, with what scores, from which version of the index |
Most production failures happen at the seam between two of these. A fine-tune from last quarter contradicts a fresh retrieval. A system prompt assumes a tool that was renamed in this morning's deploy. A vector index was updated but the agent is still in a session with the old chunks cached. Naming the seams gives you something to put a test around.
The companion kit defines the structure as AgentProfile with a tuple of KnowledgeAnchor entries, one per source. Each anchor records the identifier, version, last refresh, and owner. The whole profile fingerprints to a short hash that changes whenever any anchor's version changes. Two agents with the same fingerprint are functionally equivalent; if they produce different outputs, the fingerprint tells you exactly which anchor moved.
from agent_profile import (
AgentProfile, AgentScope,
KnowledgeSource, KnowledgeAnchor,
)
specialist = AgentProfile(
agent_id="billing_specialist",
scope=AgentScope.SPECIALIST,
domain="billing",
anchors=(
KnowledgeAnchor(source=KnowledgeSource.WEIGHTS,
identifier="claude-haiku-4-5",
version="2026-04-01",
last_refresh=1714521600.0,
owner="foundation_provider"),
KnowledgeAnchor(source=KnowledgeSource.FINETUNE,
identifier="billing_corpus",
version="v2.3",
last_refresh=1714694400.0,
owner="billing_team"),
KnowledgeAnchor(source=KnowledgeSource.SYSTEM_PROMPT,
identifier="billing_v5",
version="v5",
last_refresh=1714780800.0,
owner="platform_team"),
KnowledgeAnchor(source=KnowledgeSource.TOOL_CATALOG,
identifier="lookup_invoice,issue_refund",
version="v3",
last_refresh=1714780800.0,
owner="platform_team"),
),
output_shapes=("ticket_summary", "refund_decision"),
)
specialist.assert_consistent() # raises if the profile is incoherent
print(specialist.fingerprint()) # 16-char hex; changes with any anchor version
Notice the owner field on each anchor. This is the operational point most teams skip on day one and regret on day ninety. The base-model weights are owned by the foundation provider; the fine-tune is owned by the billing team; the system prompt and tool catalog are owned by the platform team. When something breaks, the owner is who you page. When a customer asks "why does the agent think the refund policy is fourteen days?" the owner is who answers.
How the choice changes the guards
Chapter 18 covers five guards that wrap every agent: input length cap, keyword and regex blocklist, tool allow-list, output schema validation, and per-agent rate limit. These are the same five guards regardless of profile. What changes is how each one is configured. The wrong configuration looks fine and does nothing.
| Guard | Generalist | Specialist | Generalist plus RAG |
|---|---|---|---|
| Input length cap | medium ceiling; the prompt is bounded | tighter; specialist prompts are predictable | looser; retrieval adds context |
| Keyword blocks | broad, conservative defaults | domain-specific (medical agent blocks different patterns than legal) | broad defaults plus patterns over retrieved chunks before they enter the prompt |
| Tool allow-list | narrow: a generalist can reason its way into any tool | narrowest: only the specialist's domain tools | narrow plus the retrieval tool; never expose write tools to RAG |
| Output schema | often skipped: output shape varies by query | strict: the output shape is known and enforced | often skipped, but one or two pinned shapes for citations |
| Rate limit | stricter: broad reasoning can chain expensive calls | looser: specialist work is bounded and predictable | stricter, plus a separate ceiling on retrieval bandwidth |
The companion kit's profile_aware_guards() reads an AgentProfile and returns a GuardConfig with these numbers filled in. It is not an answer; it is a starting point that gets the obvious things right. Operators override field by field as their data tells them to.
from agent_profile import profile_aware_guards
cfg_gen = profile_aware_guards(generalist_profile)
cfg_spec = profile_aware_guards(specialist_profile)
cfg_rag = profile_aware_guards(rag_profile)
# cfg_gen.enforce_output_schema -> False
# cfg_spec.enforce_output_schema -> True
# cfg_rag.taint_retrieved_input -> True (retrieved chunks are untrusted)
# cfg_gen.requests_per_minute -> 20 (broad reasoning, slower limit)
# cfg_spec.requests_per_minute -> 60 (bounded work, can run faster)
The new failure mode RAG introduces
Generalist plus RAG looks like the obvious win. It usually is. It also adds a failure mode that pure generalists and pure specialists do not have: the retrieved content is attacker-influenceable. If your knowledge base ingests customer-uploaded documents, vendor product sheets, or anything from the open web, the retrieved chunks can carry instructions that the model treats as if they came from the system prompt. This is indirect prompt injection (the technique from Greshake 2023) at the knowledge layer, not just the input layer.
The fix has three parts and all three need to be in place:
- Tag retrieved content as untrusted at ingestion. Treat every retrieved chunk as input from the lowest trust level in the lattice (the
UNTRUSTEDlevel from the taint module in chapter 21). The retrieval pipeline producesTaintedValuewrappers; the agent's tool gates check the trust level of every input before they fire. - Apply guards to retrieved chunks before they reach the prompt. Run the keyword and regex blocklists over the retrieved text the same way you run them over user input. Reject or redact any chunk that contains an instruction-pattern. Yes, this slows the retrieval. The slowness is the work.
- Never give a RAG agent write tools. A pure read tool (vector search, document fetch) is fine. The moment you also give the same agent a write tool (send email, modify record, call API), you have given the attacker who controls a single retrievable document a path to that write tool. Split read agents and write agents into different orchestrator branches.
The profile_aware_guards() helper sets taint_retrieved_input=True automatically for any GENERALIST_PLUS_RAG profile. The downstream tool gates then know to enforce the lattice. The taint flag is metadata, not magic; the rest of the system has to honor it. Chapter 21 (The 2026 frontier) covers the lattice.
How the choice changes when you update
The update cadence is where the three shapes diverge most in operations.
- Generalist: updates happen when the foundation provider releases a new model, every few months. Your code does not change; the model swap is one config line. The risk is regression: a new model that scores higher on benchmarks may score lower on your specific cases. Run the evaluation harness from chapter 17 before flipping the switch.
- Specialist: updates happen when you re-fine-tune (weeks to months) or update the system prompt (days). Re-fine-tuning is expensive and slow; system-prompt updates are fast but limited. The risk is drift: the fine-tune is from last quarter and the world has moved on. Schedule re-fine-tuning quarterly even if "nothing has changed," because the things that have changed are the things you will not notice until they fail.
- Generalist plus RAG: updates happen continuously as the index refreshes. The risk is staleness in the model that the retrieval is supposed to compensate for, plus poisoning of the index itself. Run a daily scrub over new chunks before they enter the index, separate from the per-request guards.
The kit's KnowledgeAnchor.staleness_seconds() and profile.stalest_anchor() give you the data to drive this in code. A weekly job that walks the registry and flags any anchor older than its policy ceiling is a five-line script that has saved several teams I know from production drift incidents. The script does not need to be sophisticated; it just needs to exist.
Adapter stacking: when one agent uses several knowledge sources at once
The cleanest way to read the taxonomy is "one agent, one shape." That holds for most early systems. By 2026, many production systems use stacks that draw from more than one source at the same time. The most common stack is fine-tune plus RAG plus a fixed system prompt: a small LoRA adapter trained on your domain, a vector index for facts that change often, and a system prompt that pins the tone and refuses out-of-scope requests. Three knowledge sources, one inference call.
The kit's AgentProfile already supports this directly. A profile carries a tuple of KnowledgeAnchor entries, one per source, and there is no rule against having a FINETUNE anchor and a RETRIEVED anchor and a SYSTEM_PROMPT anchor on the same agent. The scope field captures which source dominates, so the guard configuration still has something to key off.
stacked = AgentProfile(
agent_id="support_agent_v3",
scope=AgentScope.SPECIALIST, # fine-tune dominates
domain="customer_support",
anchors=(
KnowledgeAnchor(source=KnowledgeSource.WEIGHTS,
identifier="claude-haiku-4-5",
version="2026-04-01", ...),
KnowledgeAnchor(source=KnowledgeSource.FINETUNE, # LoRA adapter
identifier="support_lora_v4",
version="v4.2", ...),
KnowledgeAnchor(source=KnowledgeSource.SYSTEM_PROMPT,
identifier="support_prompt_v9",
version="v9", ...),
KnowledgeAnchor(source=KnowledgeSource.RETRIEVED, # live RAG
identifier="product_kb_index",
version="2026-04-30", ...),
),
)
Stacking pays off when the sources update at different cadences. The base model changes every few months. The LoRA adapter changes per training run, weeks to months. The system prompt changes per deploy, days. The retrieved index changes nightly. Pinning the slow-moving knowledge to weights and the fast-moving knowledge to retrieval lets each layer do the work it is good at, instead of forcing one layer to absorb all the volatility.
Three rules that keep stacks honest:
- Pick a primary scope and configure the guards from it. A stack that includes a fine-tune is treated as a specialist for guard purposes (strict output schema, tighter rate limits) even when it also retrieves. A stack with no fine-tune but with retrieval is treated as generalist plus RAG (taint flag on retrieved input). The stack does not get to claim both sets of guards; pick the one that matches the dominant source.
- If you stack RAG on top of anything, the RAG defenses still apply. Adding a fine-tune does not make retrieved chunks safe. The taint lattice from chapter 21 runs on every retrieved chunk regardless of what else the agent has learned.
- Treat the fingerprint as the contract. A stacked profile fingerprints to a single value that changes whenever any layer changes. If you swap a single LoRA adapter, the fingerprint flips and every downstream consumer can detect the swap. This is the only way to keep audit trails honest as adapters get hot-swapped at runtime.
Shared base versus separate bases
Once you have several specialists in production, a second design question shows up: do they all sit on top of the same base model, or does each have its own base? The choice does not change the taxonomy from earlier in this chapter, but it does change the deploy cost, the update cadence, the failure-mode correlation, and the adversarial blast radius.
| Property | Shared base (one model, many adapters) | Separate bases (each specialist its own model) |
|---|---|---|
| Deploy cost | One large model loaded once; adapters are megabytes each | One large model per specialist; multiplies the GPU bill |
| Update cadence | Single base update affects every specialist on it; adapters update independently | Each specialist updates on its own schedule; no shared release train |
| Failure-mode correlation | A bug in the shared base hits every specialist at once | A bug in one model is contained to that specialist |
| Adversarial blast radius | A jailbreak that works against the base works against every specialist on it | Each specialist must be jailbroken separately |
| Cross-specialist consistency | Tone and refusal patterns stay similar across specialists by construction | Specialists drift in style; explicit work is needed to keep them consistent |
| Auditability of "which model said this" | Easy: there is one base; the adapter id pins the rest | Easy too, but the deploy graph is wider |
The shared-base pattern is dominant in 2026 because LoRA-style adapters made it cheap to spin up new specialists without re-paying the base cost. The separate-base pattern is what you reach for when the failure correlation matters more than the deploy savings: regulated domains where one specialist's failure cannot be allowed to cascade to others, or red-team-prone settings where you want jailbreaks contained to one specialist.
The AgentProfile captures the choice in the WEIGHTS anchor. Two specialists with the same WEIGHTS.identifier share a base; two specialists with different identifiers do not. Auditing across the registry tells you which model anyone is sitting on. A simple rule that has saved several teams: alert when more than 80% of registered agents share a single WEIGHTS.identifier, because at that point a single base-model regression takes most of the system down at once.
How to actually pick
A short decision tree, in the order most teams should walk it:
- Is the task narrow, the output shape predictable, and the volume high? Specialist. The economics work: fine-tune cost is amortized, prompt is small, output schema is enforceable, rate limit is permissive. Examples: ticket triage, invoice extraction, code review on a single language.
- Does the task require knowledge that updates faster than your fine-tune cycle? Generalist plus RAG. The retrieval refreshes daily; the model does not need to. Pay the indirect-injection cost with the three-part defense above. Examples: customer support over a product manual that changes weekly, internal company Q&A, research over a knowledge base.
- Is the task open-ended and the volume low? Generalist. You do not need the complexity. A frontier model with a tight system prompt and a small tool catalog is enough. Examples: executive assistant, code generation, brainstorming.
- Is the task one of several, each with a different shape? Conductor of specialists (chapter 05). Route each subtask to the specialist for that subtask. The conductor itself is a generalist; the specialists underneath are specialists by the rule above.
Two things that look like profile choices but are not
A couple of decisions that get confused with this one:
- Model size. A small specialist beats a large generalist on the specialist's domain almost every time. A large specialist does not beat a small specialist by much, on the same task. Pick scope first, then pick the smallest model that meets the latency and quality bar at that scope. Doing it the other way around is how teams burn through token budgets.
- Prompt length. Long system prompts do not turn a generalist into a specialist. They make a generalist follow one specialist's instructions for one session. The next session starts over. If the same instructions appear in every session, fine-tune. If they vary by request, retrieve. If they are short and stable, leave them in the prompt. Long, drifty prompts are usually missing fine-tunes.
What this is not
- It is not a benchmark recommendation. Which generalist or specialist wins on which evaluation depends on the eval, the model version, the prompt, and the time of week. The decision tree above is structural; pick a model only after running it on your own cases.
- It is not a substitute for the trust engine. The profile says what the agent knows. The trust engine (chapter 12) says what the agent is allowed to do. Both are necessary; they answer different questions.
- It is not a substitute for the context exchange chapter. The profile is per-agent; context exchange is per-collaboration. A specialist still needs an envelope, a handshake, and a compartment when it talks to another agent.
Practical guidance
- Pick scope before you pick a model. The scope is the durable architectural decision. The model under it changes every few months.
- Make the profile a first-class object in your code. Not "we are using GPT-4 with this prompt" in a Slack thread. An
AgentProfileobject that lives in source control, registered in the capability registry from chapter 10, fingerprinted in every audit-log entry. - Tune guards from the profile, not by hand. The same five guards exist for every shape. The numbers should fall out of the profile, not be set by guesswork.
profile_aware_guards()is the starting point; you override fields, you do not write the whole config from scratch. - Do not retrofit profiles onto agents that have been running without them. Tag every new agent with a profile from day one. Agents that were running before the profile system arrived can stay tag-less in a deprecated bucket; trying to back-fill profiles to inferred values causes more confusion than the profiles solve.
- If two agents with the same role have different fingerprints, that is your bug. The fingerprint is supposed to be deterministic across the fleet. If a deploy left some pods on v3 and others on v4, the fingerprint difference is the alarm bell. Wire it into your existing rollout monitoring.
agent_profile/, with twenty tests that cover the consistency rules and the guard-tuning behavior.