Fine-tuning · Cenpie · Agentic AI Field Manual

10 Fine-tuning, when it actually pays off, and the math you should not skip

Fine-tuning is the answer that most teams reach for too early.

The previous chapter described three shapes a knowledge-bearing agent can take: generalist, specialist, generalist plus RAG. The specialist shape almost always involves fine-tuning, and yet most production agent failures around fine-tuning are not about the technique. They are about the decision: teams fine-tune before they have exhausted prompting, then carry the maintenance cost forever. This chapter is the deep dive that the specialist taxonomy hands off to. When fine-tuning is the right answer, what the math actually says about cost and sample size, what the failure modes look like in production, and how to know whether your fine-tune is working.

The honest one-line answer: fine-tune when the task has a stable shape, you have at least a thousand high-quality examples, and the cost of being wrong is high enough to justify the maintenance burden. Most teams should not fine-tune. The teams that should, fine-tune narrowly.

The decision: when does fine-tuning actually pay off?

Five conditions where fine-tuning is the right answer. You need most of them, not just one. Treat the list as a checklist, not an inspirational poster.

The task has a stable shape that will not change quarter over quarter. Fine-tuning is expensive to maintain. If the underlying behavior changes every few weeks, you will be re-training constantly and your reputation will reflect a moving target. Good fits: extraction from a fixed schema, classification into a fixed taxonomy, formatting to a fixed style, tone matching to a stable brand voice. Bad fits: anything driven by changing facts (use RAG instead), anything where the rule-set updates monthly (use a system prompt), anything still being defined.
You have at least a thousand high-quality examples of the desired behavior. Below this number, fine-tuning rarely beats few-shot prompting. The math behind why is in the sample-efficiency section below. The threshold rises with task complexity: a structured-output task needs a thousand; a tone-matching task needs five thousand; an open-ended generation task needs tens of thousands.
Latency or cost matters more than knowledge freshness. A fine-tuned 8B model can match a prompted 70B model on the trained distribution at a fraction of the inference cost. If your traffic is high enough that the inference savings cover the training and maintenance costs, fine-tuning earns its keep. If you are doing a few hundred calls a day, prompting wins on every axis except possibly tone.
The cost of being wrong is high enough to justify the audit complexity. A fine-tuned model is harder to debug than a prompted one. The behavior is in the weights, which you cannot read. When something goes wrong, you cannot just edit the prompt. You retrain or you accept the bug. If your domain is high-stakes (financial, legal, medical, regulated), this is acceptable because the audit and regression discipline you will build for fine-tuning is exactly the discipline the regulator wants. If your domain is low-stakes, the audit complexity is overhead you do not need.
You need format or tone consistency that the prompt cannot reliably enforce. Some kinds of consistency are hard to specify in words. "Sound like our brand" is the canonical example. You cannot reliably get there by listing rules; you have to show the model what good looks like at scale. Fine-tuning is what shows the model.

Five conditions where fine-tuning is not the right answer, even though teams reach for it anyway:

The task is mostly about facts. Facts go in retrieval. The base model's knowledge is from a training cutoff months before now; your fine-tune's knowledge will be from a training cutoff months before the fine-tune. Both will be stale. Use RAG for facts; reserve fine-tuning for behavior.
You only have a few hundred examples. Few-shot prompting will do better with the same examples and zero training cost. Test the few-shot baseline first. If it is close to good enough, the fine-tune will not get you much further until you have an order of magnitude more data.
The behavior you want is well-described by a system prompt. If you can write down what you want in five hundred words and the model follows it ninety percent of the time, fine-tuning will buy you the last ten percent at the cost of substantial maintenance. Often that ten percent is not worth it. Ship the prompted version; collect failure cases; revisit the fine-tune decision in three months when you have data.
You have not built an evaluation harness yet. Fine-tuning without evaluation is gambling. You do not know whether your fine-tune is better, worse, or different from the base model on the cases that matter. Build the evaluation first; the harness will tell you whether fine-tuning is even worth attempting. Chapter 17 covers this.
You expect to swap the base model frequently. Fine-tunes are tied to a specific base. When the foundation provider releases a new model, your fine-tune does not transfer; you re-train from scratch. If you intend to track frontier model releases (every two to three months in 2026), the re-training treadmill is brutal. Pick one base, accept that you are pinned to it for the life of the fine-tune, and be honest about the upgrade cost.

What full fine-tuning, LoRA, and DPO actually are, in three paragraphs

Full fine-tuning means updating every parameter in the model on your data. You start from the base model's weights, run gradient descent on your training set, and end with a new full set of weights. This is the most expressive option (the model can change in any direction) and the most expensive (you need GPU memory for the full optimizer state, which is roughly four times the model size, plus the gradients, plus the activations). For a 70B model, full fine-tuning is infeasible for most teams; for an 8B model, it is feasible if you have an H100 or two and a few thousand dollars of compute budget.

LoRA (Low-Rank Adaptation) adds small trainable matrices to specific layers of a frozen base model. Instead of updating all 70 billion parameters, you update perhaps 50 million parameters in narrow rank-r matrices that get added to the frozen weights at inference time. Trainable parameter count drops by roughly 99 percent; memory drops by a similar fraction. Inference quality on the trained distribution is typically within a few points of full fine-tuning when r is chosen well (r = 16 to 64 covers most cases). LoRA is the default for production fine-tuning in 2026 because it makes specialist agents economically viable: spinning up a new specialist costs hundreds of dollars, not tens of thousands.

DPO (Direct Preference Optimization) is a different question entirely. Full fine-tuning and LoRA train the model to imitate examples; DPO trains the model to prefer one output over another given the same input. You provide pairs (chosen, rejected) for the same prompt, and the model learns to assign higher probability to the chosen one. This is what you reach for when you have human feedback rather than ground-truth labels: "the model said both A and B; humans preferred A; train the model to prefer A too." DPO is more stable than the older PPO-plus-reward-model approach because there is no separately trained reward model to be exploited; the loss directly optimizes the implicit reward. You can apply DPO on top of a LoRA-tuned base, which is the standard production stack for high-quality specialist agents.

Compute math: what does a fine-tune actually cost?

The honest cost of fine-tuning is the part most decision documents skip. Two numbers tell you whether the project is feasible at all.

Training compute, in FLOPs: a single forward-and-backward pass over a model with P parameters processing D tokens takes approximately 6 · P · D floating-point operations. The 6 is the rule of thumb that combines forward (2PD), backward (4PD), and the implementation overhead. For one epoch over 10 million tokens of training data on an 8B parameter model: 6 · 8e9 · 1e7 = 4.8e17 FLOPs, or 480 PFLOPs. An H100 GPU at 60 percent utilization delivers about 600 TFLOP/s sustained, so this run takes 480e15 / 600e12 = 800 seconds, roughly 13 minutes per epoch. Most fine-tunes run 2 to 4 epochs, so the wall-clock training time is under an hour. At cloud GPU rental rates (about $3 to $5 per hour per H100 in 2026), the compute cost is on the order of tens of dollars for a small specialist.

Memory cost: the constraint that actually kills full fine-tuning. The optimizer state for AdamW takes 8 bytes per parameter (two moment tensors, fp32). Gradients take another 4 bytes per parameter. Model weights take 2 bytes per parameter (bf16). Activations take whatever the batch size and sequence length demand, often the largest term. Total memory: roughly 14 · P + activations. For 8B parameters: 112 GB just for weights, gradients, and optimizer state, before activations. This barely fits on a single H100 (80 GB) without tricks; for a 70B model it does not fit even on eight H100s without sharding.

LoRA changes the memory math dramatically. The base weights are frozen (so no gradient or optimizer state is needed for them; just the 2 bytes per parameter for inference). Only the LoRA parameters need optimizer state. For r=16 LoRA on an 8B model with adapters on attention projections only, trainable parameter count is around 20 million; their optimizer state is 240 MB; gradient memory is 80 MB. Total memory budget: 16 GB for weights plus a few hundred MB for everything else. An 8B fine-tune that is infeasible at full precision on a 24GB consumer GPU is comfortable as LoRA on the same hardware. This is why LoRA dominates production fine-tuning.

A useful sanity check: if your fine-tune training cost exceeds your annual inference savings, the project does not pay back. A team running a thousand inferences per day at $0.01 per call saves at most $3,650 per year by reducing inference cost. A fine-tune that costs $500 to train pays back in two months and earns the rest as profit. A fine-tune that costs $50,000 because you needed a 70B base will never pay back at that traffic level. Match the model size to the traffic, not to the capability you wish you needed.

Sample efficiency: how many examples do you actually need?

The empirical scaling for fine-tuning is roughly error ∝ D^-α, where D is training set size and α depends on the task. For most structured-output tasks, α falls between 0.3 and 0.5. Plugging in: doubling your dataset reduces error by 20 to 30 percent, not 50 percent. This is why fine-tuning has steep diminishing returns past a certain point.

Concrete reference points from the literature for production fine-tuning, with realistic ranges rather than promises:

Task type	Useful threshold	Diminishing returns	What more data buys you
Classification (small label set)	500 to 2,000	10,000 to 50,000	Coverage of edge cases; better calibration
Structured extraction (fixed schema)	1,000 to 5,000	20,000 to 100,000	Robustness to malformed input; rare-field accuracy
Tone or style matching	5,000 to 20,000	50,000 to 200,000	Stylistic consistency on out-of-distribution prompts
Open-ended generation	50,000 plus	500,000 plus	Genuine quality lift, not just style match
DPO preference pairs	1,000 to 5,000 pairs	20,000 plus pairs	Sharper preference signal on contested cases

Two implications most teams miss. First, the quality multiplier dominates the quantity multiplier. A thousand carefully curated examples beat ten thousand lightly cleaned examples on every metric that matters. The reason is that fine-tuning is gradient descent on whatever the labels say is correct; if the labels are wrong, you are training the model to be wrong in a consistent direction. Spend the first week of any fine-tuning project on data quality before you spend the first hour on training. Second, diversity matters more than volume past the threshold. Once you have your thousand examples covering the common cases, the next thousand should cover edge cases, failure modes, and unusual phrasings, not more of the same.

Catastrophic forgetting: why fine-tuning makes the model worse at things you didn't train on

The phenomenon: a model fine-tuned on customer support tickets gets better at customer support tickets and worse at writing Python code, even though no one wanted it to be worse at code. This is not a bug; it is what cross-entropy loss does. When you optimize the model's probability mass on training tokens, you are necessarily reducing the probability mass on other tokens. The probability distribution is a budget; spending it on your domain means less for everything else.

The math, simplified to one paragraph. The fine-tuning loss is cross-entropy on training examples: L = -E[log p_θ(y | x)] for (x, y) drawn from your training distribution. The gradient pushes p_θ(y | x) up. Because probabilities sum to one, this pulls p_θ(y' | x') down for every (x', y') not in the training set. The KL divergence from the base model's distribution grows monotonically with training; that growth is the formal description of forgetting. The longer you train and the higher the learning rate, the more the fine-tuned model has drifted from the base distribution, and the more general capability is lost.

Three production-tested mitigations, in order of how often they are used:

Mix general data into your training set, around 10 to 30 percent. If your domain dataset is 100,000 customer support tickets, add 20,000 examples from a general instruction-following dataset (FLAN, Alpaca, ShareGPT). The model trains on both, so general capability is preserved. Cost: training takes 25 percent longer; specialization may drop one or two points on the trained metric. Benefit: your fine-tune is no worse at unrelated tasks than the base model.
Use a small learning rate and stop early. Higher learning rates cause faster KL divergence from the base. A learning rate of 1e-5 to 5e-5 (lower than the typical 1e-4 used for full fine-tuning of small models) preserves more of the base distribution. Combine this with early stopping on a held-out general capability eval (run MMLU or BBH every few hundred steps; stop when general capability drops more than two points).
Use LoRA with low rank. Lower-rank adapters are mathematically less capable of distorting the base distribution. r=8 forgets less than r=64. The trade-off is that lower-rank adapters also learn less of the target distribution. Pick r=16 as a reasonable starting point; tune up if specialization is insufficient, tune down if forgetting is unacceptable.

The DPO loss, derived to make sense of it

DPO is the single most useful preference-learning technique for production specialist agents, because it is more stable than PPO-with-reward-model and just as effective. The loss looks intimidating; the intuition is simple.

You have a dataset of preference pairs: (x, y_w, y_l) where x is the prompt, y_w is the winning response, y_l is the losing response. The DPO loss is:

L_DPO = -E[log σ(β · log(π_θ(y_w|x) / π_ref(y_w|x))
                  - β · log(π_θ(y_l|x) / π_ref(y_l|x)))]

Read it from inside out. π_θ(y|x) is the probability your fine-tuned model assigns to response y given prompt x. π_ref(y|x) is the probability the reference model (usually the base, before DPO training) assigns. The ratio π_θ/π_ref measures how much more or less your model likes a response compared to the reference. The log of that ratio is sometimes called the implicit reward.

The loss says: maximize the gap between the implicit reward of the winner and the implicit reward of the loser, then squash through a sigmoid. The sigmoid means once the gap is large enough, additional gradient signal is small (you stop pushing on cases the model already gets right). The β parameter (typically 0.1 to 0.5) controls how much you regularize toward the reference: high β keeps you close to the base; low β lets you drift further from the base toward the preferences.

Why this is more stable than reward-model-plus-PPO: there is no separately trained reward model to overfit, no distribution shift between the reward model and the policy, and no reward hacking through finding inputs the reward model misclassifies. The reference policy π_ref is fixed; the regularization is built into the loss; there are fewer moving parts to break. In production, DPO is what you reach for when you have human preferences and you do not have a research team on standby to babysit a PPO run.

Knowing when to stop: the eval-train gap

If your training loss is dropping but your held-out evaluation loss is rising, you are overfitting. The gap between training loss and evaluation loss is the cleanest single signal of overfitting. A healthy fine-tune typically shows train loss and eval loss within 10 to 20 percent of each other in relative terms. Beyond that, you are memorizing your training set, and the model will fail on inputs that look only slightly different.

Three checkpoints worth recording during every fine-tune run, evaluated on a held-out set you never trained on:

Domain accuracy: the metric you actually care about, on the domain test set. This is what you are optimizing for; if it is not improving, stop training.
General capability: a small slice of MMLU, BBH, or your own general-purpose eval. This catches catastrophic forgetting before it ships. If general capability drops more than two points from the base, you have over-specialized.
Output diversity: a measure of how varied the model's outputs are on a fixed prompt set with non-zero temperature. Mode collapse (the model giving the same output every time) is a fine-tuning failure mode that does not show up in accuracy. Track entropy or unique-n-gram counts; alert if they drop sharply.

The discipline that separates teams that ship good fine-tunes from teams that do not: splitting your golden set into training, validation, and test before any data prep. Train on training. Tune hyperparameters on validation. Report final numbers on test, and never look at test until you are committing to ship. Teams that look at test scores during development unconsciously tune to test, and the production behavior matches development numbers minus the tuning advantage. This is the same discipline as classical ML; it just gets skipped more often in LLM fine-tuning because the dataset preparation is informal.

The four failure modes specific to fine-tuning

Beyond the general overfitting story, fine-tuned models exhibit four characteristic failure modes that prompt-engineered systems do not. Recognize them by name; each has a specific fix.

Stylistic collapse. The model produces outputs that look superficially like the training data but lose semantic substance. Common when training data is heavy on tone and light on content. Symptoms: outputs always start the same way, always include certain phrases, always end the same way. Fix: increase data diversity; cut training short; reduce learning rate.
Mode collapse. The model gives nearly identical outputs across distinct prompts. Common when DPO is run too long with too high a beta. Symptoms: low entropy in sampled outputs; the model "forgot" alternatives. Fix: stop DPO earlier; lower beta; verify preference pairs are actually diverse.
Parroting. The model echoes training examples nearly verbatim when prompts resemble training prompts. A specific form of overfitting that classification metrics miss because correctness on the train distribution is high. Symptoms: outputs that contain phrases unique to training data on novel inputs. Fix: data deduplication (remove near-duplicates from training); regularization; larger and more diverse training set.
Base-knowledge erosion. The catastrophic forgetting case from earlier in this chapter. Symptoms: fine-tuned model is worse than the base on tasks unrelated to training. Fix: mix in 10 to 30 percent general data; lower learning rate; LoRA with low rank.

Fine-tuning as a defense: the Jatmo idea

The guardrails chapter mentioned Jatmo (Piet et al., ESORICS 2024) in passing as one of the layered prompt-injection defenses. The full picture belongs here. Jatmo's insight is that fine-tuning a non-instruction-tuned base model on synthetic task-specific data produces a model that cannot follow injected instructions, because following arbitrary instructions is the very capability that was never trained in. The base model never learned "follow whatever instruction is in the prompt"; it learned only the specific task you fine-tuned it on. An attacker writing "ignore previous instructions and..." gets ignored not because of a guardrail but because the model has no idea what to do with the instruction.

Reported numbers: best attacks succeed in less than 0.5 percent of cases against Jatmo-style fine-tunes, compared to 87 percent against GPT-3.5-Turbo in the same evaluation. The trade-off is that the fine-tuned model can do exactly one task; you cannot ask it to do something else. For a sufficiently narrow task that is exactly what you want.

This connects fine-tuning to the deny-by-default principle from chapter 21. Most prompt-injection defenses try to detect attacks at the input layer; Jatmo prevents the attack from being executable at all by removing the capability the attack relies on. The cost is generality (the model can only do one thing); the benefit is structural security (no detection step is required). For a narrow specialist agent in a high-stakes domain, this is often the right trade.

Tying it back: how a fine-tuned agent fits the manual

A fine-tuned agent has the same loop, the same task ingestion, the same environmental bootstrap, the same protocols as any other agent. What changes is the knowledge anchor and the reputation slice.

Knowledge anchor. The agent's profile lists the fine-tune as a separate FINETUNE anchor distinct from the base model anchor. Both are recorded; both versions are part of the fingerprint. When the fine-tune is updated (a new LoRA adapter is trained, a new DPO pass is run), the fingerprint flips and downstream consumers see a new agent.
Provenance. The fine-tune's anchor should record more than just a version string. It should record the training-data fingerprint (a hash of the dataset), the training configuration hash (learning rate, epochs, batch size, beta if DPO), and the eval results at the checkpoint that was shipped. When something goes wrong six months later, this is what tells you whether the bug is in the fine-tune or somewhere else.
Reputation. The trust engine treats the fine-tuned agent as a separate slice for reputation purposes (the (configuration_fingerprint, tenant_id, task_class) key from chapter 12 includes the fingerprint, so any fine-tune update starts the reputation accumulation fresh). This is correct: a new fine-tune is functionally a new agent, even if it shares the same role.
Guards. Specialists configured by the profile-aware guards from chapter 09 already reach for stricter output schemas and tighter rate limits. The fine-tuned specialist also benefits from the structural-injection-resistance Jatmo provides: the fine-tune itself becomes part of the defense stack, not an addition to it.

Practical guidance

Build the evaluation harness first. Before any fine-tuning, write the test set. Train, validation, test split. Run the base model through the test set; that is your baseline. Without this, you cannot honestly answer whether the fine-tune helped.
Start with a thousand high-quality examples, not ten thousand mediocre ones. Curate, dedupe, manually inspect. Bad data is the most common cause of fine-tuning failure, and it is invisible from the loss curves alone.
Use LoRA for the first attempt at any specialist. r=16, learning rate 1e-4, two to four epochs. This is the fastest way to know whether fine-tuning is going to help your task at all. If it does, you can tune up; if it does not, you have only spent an afternoon and a few dollars.
Mix in 10 to 30 percent general data to prevent catastrophic forgetting. Even if your task is narrow, the model will be invoked in slightly broader ways at runtime. Preserving general capability is cheap insurance.
Pin the model version of the base. A fine-tune is for one specific base model checkpoint. Record the exact base version in the fine-tune's provenance, and never let a base-model upgrade silently invalidate your fine-tune.
Re-fine-tune quarterly even if "nothing has changed." The world has changed. Customer phrasing has drifted. New product features have shipped. Your old fine-tune is operating on assumptions that may no longer hold; the only way to know is to re-train and compare.
Treat fine-tunes as code, not data. Source-control the training script, the dataset hash, the configuration. Every shipped fine-tune should be reproducible from these three artifacts. If a regulator asks how the model was trained, the answer is "here is the recipe; run it and you get bit-identical weights."
Reach for fine-tuning later than you think. The right order is: prompting first; few-shot prompting second; RAG third; fine-tuning fourth. Most teams reach for fine-tuning second. The teams that ship reliable specialists are the ones that exhaust the cheaper options first and only fine-tune when prompting cannot get them where they need to go.

Fine-tuning is a powerful tool that is also a maintenance commitment. The three questions to ask before starting: have I exhausted prompting? Do I have at least a thousand examples I would stake the project on? Am I prepared to re-train this every quarter? If the answer to any is no, the project is not ready for fine-tuning yet. That is fine. Ship the prompted version, build the evaluation harness, collect failure cases, and revisit the decision in three months with data.