Tone Dark
Tint
10 Fine-tuning, when it actually pays off, and the math you should not skip

Fine-tuning is the answer that most teams reach for too early.

The previous chapter described three shapes a knowledge-bearing agent can take: generalist, specialist, generalist plus RAG. The specialist shape almost always involves fine-tuning, and yet most production agent failures around fine-tuning are not about the technique. They are about the decision: teams fine-tune before they have exhausted prompting, then carry the maintenance cost forever. This chapter is the deep dive that the specialist taxonomy hands off to. When fine-tuning is the right answer, what the math actually says about cost and sample size, what the failure modes look like in production, and how to know whether your fine-tune is working.

The honest one-line answer: fine-tune when the task has a stable shape, you have at least a thousand high-quality examples, and the cost of being wrong is high enough to justify the maintenance burden. Most teams should not fine-tune. The teams that should, fine-tune narrowly.

The decision: when does fine-tuning actually pay off?

Five conditions where fine-tuning is the right answer. You need most of them, not just one. Treat the list as a checklist, not an inspirational poster.

Five conditions where fine-tuning is not the right answer, even though teams reach for it anyway:

What full fine-tuning, LoRA, and DPO actually are, in three paragraphs

Full fine-tuning means updating every parameter in the model on your data. You start from the base model's weights, run gradient descent on your training set, and end with a new full set of weights. This is the most expressive option (the model can change in any direction) and the most expensive (you need GPU memory for the full optimizer state, which is roughly four times the model size, plus the gradients, plus the activations). For a 70B model, full fine-tuning is infeasible for most teams; for an 8B model, it is feasible if you have an H100 or two and a few thousand dollars of compute budget.

LoRA (Low-Rank Adaptation) adds small trainable matrices to specific layers of a frozen base model. Instead of updating all 70 billion parameters, you update perhaps 50 million parameters in narrow rank-r matrices that get added to the frozen weights at inference time. Trainable parameter count drops by roughly 99 percent; memory drops by a similar fraction. Inference quality on the trained distribution is typically within a few points of full fine-tuning when r is chosen well (r = 16 to 64 covers most cases). LoRA is the default for production fine-tuning in 2026 because it makes specialist agents economically viable: spinning up a new specialist costs hundreds of dollars, not tens of thousands.

DPO (Direct Preference Optimization) is a different question entirely. Full fine-tuning and LoRA train the model to imitate examples; DPO trains the model to prefer one output over another given the same input. You provide pairs (chosen, rejected) for the same prompt, and the model learns to assign higher probability to the chosen one. This is what you reach for when you have human feedback rather than ground-truth labels: "the model said both A and B; humans preferred A; train the model to prefer A too." DPO is more stable than the older PPO-plus-reward-model approach because there is no separately trained reward model to be exploited; the loss directly optimizes the implicit reward. You can apply DPO on top of a LoRA-tuned base, which is the standard production stack for high-quality specialist agents.

Compute math: what does a fine-tune actually cost?

The honest cost of fine-tuning is the part most decision documents skip. Two numbers tell you whether the project is feasible at all.

Training compute, in FLOPs: a single forward-and-backward pass over a model with P parameters processing D tokens takes approximately 6 · P · D floating-point operations. The 6 is the rule of thumb that combines forward (2PD), backward (4PD), and the implementation overhead. For one epoch over 10 million tokens of training data on an 8B parameter model: 6 · 8e9 · 1e7 = 4.8e17 FLOPs, or 480 PFLOPs. An H100 GPU at 60 percent utilization delivers about 600 TFLOP/s sustained, so this run takes 480e15 / 600e12 = 800 seconds, roughly 13 minutes per epoch. Most fine-tunes run 2 to 4 epochs, so the wall-clock training time is under an hour. At cloud GPU rental rates (about $3 to $5 per hour per H100 in 2026), the compute cost is on the order of tens of dollars for a small specialist.

Memory cost: the constraint that actually kills full fine-tuning. The optimizer state for AdamW takes 8 bytes per parameter (two moment tensors, fp32). Gradients take another 4 bytes per parameter. Model weights take 2 bytes per parameter (bf16). Activations take whatever the batch size and sequence length demand, often the largest term. Total memory: roughly 14 · P + activations. For 8B parameters: 112 GB just for weights, gradients, and optimizer state, before activations. This barely fits on a single H100 (80 GB) without tricks; for a 70B model it does not fit even on eight H100s without sharding.

LoRA changes the memory math dramatically. The base weights are frozen (so no gradient or optimizer state is needed for them; just the 2 bytes per parameter for inference). Only the LoRA parameters need optimizer state. For r=16 LoRA on an 8B model with adapters on attention projections only, trainable parameter count is around 20 million; their optimizer state is 240 MB; gradient memory is 80 MB. Total memory budget: 16 GB for weights plus a few hundred MB for everything else. An 8B fine-tune that is infeasible at full precision on a 24GB consumer GPU is comfortable as LoRA on the same hardware. This is why LoRA dominates production fine-tuning.

A useful sanity check: if your fine-tune training cost exceeds your annual inference savings, the project does not pay back. A team running a thousand inferences per day at $0.01 per call saves at most $3,650 per year by reducing inference cost. A fine-tune that costs $500 to train pays back in two months and earns the rest as profit. A fine-tune that costs $50,000 because you needed a 70B base will never pay back at that traffic level. Match the model size to the traffic, not to the capability you wish you needed.

Sample efficiency: how many examples do you actually need?

The empirical scaling for fine-tuning is roughly error ∝ D, where D is training set size and α depends on the task. For most structured-output tasks, α falls between 0.3 and 0.5. Plugging in: doubling your dataset reduces error by 20 to 30 percent, not 50 percent. This is why fine-tuning has steep diminishing returns past a certain point.

Concrete reference points from the literature for production fine-tuning, with realistic ranges rather than promises:

Task typeUseful thresholdDiminishing returnsWhat more data buys you
Classification (small label set) 500 to 2,000 10,000 to 50,000 Coverage of edge cases; better calibration
Structured extraction (fixed schema) 1,000 to 5,000 20,000 to 100,000 Robustness to malformed input; rare-field accuracy
Tone or style matching 5,000 to 20,000 50,000 to 200,000 Stylistic consistency on out-of-distribution prompts
Open-ended generation 50,000 plus 500,000 plus Genuine quality lift, not just style match
DPO preference pairs 1,000 to 5,000 pairs 20,000 plus pairs Sharper preference signal on contested cases

Two implications most teams miss. First, the quality multiplier dominates the quantity multiplier. A thousand carefully curated examples beat ten thousand lightly cleaned examples on every metric that matters. The reason is that fine-tuning is gradient descent on whatever the labels say is correct; if the labels are wrong, you are training the model to be wrong in a consistent direction. Spend the first week of any fine-tuning project on data quality before you spend the first hour on training. Second, diversity matters more than volume past the threshold. Once you have your thousand examples covering the common cases, the next thousand should cover edge cases, failure modes, and unusual phrasings, not more of the same.

Catastrophic forgetting: why fine-tuning makes the model worse at things you didn't train on

The phenomenon: a model fine-tuned on customer support tickets gets better at customer support tickets and worse at writing Python code, even though no one wanted it to be worse at code. This is not a bug; it is what cross-entropy loss does. When you optimize the model's probability mass on training tokens, you are necessarily reducing the probability mass on other tokens. The probability distribution is a budget; spending it on your domain means less for everything else.

The math, simplified to one paragraph. The fine-tuning loss is cross-entropy on training examples: L = -E[log pθ(y | x)] for (x, y) drawn from your training distribution. The gradient pushes pθ(y | x) up. Because probabilities sum to one, this pulls pθ(y' | x') down for every (x', y') not in the training set. The KL divergence from the base model's distribution grows monotonically with training; that growth is the formal description of forgetting. The longer you train and the higher the learning rate, the more the fine-tuned model has drifted from the base distribution, and the more general capability is lost.

Three production-tested mitigations, in order of how often they are used:

The DPO loss, derived to make sense of it

DPO is the single most useful preference-learning technique for production specialist agents, because it is more stable than PPO-with-reward-model and just as effective. The loss looks intimidating; the intuition is simple.

You have a dataset of preference pairs: (x, yw, yl) where x is the prompt, yw is the winning response, yl is the losing response. The DPO loss is:

LDPO = -E[log σ(β · logθ(yw|x) / πref(yw|x))
                  - β · logθ(yl|x) / πref(yl|x)))]

Read it from inside out. πθ(y|x) is the probability your fine-tuned model assigns to response y given prompt x. πref(y|x) is the probability the reference model (usually the base, before DPO training) assigns. The ratio πθref measures how much more or less your model likes a response compared to the reference. The log of that ratio is sometimes called the implicit reward.

The loss says: maximize the gap between the implicit reward of the winner and the implicit reward of the loser, then squash through a sigmoid. The sigmoid means once the gap is large enough, additional gradient signal is small (you stop pushing on cases the model already gets right). The β parameter (typically 0.1 to 0.5) controls how much you regularize toward the reference: high β keeps you close to the base; low β lets you drift further from the base toward the preferences.

Why this is more stable than reward-model-plus-PPO: there is no separately trained reward model to overfit, no distribution shift between the reward model and the policy, and no reward hacking through finding inputs the reward model misclassifies. The reference policy πref is fixed; the regularization is built into the loss; there are fewer moving parts to break. In production, DPO is what you reach for when you have human preferences and you do not have a research team on standby to babysit a PPO run.

Knowing when to stop: the eval-train gap

If your training loss is dropping but your held-out evaluation loss is rising, you are overfitting. The gap between training loss and evaluation loss is the cleanest single signal of overfitting. A healthy fine-tune typically shows train loss and eval loss within 10 to 20 percent of each other in relative terms. Beyond that, you are memorizing your training set, and the model will fail on inputs that look only slightly different.

Three checkpoints worth recording during every fine-tune run, evaluated on a held-out set you never trained on:

The discipline that separates teams that ship good fine-tunes from teams that do not: splitting your golden set into training, validation, and test before any data prep. Train on training. Tune hyperparameters on validation. Report final numbers on test, and never look at test until you are committing to ship. Teams that look at test scores during development unconsciously tune to test, and the production behavior matches development numbers minus the tuning advantage. This is the same discipline as classical ML; it just gets skipped more often in LLM fine-tuning because the dataset preparation is informal.

The four failure modes specific to fine-tuning

Beyond the general overfitting story, fine-tuned models exhibit four characteristic failure modes that prompt-engineered systems do not. Recognize them by name; each has a specific fix.

Fine-tuning as a defense: the Jatmo idea

The guardrails chapter mentioned Jatmo (Piet et al., ESORICS 2024) in passing as one of the layered prompt-injection defenses. The full picture belongs here. Jatmo's insight is that fine-tuning a non-instruction-tuned base model on synthetic task-specific data produces a model that cannot follow injected instructions, because following arbitrary instructions is the very capability that was never trained in. The base model never learned "follow whatever instruction is in the prompt"; it learned only the specific task you fine-tuned it on. An attacker writing "ignore previous instructions and..." gets ignored not because of a guardrail but because the model has no idea what to do with the instruction.

Reported numbers: best attacks succeed in less than 0.5 percent of cases against Jatmo-style fine-tunes, compared to 87 percent against GPT-3.5-Turbo in the same evaluation. The trade-off is that the fine-tuned model can do exactly one task; you cannot ask it to do something else. For a sufficiently narrow task that is exactly what you want.

This connects fine-tuning to the deny-by-default principle from chapter 21. Most prompt-injection defenses try to detect attacks at the input layer; Jatmo prevents the attack from being executable at all by removing the capability the attack relies on. The cost is generality (the model can only do one thing); the benefit is structural security (no detection step is required). For a narrow specialist agent in a high-stakes domain, this is often the right trade.

Tying it back: how a fine-tuned agent fits the manual

A fine-tuned agent has the same loop, the same task ingestion, the same environmental bootstrap, the same protocols as any other agent. What changes is the knowledge anchor and the reputation slice.

Practical guidance

Fine-tuning is a powerful tool that is also a maintenance commitment. The three questions to ask before starting: have I exhausted prompting? Do I have at least a thousand examples I would stake the project on? Am I prepared to re-train this every quarter? If the answer to any is no, the project is not ready for fine-tuning yet. That is fine. Ship the prompted version, build the evaluation harness, collect failure cases, and revisit the decision in three months with data.