Tone Dark
Tint
11 Heuristics & rewards · how to guide agent behavior

Prompts aren't your only steering wheel.

Most teams shape agent behavior with one tool: the prompt. Tweak the prompt, the agent acts a little differently. But there's a whole toolbox of other ways to guide what an agent does. Some are simple (a hand-written rule), some are clever (a reward signal that nudges the agent over time), and some are recent research (learning what users prefer from how they reactYao 2023). This chapter covers the four main ones, how they layer, and when to reach for which.

Think of guidance as a stack. The prompt is the loudest layer, but it's also the most expensive (every token costs money) and the most fragile (one user message can override it). The other three layers are quieter but often more durable.

The four ways to guide an agent

1 · Prompts
The instructions you write in plain English. Cheapest to change, but the agent re-interprets them every turn and they can be overridden.
"You are a helpful customer support agent. Be polite."
2 · Heuristics
Hand-written rules that run in code, not in the model. Cheap, predictable, easy to audit. Good for hard limits.
if action.cost > 100: require_approval()
3 · Rewards
A score the agent tries to maximize. Often used to fine-tune the model itself, but you can also use rewards at runtime to pick between options.
+1 for solving, -0.1 per extra step
4 · Preferences
Patterns learned from how users react to past outputs. The model gradually adapts toward what people thumbs-up.
User edited the response → next time, write more concisely

2 · Heuristics: the rules that just work

Heuristics are plain old code. They check things, decide things, and block things, all without involving the LLM. Most production agent systems lean heavily on heuristics for anything that needs to be predictable.

Examples of useful heuristics:

Heuristics shine when you can write the rule down clearly. They struggle when the rule is fuzzy ("be helpful but not too helpful"). For fuzzy things, you'll need rewards or preferences instead.

# A simple heuristic layer wrapped around the agent
class GuidedAgent:
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools

    def act(self, request):
        decision = self.llm.decide(request)

        # --- Heuristic layer runs BEFORE the action executes ---
        if decision.tool == "refund" and decision.amount > 500:
            return self._escalate_to_human(decision, reason="refund > $500")

        if decision.tool == "delete_user":
            recent = self._count_recent("delete_user", window_seconds=60)
            if recent >= 10:
                return self._block(decision, reason="rate limit")

        if decision.confidence < 0.4:
            return self._escalate_to_human(decision, reason="low confidence")

        # Passed all checks; execute
        return self.tools[decision.tool](**decision.args)

Notice how none of these checks need an LLM. They're fast, free, and run the same way every time. That predictability is exactly the point.

3 · Rewards: nudging the agent toward what you want

A reward is a number you give the agent telling it how well it's doing. Higher is better. Over many examples, the agent learns which kinds of behaviors earn high rewards and shifts toward those.

There are two places rewards show up in agent systems:

Training-time rewards

This is the classic use of rewards. You collect examples of agent behavior, score each one, and use those scores to fine-tune the model. The most famous version is RLHF (Reinforcement Learning from Human Feedback): humans rate which of two responses is better, and the model is trained to produce more of the higher-rated kind. This is how ChatGPT, Claude, and most modern chat models were tuned.

The catch is something called reward hacking. Agents are good at finding shortcuts that maximize the reward without actually doing what you wanted. A 2025–2026 study Fu et al., arXiv 2025-2026 went deep on how to design rewards that resist this. Their main lessons:

Runtime rewards

You don't have to wait for training. You can use rewards at runtime too: have the agent generate several candidate actions, score each one with a reward function, and pick the highest-scoring one. This is sometimes called "best-of-N sampling" or "process reward modeling".

A 2026 paper CRM, Yang et al., arXiv 2026 proposed using multiple specialist agents as the reward function: one agent scores for factual correctness, another for safety, another for helpfulness, and a coordinator combines their signals. This makes the reward easier to debug ("which agent thinks the answer is bad?") and harder to game (the agent has to satisfy all the specialists, not just one).

# Runtime reward: generate N candidates, score, pick the best
def best_of_n(prompt, n=5):
    candidates = [llm.generate(prompt) for _ in range(n)]

    # Score each candidate with multiple specialist rewards
    scored = []
    for c in candidates:
        score = (
            0.5 * factuality_scorer(c) +
            0.3 * safety_scorer(c) +
            0.2 * helpfulness_scorer(c)
        )
        scored.append((score, c))

    # Pick the highest-scoring one
    return max(scored, key=lambda x: x[0])[1]
Watch out for the cost. Generating 5 candidates means 5x the LLM bill for that step. Best-of-N pays off when the task is high-value (like a customer-facing email) and quality matters more than cost. For routine internal stuff, one shot is usually fine.

4 · Preferences: learning from what users actually do

The most subtle layer. Instead of telling the agent what to do (prompts), giving it rules (heuristics), or scoring it on benchmarks (rewards), you watch how users react and let those reactions shape future behavior.

Common signals to learn from:

These signals get aggregated into the agent's prompt over time ("based on past interactions, this user prefers concise responses without bullet lists") or used to fine-tune a small adapter on top of the base model. Either way, the system gradually adapts without anyone having to write new rules.

The risk: preference learning amplifies whatever you measure. If you reward "user accepted the suggestion", the agent learns to give bland, hard-to-disagree-with suggestions. If you reward "user spent more time", the agent learns to be vaguely interesting rather than useful. Pick what you measure carefully.

How the four layers work together

None of these are "the right answer" on their own. Production systems stack them:

LayerBest forCost to changeRisk if wrong
PromptsTone, role, general instructionsSeconds (just edit)Easy to override; user input can defeat them
HeuristicsHard limits, routing, sanity checksHours (deploy code)Brittle for fuzzy cases; rule explosion
RewardsImproving general capabilityDays to weeks (training)Reward hacking; expensive to fix
PreferencesPersonalization, gradual improvementContinuous (always learning)Amplifies bad signals; can drift away from intent

A practical recipe for a customer-service agent:

The teams that ship the best agents in 2026 aren't the ones with the cleverest prompt. They're the ones who layer all four kinds of guidance and know exactly which problems each layer is solving.

Practical advice