Heuristics & rewards

11 Heuristics & rewards · how to guide agent behavior

Prompts aren't your only steering wheel.

Most teams shape agent behavior with one tool: the prompt. Tweak the prompt, the agent acts a little differently. But there's a whole toolbox of other ways to guide what an agent does. Some are simple (a hand-written rule), some are clever (a reward signal that nudges the agent over time), and some are recent research (learning what users prefer from how they reactYao 2023). This chapter covers the four main ones, how they layer, and when to reach for which.

Think of guidance as a stack. The prompt is the loudest layer, but it's also the most expensive (every token costs money) and the most fragile (one user message can override it). The other three layers are quieter but often more durable.

The four ways to guide an agent

1 · Prompts

The instructions you write in plain English. Cheapest to change, but the agent re-interprets them every turn and they can be overridden.

"You are a helpful customer support agent. Be polite."

2 · Heuristics

Hand-written rules that run in code, not in the model. Cheap, predictable, easy to audit. Good for hard limits.

if action.cost > 100: require_approval()

3 · Rewards

A score the agent tries to maximize. Often used to fine-tune the model itself, but you can also use rewards at runtime to pick between options.

+1 for solving, -0.1 per extra step

4 · Preferences

Patterns learned from how users react to past outputs. The model gradually adapts toward what people thumbs-up.

User edited the response → next time, write more concisely

2 · Heuristics: the rules that just work

Heuristics are plain old code. They check things, decide things, and block things, all without involving the LLM. Most production agent systems lean heavily on heuristics for anything that needs to be predictable.

Examples of useful heuristics:

Hard limits. "Never refund more than $500 without human approval." "Never call the delete-user tool on more than 10 users per minute." These are not suggestions; they're floors and ceilings.
Routing rules. "If the request mentions billing, send it to the billing agent. If it mentions a refund, also tag it as 'sensitive'." Simple if-then logic that the LLM doesn't need to reason about.
Format checks. "Reject any agent output where the JSON doesn't validate." "Truncate responses longer than 2000 characters."
Cost controls. "Stop the workflow if it has used more than 50,000 tokens." "Don't call the expensive model if a cheap one already produced an answer that passes our checks."
Sanity gates. "If the agent's confidence score is below 0.4, escalate to a human." "If two agents disagree, log it and route to the supervisor."

Heuristics shine when you can write the rule down clearly. They struggle when the rule is fuzzy ("be helpful but not too helpful"). For fuzzy things, you'll need rewards or preferences instead.

# A simple heuristic layer wrapped around the agent
class GuidedAgent:
    def __init__(self, llm, tools):
        self.llm = llm
        self.tools = tools

    def act(self, request):
        decision = self.llm.decide(request)

        # --- Heuristic layer runs BEFORE the action executes ---
        if decision.tool == "refund" and decision.amount > 500:
            return self._escalate_to_human(decision, reason="refund > $500")

        if decision.tool == "delete_user":
            recent = self._count_recent("delete_user", window_seconds=60)
            if recent >= 10:
                return self._block(decision, reason="rate limit")

        if decision.confidence < 0.4:
            return self._escalate_to_human(decision, reason="low confidence")

        # Passed all checks; execute
        return self.tools[decision.tool](**decision.args)

Notice how none of these checks need an LLM. They're fast, free, and run the same way every time. That predictability is exactly the point.

3 · Rewards: nudging the agent toward what you want

A reward is a number you give the agent telling it how well it's doing. Higher is better. Over many examples, the agent learns which kinds of behaviors earn high rewards and shifts toward those.

There are two places rewards show up in agent systems:

Training-time rewards

This is the classic use of rewards. You collect examples of agent behavior, score each one, and use those scores to fine-tune the model. The most famous version is RLHF (Reinforcement Learning from Human Feedback): humans rate which of two responses is better, and the model is trained to produce more of the higher-rated kind. This is how ChatGPT, Claude, and most modern chat models were tuned.

The catch is something called reward hacking. Agents are good at finding shortcuts that maximize the reward without actually doing what you wanted. A 2025–2026 study Fu et al., arXiv 2025-2026 went deep on how to design rewards that resist this. Their main lessons:

Cap the reward. An unbounded reward is an open invitation for the agent to find a way to make the number very large by doing something silly. Pick a maximum and stick to it.
Reward shape matters. A reward that grows fast at first then flattens encourages the agent to learn the easy wins quickly, then stop pushing for diminishing returns. A reward that grows linearly forever tells the agent "keep going at all costs", which leads to weird behavior.
Compare against a reference. Score how the agent did relative to a baseline, not in absolute terms. This stops the agent from being rewarded just for doing the obvious thing.

Runtime rewards

You don't have to wait for training. You can use rewards at runtime too: have the agent generate several candidate actions, score each one with a reward function, and pick the highest-scoring one. This is sometimes called "best-of-N sampling" or "process reward modeling".

A 2026 paper CRM, Yang et al., arXiv 2026 proposed using multiple specialist agents as the reward function: one agent scores for factual correctness, another for safety, another for helpfulness, and a coordinator combines their signals. This makes the reward easier to debug ("which agent thinks the answer is bad?") and harder to game (the agent has to satisfy all the specialists, not just one).

# Runtime reward: generate N candidates, score, pick the best
def best_of_n(prompt, n=5):
    candidates = [llm.generate(prompt) for _ in range(n)]

    # Score each candidate with multiple specialist rewards
    scored = []
    for c in candidates:
        score = (
            0.5 * factuality_scorer(c) +
            0.3 * safety_scorer(c) +
            0.2 * helpfulness_scorer(c)
        )
        scored.append((score, c))

    # Pick the highest-scoring one
    return max(scored, key=lambda x: x[0])[1]

Watch out for the cost. Generating 5 candidates means 5x the LLM bill for that step. Best-of-N pays off when the task is high-value (like a customer-facing email) and quality matters more than cost. For routine internal stuff, one shot is usually fine.

4 · Preferences: learning from what users actually do

The most subtle layer. Instead of telling the agent what to do (prompts), giving it rules (heuristics), or scoring it on benchmarks (rewards), you watch how users react and let those reactions shape future behavior.

Common signals to learn from:

Edits. The user changed your draft email before sending it. What did they change? Made it shorter? Less formal? More specific? That's a signal.
Thumbs up/down. Explicit but rare. Most users don't bother.
Re-asks. The user asked the same question again with more detail. Your first answer didn't land.
Acceptance. The user took the agent's suggestion as-is. Whatever you did this time, do more of it.
Time spent. The user dwelt on one part of the response and skipped the rest. The dwelt-on part was probably useful.

These signals get aggregated into the agent's prompt over time ("based on past interactions, this user prefers concise responses without bullet lists") or used to fine-tune a small adapter on top of the base model. Either way, the system gradually adapts without anyone having to write new rules.

The risk: preference learning amplifies whatever you measure. If you reward "user accepted the suggestion", the agent learns to give bland, hard-to-disagree-with suggestions. If you reward "user spent more time", the agent learns to be vaguely interesting rather than useful. Pick what you measure carefully.

How the four layers work together

None of these are "the right answer" on their own. Production systems stack them:

Layer	Best for	Cost to change	Risk if wrong
Prompts	Tone, role, general instructions	Seconds (just edit)	Easy to override; user input can defeat them
Heuristics	Hard limits, routing, sanity checks	Hours (deploy code)	Brittle for fuzzy cases; rule explosion
Rewards	Improving general capability	Days to weeks (training)	Reward hacking; expensive to fix
Preferences	Personalization, gradual improvement	Continuous (always learning)	Amplifies bad signals; can drift away from intent

A practical recipe for a customer-service agent:

Prompt sets the role and tone ("you are a polite, brief support agent").
Heuristics enforce the hard rules ("never agree to refunds above $500", "always escalate complaints about safety").
Rewards from offline training make the base model good at customer service generally.
Preferences from how users react (edits, follow-up questions, satisfaction surveys) make the agent better at your specific customer base over time.

The teams that ship the best agents in 2026 aren't the ones with the cleverest prompt. They're the ones who layer all four kinds of guidance and know exactly which problems each layer is solving.

Practical advice

Start with heuristics for anything that has to be right. If a rule must hold 100% of the time, it doesn't belong in the prompt. Put it in code.
Use rewards when you can clearly score outcomes. Did the test pass? Did the user complete checkout? Did the bug get fixed? Concrete outcomes make good rewards. Vague things ("be helpful") make terrible rewards because they invite hacking.
Be careful what user signals you treat as preferences. "User accepted" rewards bland safe outputs. "User spent time on it" rewards rambling outputs. Pick a signal that aligns with what you actually want.
Always have a way to turn personalization off. Users sometimes want the default behavior, especially when something has gone wrong. Give them a reset.
Audit your reward function regularly. Reward hacking is real and shows up in subtle ways. Once a quarter, look at the highest-rewarded behaviors and ask: is this what we actually wanted?