Prompts aren't your only steering wheel.
Most teams shape agent behavior with one tool: the prompt. Tweak the prompt, the agent acts a little differently. But there's a whole toolbox of other ways to guide what an agent does. Some are simple (a hand-written rule), some are clever (a reward signal that nudges the agent over time), and some are recent research (learning what users prefer from how they reactYao 2023). This chapter covers the four main ones, how they layer, and when to reach for which.
The four ways to guide an agent
2 · Heuristics: the rules that just work
Heuristics are plain old code. They check things, decide things, and block things, all without involving the LLM. Most production agent systems lean heavily on heuristics for anything that needs to be predictable.
Examples of useful heuristics:
- Hard limits. "Never refund more than $500 without human approval." "Never call the delete-user tool on more than 10 users per minute." These are not suggestions; they're floors and ceilings.
- Routing rules. "If the request mentions billing, send it to the billing agent. If it mentions a refund, also tag it as 'sensitive'." Simple if-then logic that the LLM doesn't need to reason about.
- Format checks. "Reject any agent output where the JSON doesn't validate." "Truncate responses longer than 2000 characters."
- Cost controls. "Stop the workflow if it has used more than 50,000 tokens." "Don't call the expensive model if a cheap one already produced an answer that passes our checks."
- Sanity gates. "If the agent's confidence score is below 0.4, escalate to a human." "If two agents disagree, log it and route to the supervisor."
Heuristics shine when you can write the rule down clearly. They struggle when the rule is fuzzy ("be helpful but not too helpful"). For fuzzy things, you'll need rewards or preferences instead.
# A simple heuristic layer wrapped around the agent
class GuidedAgent:
def __init__(self, llm, tools):
self.llm = llm
self.tools = tools
def act(self, request):
decision = self.llm.decide(request)
# --- Heuristic layer runs BEFORE the action executes ---
if decision.tool == "refund" and decision.amount > 500:
return self._escalate_to_human(decision, reason="refund > $500")
if decision.tool == "delete_user":
recent = self._count_recent("delete_user", window_seconds=60)
if recent >= 10:
return self._block(decision, reason="rate limit")
if decision.confidence < 0.4:
return self._escalate_to_human(decision, reason="low confidence")
# Passed all checks; execute
return self.tools[decision.tool](**decision.args)
Notice how none of these checks need an LLM. They're fast, free, and run the same way every time. That predictability is exactly the point.
3 · Rewards: nudging the agent toward what you want
A reward is a number you give the agent telling it how well it's doing. Higher is better. Over many examples, the agent learns which kinds of behaviors earn high rewards and shifts toward those.
There are two places rewards show up in agent systems:
Training-time rewards
This is the classic use of rewards. You collect examples of agent behavior, score each one, and use those scores to fine-tune the model. The most famous version is RLHF (Reinforcement Learning from Human Feedback): humans rate which of two responses is better, and the model is trained to produce more of the higher-rated kind. This is how ChatGPT, Claude, and most modern chat models were tuned.
The catch is something called reward hacking. Agents are good at finding shortcuts that maximize the reward without actually doing what you wanted. A 2025–2026 study Fu et al., arXiv 2025-2026 went deep on how to design rewards that resist this. Their main lessons:
- Cap the reward. An unbounded reward is an open invitation for the agent to find a way to make the number very large by doing something silly. Pick a maximum and stick to it.
- Reward shape matters. A reward that grows fast at first then flattens encourages the agent to learn the easy wins quickly, then stop pushing for diminishing returns. A reward that grows linearly forever tells the agent "keep going at all costs", which leads to weird behavior.
- Compare against a reference. Score how the agent did relative to a baseline, not in absolute terms. This stops the agent from being rewarded just for doing the obvious thing.
Runtime rewards
You don't have to wait for training. You can use rewards at runtime too: have the agent generate several candidate actions, score each one with a reward function, and pick the highest-scoring one. This is sometimes called "best-of-N sampling" or "process reward modeling".
A 2026 paper CRM, Yang et al., arXiv 2026 proposed using multiple specialist agents as the reward function: one agent scores for factual correctness, another for safety, another for helpfulness, and a coordinator combines their signals. This makes the reward easier to debug ("which agent thinks the answer is bad?") and harder to game (the agent has to satisfy all the specialists, not just one).
# Runtime reward: generate N candidates, score, pick the best
def best_of_n(prompt, n=5):
candidates = [llm.generate(prompt) for _ in range(n)]
# Score each candidate with multiple specialist rewards
scored = []
for c in candidates:
score = (
0.5 * factuality_scorer(c) +
0.3 * safety_scorer(c) +
0.2 * helpfulness_scorer(c)
)
scored.append((score, c))
# Pick the highest-scoring one
return max(scored, key=lambda x: x[0])[1]
4 · Preferences: learning from what users actually do
The most subtle layer. Instead of telling the agent what to do (prompts), giving it rules (heuristics), or scoring it on benchmarks (rewards), you watch how users react and let those reactions shape future behavior.
Common signals to learn from:
- Edits. The user changed your draft email before sending it. What did they change? Made it shorter? Less formal? More specific? That's a signal.
- Thumbs up/down. Explicit but rare. Most users don't bother.
- Re-asks. The user asked the same question again with more detail. Your first answer didn't land.
- Acceptance. The user took the agent's suggestion as-is. Whatever you did this time, do more of it.
- Time spent. The user dwelt on one part of the response and skipped the rest. The dwelt-on part was probably useful.
These signals get aggregated into the agent's prompt over time ("based on past interactions, this user prefers concise responses without bullet lists") or used to fine-tune a small adapter on top of the base model. Either way, the system gradually adapts without anyone having to write new rules.
The risk: preference learning amplifies whatever you measure. If you reward "user accepted the suggestion", the agent learns to give bland, hard-to-disagree-with suggestions. If you reward "user spent more time", the agent learns to be vaguely interesting rather than useful. Pick what you measure carefully.
How the four layers work together
None of these are "the right answer" on their own. Production systems stack them:
| Layer | Best for | Cost to change | Risk if wrong |
|---|---|---|---|
| Prompts | Tone, role, general instructions | Seconds (just edit) | Easy to override; user input can defeat them |
| Heuristics | Hard limits, routing, sanity checks | Hours (deploy code) | Brittle for fuzzy cases; rule explosion |
| Rewards | Improving general capability | Days to weeks (training) | Reward hacking; expensive to fix |
| Preferences | Personalization, gradual improvement | Continuous (always learning) | Amplifies bad signals; can drift away from intent |
A practical recipe for a customer-service agent:
- Prompt sets the role and tone ("you are a polite, brief support agent").
- Heuristics enforce the hard rules ("never agree to refunds above $500", "always escalate complaints about safety").
- Rewards from offline training make the base model good at customer service generally.
- Preferences from how users react (edits, follow-up questions, satisfaction surveys) make the agent better at your specific customer base over time.
Practical advice
- Start with heuristics for anything that has to be right. If a rule must hold 100% of the time, it doesn't belong in the prompt. Put it in code.
- Use rewards when you can clearly score outcomes. Did the test pass? Did the user complete checkout? Did the bug get fixed? Concrete outcomes make good rewards. Vague things ("be helpful") make terrible rewards because they invite hacking.
- Be careful what user signals you treat as preferences. "User accepted" rewards bland safe outputs. "User spent time on it" rewards rambling outputs. Pick a signal that aligns with what you actually want.
- Always have a way to turn personalization off. Users sometimes want the default behavior, especially when something has gone wrong. Give them a reset.
- Audit your reward function regularly. Reward hacking is real and shows up in subtle ways. Once a quarter, look at the highest-rewarded behaviors and ask: is this what we actually wanted?