Tone Dark
Tint
19 Evaluation · how to know if your agent works

The popular benchmarks have problems. Here's what to do instead.

In April 2026, researchers at Berkeley's RDI group Berkeley RDI, 2026 built a "scanning agent" and pointed it at eight of the most popular AI agent benchmarks. Every single one could be tricked into giving near-perfect scores without actually solving any tasks. The most striking finding: a 10-line Python config file ("conftest.py") was enough to fake a 100% score on SWE-benchJimenez 2024 Verified, the benchmark most labs cite when claiming progress on coding agents. WebArena and OSWorldXie 2024 both call Python's eval() on text the agent produces, which is straightforward to abuse. GAIAMialon 2023's answer key is downloadable from Hugging Face.

This doesn't mean benchmarks are useless. It means single benchmark numbers should not be trusted on their own, and that you need your own evaluation if you care about whether your agent actually works. This chapter covers what the popular benchmarks actually measure, where they break, and how to build evaluation for your own system.

The popular benchmarks at a glance

BenchmarkWhat it testsYearCurrent state
SWE-bench Jimenez et al., ICLR 2024 Can the agent fix real GitHub issues from open-source Python projects? 2024, refined through 2026 Top scores are above 70%, but the benchmark can be cheated. Treat individual scores carefully.
OSWorld Xie et al., NeurIPS 2024 Can the agent operate a real Linux desktop the way a human would? 2024, still active Scores jumped from 23% to 51% during 2025 thanks to better screen-grounding (the Jedi dataset). The agent can also tamper with the VM state, which makes some scores suspect.
GAIA Mialon et al., ICLR 2024 General assistant tasks: combine search, reasoning, file handling, and tool use. 2023, still cited Answers are public on Hugging Face, so an agent that knows where to look can "solve" most of it without any reasoning.
TAU-benchYao 2024 Yao et al., 2024 Customer service style tasks: retail and airline scenarios with simulated users. 2024 Harder to game because the simulated user is unpredictable. One of the more honest benchmarks.
Terminal-Bench Complex command-line tasks (e.g., configuring a system, building a small program). 2026 New; has the same "agent and grader share an environment" problem as SWE-bench.
MCPMCP 2025-Bench Can the agent pick the right tool from a set of MCP servers? 2025 Useful for the new protocol-era agents. Numbers still settling.
Windows Agent Arena MS / arXiv 2024 Same idea as OSWorld but on Windows, easy to run many instances in parallel on Azure. 2024 Solid infrastructure; widely used in industry research.
OccuBenchOccuBench 2026 arXiv 2026 Real professional work tasks across a range of jobs. 2026 Newer, broader coverage of "what does an actual office worker do all day?"

Frontier scores at a glance

Treat the table below as a snapshot, not a leaderboard. Numbers move every quarter, sometimes by 10+ percentage points when a new harness lands. Always verify on the official leaderboard before quoting any specific score in a deck. The point of this table is the shape of where each benchmark sits, not its current peak.

BenchmarkTask typeReported frontier (2025-26 era)Where to verify
SWE-bench Verified Real GitHub issues, Python repos ~50-65% with frontier models + scaffolding; cheating reports show some leaderboard scores were inflated swebench.com
OSWorld Linux desktop, GUI use ~20-50%; jumped during 2025 with screen-grounding improvements os-world.github.io
GAIA General assistant: search + reason + tool use High but answer-leakage means treat with care HF leaderboard
TAU-bench Customer service simulation Substantially harder; simulated-user variability is the point sierra-research/tau-bench
Windows Agent Arena Windows desktop tasks at parallel scale Scaling is the differentiator, not raw score microsoft.github.io/WindowsAgentArena
Terminal-Bench Complex shell sequences New, scores still settling; same agent-environment-share problem as SWE-bench laude-institute/terminal-bench
OccuBench Real professional work across occupations Broad; reveals where agents are still weak (legal, medical reasoning) arXiv 2026

Two recurring patterns worth naming. First, every benchmark whose harness lets the agent execute Python in the grading environment has been gameable so far. SWE-bench, OSWorld, Terminal-Bench all fall in this category. The Berkeley RDI work cited above showed that a 10-line config file is sufficient to fake a 100% score on SWE-bench Verified. Second, every benchmark whose answer keys are public on Hugging Face has been gamed by retrieval. GAIA is the prominent example. The benchmarks that hold up under stress are the ones with simulated counterparties (TAU-bench) or sealed environments where the agent cannot see the grader (private SWE-bench harness instances).

Why these benchmark scores can mislead you

The Berkeley study found four patterns that show up across most benchmarks. None of them are sophisticated. They're the kinds of mistakes any of us could make when we set up evaluation:

How to read benchmark numbers: when a paper or company announces "70% on SWE-bench Verified", that could mean a great model solving real problems. It could also mean a team found a clever exploit in the harness. Without seeing the actual agent traces and the harness audit logs, you can't tell which it is. Treat single-number leaderboards skeptically, especially when the numbers jump suddenly.

What good evaluation actually looks like

Forget the public benchmarks for a moment. If you're building a real agent, you need evaluation that reflects your use case. Most working teams stack three layers, from cheapest to most expensive:

1 Per-component tests (run on every commit)

Each agent in your system gets its own small test set: a few dozen example inputs paired with the outputs you'd consider correct. Run these on every code change. They catch regressions in single agents fast, before you notice them in the whole-system tests.

Build them like you'd build unit tests for any other code: pick the inputs deliberately, focus on failure modes you've already seen, and don't try to cover every possible input. Quality over quantity.

2 End-to-end workflow tests (run nightly)

These check whether the whole system completes a task: given this user request, did the agent get the right answer? Track success rate, how many iterations it took, total cost, and how long the whole thing ran. Per-component tests don't catch problems like agents handing off bad context to each other; these whole-flow tests do.

A good source of test cases: real user conversations from your production logs, with personal info removed. Real traces find edge cases that synthetic test sets never imagine.

3 Production monitoring (always on)

Log every workflow. Sample some for human review. Track per-step cost, latency, and error rate. Build dashboards that answer "is the agent getting better or worse this week?" without anyone having to dig through logs.

The MemMachine team's analysis MemMachine, arXiv 2026 on the LongMemEvalLongMemEval 2025 benchmark Wu et al., ICLR 2025 is a useful template. They listed six things they could change (chunking strategy, query wording, context formatting, etc) and measured the effect of each one separately. Running similar isolated experiments in production gives you a clear answer to "what should I change next?"

One number won't tell you the truth

The three layers above tell you whether your agent works on the inputs you tested. They don't tell you whether it works on inputs that are slightly different. The gap between the two is where most production failures live. A model that scores 92% on your eval set might score 51% on the same questions paraphrased, with names changed, or with one number swapped. The 92% gives you no warning about which world you're in.

The fix is to take each test case and create small variations of it that should have the same answer. Run all of them. Then report three numbers instead of one: how it does on average, how it does on the hardest variations, and how fast accuracy drops as you push the variations further. That third number is the important one. If accuracy stays flat as you vary the inputs, the agent has actually learned the task. If accuracy falls off a cliff, it's matching on surface details that won't survive contact with real users.

Variations that almost always reveal something:

A good internal report looks like: average 0.91, worst case 0.62, accuracy drops about 0.05 per unit of variation. The third number tells you the agent is brittle even though the average looks fine. Without this kind of evaluation, you'd ship on the 0.91 and find out about the 0.62 from a customer. Running fifty variations per case isn't free, but skipping it costs more.

Using one model to grade another (with care)

For tasks where there's no single right answer (summarization quality, helpfulness, code style), the most common approach in 2025 is to use a second LLM as the grader. This is sometimes called "LLM-as-judge". It's cheap and easy to set up. It also has well-documented biases you should know about:

What helps: rotate which model does the grading, randomize order in pairwise comparisons, calibrate against a small set of human-labeled examples once a month, and don't grade with the same model that produced the answer.

Reproducibility: boring but critical

If you can't re-run an evaluation tomorrow and get the same answer, you don't know what you measured. A 2026 guide on agent benchmarking Spheron, 2026 listed six things that quietly break reproducibility:

What to actually emit: OpenTelemetry's gen_ai semantic conventions

Observability used to mean "log whatever feels important." That worked when each team built one LLM application; it does not work now that an agent calls three tools that each call a sub-agent that calls a model. The fix is a shared vocabulary so traces from different libraries, different teams, and different models compose into one debuggable system. The OpenTelemetry project, the same organization behind metric and trace standards across the rest of the software industry, has been standardizing one for AI agents: the gen_ai semantic conventions.

Adopt the names. Two reasons. First, every observability platform worth naming (LangSmith, Langfuse, Arize Phoenix, Braintrust, Datadog, Helicone, MLflow, New Relic, Honeycomb, Grafana) ingests gen_ai.* attributes natively, so you swap platforms without rewriting instrumentation. Second, the names already encode the right design choices. You do not have to invent a token-counting attribute or argue about whether a "step" is a span or an event; the spec made those choices for you.

One important caveat before you adopt this in production. As of OpenTelemetry semantic conventions v1.41.0 (May 2026), every gen_ai.* attribute is still marked Status: Development, not Stable. Names can still change. The conservative move is to use the names today and hide them behind a small adapter so a future rename is a one-line change in your instrumentation library, not a search-and-replace across every service.

The four span types you actually need

The spec defines a small vocabulary that maps directly onto how an agent runs. You do not need all of it. These four spans cover almost every agent system in the wild:

Span operationWhat it representsWhen you create it
chat One model invocation. Prompt in, completion out. The leaf of most traces. Every time you call the model API. Auto-instrumented by most SDKs.
execute_tool One tool call. The model decided to call issue_refund; this span wraps the actual execution. Every tool dispatched by your loop. Span name pattern: execute_tool issue_refund.
invoke_agent One full run of one agent: perceive, decide, act, repeat, until done. Wraps multiple chat and execute_tool spans. The outer boundary of an agent's loop, including any sub-agents it spawns.
invoke_workflow The orchestrator's view: multiple agents collaborating on one task. Wraps multiple invoke_agent spans. At the top of a multi-agent run. This is where the orchestrator from chapter 05 lives.

The shape that falls out is a tree: invoke_workflow at the top, invoke_agent spans for each participating agent, and inside each one a series of chat and execute_tool spans. Sub-agents nest naturally: when one agent calls another, the child's invoke_agent span is a child of the parent's. You can read off "which agent did what, in what order" by walking the tree. This is what makes the failure-attribution problem from chapter 13 tractable: the audit log from the trust engine references the same trace by ID, and the contribution ledger is just a query over the tree.

The metrics that go alongside

Three metric names are required by the spec; track them all. Adding any others on top is fine, but skip these and your dashboards will not compose with anyone else's:

The same naming logic applies to the next layer up. execute_tool spans should record duration on the same histogram (with gen_ai.operation.name=execute_tool as an attribute) so you can compare model latency to tool latency on the same dashboard. Tool-call success rate is then computable from the standard error.type attribute on those spans, no new metric needed.

Privacy by default

The spec gets one detail right that most home-grown logging gets wrong: prompts and completions are not captured by default. You opt in by setting OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true, and even then the spec recommends uploading content to external object storage and recording only a reference in the span. Two reasons this matters in practice. First, prompts and completions routinely contain personally identifiable information you cannot afford to copy into your observability backend's hot storage. Second, separating the structural trace (which scales to billions of spans cheaply) from the content blob (which needs different retention and access control) is what makes audit logs feasible. See the next section for why those are different things.

Observability traces and audit logs are not the same thing

A trace is for debugging. An audit log is for accountability. They look similar, store similar data, and routinely get confused, but they have different consumers, different retention, different access control, and different correctness guarantees. Conflating them is one of the most common mistakes operators make, and the LangChain community has explicitly called this gap out in issue #35357: callback handlers and observability platforms are designed for debugging, not for regulatory compliance.

A useful split has three sinks, not one:

The three sinks share an ID scheme so you can join across them when needed. The trace ID lives in the trace, gets referenced from the audit log entry, and is the parent of any evaluation events. But the IAM boundaries are different and the access patterns are different. An engineer chasing a regression only needs traces. An auditor reconstructing a year-old incident needs audit log plus traces (if not yet expired) plus the evaluation history. A reputation engine consuming evaluation events does not need either of the others.

Evaluating multi-agent systems

For a system with multiple agents, the most important metric is still: did the whole thing succeed? But that single number hides a lot. Tracking these in addition gives you better debugging info:

The trap to watch for: tuning each agent until its individual numbers look great, while the overall system success rate drops. Local optimization can hurt the team. Always keep the whole-workflow success rate as the truth metric, and use the per-agent metrics for figuring out why the workflow failed.
Public benchmarks are a fine starting point. Real evaluation is the boring work of building a held-out test set from real traffic, sampling traces for human review, calibrating your judges, and pinning the things that drift. Skip those and a "70% on SWE-bench" tells you almost nothing about whether your agent works in practice.