AI Agent Evaluation Is Broken: 5 Structural Gaps Between Evals and Production Reality

·Patrick Joubert·9 min read
ai-agentsevaluationproduction-reliabilitydecision-architecture

You think your agent is ready for production because it passed your evaluation suite.

It isn't.

The gap between "passes evals" and "works in production" isn't a tuning problem. It's structural. Most evaluation frameworks measure the wrong thing at the wrong level of abstraction. They test isolated capabilities when production demands composed decision chains. They measure task completion on synthetic benchmarks when they should be measuring decision quality under real-world constraints.

The teams shipping reliable agents aren't running better evals. They're building evaluation as infrastructure: continuous, production-embedded, decision-focused. They've stopped asking "does this agent complete the task?" and started asking "can we validate this decision before execution?"

This memo identifies five patterns that expose the gap. Recognize them in your deployment, and you'll understand why evaluation alone can't save you.

Pattern 1: Benchmark Theater

Your agent scored 89% on your custom evaluation suite. It aced the benchmark.

Then it hit production and failed on a decision it should have gotten right.

Symptom: Agent passes all pre-deployment evals but makes critical errors in live traffic. Failures don't match the kinds of errors you tested for. The agent sometimes does the right thing, sometimes doesn't, with no obvious pattern.

Root Cause: Benchmarks test atomic actions in isolation. Production tests composed decision chains across state changes. An agent that can correctly identify an entity, retrieve its attributes, and make a single decision will fail when it needs to maintain context across three sequential decisions while the underlying data contradicts itself. The evaluation tested each step independently. Production tests the coherence of the whole sequence.

Detection Signal: You notice that agent errors cluster around multi-step workflows. Single-action tasks have high success rates. Multi-hop reasoning sequences degrade faster than the quality of individual steps would predict. You can reproduce the error in isolation. Until you add the preceding context, then the error disappears or changes character.

Architecture Fix: Stop evaluating atomic capabilities. Evaluate decision sequences. Build test suites where each evaluation scenario is a workflow graph, not a single prompt. Trace the agent's output not as a final answer but as a decision with provenance: which prior decisions led to this one, which context was used, where ambiguities still exist. The score isn't "correct answer" but "correct reasoning chain that would produce a correct answer next time too."

Pattern 2: The Accuracy Trap

Your agent has 95% accuracy on the eval. That sounds reliable.

At 10,000 decisions per day, you're pushing 500 errors into production daily.

Symptom: Errors start rare and manageable in testing but create cascading failures at scale. A single wrong decision triggers follow-up systems to make wrong decisions downstream. You notice that failures cluster. when one agent decision is wrong, 3-5 subsequent decisions in dependent workflows are wrong too. Error rates don't scale linearly with volume; they compound.

Root Cause: Accuracy metrics treat each decision as independent. Production doesn't. One agent decision feeds into human review queues, automated downstream actions, and context for future decisions. An error rate that's acceptable on a single decision becomes catastrophic when that decision is the input to five more. You need to evaluate not accuracy, but error cost: which is the accuracy of decision at position N multiplied by the accuracy degradation it introduces at position N+1.

Detection Signal: Your monitoring shows that failure rates in production are much worse than accuracy metrics predict. When you trace root causes, the same wrong decisions appear in multiple failure chains. You see users or downstream systems correcting agent decisions, and those corrections become the context for the next agent action, which is sometimes wrong in a way that wouldn't have happened with the correct prior decision.

Architecture Fix: Replace accuracy metrics with compound error analysis. For each evaluation scenario, calculate not just whether the agent was right, but what happens if it's wrong. What are the downstream consequences? Build error matrices, not confusion matrices. Score the agent on "would this error propagate?" not just "is this correct?" Evaluate multi-step workflows and measure the error rate of the final output given a 5% error rate on each intermediate step. If your compound error rate exceeds your production SLO, you don't have an agent reliability problem. You have an architecture problem. You need error isolation or decision validation before execution, not better accuracy.

Pattern 3: Context Blindness

Your evaluation suite uses clean, complete context. Every entity has all relevant attributes. Every decision point has clear, unambiguous information.

Production is messier. Much messier.

Symptom: Agent gives answers that are logically correct for the context it perceived but wrong for the actual production context. You ask the agent the same question with the same phrasing, but it answers differently depending on what's in context memory. The agent confidently makes decisions on incomplete information and doesn't recognize when information is missing or contradictory.

Root Cause: Evaluation contexts are synthetic. They're curated, complete, and unambiguous. Real production contexts have missing data, contradictory signals, stale information, and ambiguous entities. An agent evaluated on "customer X with account Y has these attributes" will fail when production says "customer X might be one of three different entities, here's their conflicting data, make a decision anyway." The evaluation didn't teach the agent that real context is noisy.

Detection Signal: When you trace agent errors, they almost never involve "the agent didn't know how to process this type of data." They involve "the agent didn't notice the data was incomplete or contradictory." Users report that the agent worked fine when they provided complete information but failed when context was sparse. The agent's reasoning is sound given what it perceived, but it perceived wrong because the context was ambiguous.

Architecture Fix: Inject realistic context degradation into evaluations. Don't just test "customer with complete attributes." Test "customer with missing phone number," "customer with two possible matches," "customer data that contradicts itself," "customer context from three days ago versus now." Score the agent not on how well it decides with complete information, but on how well it identifies context gaps and either requests clarification or explicitly documents its assumptions. Build evaluations that measure applicability: can the agent recognize when context isn't sufficient for the decision it's being asked to make? The goal isn't to make agents work with bad context. It's to make agents refuse to decide when context doesn't support the decision.

Pattern 4: Decision Provenance Gap

Your agent output passes validation. The decision is correct.

So why does the next decision in the sequence go wrong?

Symptom: Individual agent decisions look right when you review them. The reasoning is sound, the conclusion follows. But when you look at sequences of decisions, later decisions don't build logically on earlier ones. The agent's reasoning is internally consistent but doesn't maintain coherence across the conversation or workflow. You can't reliably reproduce the error because the agent's reasoning changes context between decisions.

Root Cause: Evaluations check outputs, not reasoning chains. You validate that "the agent chose X" but never validate "the agent chose X because of context Y, and context Y is still true, and the next decision should account for that." In production, a decision isn't just right or wrong. It's right or wrong given what the agent claims to know. If the agent's reasoning at step 1 contradicts its reasoning at step 3, subsequent decisions fail because they're built on false premises. Evals don't catch this because they don't track provenance.

Detection Signal: When you audit agent decisions, individual decisions look defensible. But the sequence looks incoherent. The agent reasons one way about a fact at timestamp T, then reasons differently about the same fact at timestamp T+5 minutes. Users report that they can't predict the agent's next move because its reasoning doesn't build on its prior reasoning. When you trace errors, they often start with "the agent forgot what it said earlier" or "the agent used contradictory facts to justify different decisions."

Architecture Fix: Embed decision provenance into evaluation and production. Every agent decision needs a trace: what context was used, what was assumed, what was uncertain. Score evals not just on the decision, but on whether the decision's reasoning chain would allow the next downstream decision to be correct too. Build evaluation scenarios where consistency matters. Ask the agent to decide on X, then later ask it about X again. with slightly different framing. Can it recognize it's the same question? Does its reasoning remain consistent? In production, every decision should be paired with its provenance: a structured record of which context informed it, what was assumed, what rank of confidence applies. Before downstream systems act on an agent decision, they should validate that provenance makes sense in the current context.

Pattern 5: Temporal Drift Blindness

You ran comprehensive evals before deploying. Everything looked good.

Six weeks later, without any code changes, reliability degraded 12%.

Symptom: Agent performance decays over time with no obvious cause. The decay is slow enough that no single incident triggers an alert, but eventually you notice the baseline has shifted. The agent works well on recent examples but worse on older patterns. User complaints increase but don't spike. They trend. You deploy a fix, performance improves slightly, then gradually decays again.

Root Cause: Evaluation is a point-in-time gate. You test once, decide the agent is ready, deploy it, then never evaluate it again until something breaks. Production data distribution changes. Context composition changes. The world drifts. An agent that was 94% accurate at deployment is 89% accurate three months later, not because the agent changed, but because the data it operates on changed. Pre-deployment evals don't catch this because they don't run continuously.

Detection Signal: You notice that agent errors aren't reproducible from old test cases. When you re-run your pre-deployment evaluation suite months later, the agent still passes. But production performance is worse. This means the evaluation itself was too narrow. it tested the distribution that existed at evaluation time, not the distribution that emerges over months. You start tracking agent performance by age (when was this decision made?) and find strong trends: recent decisions have different error patterns than old ones.

Architecture Fix: Make evaluation continuous and production-embedded, not pre-deployment and static. Deploy evaluation probes that continuously measure agent decision quality on real data. Set decision validation gates that check agent outputs before they affect downstream systems or user experience. The validation should measure not just "is this right?" but "does this decision's quality match the baseline from deployment?" When quality degrades, you get early warning before users do. You're not trying to perfectly predict production behavior at evaluation time. That's impossible. You're building infrastructure that detects when production behavior changes and raises flags before degradation becomes critical. The evaluation suite that predicted deployment quality becomes the monitoring suite that measures production health.

Why Evaluation Fails

These five patterns share a common structure: the gap between evaluation and production isn't a data problem or a model problem. It's a visibility problem.

Evaluation happens in controlled conditions. Production is uncontrolled. Evaluation tests task completion. Production tests decision quality under pressure. Evaluation is a gate: one event, once, before launch. Production is continuous. The agent that passes a curated benchmark set doesn't fail because it's dumb. It fails because benchmarks don't measure what actually matters in production.

What matters in production is whether a decision is defensible before it's executed. Not whether it's ultimately correct. Not whether it's accurate on historical data. But whether, given the context available, the decision reasoning is sound enough that executing it won't break something else.

That's why the teams shipping reliable agents stop thinking of evaluation as testing and start thinking of it as infrastructure. They build systems where every agent decision is paired with its reasoning, its context, and its confidence. Where every decision can be validated before execution. Where validation rules are domain-specific: knowing what makes a decision trustworthy in your context. Where continuous monitoring catches drift before it becomes failure.

They've solved the evaluation problem by making evaluation continuous, decision-focused, and production-embedded. Not a gate you pass once. A system you build into every deployment.

Next

Running into these patterns in production?

Compare notes. The structural gaps are real. The fixes require rebuilding how you think about what evaluation means. But every team that made the shift reports the same outcome: lower incident rates, faster error recovery, and agents that actually work the way they were supposed to work before they went live.


Patrick Joubert. (2026). "AI Agent Evaluation Is Broken: 5 Structural Gaps Between Evals and Production Reality." The Context Graph. https://thecontextgraph.co/memos/ai-agent-evaluation-guide

Cite this memo

Patrick Joubert. (2026). "AI Agent Evaluation Is Broken: 5 Structural Gaps Between Evals and Production Reality." The Context Graph. https://thecontextgraph.co/memos/ai-agent-evaluation-guide

Running into these patterns in production?