Context Graph vs Agent Evaluation: Measurement vs Enforcement

Q: What is the difference between agent evaluation and a context graph?

Agent evaluation measures whether an AI agent behaved correctly across test cases, traces, tool calls, and production samples. A context graph is a decision boundary that validates whether a proposed action is applicable, scoped, current, policy-compliant, and traceable before execution.

Q: Are trace-based evals enough for production agents?

No. Trace-based evals are useful for finding failures and improving agents, but they do not themselves authorize or block the next action. Production agents need trace-based evaluation for measurement and a decision context graph for pre-execution enforcement.

Q: Do production agents need both evaluation and context graphs?

Yes. Evaluation measures quality, regression, safety, and drift. A decision context graph governs whether each proposed action is valid at execution time. The strongest production architecture uses both.

Patrick Joubert; Patrick Joubert

Context Graph vs Agent Evaluation

Evals Measure. Enforcement Decides.

Agent evaluation is becoming a default production requirement. Teams now score tool calls, arguments, plans, trajectories, handoffs, safety, cost, latency, and production traces.

That is necessary infrastructure. It is also not decision infrastructure.

Evaluation tells a team whether an agent behaved well in a sample. A context graph determines whether the next proposed action is valid in the live context before it reaches a tool, database, workflow, or customer system.

The Core Distinction

Agent evaluation is a measurement layer. It helps teams inspect behavior, score quality, detect regressions, compare versions, and turn production failures into new tests.

A context graph is a decision boundary. It evaluates the proposed action against scope, policy, temporal validity, applicability, provenance, and prior decisions before execution.

Layer	Core question	Control point	Primary artifact	Limit
Agent evaluation	Did the agent behave correctly?	Before release, during review, or after production sampling	Score, rubric, trace review, regression result	A valid score can still permit an invalid next action in a changed context
Context graph	Should this action be allowed now?	Before execution	Applicability result, policy decision, causal decision trace	Requires a maintained decision model with current rules, scope, and temporal state

What Agent Evaluation Is Good For

Evaluation gives engineering teams a feedback loop. It can reveal wrong tool selection, bad arguments, inefficient plans, incomplete tasks, unsafe outputs, handoff gaps, memory failures, and regressions after a model or prompt change.

Trace-based evaluation is especially useful because it inspects the trajectory, not only the final answer. It can show whether the agent used stale memory, skipped an approval step, queried the wrong source, or handed incomplete context to another agent.

But evaluation remains a measurement system. It improves the agent over time. It does not, by itself, make the next live action valid.

The Evaluation-to-Enforcement Gap

Eval surface	What it measures	What still needs enforcement
Tool correctness	Did the agent call the right tool with the right arguments?	Was this tool use valid for this actor, customer, policy, and time?
Plan quality	Did the agent follow a coherent sequence?	Should each step in the sequence be allowed before it runs?
Task completion	Did the agent reach the requested outcome?	Was the requested outcome permitted under the governing context?
Trace-based eval	Where did the trajectory fail?	Can the same invalid trajectory be blocked next time?
Production monitoring	Is quality drifting in live traffic?	Which decision boundary prevents the drift from creating side effects?

Where the Difference Shows Up

Refund workflow

Evaluation: An eval can score whether the agent selected the refund tool, supplied the correct amount, and completed the task.

Context graph: A context graph validates whether the refund policy applies to this customer, contract, geography, approval state, and time window before the refund API is called.

Data agent

Evaluation: A trace-based eval can show whether the agent queried the expected table and produced a plausible answer.

Context graph: A decision context graph checks source authority, data tier, freshness, row-level scope, and allowed purpose before the query becomes an action.

Multi-agent handoff

Evaluation: A handoff eval can detect whether the receiving agent had enough context to continue the workflow.

Context graph: Pre-execution enforcement validates that the handoff context is within scope, current, provenance-backed, and sufficient for the receiving agent's next action.

The Production Stack

Reliable agent infrastructure separates measurement from authority:

Evaluate the agent against curated scenarios, production traces, regression sets, and domain rubrics.
Observe live behavior through traces, spans, costs, errors, memory operations, and handoffs.
Enforce every proposed action through a decision context graph before the side effect reaches an external system.
Record the allow or block result as a causal decision trace that future evaluations can inspect.

The Accountable Agent Test

Ask one question: can the system block a high-scoring but currently invalid action before it reaches the tool?

If the answer is no, the system has evaluation. It does not yet have pre-execution enforcement.