Technical

Eval: graph harness outperforms all LLM-based agents

Thomas MolinierMember of Technical Staff2026-04-15

Claude Sonnet 4.6 and Claude Opus 4.6 benchmark comparing baseline, Rippletide Context Graph, and optimized Rippletide Context Graph

Closing the last mile to reliable business outcomes.

TL;DR: To get reliable business outcomes, give your Agents a navigable Context Graph, so they follow the right path. That way you expose where they fail, and improve across models without a model upgrade.

We ran 4,800 evaluations across 8 frontier models and found that the problem is rarely the model itself. The problem is how information is structured and delivered at decision time.

You ask your AI assistant to handle a customer question. It gives you an answer. The answer is wrong.

You dig in and realize the AI had access to the right information. It just did not use it properly. It skimmed the header instead of the section. It stopped too early. It took a shortcut.

This is one of the most common frustrations with AI systems today, and the reflex is usually the same: switch to a newer model, a larger model, or a more expensive model.

That instinct misses the real issue.

The model is often not the problem. The way you are giving it information is.

Think of it like a GPS, not a textbook

When you drive somewhere new, you do not memorize a 40-page travel guide and figure it out on the fly. You use a GPS: turn by turn, one decision at a time, recalculating when something changes.

Most companies still feed their AI the travel guide. They paste in a full policy document, a product manual, or a large knowledge base and hope the model will find the right detail at the right moment. Sometimes it does. Often, it takes shortcuts. And when it fails, there is no clean way to know exactly where the reasoning drifted.

What if, instead of a textbook, you gave the model a GPS?

That is what we set out to test.

What we tested

We took a realistic refund policy, the kind a customer support team deals with every day, and ran 200 eligibility questions through 8 frontier models from Anthropic and OpenAI:

Claude Sonnet 4.5
Claude Opus 4.5
Claude Sonnet 4.6
Claude Opus 4.6
GPT-5.2
GPT-5.2-Codex
GPT-5.3-Codex
GPT-5.4

Each model was tested in three setups:

Raw policy in the prompt. The full policy text is pasted into the prompt, and the model reads everything at once. This is how many teams still operate today.
Context Graph harness. The same policy is structured as a graph the model navigates step by step, reading one section, deciding where to go next, and building its answer along a decision path.
Optimized Context Graph harness. Same graph, but refined after observing where models actually failed. Shortcuts were blocked. Weak transitions were tightened. The underlying models did not change.

That gave us 4,800 total evaluations.

No prompt tuning. No per-model rescue logic. The same task, the same instructions, and one question that mattered: which setup produces the right business outcome most reliably?

That distinction matters. We evaluated the business decision itself, not just whether the model produced a plausible-looking answer. If you want the reasoning behind that methodology, see Micro, Macro, and Multi-Determinism for AI Agents.

What we found

Every single model improved.

Not one regressed.

The gains were not marginal. Claude Sonnet 4.5, a lighter and faster model, went from 38% accuracy on the raw policy to 74.5% with the optimized graph, nearly doubling performance without a model upgrade. Both Opus variants gained roughly 30 percentage points.

Anthropic model results with baseline, Rippletide Context Graph, and optimized Rippletide Context Graph

Anthropic models all improved once the task was restructured into a navigable graph.

The OpenAI side showed the same pattern. GPT-5.2 moved from 62.5% to 80.5%. GPT-5.2-Codex reached 84.5%. Even the models that started stronger still improved after the harness was optimized.

OpenAI model results with baseline, Rippletide Context Graph, and optimized Rippletide Context Graph

The same effect held across the OpenAI models.

The most important result was not just higher scores. It was the way structure narrowed the gap between models. Before optimization, the spread between the best and worst performer was 39.5 percentage points. After optimization, that gap dropped to 19 points.

The graph did not make the models smarter. It gave them structure that compensated for their individual weaknesses, and it worked across all of them.

Why traceability changes everything

Better accuracy is useful. But the real shift is what happens when something goes wrong.

With a wall of text, a wrong answer is a dead end. You know the model failed, but you do not know why. You have no clean way to fix just that failure mode without risking changes elsewhere.

With a graph, every step the model takes is visible. You can see which section it read, where it stopped, and what it skipped.

A score of 69.5% is no longer just a number. It becomes a diagnosis.

In our testing, we saw models consistently stopping after three navigation steps instead of going deeper. We saw them reading section headers instead of section content. And because those failures were traceable, we could patch them precisely without touching the rest of the system.

That is the difference between guessing and engineering.

Localized fixes, not endless rewrites

The default way teams improve AI agents today is prompt engineering: rewrite the instructions, add examples, adjust settings, retest, and hope the improvement generalizes.

That approach is slow, brittle, and usually model-specific. Every time the underlying model changes, you start again.

Graph engineering is different.

When a model takes a shortcut, you block the shortcut at the exact node where it happens. When it misses a prerequisite, you strengthen that branch. The fix is localized. It does not require rewriting the full interaction contract.

And because the harness is model-agnostic, the same change that helped Claude Sonnet 4.5 also helped GPT-5.2-Codex.

You optimize once. Every model that runs through the harness benefits.

This is not prompt engineering.

This is infrastructure.

The ceiling keeps moving

Our best result was 84.5% accuracy.

A fair question follows immediately: what about the remaining 15.5%?

Some cases still involve complex logical edge conditions that current models do not resolve consistently. But the important point is not that the ceiling exists. It is that the ceiling moves every time you make the failure path visible and fixable.

Each optimization pass exposes a new class of mistakes. Each pass raises the floor.

That is why we care so much about self-learning, non-regressive agents: improvement matters only if it compounds without reintroducing older failure modes.

What this means for you

These results point to a simple conclusion.

Production-ready AI is not primarily about buying a smarter model. It is about building a better harness.

When you replace static text with a structured Context Graph, you are not just giving the model more information. You are shaping how it moves through the task. You are constraining lazy shortcuts. You are forcing the system to follow your operational logic step by step.

A lighter, cheaper model inside a well-engineered harness can outperform a frontier model left alone with an unstructured wall of text.

Stop waiting for a flawless model to arrive.

Start building the structure that makes the models you already have reliable.

Structure the task. Make failures visible. Fix them precisely. Every model you run through the harness gets better.

Frequently Asked Questions

Why do AI agents still make mistakes even when they have the right information?

Because access is not the same as structure. When you paste a large policy or knowledge base into a prompt, the model still has to decide what matters, in what order, and when to stop. That is where shortcuts and missed constraints appear.

What did Rippletide test in this benchmark?

Rippletide ran 4,800 evaluations across 8 frontier models on 200 refund-eligibility questions, comparing three setups: raw policy text in the prompt, a navigable Context Graph, and an optimized Context Graph refined from observed failure paths.

What is the main benefit of a graph harness over prompt tuning?

A graph harness makes navigation explicit and traceable. Instead of rewriting prompts and hoping the improvement generalizes, teams can see where the model stopped, what it skipped, and patch that specific failure point without destabilizing the rest of the system.

Why does traceability matter for business outcomes?

Traceability turns a wrong answer into a diagnosable system failure. When every navigation step is visible, teams can localize fixes, improve reliability across models, and optimize for the business outcome instead of debating prompt wording.