If your AI agent has a 95% accuracy, it will fail in production.

If your AI agent has a 95% accuracy, it will fail in production

A quiet failure at 9:12 a.m.

At 9:12 a.m., an “AI sales ops assistant” updates a CRM record.

Nothing breaks. No red error banner. No grotesque hallucination. The field names are correct. The formatting is pristine. The agent even leaves a tidy note explaining what it changed and why.

By 9:18, a director asks why a sensitive account was moved into a stage it should never enter without legal sign-off. The update was plausible. It was also forbidden.

This is the quiet failure mode of the agent era: systems that look competent step-by-step, then drift politely, coherently, over a boundary that was never meant to be crossed.

That’s why 95% accuracy, a number that once sounded like maturity, is now a trap.

Not because large language models are uniquely unreliable. But because accuracy, as we’ve inherited it from classic machine learning, is a measurement from an older world, one where models mostly produced answers and then stopped. Agents don’t stop. They decide, act, and keep going.

In the past, machine learning lived comfortably inside static problems. An image is a cat or it isn’t. A transaction is fraudulent or it isn’t. You evaluate a prediction, score it, and move on.

An agent operates in a different regime: sequential, stateful, consequential. It reads context. It forms an intent. It chooses a tool. It calls an API. It writes memory. It triggers an action. Then it uses the new state it created as the starting point for the next decision.

Accuracy versus reliability in a world of decisions

What matters is no longer whether each step is locally “correct.” What matters is whether the outcome is acceptable, safe, reversible and explainable after the fact.

That distinction sounds academic until you put it under production pressure.

Internal accuracy tells you how often a model performs a narrow task well: classify an intent, retrieve the right doc, produce a faithful summary.

Outcome reliability asks harsher questions: was the decision allowed, consistent with policy, consistent with business rules, consistent with the system’s obligations to users and regulators? Can you prove why it acted? Can you prove why it didn’t? That framing shows up explicitly in modern risk governance guidance, because the real-world failure is rarely “the model got a question wrong.” It’s “the system did something it shouldn’t have done.” (Publications NIST)

You can see how the trap forms.

Take a benign example: an agent summarises a contract clause perfectly. No fabrication. No missing detail. Then it takes an action, sending a renewal notice, changing a billing schedule, marking an account “approved”, that violates policy, or a compliance constraint, or a separation-of-duties rule. The reasoning step was accurate. The decision was invalid.

Nothing “hallucinated.” Nothing crashed. The model behaved exactly like a model: producing a coherent continuation. The failure happens at the boundary between reasoning and authority, where language becomes action.

And that boundary gets crossed more often than people expect, because decisions don’t exist in isolation. They compose.

Why reliability collapses in production

Every step in an agent workflow carries uncertainty: interpret intent, choose tool, fetch context, resolve ambiguity, apply policy, decide whether to act. Even when the uncertainty at each step is small, the system as a whole becomes fragile, because reliability doesn’t add. It multiplies.

Here’s the part nobody wants to admit out loud: in a ten-step workflow, “95% reliable at each step” quietly becomes 0.95¹⁰, about 60%. That’s not a KPI. That’s a coin toss.

And production workflows aren’t ten steps because engineers like suffering. They’re ten steps because the world is messy: systems are fragmented, context is partial, permissions are real, and edge cases aren’t edge cases once you have thousands of users.

This compounding effect is not new science. It’s a classic pathology of sequential decision-making: errors early in a trajectory distort the state distribution you face later, making the system less and less like the neat i.i.d. world that offline accuracy assumes. The imitation learning literature formalised this years ago, precisely because “one-step accuracy” does not predict behaviour over long horizons. (arXiv)

Agents are simply bringing that lesson into enterprise workflows.

Why demos succeed and production fails

Which is why demos are so seductive and so misleading.

Demos happen in forgiving environments. Inputs are clean. Context is curated. Memory is fresh. Tool calls are short. Consequences are hypothetical. The agent is never forced to choose between two bad options, or to proceed when it should pause, or to explain itself when the human asks, “Why did you do that?”

Production is none of those things.

In production, inputs are ambiguous. Context is incomplete. Memory is lossy. Logs are scrutinised. Tool outputs change. Users are impatient. And the system has to make decisions under uncertainty because the business process does not accept a shrug.

This is also why “just evaluate it more” isn’t a magic spell. Modern evaluation work has made the same point from another angle: a single metric is rarely meaningful across real scenarios. What matters is coverage, multi-metric measurement, and transparency about failure modes because model behaviour shifts radically with the situation. (arXiv)

Intelligence is not governance

So why do so many teams respond to agent failures by doubling down on intelligence?

They add more context. They add more prompts. They add more “reflection.” They stack agents on top of agents. They hope one model will supervise another model into correctness.

Sometimes that helps. Often it just creates a more elaborate version of the same mistake: treating intelligence as governance.

Large language models are exceptional at reasoning in the sense that they produce convincing, context-aware continuations. They are not designed to enforce invariants. They optimize for plausibility, not for guarantees. They do not naturally produce “deny by default.” They do not reliably abstain. They do not reliably produce audit-ready explanations of why an action was authorised and another was rejected.

In enterprise systems, that isn’t a nice-to-have. It’s the difference between a tool and a liability. NIST’s AI risk guidance makes this explicit: governance, measurement, and management are system properties, not vibes you prompt into existence. (Publications NIST)

What trust actually requires

If you want a model to behave like a production system, you need boundaries that behave like engineering.

That means decisions constrained by explicit rules, structured state, verifiable permissions, and traceable causality. It means the system knows when to stop, when to escalate, and when not to act, even if it could.

This is also where the most practical idea in reliability is the least glamorous: abstention. Mature systems don’t just try to be right; they try to be safe when they might be wrong. There is deep work on selective prediction, models that refuse to answer when uncertainty is high, because a controlled “I don’t know” is often the only responsible output in a high-consequence pipeline. (arXiv)

Now the obvious counter-argument arrives, usually with a smirk:

Humans also make chains of decisions, and humans also fail. So why hold agents to a higher standard?

We don’t. We hold agents to the same standard we hold systems.

A bank doesn’t rely on human intelligence to prevent fraud. It relies on permissions, process, separation of duties, audits, and the ability to reconstruct what happened months later. Aviation doesn’t rely on a pilot’s brilliance as the sole safety mechanism. It relies on checklists, controls, and black boxes.

Humans are not safe because they’re smart. They’re safe, when they are, because they are governed by infrastructure.

Final takeaway

If your agent cannot prove why it took an action, cannot prove why it rejected alternatives, cannot be stopped mid-chain, cannot be audited later, and cannot guarantee non-action when a policy boundary is reached, you don’t have an agent you can deploy.

You have a demo.

And this matters more than most people want to admit, because the biggest failures are rarely theatrical. They don’t announce themselves as “hallucinations.” They look like reasonable micro-decisions that accumulate into a non-compliant outcome. They scale quietly. They surface when the cost is already real.

So yes : 95% accuracy is impressive.

It just doesn’t answer the only question production ever asks: can this system be trusted when it matters?

The future of AI agents won’t be decided by who builds the smartest model. It will be decided by who builds systems that treat decision-making as a first-class engineering problem : governed, constrained, auditable.

Intelligence creates capability.

Trust creates permission.

And production only grants permission once. (Publications NIST)