Four Components. Each 90% Reliable. End-to-End: 65%.

May 25, 2026

There’s a calculation your AI team probably hasn’t run.

Take your production AI system. Pick four components: context retrieval, output verification, feedback routing, and measurement. Estimate a reliability floor for each one. Be generous — call it 90%. Each component does the right thing nine times out of ten.

Ninety percent sounds high. It feels like a system that mostly works.

But these probabilities don’t add. They multiply.

0.90 × 0.90 × 0.90 × 0.90 = 0.656

Four individually impressive components produce a system that fails on more than one in three requests. Not because anything is broken — every component works correctly 90% of the time. The failure is structural. It emerges from the compound effect of small gaps across a stack that no one is tracking end-to-end.

This is the multiplication problem. It’s why so many AI prototypes that looked strong in development go sideways in production — not because any single piece degraded, but because nobody ran the math.Why 80% Feels Like Almost Done (And Isn’t)

LinkedIn’s engineering team built an AI chatbot for skill-fit assessment. Early prototypes looked impressive. Within a month, they had roughly 80% of the experience they wanted. The product felt close.

It wasn’t.

Over the following four months, they kept building — not because the model changed, but because the system around the model needed to be painstakingly tuned, tested, debugged, and refined. The behavior that seemed correct at 80% turned out to be subtly wrong in ways that only became visible under sustained load from real users. Every subsequent 1% gain cost more than the last.

Chip Huyen, whose practitioner guide on shipping LLM applications is among the most-read in the field, documents this pattern consistently: “Gross underestimation of how challenging it is to improve the product, especially around hallucinations” is the most common source of post-launch disappointment. The specific failure modes she catalogs — accuracy-latency tradeoffs, tool differentiation failures, tonal inconsistency — are all execution layer problems. None of them would have been fixed by switching to a newer model.

This is the core insight: most AI production failures are not model failures. They’re failures in the components that sit around the model — context construction, evaluation, feedback loops, measurement. And those components multiply.

What the Math Looks Like When You Do the Work

Run the same calculation on a team that has built the execution layer infrastructure — context retrieval, verification, feedback loops, and measurement each operating at 97% instead of 90%:

0.97 × 0.97 × 0.97 × 0.97 = 0.885

The difference between 65% and 88.5% is not a performance improvement. It’s the difference between a system users learn to distrust and one they learn to depend on.

The 7 percentage points per component that separate these two teams don’t come from a better model. They come from the unglamorous work of building context that’s actually correct, verification that actually catches errors, feedback that closes the loop, and measurement that tracks whether any of it is working.

Why Failures Compound in the Wrong Direction

Most teams discover the multiplication problem only after production arrives — and by then, the failure modes are entangled.

A failure in context retrieval doesn’t stay contained. If the system is working with wrong information, the verification layer is now evaluating an output that was wrong before the model touched it. If the feedback loop correctly routes that failure for human review, the reviewer spends time diagnosing a problem that began upstream. Measurement records the failure, but attribution is unclear — was it retrieval? Reasoning? Verification?

This is what Huyen describes as “AI-specific challenges that can’t be addressed with more compute”: the failures aren’t in any single component, they’re in the interactions between components. A system without execution layer infrastructure doesn’t just fail more often — it fails in ways that are harder to diagnose, because the failure is distributed across a stack that no one built to be observable.

The Question Teams Don’t Ask

After the demo works, most teams ask: “What features should we add?” The teams that ship ask a different question: “What does end-to-end failure look like, and are we measuring it?”

Those questions produce different products.

The multiplication problem is the reason the second question matters. If you’re not tracking end-to-end reliability explicitly — not component-level metrics, but what percentage of user interactions complete successfully from start to finish — you don’t know whether your system is running at 65% or 88.5%.

You have intuitions. You have component metrics. You have user complaints when things break badly enough to surface.

You don’t have the number. And you can’t fix a number you’re not tracking.

This post expands on Chapter 10 of Wrong by Default: What AI Builders Know That Everyone Else Doesn’t by Alokit. Available on Kindle ($7.99): amazon.com/dp/B0GZCY9CGF

Alokit's Substack

Discussion about this post

Ready for more?