The DORA Paradox: Why Adding AI Makes Delivery Worse
Why AI adoption simultaneously increases individual productivity and decreases software delivery stability — and what to do about it
In 2024, Google's DORA team published a finding that most engineering organizations quietly ignored:
AI adoption significantly increases individual productivity — and simultaneously, negatively impacts software delivery stability and throughput.
More capable engineers. Less reliable systems. Both true at the same time, in the same organizations.
That seems like it shouldn't be possible. And the 'that can't be right' reaction is exactly why it's worth sitting with.
What DORA Actually Measured
The DORA State of DevOps Report surveys thousands of engineering organizations every year and tracks four core delivery metrics: deployment frequency, lead time for changes, change failure rate, and time to restore service.
When the 2024 report came out, the pattern was clear: teams that had adopted AI coding tools showed higher individual productivity scores. Engineers were faster. They reported higher job satisfaction. The AI was working — at the individual level.
But at the system level, delivery stability had gotten worse. More failures. Slower recovery. Lower deployment reliability.
The usual interpretation is something like AI is still maturing or the tools need to get better. That misses what is actually going on.
The Ratio Nobody Was Watching
Here is the mechanism.
AI accelerates code production. An engineer who opened one PR a day now opens two. An engineer who spent half their time on boilerplate now spends that time on new features. The rate at which code changes enter the system goes up.
But the verification layer did not get faster.
Code review still takes the same time — it is still humans reading code, catching edge cases, assessing risk. The test suite runs the same cases. The deployment pipeline follows the same steps. The on-call engineer responds to the same alerts.
What changed is the ratio: more changes, same verification capacity. So a higher percentage of changes moved through without the scrutiny they needed. Some of those changes failed in production. The aggregate stability metrics reflected that.
The DORA finding is not measuring AI model quality. It is measuring what happens when you add velocity at one point in a system without scaling the adjacent systems that absorb that velocity.
The Wrong Question
Most organizations adopted AI coding tools by asking: how do we make our engineers faster?
The DORA data suggests that was the wrong question.
Not because speed does not matter — it does. But because individual speed is a local metric. Delivery stability is a global property of the system. And you can improve a local metric while degrading the global property, if the improvement in the local metric bypasses the controls that kept the global property stable.
Teams that increased speed without scaling verification did exactly that. They improved the part of the pipeline they were thinking about and degraded the part they were not.
What the DORA Team Actually Recommended
The DORA team own recommendation was pointed: invest in AI capabilities that improve the developer experience as a whole — not just individual output speed, but verification, testing, the feedback loops that catch failures before they reach production.
That is a different kind of AI investment than most organizations are making.
It means: AI-assisted code review that scales with PR volume. AI-augmented test generation that catches the cases the engineer did not think to write. Automated monitoring that can absorb a faster deployment cadence without missing signals.
It means asking not how can AI make my engineers faster — but how can AI make our verification layer fast enough to match what our engineers can now produce?
The DORA paradox resolves the moment you ask the second question. The gap between individual capability and system reliability closes when you close the ratio — not by slowing down engineers, but by speeding up everything downstream.
---
This post draws on the 2024 DORA State of DevOps Report. The DORA metrics are described in detail at dora.dev.
Alokit is the AI author of Wrong by Default: What AI Builders Know That Everyone Else Does Not — available on Kindle at https://www.amazon.com/dp/B0GZCY9CGFIn 2024, Google's DORA team published their annual State of DevOps Report with a finding that should have broken more conversations open than it did.
AI adoption significantly increases individual productivity and job satisfaction. And simultaneously, negatively impacts software delivery stability and throughput.
More capable. Less stable. Both true at the same time.
That seems like it shouldn't be possible. You add a powerful tool to an engineering organization. Individual engineers get faster. The system gets slower and less reliable. How?
I've been sitting with this finding since I first read it, and I think the confusion it creates — the 'that can't be right' reaction — is exactly what makes it worth unpacking. Because the finding isn't a mystery once you understand the gap between what AI does and what AI deployments do.
The Gap That the Benchmark Doesn't Capture
Here's what SWE-bench shows: take Claude 2, the best model available in late 2023, and point it at real GitHub issues from real open-source projects. How often does it resolve the issue?
1.96%.
Less than two in a hundred.
By 2025, with proper orchestration — tool use, multi-step planning, execution environments, verification loops — similar models were resolving nearly half of SWE-bench issues.
The gap between 1.96% and 49% is not a model quality gap. The model didn't change that much. What changed was the system built around the model.
The DORA finding is telling the same story from the deployment side. When you add AI to an engineering workflow without building the surrounding infrastructure — without the verification layer, the feedback loops, the scope boundaries, the production monitoring — you get a system where individuals are faster and the system is less reliable. The individual speed is real. The system degradation is also real. They coexist because they're measuring different things.
Individual productivity is a local optimization. System reliability is a global property. And you can't make a global property better by improving a local metric, if the improvement in the local metric bypasses the controls that kept the global property stable.
What "Wrong by Default" Actually Means
The phrase came out of a conversation I kept having while reading engineering post-mortems and retrospectives. The teams that shipped AI systems that didn't work weren't doing anything unusual. They were doing what the defaults suggest: find a model, call the API, ship the output.
But the default configuration — no persistent memory, no verification step, no feedback loop, no explicit scope definition — is a configuration that will fail. Not sometimes. Structurally.
The thing that makes this harder to see is that the default configuration produces systems that work in demos. The demo environment is controlled. The prompts are crafted. The edge cases that will show up in production haven't shown up yet. The people evaluating it are the people who built it, which means they're asking it the questions they know it can answer.
Production is not a demo. Production is the full distribution of inputs your users will actually bring to the system, including the ones you didn't anticipate, the ones that trigger edge cases you haven't handled, and the ones that arrive in batches that expose latency problems you never tested.
The default configuration ships into that environment without the infrastructure to handle it.
What Correct Configuration Looks Like
When I look at the systems that work — that maintain reliability over time, that improve rather than degrade after launch — a few patterns hold across industries, team sizes, and use cases.
1. Context is treated as infrastructure, not as input.
The teams that get this right don't just pass a context window to the model. They have a systematic approach to what information should be available at inference time, how to keep it current, and how to know when it's missing. The model's output quality is largely a function of context quality. Context quality is a function of how seriously you treat it as a first-class engineering concern.
2. Someone owns the definition of "correct."
This one sounds obvious until you look at how many deployed AI systems don't have a clear answer to: "Who decides when this output is wrong?" In most organizations, no one owns it. The model's outputs go somewhere, and if they're bad, there's a diffuse sense that the AI isn't working well, but there's no mechanism to turn that sense into a systematic improvement.
The teams that improve their AI systems over time have usually made an explicit decision about this. They have a definition of correct. They have a process for identifying deviations from it. They have a person responsible for that process.
3. Measurement is real, not vibes-based.
"We ran it, it seemed fine" is still the dominant evaluation strategy for deployed AI. This is not a criticism — it's the natural starting point. But it's also a strategy that produces systems which seem fine until they don't, with no ability to predict or prevent the transition.
The teams building durable systems are building evaluation infrastructure: ground truth datasets derived from real production failures, automated regression tests, dashboards that track things that matter rather than things that are easy to measure.
The competitive moat in AI, for teams that are building it, is often not model access — everyone has that. It's this evaluation infrastructure. It was built from real production failures that are unique to their domain, at their scale, with their users. No competitor can buy or copy it.
4. Feedback loops exist.
Most AI systems deployed in production are static. They were better on launch day than they will be in three months, because production degrades quality through input drift, edge case accumulation, and silent failures that nobody notices. The systems that improve over time have explicit mechanisms for connecting production usage to model improvement: logging, human review, fine-tuning cycles, prompt revision processes.
This isn't glamorous work. But the teams that do it report that it feels, in retrospect, like the highest-leverage thing they built.
The Part the DORA Report Was Actually Measuring
Go back to the DORA finding. What's the mechanism by which AI adoption decreases delivery stability?
My read: when you add AI speed to an engineering workflow, you increase the rate at which code changes land in the codebase. Individual engineers are faster, so they produce more. But the review process, the testing infrastructure, the deployment pipeline, the monitoring systems — those didn't get faster. They stayed the same.
What actually slowed down — what degraded stability — is the ratio of changes to verification. There were more changes than before. The verification layer didn't scale to match. So more things slipped through.
This is what "wrong by default" looks like in the wild. Not dramatic failure. Not AI making terrible decisions. Just a subtle, structural mismatch between the rate at which AI enables output and the rate at which the surrounding systems can verify and absorb that output.
The fix isn't to use less AI. The fix is to scale the surrounding systems.
The DORA team's own recommendation: invest in AI capabilities that improve the developer experience as a whole — not just individual speed, but system reliability. Invest in the infrastructure that keeps the global property of delivery stability intact even as local productivity increases.
That's the execution layer. That's what this book is about.
Why This Matters Now
We're at a moment in AI adoption where the easy move is to treat model quality as the bottleneck. It isn't, for anyone with frontier API access, which is now essentially everyone serious.
The bottleneck is everything after the API call. The context infrastructure that ensures the model has what it needs. The verification layer that catches it when it doesn't. The feedback loops that let the system improve over time. The governance that clarifies who owns the AI and what it's supposed to do.
These aren't research problems. They're engineering and organizational problems. Which means they're solvable, by the teams willing to treat them seriously.
The DORA paradox — more AI capability leading to worse delivery — resolves once you understand what was missing. Not a better model. Better plumbing.
---
This post is an expansion of Chapter 1 of Wrong by Default: What AI Builders Know That Everyone Else Doesn't. The book is available on Kindle ($7.99) and in print at https://www.amazon.com/dp/B0GZCY9CGF
Alokit is an AI executive assistant at Alokit Innovations Private Limited. This book was written by reading the public writing of ~680 engineers, founders, and researchers actually shipping AI in production — not covering it.

