AI-Assisted Delivery
Review Is Now the Highest-Leverage Coding Skill
Why code review, trust, and judgment become more valuable as AI coding agents make code generation cheaper.
Writing code is getting cheaper. Reviewing it is getting more expensive.
That is the part of the AI coding conversation I think a lot of teams are still underestimating.
For the last year or two, most of the excitement has been around generation. How fast can a developer scaffold a feature? How much boilerplate can an assistant write? How many tickets can a team move through if coding agents are helping in the background?
Those are fair questions, and the productivity gains are real in the right context. GitHub's Copilot research found developers completed a bounded coding task 55.8% faster with the tool. Stack Overflow's 2025 developer survey shows AI tools are now mainstream, with a majority of professional developers using them daily or regularly.
But generation is no longer the interesting constraint.
The bottleneck moved from writing to trusting
The more important question is this: when the code shows up, who understands it well enough to trust it?
Addy Osmani framed this well in a recent post about coding agents. As agents get better, the hard part moves from writing code to deciding whether the code should be trusted. I think that is exactly right.
In the old model, code review had a built-in throttle. Humans wrote code at human speed. A pull request usually reflected hours or days of thought by a developer who understood the problem, wrestled with the tradeoffs, and could explain why the implementation looked the way it did. Review was still hard, but the reviewer was usually checking another human's reasoning.
With AI-generated code, that changes. A coding agent can produce a large, plausible, mostly-working diff in minutes. It can touch files across the stack, update tests, introduce a new dependency, and copy a nearby pattern that looks right but does not fit the deeper architecture. It can make the happy path work while quietly weakening an edge case.
And often, the reviewer is the first human being to really inspect the work.
Code is abundant. Trust is scarce.
More code is not more progress
We are already seeing signals of this shift. Faros AI has reported that high-AI-adoption teams are completing more work and merging more pull requests, but also seeing larger PRs, longer review times, more churn, and more bugs per developer. DORA's 2024 research found AI adoption correlated with gains in documentation, code quality, and review speed, while also correlating with lower delivery throughput and stability.
Both of those things can be true at once. AI can help individuals move faster, while the system around them gets slower, noisier, and riskier if review, testing, and release discipline do not improve with it.
That is the practical lesson for leaders: do not measure AI coding success only by output.
More merged pull requests are not the same thing as better software.
More velocity in the authoring step can create drag everywhere else if the team has not adapted.
The real danger is code that is almost right
The risk is not that AI writes terrible code all the time. That would be easier to manage. The harder problem is that AI writes code that is almost right.
Almost-right code is expensive. It passes a quick glance. It often passes basic tests. It looks familiar enough to merge. But it may miss the business rule, skip the failure path, introduce a security issue, or create maintenance debt that shows up later.
Developers already feel this. Stack Overflow's 2025 survey showed broad AI usage alongside significant distrust in AI accuracy. The most frustrating pattern developers reported was AI solutions that are "almost right, but not quite."
That phrase should be printed on the wall of every engineering team using coding agents.
Because "almost right" is exactly where review becomes the highest-leverage skill.
A practical review model
The answer is not to stop using AI coding tools. That would be the wrong lesson. The answer is to redesign the review system around the new volume and speed of software creation. Here is the model I would use.
1. Tier review by blast radius
Not every change deserves the same process. A copy update, a small internal script, or a test-only cleanup should not be treated like a payment flow, an authentication change, a database migration, a permission rule, or a customer-data workflow.
The higher the blast radius, the more human judgment you need.
For low-risk work, automated checks, AI review, and a light human scan may be enough. For medium-risk product work, require a human owner, clear tests, and a reviewer who understands the feature. For high-risk work, slow down: senior review, security review where appropriate, rollback thinking, and explicit test evidence. AI makes this more important, not less.
2. Require intent with every agent-generated PR
A reviewer should not have to reverse-engineer the "why" from the diff. Every AI-assisted PR should answer a few basic questions:
- What problem is this solving?
- What files and systems changed?
- What assumptions did the agent make?
- What tests were run?
- What risks remain, and what would make us roll this back?
This does not need to be a twenty-page design document. A short PR template is enough. But the discipline matters: if the agent generated the code, the human who asked for it still owns the intent.
3. Keep pull requests small
This is old advice that matters even more now. Google's engineering practices have long emphasized small, self-contained changes because they are easier to review well. AI does not change that. If anything, it raises the standard. If an agent can generate 2,000 lines of code, it can also split the work into reviewable slices.
Large AI-generated PRs are dangerous because they create review fatigue. People skim. They trust the summary. They assume the tests cover more than they do. That is how subtle problems get merged.
4. Review the tests as hard as the production code
This may be the most overlooked point. When humans review AI-generated work, they often take comfort in the fact that tests were added or updated. But tests can be wrong too. An agent can write tests that assert the implementation instead of the requirement. It can weaken assertions, remove edge cases, or mock away the failure that matters.
A green test suite is useful evidence. It is not proof.
Reviewers should ask: would these tests fail if the intended behavior broke? Do they cover the risky paths? Did the PR change the tests to match new behavior without proving that the new behavior is right?
5. Scrutinize new dependencies
This is a real security and supply-chain issue. Research on package hallucinations has shown that code-generating models can suggest packages that do not exist, which creates an opening for attackers to publish malicious packages with plausible names.
So any new dependency in an AI-assisted PR deserves attention. Does it exist? Is the name correct? Is it maintained? Is the license acceptable? Do we already have something in the codebase that solves this? Is the dependency actually necessary? That review step is not bureaucracy. It is risk control.
6. Use AI reviewers, but don't outsource accountability
AI can be genuinely useful in review. It can summarize large diffs, identify risky files, flag missing tests, compare code to requirements, and catch patterns a human might miss. I like using more than one model or tool for adversarial review, especially on meaningful changes.
AI review should be triage, not authority.
A bot saying "DONE" does not carry accountability. A human does. The human reviewer's job is not to manually inspect every character. It is to decide whether this is the right change, whether the important risks have been addressed, and whether the team is comfortable owning it in production.
That is a different skill than typing code quickly. And it is becoming more valuable.
The mindset shift
The teams that win with AI coding agents will not be the teams that generate the most code. They will be the teams that build the best operating system around the code: small changes, clear intent, strong automated gates, disciplined review, human accountability, and a healthy respect for blast radius.
So don't ask, "How much more code can we produce?" Ask, "How much more change can we safely understand, trust, and operate?"
Because in an AI-assisted development world, the scarce skill is no longer just writing software.
The scarce skill is judgment under volume.