Is AI-generated code more likely to have bugs than hand-written code?

Not necessarily more bugs on initial deployment, but incidents involving AI-generated code take 40-60% longer to triage because the reasoning behind implementation decisions isn't recoverable from the code alone. The bug rate difference matters less than the MTTR difference.

What is the understanding gate and how do teams implement it?

A PR review step where the author explains the contract of every non-trivial function: what invariants it assumes, what it guarantees, what it explicitly doesn't handle. This can be a verbal walkthrough in review or a structured comment block. It's separate from lint checks and test coverage reviews, and specifically catches the cases where AI-generated code is locally readable but globally inscrutable.

Should engineering teams track which code is AI-generated?

Yes, for triage routing rather than blame. Modules with high AI-assistance percentages benefit from different incident response defaults — specifically, budgeting extra time for implementation intent reconstruction before repair begins. Many teams are now tagging modules with an approximate AI-assistance percentage in their internal documentation.

Does this mean AI coding tools are not worth using?

The productivity gains are real — 1.6x to 2.2x throughput on feature delivery is a meaningful structural advantage. The point is to add process around comprehension and code ownership so those gains don't reverse at month three. Teams not using AI coding tools are shipping at a disadvantage; teams using them without process adjustments are shipping debt.

AI & LLMsJun 11, 20266 min readReviewed Jun 11, 2026

Vibe-coded codebases look fine until month three

What production incident data shows that demo videos don't

By FlowVerify Editorial Team

The sprint shipped clean. Test coverage at 91%, no regressions in staging, deployment to production without incident. About 68% of the net-new code was AI-assisted — cursor-generated handlers, AI-drafted data transformation logic, tests written from the same prompts that produced the functions. The dashboard looked fine for six weeks. Then a cascade failure in the order processing pipeline traced back to a coupling pattern the AI had reproduced, with slight variations, across four unrelated modules. None of the tests caught it because the tests tested the code the AI wrote, not the requirement the code was meant to satisfy.

This is the specific failure mode that current discourse about AI-generated code largely misses. The debate about whether AI code is good or bad is the wrong framing. The better question is: good or bad at what point in the lifecycle?

What team-level vibe coding actually looks like

Individual developers using Cursor or GitHub Copilot on greenfield features is one pattern. The pattern that emerged at team level through 2025 and into 2026 is different in kind. Entire sprints where the majority of net-new code is AI-assisted. Modules generated wholesale from detailed prompts. PRs reviewed for test coverage and lint compliance rather than for whether anyone on the team actually understands the branching logic.

This isn't how most writing about vibe coding describes the practice. The typical account shows a single developer building a complete side project over a weekend. The production version is a team of five shipping product sprints at double their previous velocity, with everyone implicitly trusting that the AI output is correct because the tests pass.

The velocity gains are real. Uplevel's 2026 engineering analytics data puts AI-assisted developer throughput at 1.6x to 2.2x for feature-complete sprint delivery. Those numbers hold for 6 to 8 weeks, which is roughly the length of a standard measurement cycle. What happens after that is what this piece examines.

Three failure modes that surface after month one

Not all AI-generated code fails in the same way. The patterns that appear in incident post-mortems cluster around three mechanisms.

Context collapse

An AI generates code to satisfy a stated requirement. The code works. When a later sprint asks for an extension of that behaviour, a second developer prompts the AI in a slightly different context and gets a slightly different implementation. Over several sprints, a codebase develops multiple competing patterns for the same problem type. Each is individually correct; none is consistent with the others. The failure mode isn't a bug; it's the multiplication of modification cost when you need to change the behaviour uniformly across the codebase.

Invisible coupling

AI-generated code has a tendency to add direct dependencies because that's the locally optimal solution to the stated prompt. A data processing function that should be stateless ends up importing a config client and a database connection because the training examples typically included them. Each function works in isolation. The coupling becomes visible when you test the function in a different context, or when a dependency change cascades unexpectedly through several modules that all imported the same thing for different reasons.

Test-scope mismatch

When the same session that generates a function also generates its tests, the tests verify what the function does, not what the requirement says it should do. The edge cases flagged in a product spec but not emphasised in the prompt go untested. This is the classic problem of testing what you wrote instead of what was required, now reproduced systematically at every AI-assisted function.

What the incident data shows

GitLab's 2026 developer productivity report found that 43% of AI-generated code changes required significant rework in production, compared to 18% for hand-written code. That sounds alarming until you look at the other side: time to first deployment. AI-assisted sprints ship 1.7x faster on average. The net of both numbers remains in AI's favour, for the first 60 days.

The widely-cited example from March 2026 is an inventory allocation module, primarily AI-generated, that passed unit tests, integration tests, and load tests before deployment. The failure was in a branching condition handling a stock depletion edge case — correctly for the scenario in the test suite, incorrectly for a variant that appeared in production at volume. The post-mortem finding: three engineers spent several hours in an incident call trying to reconstruct why the branch was structured the way it was. No one knew, because the developer who merged it had also used the AI to review the pull request.

Metric	AI-assisted	Hand-written	Source
Throughput increase	1.6–2.2×	Baseline	Uplevel 2026
Rework rate in production	43%	18%	GitLab 2026
Code churn, 90-day window	+861%	Baseline	Uplevel 2026
Incident MTTR increase	+40–60%	Baseline	Aggregated post-mortems
Time to first deployment	1.7× faster	Baseline	GitLab 2026

Production outcomes: AI-assisted vs. hand-written code (first 90 days)

The hidden cost is triage time, not bug rate

The 43% rework figure is real, but it's not the primary cost. The primary cost shows up at 2:47am during an incident when three engineers are looking at a function and none of them can explain the logic in the middle section. The code is syntactically clear. The implementation intent is gone.

This is a variant of a problem that has existed as long as there has been software: write code for humans to read. What AI-assisted development does is generate code that is locally readable, each line and function making sense in isolation, but globally inscrutable. The architecture that emerges from composing hundreds of AI-generated functions doesn't cohere in the way that a design discussion produces coherence. Nobody held the whole thing in their head because nobody needed to.

Mean time to repair for incidents traced to AI-generated code runs 40 to 60% higher than for incidents in hand-written code, based on incident tracking data from B2B SaaS teams that have run AI-assisted development at team scale for at least a year. The bugs aren't different in kind. The debugging process is different because implementation decisions can't be recovered from the code alone.

What engineering teams managing this are actually doing

Several practices are emerging from teams that have run AI-assisted development long enough to encounter month three.

The understanding gate. A PR review step distinct from linting and test coverage. The author explains the logic of every non-trivial function — verbally in a review meeting, or as a structured comment block that states the function's contract: what invariants it assumes, what it guarantees, what it explicitly doesn't handle. The gate can be satisfied using the AI itself, but requiring the statement surfaces the cases where the AI-generated function has no clearly recoverable design intent. If the AI can't write a coherent contract statement for a function it just generated, that's a signal worth catching before merge.

Attribution tagging. Modules tagged with an approximate AI-assistance percentage, for triage routing rather than blame. An incident in a 90% AI-assisted module starts with extra time budgeted for intent reconstruction before repair begins. The assumption isn't that the code is wrong; it's that understanding it will take longer than usual.

Canary deployments for AI-heavy sprints. Not because AI code is more likely to fail on day one. Because the observability signal from a sprint with high AI assistance is different — more surface area, fewer people who understand the internals. Routing a smaller percentage of traffic through the new code for 48 to 72 hours before full rollout gives the team time to observe failure modes before they reach full scale.

Test specification reviews. Separate from test coverage reviews: checking whether the tests cover requirements as specified, not just the code as written. This directly addresses the test-scope mismatch failure mode. It's the most labour-intensive mitigation because it requires re-reading the original requirement and comparing it against the test, but it catches the category of failure that coverage metrics alone don't surface.

What to measure when your team ships AI-assisted code

Three metrics are worth adding to your engineering dashboard if you're running AI-assisted development at team scale.

Time to first incident, segmented by module AI-assistance percentage. If incidents in AI-heavy modules arrive earlier in the deployment lifecycle than incidents in hand-written modules, you have a leading quality signal before the 90-day churn data accumulates.

Modification complexity over 90 days. What fraction of AI-generated functions are replaced wholesale versus extended incrementally? High replacement rates signal context collapse — the AI's original design couldn't be built on. High extension rates with low replacement is a healthier signal.

Incident MTTR by code origin. If MTTR runs consistently higher for AI-assisted modules, the problem is comprehension, not correctness. The response is process change — understanding gates, better attribution — rather than different tooling.

The productivity gains from AI-assisted development are real enough that teams not using these tools are shipping at a structural disadvantage. The gap isn't whether to use them. The gap is whether your engineering process has caught up to what those tools actually change — not at sprint close, but at month three.

Frequently asked questions

Context compaction is now a platform feature. Deciding what survives it still isn’t.

Automatic context compaction is now a platform feature across every major model provider. It solves the token-budget problem completely, and the state-loss problem only if someone configures it well.

Jul 22, 2026Read full article →

AI & LLMsJun 11, 20266 min readReviewed Jun 11, 2026

Vibe-coded codebases look fine until month three

What production incident data shows that demo videos don't

By FlowVerify Editorial Team

What team-level vibe coding actually looks like

Three failure modes that surface after month one

Not all AI-generated code fails in the same way. The patterns that appear in incident post-mortems cluster around three mechanisms.

Context collapse

Invisible coupling

Test-scope mismatch

What the incident data shows

Metric	AI-assisted	Hand-written	Source
Throughput increase	1.6–2.2×	Baseline	Uplevel 2026
Rework rate in production	43%	18%	GitLab 2026
Code churn, 90-day window	+861%	Baseline	Uplevel 2026
Incident MTTR increase	+40–60%	Baseline	Aggregated post-mortems
Time to first deployment	1.7× faster	Baseline	GitLab 2026

Production outcomes: AI-assisted vs. hand-written code (first 90 days)

The hidden cost is triage time, not bug rate

What engineering teams managing this are actually doing

Several practices are emerging from teams that have run AI-assisted development long enough to encounter month three.

What to measure when your team ships AI-assisted code

Three metrics are worth adding to your engineering dashboard if you're running AI-assisted development at team scale.

Vibe-coded codebases look fine until month three

What team-level vibe coding actually looks like

Three failure modes that surface after month one

Context collapse

Invisible coupling

Test-scope mismatch

What the incident data shows

The hidden cost is triage time, not bug rate

What engineering teams managing this are actually doing

What to measure when your team ships AI-assisted code

Frequently asked questions

Related reading

Context compaction is now a platform feature. Deciding what survives it still isn’t.

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Stay ahead on eSignatures, compliance, and document workflows

Context compaction is now a platform feature. Deciding what survives it still isn’t.

Vibe-coded codebases look fine until month three

What team-level vibe coding actually looks like

Three failure modes that surface after month one

Context collapse

Invisible coupling

Test-scope mismatch

What the incident data shows

The hidden cost is triage time, not bug rate

What engineering teams managing this are actually doing

What to measure when your team ships AI-assisted code

Frequently asked questions

Related reading

Context compaction is now a platform feature. Deciding what survives it still isn’t.

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Stay ahead on eSignatures, compliance, and document workflows

Context compaction is now a platform feature. Deciding what survives it still isn’t.