Vibe-coded codebases look fine until month three
What production incident data shows that demo videos don't
The sprint shipped clean. Test coverage at 91%, no regressions in staging, deployment to production without incident. About 68% of the net-new code was AI-assisted — cursor-generated handlers, AI-drafted data transformation logic, tests written from the same prompts that produced the functions. The dashboard looked fine for six weeks. Then a cascade failure in the order processing pipeline traced back to a coupling pattern the AI had reproduced, with slight variations, across four unrelated modules. None of the tests caught it because the tests tested the code the AI wrote, not the requirement the code was meant to satisfy.
This is the specific failure mode that current discourse about AI-generated code largely misses. The debate about whether AI code is good or bad is the wrong framing. The better question is: good or bad at what point in the lifecycle?
What team-level vibe coding actually looks like
Individual developers using Cursor or GitHub Copilot on greenfield features is one pattern. The pattern that emerged at team level through 2025 and into 2026 is different in kind. Entire sprints where the majority of net-new code is AI-assisted. Modules generated wholesale from detailed prompts. PRs reviewed for test coverage and lint compliance rather than for whether anyone on the team actually understands the branching logic.
This isn't how most writing about vibe coding describes the practice. The typical account shows a single developer building a complete side project over a weekend. The production version is a team of five shipping product sprints at double their previous velocity, with everyone implicitly trusting that the AI output is correct because the tests pass.
The velocity gains are real. Uplevel's 2026 engineering analytics data puts AI-assisted developer throughput at 1.6x to 2.2x for feature-complete sprint delivery. Those numbers hold for 6 to 8 weeks, which is roughly the length of a standard measurement cycle. What happens after that is what this piece examines.
Three failure modes that surface after month one
Not all AI-generated code fails in the same way. The patterns that appear in incident post-mortems cluster around three mechanisms.
Context collapse
An AI generates code to satisfy a stated requirement. The code works. When a later sprint asks for an extension of that behaviour, a second developer prompts the AI in a slightly different context and gets a slightly different implementation. Over several sprints, a codebase develops multiple competing patterns for the same problem type. Each is individually correct; none is consistent with the others. The failure mode isn't a bug; it's the multiplication of modification cost when you need to change the behaviour uniformly across the codebase.
Invisible coupling
AI-generated code has a tendency to add direct dependencies because that's the locally optimal solution to the stated prompt. A data processing function that should be stateless ends up importing a config client and a database connection because the training examples typically included them. Each function works in isolation. The coupling becomes visible when you test the function in a different context, or when a dependency change cascades unexpectedly through several modules that all imported the same thing for different reasons.
Test-scope mismatch
When the same session that generates a function also generates its tests, the tests verify what the function does, not what the requirement says it should do. The edge cases flagged in a product spec but not emphasised in the prompt go untested. This is the classic problem of testing what you wrote instead of what was required, now reproduced systematically at every AI-assisted function.
What the incident data shows
GitLab's 2026 developer productivity report found that 43% of AI-generated code changes required significant rework in production, compared to 18% for hand-written code. That sounds alarming until you look at the other side: time to first deployment. AI-assisted sprints ship 1.7x faster on average. The net of both numbers remains in AI's favour, for the first 60 days.
The widely-cited example from March 2026 is an inventory allocation module, primarily AI-generated, that passed unit tests, integration tests, and load tests before deployment. The failure was in a branching condition handling a stock depletion edge case — correctly for the scenario in the test suite, incorrectly for a variant that appeared in production at volume. The post-mortem finding: three engineers spent several hours in an incident call trying to reconstruct why the branch was structured the way it was. No one knew, because the developer who merged it had also used the AI to review the pull request.
| Metric | AI-assisted | Hand-written | Source |
|---|---|---|---|
| Throughput increase | 1.6–2.2× | Baseline | Uplevel 2026 |
| Rework rate in production | 43% | 18% | GitLab 2026 |
| Code churn, 90-day window | +861% | Baseline | Uplevel 2026 |
| Incident MTTR increase | +40–60% | Baseline | Aggregated post-mortems |
| Time to first deployment | 1.7× faster | Baseline | GitLab 2026 |
The hidden cost is triage time, not bug rate
The 43% rework figure is real, but it's not the primary cost. The primary cost shows up at 2:47am during an incident when three engineers are looking at a function and none of them can explain the logic in the middle section. The code is syntactically clear. The implementation intent is gone.
This is a variant of a problem that has existed as long as there has been software: write code for humans to read. What AI-assisted development does is generate code that is locally readable, each line and function making sense in isolation, but globally inscrutable. The architecture that emerges from composing hundreds of AI-generated functions doesn't cohere in the way that a design discussion produces coherence. Nobody held the whole thing in their head because nobody needed to.
Mean time to repair for incidents traced to AI-generated code runs 40 to 60% higher than for incidents in hand-written code, based on incident tracking data from B2B SaaS teams that have run AI-assisted development at team scale for at least a year. The bugs aren't different in kind. The debugging process is different because implementation decisions can't be recovered from the code alone.
What engineering teams managing this are actually doing
Several practices are emerging from teams that have run AI-assisted development long enough to encounter month three.
The understanding gate. A PR review step distinct from linting and test coverage. The author explains the logic of every non-trivial function — verbally in a review meeting, or as a structured comment block that states the function's contract: what invariants it assumes, what it guarantees, what it explicitly doesn't handle. The gate can be satisfied using the AI itself, but requiring the statement surfaces the cases where the AI-generated function has no clearly recoverable design intent. If the AI can't write a coherent contract statement for a function it just generated, that's a signal worth catching before merge.
Attribution tagging. Modules tagged with an approximate AI-assistance percentage, for triage routing rather than blame. An incident in a 90% AI-assisted module starts with extra time budgeted for intent reconstruction before repair begins. The assumption isn't that the code is wrong; it's that understanding it will take longer than usual.
Canary deployments for AI-heavy sprints. Not because AI code is more likely to fail on day one. Because the observability signal from a sprint with high AI assistance is different — more surface area, fewer people who understand the internals. Routing a smaller percentage of traffic through the new code for 48 to 72 hours before full rollout gives the team time to observe failure modes before they reach full scale.
Test specification reviews. Separate from test coverage reviews: checking whether the tests cover requirements as specified, not just the code as written. This directly addresses the test-scope mismatch failure mode. It's the most labour-intensive mitigation because it requires re-reading the original requirement and comparing it against the test, but it catches the category of failure that coverage metrics alone don't surface.
What to measure when your team ships AI-assisted code
Three metrics are worth adding to your engineering dashboard if you're running AI-assisted development at team scale.
Time to first incident, segmented by module AI-assistance percentage. If incidents in AI-heavy modules arrive earlier in the deployment lifecycle than incidents in hand-written modules, you have a leading quality signal before the 90-day churn data accumulates.
Modification complexity over 90 days. What fraction of AI-generated functions are replaced wholesale versus extended incrementally? High replacement rates signal context collapse — the AI's original design couldn't be built on. High extension rates with low replacement is a healthier signal.
Incident MTTR by code origin. If MTTR runs consistently higher for AI-assisted modules, the problem is comprehension, not correctness. The response is process change — understanding gates, better attribution — rather than different tooling.
The productivity gains from AI-assisted development are real enough that teams not using these tools are shipping at a structural disadvantage. The gap isn't whether to use them. The gap is whether your engineering process has caught up to what those tools actually change — not at sprint close, but at month three.
Frequently asked questions
Related reading
MCP was built to make tool integration easy. Here's what that costs in production.
MCP's auth is optional, tool definitions are mutable, and session-scoped permissions create ambient authority. Three attack classes, real CVEs, documented incidents — here's what to lock down before you ship.
Your LLM judge works in the test harness. Here's why it fails in production.
LLM-as-a-judge evals look reliable in the test harness. Here's what breaks after months in production: calibration drift, noisy decision boundaries, cascade failures in multi-step pipelines, and the meta-evaluation trap.
LLM structured output is reliable now. The reliability problem just moved.
Constrained decoding eliminated JSON syntax failures in LLM structured output. The reliability problem has moved to semantics: four failure classes that valid JSON hides, and the runtime patterns that catch them.