Flaky tests aren't random. Six root causes explain almost all of them.
Each cause produces an identical-looking CI failure, and the fix for one often makes another harder to diagnose.
Flaky tests aren't one bug. They're six.
Flaky tests have a predictable shape once a suite gets large enough: a test that fails once every 10,000 runs looks harmless on its own. Put it inside a 4,000-test suite and the arithmetic changes. At a 0.01% per-test flake rate, roughly a third of full pipeline runs fail because of it. Push that rate to 0.03%, still a test that passes 999 times out of 1,000, and about 70% of runs fail.
Those numbers come from TestDino's 2026 flaky test benchmark report, and they explain why "add a retry step" became the default response to a red build. At that failure rate, retries are the only thing keeping a pipeline green most days.
The trouble is that "flaky tests" are not one bug. They're a label engineers apply to six unrelated failure modes that happen to produce an identical symptom: a test that passes most of the time and fails without any code change behind it. Treat all six with the same retry-until-green policy and the cheapest one gets patched while the other five get harder to find, because every retry erases the evidence that would have pointed at the real cause.
The scale is large enough that major engineering organisations track it as its own line item. Google has reported that around 16% of its tests show some flakiness, costing roughly 2% of total engineering time. Atlassian attributes 15 to 21% of build failures to flakiness depending on the project, and GitHub has measured flaky-caused failures on about 9% of commits. Four companies, four different test suites, the same order of magnitude.
A separate cost model from Autonoma, assuming a 5% flake rate and senior engineering time at $150 an hour, puts the all-in annual cost for a 50-person team, developer investigation time, CI compute, deployment delays, and incident triage combined, at roughly $400,000. None of that gets cheaper by retrying harder.
The six root causes behind most flaky tests
The most-cited breakdown of why tests flake comes from a 2014 study by Luo et al. of flaky tests across Apache's open-source projects, and TestDino's 2026 report still uses its category split as the baseline. Async timing accounts for roughly 45% of cases and concurrency issues for another 20%. The rest splits across order dependency, resource leaks, network failures, and clock dependency. Each cause has a different signature once you go looking for it, and the fix that works for one tends to do nothing, or actively hurt, for another.
Async timing and the fixed sleep() that almost works
This is the classic case: a test submits a job, waits a fixed two seconds, then checks the result. It passes on a fast CI runner and fails on a loaded one, because two seconds was a guess about how long the job usually takes, not a guarantee about how long it actually takes. The failure rate tracks infrastructure load, not code changes, which is exactly why it looks random.
The fix isn't a longer sleep. It's replacing the guess with a check for the actual condition, polled until it's true or a timeout is hit:
# Before: a fixed wait that's usually, but not always, long enough
def test_job_completes():
submit_job()
time.sleep(2)
assert get_job_status() == "complete"
# After: poll for the real condition, with an explicit timeout
def test_job_completes():
submit_job()
wait_until(lambda: get_job_status() == "complete", timeout=10)That pattern generalises past job polling. It's the same fix for a test waiting on a network call to return, a database write to commit, or a UI element to finish rendering.
Shared mutable state between tests
A test passes when run alone and fails only when run after a specific other test. The usual cause is a module-level fixture, a global counter, or a database row that one test mutates and another assumes is still in its original state. Frameworks that reuse a fixture across a test file for speed make this worse, since the second test to touch the fixture inherits whatever the first one left behind.
The fix is isolation rather than luck: reset state in setup and teardown for every test, and avoid module- or session-scoped fixtures for anything a test mutates. If a fixture is too expensive to recreate per test, snapshot and restore it instead of sharing the live object across tests.
Uncontrolled external dependencies
If failures cluster in bursts that line up with a third-party API's own incident history, the test isn't flaky. It's accurately reporting that a dependency it shouldn't be calling is unreliable. This shows up most often in integration tests that hit a live payment gateway, a SaaS API, or even an internal service that happens to be down for unrelated maintenance.
The fix is to stop the test from depending on something outside your control. Stub the dependency for unit tests, and for integration tests that genuinely need to verify real behaviour, run them against a recorded fixture or a sandboxed instance with its own SLA, separate from the always-on CI suite.
Environment drift between local and CI
A test that's green on every engineer's laptop and red only in CI is rarely a CI problem. It's usually a difference between the two environments that the test happens to be sensitive to: a different timezone, a different locale's number formatting, a memory or file-descriptor limit, or a base image with a different default shell.
The fix is making local and CI as close to identical as the team can stand: the same container image, the same environment variables, the same resource limits, and running locally through the same entrypoint script CI uses rather than a separate "dev mode" path.
Test order dependency
A test passes in isolation, passes in the usual CI order, and fails the one time someone reorders the suite or inserts a new test before it. The hidden coupling is almost always global state again, except this time it's order-sensitive rather than simply shared: test B only fails if test A ran first and left a side effect B silently depends on.
The fix is to make the dependency visible by breaking it on purpose. Run the suite with randomised test order in CI on a regular schedule, even if the default merge-blocking run keeps deterministic order for speed. Whatever breaks under randomisation is coupled to something it shouldn't be.
One detail trips teams up here: randomised order only helps if the run logs the seed it used. A test that fails under randomisation and can't be reproduced because nobody captured which order produced it is no better than the flake it was meant to catch.
Resource contention under parallel runners
A test is reliable at one worker per suite and gets flakier as the team adds parallel workers to speed up CI. The signature is a flake rate that scales with parallelism rather than with code changes: two tests grabbing the same port, writing to the same temp file path, or racing for rows in a shared test database.
The fix is giving every parallel worker its own slice of whatever resource is shared: a dedicated schema or database per worker, a port range derived from the worker index, a temp directory scoped to the worker rather than the machine.
What actually fixes each pattern
Put side by side, the six causes and their fixes barely overlap, which is the actual problem with a single retry policy applied to all of them:
| Root cause | What it looks like in CI | The fix that works |
|---|---|---|
| Async timing | Fails roughly proportional to a fixed wait time, never twice at the same line | Poll for the real condition with a timeout instead of a fixed sleep |
| Shared mutable state | Fails only when run after a specific other test, passes alone | Reset state in setup/teardown; avoid mutating shared fixtures |
| External dependencies | Fails in bursts that match a third party's own incident history | Stub the dependency, or isolate it behind a sandboxed instance |
| Environment drift | Fails only in CI, never on a laptop | Match container image, env vars, and resource limits to local dev |
| Test order dependency | Passes alone, fails as part of the full suite | Run the suite in randomised order on a schedule to surface coupling |
| Resource contention | Flake rate rises with parallel worker count | Give each worker its own schema, port range, and temp directory |
Only the first row is something a retry genuinely papers over without much consequence, and even there it's hiding a job that may be slower than the test assumed, which is itself worth knowing.
Quarantine without losing the signal
Quarantining a known-flaky test, marking it skipped or excluded from the merge-blocking run, is a reasonable short-term move. It stops a known problem from blocking unrelated work. It turns into a different kind of problem the moment it has no expiry: a quarantine list with no owner and no fix-by date is how teams end up with tests permanently excluded from the very suite that exists to catch the bug they were written for.
The cheap fix is procedural: every quarantined test gets a ticket, an owner, and a re-evaluation date, and the quarantine list itself gets reviewed the way an on-call rotation reviews open incidents, visibly and on a fixed cadence, not as an afterthought.
What to measure instead of "CI is flaky today"
Pipeline-level pass rate is the wrong unit to track, because it blends a hundred reliable tests with the one repeat offender and tells you nothing about which test to fix first. Flake rate per individual test, tracked over rolling builds rather than a single bad day, is the metric that actually points at where to spend an afternoon.
Most CI platforms can surface this without new tooling. A test that fails and then passes on an identical rerun, with no code change in between, is a flaky-test event. Counting those per test name over a couple of weeks turns a vague sense that "CI feels flaky" into a ranked list, and a ranked list is something an engineer can actually act on first thing Monday morning.
The shape of that list matters as much as its length. A flake rate concentrated in three tests is three bugs to fix. A flake rate spread evenly across forty unrelated tests is usually not forty separate bugs at all. It's a shared-infrastructure problem, a slow CI runner, a contended database, a network path with intermittent packet loss, wearing forty different costumes.
The team that builds that list usually finds the same five or six tests responsible for most of the noise, which is a smaller, more specific problem than an unreliable CI pipeline, and a far more solvable one.
Frequently asked questions
Related reading
An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.
A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.
Three npm supply-chain attacks hit in four weeks. None of them needed a stolen password.
Three unrelated npm attacks in May and June 2026 used three different techniques. All three got past 2FA and OIDC Trusted Publishing by skipping the registry account and going straight for the CI runner.
The One Medical breach claim is what M&A security debt looks like, five years later
A 2026 breach claim against Amazon's One Medical traces back through two old acquisitions. Yahoo and Marriott show why: due diligence audits what's active at close, not what gets orphaned afterward.