Does retrying a failed test actually fix flakiness?

It fixes the build for that run, and only for one of the six root causes: async timing, where the underlying job genuinely just needed more time. For the other five causes (shared state, external dependencies, environment drift, order dependency, resource contention), a retry just delays finding out why the test depends on something it shouldn't, and it erases the evidence each time it passes on the second try.

How do I figure out which root cause is behind a specific flaky test?

Run the suspect test alone, then inside the full suite, then under your normal parallel worker count, and compare results. A test that only fails alongside specific other tests points to shared state or order dependency. A test that fails only in CI and never on a laptop points to environment drift. A test that gets flakier as you add parallel workers points to resource contention. A test with no pattern beyond a fixed wait time points to async timing.

Is it ever fine to just delete a flaky test instead of fixing it?

Only after confirming the test isn't the only thing catching a real bug. Check what behaviour or past incident it was written to guard against. If the answer is unclear, that's a reason to fix the test, not delete it. If the behaviour it checks is genuinely obsolete, deleting it is more honest than leaving a permanently skipped test sitting in the suite.

What's a reasonable flake rate to tolerate in a large test suite?

Close to zero for any test that runs thousands of times a day. The compounding math is unforgiving: a 0.03% per-test flake rate, a test that's fine 999 times out of 1,000, causes roughly 70% of runs to fail once a suite reaches a few thousand tests. At that scale, rare and tolerable stop being the same thing.

Best PracticesJun 29, 20267 min readReviewed Jun 29, 2026

Flaky tests aren't random. Six root causes explain almost all of them.

Each cause produces an identical-looking CI failure, and the fix for one often makes another harder to diagnose.

By FlowVerify Editorial Team

Key takeaways

Flaky tests have six distinct root causes that look identical in a CI log, from async timing to resource contention.
A test that's fine 999 times out of 1,000 (a 0.03% flake rate) still fails about 70% of runs once a suite reaches a few thousand tests.
Google, Microsoft, Atlassian, and GitHub each independently measure flaky-test impact in the same 9-16% range, despite different test suites.
Retrying a failed test only genuinely fixes one of the six root causes (async timing); for the other five it just erases the evidence.
One cost model puts a 50-person team's all-in annual cost from flaky tests, including investigation time and deployment delays, at roughly $400,000.
Track flake rate per individual test over rolling weeks, not pipeline-level pass rate, to find which specific tests are worth fixing first.

Flaky tests aren't one bug. They're six.

Flaky tests have a predictable shape once a suite gets large enough: a test that fails once every 10,000 runs looks harmless on its own. Put it inside a 4,000-test suite and the arithmetic changes. At a 0.01% per-test flake rate, roughly a third of full pipeline runs fail because of it. Push that rate to 0.03%, still a test that passes 999 times out of 1,000, and about 70% of runs fail.

Those numbers come from TestDino's 2026 flaky test benchmark report, and they explain why "add a retry step" became the default response to a red build. At that failure rate, retries are the only thing keeping a pipeline green most days.

The trouble is that "flaky tests" are not one bug. They're a label engineers apply to six unrelated failure modes that happen to produce an identical symptom: a test that passes most of the time and fails without any code change behind it. Treat all six with the same retry-until-green policy and the cheapest one gets patched while the other five get harder to find, because every retry erases the evidence that would have pointed at the real cause.

The scale is large enough that major engineering organisations track it as its own line item. Google has reported that around 16% of its tests show some flakiness, costing roughly 2% of total engineering time. Atlassian attributes 15 to 21% of build failures to flakiness depending on the project, and GitHub has measured flaky-caused failures on about 9% of commits. Four companies, four different test suites, the same order of magnitude.

A separate cost model from Autonoma, assuming a 5% flake rate and senior engineering time at $150 an hour, puts the all-in annual cost for a 50-person team, developer investigation time, CI compute, deployment delays, and incident triage combined, at roughly $400,000. None of that gets cheaper by retrying harder.

The six root causes behind most flaky tests

The most-cited breakdown of why tests flake comes from a 2014 study by Luo et al. of flaky tests across Apache's open-source projects, and TestDino's 2026 report still uses its category split as the baseline. Async timing accounts for roughly 45% of cases and concurrency issues for another 20%. The rest splits across order dependency, resource leaks, network failures, and clock dependency. Each cause has a different signature once you go looking for it, and the fix that works for one tends to do nothing, or actively hurt, for another.

Async timing and the fixed sleep() that almost works

This is the classic case: a test submits a job, waits a fixed two seconds, then checks the result. It passes on a fast CI runner and fails on a loaded one, because two seconds was a guess about how long the job usually takes, not a guarantee about how long it actually takes. The failure rate tracks infrastructure load, not code changes, which is exactly why it looks random.

The fix isn't a longer sleep. It's replacing the guess with a check for the actual condition, polled until it's true or a timeout is hit:

test_job.py

# Before: a fixed wait that's usually, but not always, long enough
def test_job_completes():
    submit_job()
    time.sleep(2)
    assert get_job_status() == "complete"

# After: poll for the real condition, with an explicit timeout
def test_job_completes():
    submit_job()
    wait_until(lambda: get_job_status() == "complete", timeout=10)

That pattern generalises past job polling. It's the same fix for a test waiting on a network call to return, a database write to commit, or a UI element to finish rendering.

Shared mutable state between tests

A test passes when run alone and fails only when run after a specific other test. The usual cause is a module-level fixture, a global counter, or a database row that one test mutates and another assumes is still in its original state. Frameworks that reuse a fixture across a test file for speed make this worse, since the second test to touch the fixture inherits whatever the first one left behind.

The fix is isolation rather than luck: reset state in setup and teardown for every test, and avoid module- or session-scoped fixtures for anything a test mutates. If a fixture is too expensive to recreate per test, snapshot and restore it instead of sharing the live object across tests.

Uncontrolled external dependencies

If failures cluster in bursts that line up with a third-party API's own incident history, the test isn't flaky. It's accurately reporting that a dependency it shouldn't be calling is unreliable. This shows up most often in integration tests that hit a live payment gateway, a SaaS API, or even an internal service that happens to be down for unrelated maintenance.

The fix is to stop the test from depending on something outside your control. Stub the dependency for unit tests, and for integration tests that genuinely need to verify real behaviour, run them against a recorded fixture or a sandboxed instance with its own SLA, separate from the always-on CI suite.

Environment drift between local and CI

A test that's green on every engineer's laptop and red only in CI is rarely a CI problem. It's usually a difference between the two environments that the test happens to be sensitive to: a different timezone, a different locale's number formatting, a memory or file-descriptor limit, or a base image with a different default shell.

The fix is making local and CI as close to identical as the team can stand: the same container image, the same environment variables, the same resource limits, and running locally through the same entrypoint script CI uses rather than a separate "dev mode" path.

Test order dependency

A test passes in isolation, passes in the usual CI order, and fails the one time someone reorders the suite or inserts a new test before it. The hidden coupling is almost always global state again, except this time it's order-sensitive rather than simply shared: test B only fails if test A ran first and left a side effect B silently depends on.

The fix is to make the dependency visible by breaking it on purpose. Run the suite with randomised test order in CI on a regular schedule, even if the default merge-blocking run keeps deterministic order for speed. Whatever breaks under randomisation is coupled to something it shouldn't be.

One detail trips teams up here: randomised order only helps if the run logs the seed it used. A test that fails under randomisation and can't be reproduced because nobody captured which order produced it is no better than the flake it was meant to catch.

Resource contention under parallel runners

A test is reliable at one worker per suite and gets flakier as the team adds parallel workers to speed up CI. The signature is a flake rate that scales with parallelism rather than with code changes: two tests grabbing the same port, writing to the same temp file path, or racing for rows in a shared test database.

The fix is giving every parallel worker its own slice of whatever resource is shared: a dedicated schema or database per worker, a port range derived from the worker index, a temp directory scoped to the worker rather than the machine.

What actually fixes each pattern

Put side by side, the six causes and their fixes barely overlap, which is the actual problem with a single retry policy applied to all of them:

Root cause	What it looks like in CI	The fix that works
Async timing	Fails roughly proportional to a fixed wait time, never twice at the same line	Poll for the real condition with a timeout instead of a fixed sleep
Shared mutable state	Fails only when run after a specific other test, passes alone	Reset state in setup/teardown; avoid mutating shared fixtures
External dependencies	Fails in bursts that match a third party's own incident history	Stub the dependency, or isolate it behind a sandboxed instance
Environment drift	Fails only in CI, never on a laptop	Match container image, env vars, and resource limits to local dev
Test order dependency	Passes alone, fails as part of the full suite	Run the suite in randomised order on a schedule to surface coupling
Resource contention	Flake rate rises with parallel worker count	Give each worker its own schema, port range, and temp directory

Six causes of test flakiness, and the fix that addresses each one

Only the first row is something a retry genuinely papers over without much consequence, and even there it's hiding a job that may be slower than the test assumed, which is itself worth knowing.

Quarantine without losing the signal

Quarantining a known-flaky test, marking it skipped or excluded from the merge-blocking run, is a reasonable short-term move. It stops a known problem from blocking unrelated work. It turns into a different kind of problem the moment it has no expiry: a quarantine list with no owner and no fix-by date is how teams end up with tests permanently excluded from the very suite that exists to catch the bug they were written for.

The cheap fix is procedural: every quarantined test gets a ticket, an owner, and a re-evaluation date, and the quarantine list itself gets reviewed the way an on-call rotation reviews open incidents, visibly and on a fixed cadence, not as an afterthought.

What to measure instead of "CI is flaky today"

Pipeline-level pass rate is the wrong unit to track, because it blends a hundred reliable tests with the one repeat offender and tells you nothing about which test to fix first. Flake rate per individual test, tracked over rolling builds rather than a single bad day, is the metric that actually points at where to spend an afternoon.

Most CI platforms can surface this without new tooling. A test that fails and then passes on an identical rerun, with no code change in between, is a flaky-test event. Counting those per test name over a couple of weeks turns a vague sense that "CI feels flaky" into a ranked list, and a ranked list is something an engineer can actually act on first thing Monday morning.

The shape of that list matters as much as its length. A flake rate concentrated in three tests is three bugs to fix. A flake rate spread evenly across forty unrelated tests is usually not forty separate bugs at all. It's a shared-infrastructure problem, a slow CI runner, a contended database, a network path with intermittent packet loss, wearing forty different costumes.

The team that builds that list usually finds the same five or six tests responsible for most of the noise, which is a smaller, more specific problem than an unreliable CI pipeline, and a far more solvable one.

Frequently asked questions

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.

Jul 1, 2026Read full article →

Best PracticesJun 29, 20267 min readReviewed Jun 29, 2026

Flaky tests aren't random. Six root causes explain almost all of them.

Each cause produces an identical-looking CI failure, and the fix for one often makes another harder to diagnose.

By FlowVerify Editorial Team

Key takeaways

Flaky tests have six distinct root causes that look identical in a CI log, from async timing to resource contention.
A test that's fine 999 times out of 1,000 (a 0.03% flake rate) still fails about 70% of runs once a suite reaches a few thousand tests.
Google, Microsoft, Atlassian, and GitHub each independently measure flaky-test impact in the same 9-16% range, despite different test suites.
Retrying a failed test only genuinely fixes one of the six root causes (async timing); for the other five it just erases the evidence.
One cost model puts a 50-person team's all-in annual cost from flaky tests, including investigation time and deployment delays, at roughly $400,000.
Track flake rate per individual test over rolling weeks, not pipeline-level pass rate, to find which specific tests are worth fixing first.

Flaky tests aren't one bug. They're six.

The six root causes behind most flaky tests

Async timing and the fixed sleep() that almost works

The fix isn't a longer sleep. It's replacing the guess with a check for the actual condition, polled until it's true or a timeout is hit:

test_job.py

# Before: a fixed wait that's usually, but not always, long enough
def test_job_completes():
    submit_job()
    time.sleep(2)
    assert get_job_status() == "complete"

# After: poll for the real condition, with an explicit timeout
def test_job_completes():
    submit_job()
    wait_until(lambda: get_job_status() == "complete", timeout=10)

That pattern generalises past job polling. It's the same fix for a test waiting on a network call to return, a database write to commit, or a UI element to finish rendering.

Shared mutable state between tests

Uncontrolled external dependencies

Environment drift between local and CI

Test order dependency

Resource contention under parallel runners

What actually fixes each pattern

Put side by side, the six causes and their fixes barely overlap, which is the actual problem with a single retry policy applied to all of them:

Root cause	What it looks like in CI	The fix that works
Async timing	Fails roughly proportional to a fixed wait time, never twice at the same line	Poll for the real condition with a timeout instead of a fixed sleep
Shared mutable state	Fails only when run after a specific other test, passes alone	Reset state in setup/teardown; avoid mutating shared fixtures
External dependencies	Fails in bursts that match a third party's own incident history	Stub the dependency, or isolate it behind a sandboxed instance
Environment drift	Fails only in CI, never on a laptop	Match container image, env vars, and resource limits to local dev
Test order dependency	Passes alone, fails as part of the full suite	Run the suite in randomised order on a schedule to surface coupling
Resource contention	Flake rate rises with parallel worker count	Give each worker its own schema, port range, and temp directory

Six causes of test flakiness, and the fix that addresses each one

Only the first row is something a retry genuinely papers over without much consequence, and even there it's hiding a job that may be slower than the test assumed, which is itself worth knowing.

Flaky tests aren't random. Six root causes explain almost all of them.

Flaky tests aren't one bug. They're six.

The six root causes behind most flaky tests

Async timing and the fixed sleep() that almost works

Shared mutable state between tests

Uncontrolled external dependencies

Environment drift between local and CI

Test order dependency

Resource contention under parallel runners

What actually fixes each pattern

Quarantine without losing the signal

What to measure instead of "CI is flaky today"

Frequently asked questions

Related reading

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

Three npm supply-chain attacks hit in four weeks. None of them needed a stolen password.

The One Medical breach claim is what M&A security debt looks like, five years later

Stay ahead on eSignatures, compliance, and document workflows

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

Flaky tests aren't random. Six root causes explain almost all of them.

Flaky tests aren't one bug. They're six.

The six root causes behind most flaky tests

Async timing and the fixed sleep() that almost works

Shared mutable state between tests

Uncontrolled external dependencies

Environment drift between local and CI

Test order dependency

Resource contention under parallel runners

What actually fixes each pattern

Quarantine without losing the signal

What to measure instead of "CI is flaky today"

Frequently asked questions

Related reading

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

Three npm supply-chain attacks hit in four weeks. None of them needed a stolen password.

The One Medical breach claim is what M&A security debt looks like, five years later

Stay ahead on eSignatures, compliance, and document workflows

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.