Feature flags in production: the lifecycle teams skip
Adding a flag takes five minutes. Retiring it takes five months. Not because the code is hard.
A feature flag takes five minutes to add and five months to remove. Not because the code is hard. A flag is a boolean check. The delay is organisational: nobody knows if the flag is still doing anything, whose job it is to find out, or what 'done' even means for a flag once its initial rollout finishes.
The result is flag debt: conditional branches for releases that shipped a year ago, A/B tests where someone forgot to pick a winner, kill switches nobody dares touch because they are not sure what breaks if flipped. Every active flag in your codebase is a branch every code reviewer has to mentally evaluate. Every incident investigation starts with 'was a flag involved?', and answering that question takes longer as the flag count climbs.
The feature flag lifecycle nobody draws on the whiteboard
The standard diagram for feature flags stops at a green box labelled 'flag is at 100%'. What happens after is left to drift. In practice, most flag lifecycles look like this: an engineer needs to ship something safely, adds a flag, rolls it out, moves to the next project, and never revisits the flag. It reaches 100%, stays there, and becomes invisible. Six months later, someone asks what it does. Nobody remembers. The flag stays forever.
The reason this keeps happening is not lack of discipline. Adding a flag is part of the delivery workflow. It shows up in code review, in CI checks, in the rollout runbook. Retiring a flag is not part of any workflow. There is no moment where 'this flag's job is done' gets formally acknowledged and acted on. The lifecycle ends at 'ship', not at 'clean up'.
Four types of flags, four different lifetimes
Most flag debt comes from treating all flags the same. There are four distinct types, each with a different intended lifetime and a different retirement trigger.
Release flags
Added to gate a new feature during rollout. Their job ends one to two weeks after the flag reaches 100% and the team has confirmed nothing is on fire. These are the easiest type to clean up and the most commonly neglected, because the team has already mentally declared victory and moved on by the time cleanup is due.
Experiment flags
Control which variant a user sees in an A/B or multivariate test. Their job ends when statistical significance is reached, or when the experiment's cutoff date passes, whichever comes first. The failure mode here is not neglect; it is forgetting to review the results and pick a winner. Experiments that 'run a bit longer' often run for a year.
Ops flags (kill switches)
Let an engineer disable a code path without a deployment. Unlike release and experiment flags, ops flags are intentionally long-lived. They exist specifically so the team can respond to production incidents faster than a deploy cycle allows. Cleaning these up on the same schedule as release flags is a mistake, and a dangerous one.
Permission flags (entitlement gates)
Control access based on plan tier, feature entitlement, or user segment. These can live for years and often become load-bearing business logic disguised as flags. The retirement trigger for a permission flag is a product decision, not a technical one.
| Type | Intended lifetime | Staleness signal | Retirement trigger |
|---|---|---|---|
| Release | 1-2 weeks post-100% | 100% on for 14+ days, no targeting changes | Rollout confirmed stable; remove conditional |
| Experiment | To cutoff date or stat significance | Past cutoff; no winner declared | Pick winning variant; delete flag |
| Ops / kill switch | Indefinite | N/A; review quarterly | Intentional decision after review |
| Permission / entitlement | Business-driven | Zero active users match targeting rules | Product decision to retire tier or feature |
How flag debt compounds
Flag debt is not just code bloat. Three things make it worse than it looks from the outside.
- Code review overhead. Every flag in the codebase is a branch every reviewer has to reason about. In a codebase with 50 active flags, that is 50 implicit questions during each review: is this still needed? What is the current rollout state? Does this path matter for the change I am looking at?
- Test matrix expansion. Each flag doubles the code paths that tests should cover. In practice, teams do not test every combination, so flag debt accumulates as untested state: code paths that exist in production but have no coverage.
- Incident overhead. In a production incident, the first question is whether a flag was recently changed. With 200 flags and no log of recent changes, answering that question takes 20 minutes. With a clear lifecycle and a small active flag count, it takes 30 seconds.
The less obvious compounding: flag debt clusters. Teams that do not retire flags tend to add more flags to work around the ambiguity of the existing ones. 'We cannot change how feature X behaves because we do not know if flag Y is still in use' leads to flag Z, which gates the new behaviour without touching the uncertain old one. Each generation of flags makes the next harder to clean up.
The staleness signal already in your flag service
Most flag services expose evaluation data via API: when the flag was last evaluated, its current targeting distribution, and when it was last modified. LaunchDarkly, Split, and Flagsmith all have endpoints for this. That is everything needed to detect stale release flags automatically.
A flag that has been 100% on (or 100% off) for 30 or more days without a targeting change is a strong retirement candidate. The service already knows this. The missing piece is a cron job that queries for it and routes the results somewhere actionable.
import httpx
from datetime import datetime, timedelta, timezone
STALE_DAYS = 30
LD_API_KEY = "api-..."
PROJECT_KEY = "your-project"
ENV_KEY = "production"
def find_stale_release_flags() -> list[dict]:
resp = httpx.get(
f"https://app.launchdarkly.com/api/v2/flags/{PROJECT_KEY}",
headers={"Authorization": LD_API_KEY},
params={"env": ENV_KEY, "tag": "release"}, # tag your release flags
)
resp.raise_for_status()
cutoff = datetime.now(timezone.utc) - timedelta(days=STALE_DAYS)
stale = []
for flag in resp.json()["items"]:
env = flag["environments"].get(ENV_KEY, {})
last_modified = datetime.fromisoformat(
env.get("lastModified", "2000-01-01T00:00:00Z")
)
is_fully_on = env.get("on", False)
if is_fully_on and last_modified < cutoff:
stale.append({
"key": flag["key"],
"name": flag["name"],
"last_modified": env.get("lastModified"),
})
return stale
if __name__ == "__main__":
for f in find_stale_release_flags():
print(f"STALE: {f['key']} (last modified: {f['last_modified']})")For self-hosted flag services, the equivalent data is in their admin API: lastSeenAt and toggle state per environment are available in all major open-source options. The API shapes differ; the detection logic is identical. A weekly cron job piping output to a Slack channel costs a few hours to set up and replaces roughly one manual flag-debt audit per quarter.
The cleanup playbook
Identifying a stale flag is the easy part. Retiring it has six steps, and teams typically skip the last two.
- Verify intent. Was this flag left at 100% deliberately, or did it drift there? Ask the last person who modified it. If they have left the company, check the commit history for the flag config.
- Assign an owner. Flag debt has no natural owner once the original engineer moves on. Name one person responsible for the removal and hold them to the next step.
- Set a deadline. 'Remove it when we have time' is never. 'Remove it by end of sprint' is a date.
- Remove the conditional in code. Keep the winning behaviour; delete the losing branch and all flag-evaluation logic. Do not leave the losing variant behind 'just in case'.
- Update tests. Remove test cases that explicitly exercise the 'off' variant, or that pass flag state in as a parameter. Leaving dead test branches intact is the same as leaving dead code; it just costs future test-run time rather than production overhead.
- Delete the flag from the service. Not archive: delete. Archiving leaves a ghost in the dashboard that confuses the next person who searches for flags and adds the key to their planning.
The step that slows teams down most is step 4 in multi-service codebases. If a flag is evaluated in three services, removal means three separate code changes, three reviews, and three deploys. Schedule them together where possible. Partial removal, where the flag is gone from one service but still live in two others, does not reduce ambiguity; it just makes the codebase inconsistent and keeps the flag alive in the dashboard.
What 'done' actually looks like
A flag is done when four things are simultaneously true: the winning code path is deployed unconditionally in every service that evaluated the flag; tests pass for the winning path only, without flag-specific branches or parameterised flag state; the flag is deleted from the service (not archived); and the flag key is recorded in a list of retired keys so the name cannot be accidentally reused.
That last point matters more than it sounds. Reusing a flag key that old config files or log parsers still reference is a subtle, hard-to-diagnose regression. Some teams keep a RETIRED_FLAGS constant in a shared module, appended to during each cleanup. A startup check that verifies no live flag key matches a retired one takes a few lines of code and has caught more than one configuration bug.
The metric worth tracking: median time from 'flag reaches 100%' to 'flag is deleted from service'. For release flags, a healthy median is under 30 days. If that number is climbing, or if you have never measured it because you do not know how many active flags you currently have, the lifecycle system is the gap. A different flag service does not fix a missing lifecycle; it just gives you a nicer dashboard to show the accumulating debt.
Frequently asked questions
Related reading
Every Postgres isolation level, and the production bug it's designed to prevent
Most Postgres users never touch isolation levels — until a double-charge or an oversold booking forces the question. What each level allows, and the production bug that follows when you pick the wrong one.
Five security patterns that appear in AI-generated code — and why code review usually misses them
AI-generated codebases have 2.5x more critical vulnerabilities than human-written code. The useful finding: five predictable patterns that standard code review is not designed to catch.
Postgres has four index types. Most teams use one. Here's what the others unlock.
Most Postgres performance problems have a better fix than yet another B-tree index. A guide to choosing between B-tree, BRIN, GIN, and partial indexes — by matching index type to query pattern, not by guessing.