Context rot is real: what the 18-model study means for production LLM engineering
Every frontier model gets measurably worse as context grows. Here is the production engineering playbook.
Every major LLM provider has spent the last two years racing to expand context windows. The headlines followed: 128K tokens, then 1M, then 2M. The implication is that larger is strictly better: more context means the model knows more, attends to more, gets it more right.
Chroma's 2025 research tested that assumption directly. They ran 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5, and others) on a retrieval accuracy task at increasing context lengths. Every single one got worse as context grew, not when the context window filled, but much earlier. Chroma named this context rot. Some models showed severe degradation by 1,000 tokens of surrounding context. Across the full test set, models fell short of their marketed maximum effective context window by up to 99%.
This is the phenomenon the research named context rot: measurable degradation in LLM output quality as input context grows, independent of whether the model's context window is anywhere near capacity.
For engineers building production LLM applications, the implication is specific: filling your context window is not a neutral act. It actively harms output quality and multiplies API costs. The engineering response is not a larger model with a bigger window; it is discipline about what goes into context in the first place.
What context rot actually is
Context rot is distinct from two related but different problems. Running out of context — when input exceeds the model's maximum token limit — produces a hard error or truncation. You know it happened. Context rot produces no error. The model accepts the input, generates output, and the output is worse. Often worse in ways that look like normal model variance rather than a systematic failure.
Three mechanisms drive it.
The first is the lost-in-the-middle problem, documented in transformer attention research going back years. Models attend reliably to content near the beginning and end of their context. Content placed in the middle receives less consistent attention. A document buried in the centre of a 128K context is attended to less reliably than the same document placed first, even if it is the most relevant content in the input.
The second is semantic interference. Context that is thematically related to the current task but no longer relevant (an old version of a file, a completed step in a multi-step workflow, a conversation turn from an earlier topic) actively misleads the model. This is not just noise; it competes with the relevant signal in attention. The Chroma research found that irrelevant context degrades performance more severely than simply leaving space empty.
The third is instruction weight drift. As context grows, instructions given at the start of the input progressively lose relative weight compared to content that appears closer to the query. This is why long-running agents often stop following their system prompt constraints while continuing to complete tasks: the instructions are still technically present, but attended to with decreasing fidelity.
The Chroma research: 18 models, no exceptions
The study's methodology was straightforward: given a set of facts and a retrieval question, how accurately does the model find the correct answer as the surrounding context volume increases? The target facts stayed constant. The irrelevant surrounding content scaled.
The result was consistent across all 18 models tested. Accuracy dropped at every context length increment. There were no models that handled long context without degradation. Some handled it more gracefully than others; the cliff edges were at different points, the slopes were different; but the directional finding held universally.
Several specific findings matter for production decisions. First, the degradation is not linear. Performance does not fall gradually and evenly as context grows. There are cliff edges: specific context lengths where accuracy drops sharply. The location of these cliffs varies by model and by task type, which means you cannot rely on general benchmarks to predict your application's specific failure point.
Second, the gap between marketed context length and effective context length is large. The researchers defined the effective context window as the length at which accuracy dropped below a useful threshold. For many models on many tasks, that threshold was reached well under 50K tokens. On some task types, it was reached under 10K.
Third, no model tested was immune. Context rot is not a property of a particular architecture or a particular provider. It is a current property of transformer attention at scale.
| Task type | Typical marketed max | Practical effective range | Primary failure mode |
|---|---|---|---|
| Fact retrieval / short Q&A | 1M tokens | 4K–16K relevant | Lost in the middle, semantic interference |
| Document summarisation | 200K tokens | 32K–64K focused text | Calibration drift at edges |
| Code generation / review | 128K tokens | 20K–40K codebase | Irrelevant file interference |
| Multi-turn chat / agent loop | 128K tokens | Degrades after ~50–100 meaningful turns | Instruction weight loss |
| RAG with retrieved chunks | 128K tokens | 3–8 high-quality chunks | Dilution from low-relevance chunks |
These ranges are approximations from the research; exact figures vary by model and task. The pattern holds.
The cost side of filling your context window
Context window costs are linear arithmetic, and the arithmetic is steep at production scale.
Most LLM APIs price on tokens: input tokens, output tokens, or both. A request with 1K input tokens costs, per token, the same as a request with 128K input tokens. But the 128K request costs 128 times more in absolute terms. A 512K request costs 512 times more than a 1K request.
For products that run high call volumes (coding assistants, document review tools, customer support agents, structured data extraction pipelines) context management is not a nice-to-have optimisation. It is the unit economics of the product.
The production engineering playbook
Given the above, the engineering goal shifts from 'how do I fit more into context?' to 'what can I safely keep out?' Several patterns address this directly.
Rolling windows, not full history
For multi-turn conversations and long-running agents, keep a rolling window defined by function rather than by recency. The window should be functional, not chronological: 'the turns that remain relevant to the current task,' not 'the last N turns.' When a topic changes or a subtask completes, the turns from the previous phase should exit the active context. They are not deleted — store them — but they are not included in the next API call.
The implementation choice that matters: whether the trim decision is token-based or turn-count-based. Turn count is simpler to reason about but can allow very long individual turns to blow the budget. Token-count-based trimming is more reliable — you know what you are sending.
Ranked retrieval with a hard chunk limit
In RAG systems, including more retrieved chunks does not compensate for including poorly-ranked ones. Three chunks at 0.85+ similarity threshold outperform twelve chunks at mixed relevance in most retrieval benchmarks. The cut-off should be aggressive. If you cannot articulate how a given chunk answers the current query, leave it out.
A practical default: set a maximum of five chunks, require a minimum similarity score, and treat any chunk below the threshold as a discard rather than a fallback. Below-threshold chunks are not better than nothing — they are often worse than nothing.
Summarise at turn boundaries
For long agent workflows, summarise conversation history at regular checkpoints (every 20 turns, or at the natural end of a subtask) and replace the raw turn history with the summary before the next phase. The model loses the exact wording of earlier exchanges, which is usually fine: the substance of what was agreed is preserved at a fraction of the token cost.
This pattern also partially addresses instruction weight drift. A fresh summarised context resets the relative weight of the current instructions.
Pin instructions at both ends
Important instructions should appear at both the start and the end of the context. This is a direct response to the attention pattern: content at the edges receives more consistent attention than content in the middle. A system prompt pinned at both ends of a long context is attended to more reliably than a system prompt that appears only at the start.
This is not duplication for aesthetics. It is an engineering response to a documented attention characteristic.
Set a context budget as a first-class parameter
Rather than treating context length as an implicit default, make it an explicit budget with a hard ceiling. For each step in a pipeline, calculate the expected token count before making the call. If the calculation exceeds the budget, reduce scope.
The budget should be set well below the marketed maximum. A practical starting point for tasks where accuracy matters: 20–30% of the model's marketed context window. For tasks where occasional quality degradation is acceptable, you can push higher, but with explicit awareness that you are trading quality for throughput.
def build_context(
system_prompt: str,
history: list[dict],
retrieved_chunks: list[str],
max_tokens: int = 16_000, # budget, not the model's maximum
chunk_limit: int = 5,
min_chunk_score: float = 0.82,
scored_chunks: list[tuple[str, float]] | None = None,
) -> list[dict]:
context = [{"role": "system", "content": system_prompt}]
used = count_tokens(system_prompt)
# Fill conversation history from most recent, staying within 60% of budget
history_budget = int(max_tokens * 0.6)
recent = []
for turn in reversed(history):
turn_tokens = count_tokens(turn["content"])
if used + turn_tokens > history_budget:
break
recent.insert(0, turn)
used += turn_tokens
context.extend(recent)
# Add high-quality chunks only
chunks_to_add = scored_chunks or [(c, 1.0) for c in retrieved_chunks]
added = 0
for chunk, score in sorted(chunks_to_add, key=lambda x: -x[1]):
if score < min_chunk_score or added >= chunk_limit:
break
chunk_tokens = count_tokens(chunk)
if used + chunk_tokens > max_tokens:
break
context.append({"role": "user", "content": f"Relevant context:\n{chunk}"})
used += chunk_tokens
added += 1
return context
When large context is actually the right tool
Not every application is equally affected by context rot, and not every case calls for aggressive trimming.
Single-pass document analysis, where you process a contract or codebase once and then discard the context, is a legitimate use for large context windows. The degradation is real, but for a single pass with a clear output schema, the quality loss may fall within acceptable bounds. The cost is a one-time charge, not a per-call multiplier.
Needle-in-a-haystack tasks where the relevant content is placed deliberately at the beginning or end of context, not buried in the middle, sidestep the worst of the lost-in-the-middle problem. If you can control placement, you can partially control degradation.
Batch processing jobs where throughput matters more than per-request accuracy are also a reasonable fit. If you are processing 10,000 documents and 90% accuracy is sufficient, the cost savings from batching into larger contexts may outweigh the quality loss.
These are exceptions, not defaults. For any application where output quality matters on individual requests, and especially for agentic workflows running over many turns, the default assumption should be: smaller, curated context outperforms larger, comprehensive context on accuracy, cost, and latency simultaneously.
What this means for how you build
Context management has historically been treated as a post-optimisation concern, something you revisit after the product is working and the API bills arrive. The Chroma research suggests it should be a first-class design decision, addressed at the architecture stage.
The practical implication: when you sketch out an LLM pipeline, the context budget for each step is as important a design parameter as the model choice, the output schema, or the prompt structure. A well-managed 16K context will outperform a poorly-managed 128K context on most tasks — at lower cost and with lower latency.
The large context window is a useful upper bound and an occasionally valuable capability. It is not a substitute for knowing what your model actually needs to see.
Frequently asked questions
Related reading
Local LLMs in production, 2026: the honest economics
Vendor benchmarks leave out the two cost items that usually flip the self-hosting decision: engineering overhead and the model-update cycle. Here is the honest break-even analysis.
The AI productivity paradox is more interesting than either side admits
AI is making specific tasks measurably faster: coding 55%, X-ray reading 36%, customer service sales up 16%. And yet 90% of firms saw no firm-level productivity gain. Here's what the gap means.
Model Context Protocol: what it actually standardises (and what you'll still have to build yourself)
MCP is becoming the standard interface for connecting AI agents to external tools. But most teams adopting it don't have a clear picture of what the protocol covers and what it deliberately leaves out.