Does context rot affect all LLM models equally?

No. The severity and the location of degradation cliffs vary by model and task type. But Chroma's 2025 research found that all 18 frontier models tested exhibited measurable degradation as context grew — there were no exceptions. Some models handle long context more gracefully than others, but none is immune to the underlying attention mechanics.

If I have a 1M token context window, can I just put everything in?

The research suggests this is a reliable way to get worse results at higher cost. The marketed context window is the ceiling for what the model will accept, not the range over which it performs well. For most production tasks, effective context windows are a fraction of the marketed maximum — sometimes under 10% of it. The right approach is to treat the large window as an emergency capacity reserve, not an operating point.

Does prompt caching interact with context rot?

Prompt caching reduces the token cost of repeated context, but it does not fix the quality degradation. A cached 128K prompt costs less per call than an uncached one, but the model is still attending to 128K tokens and producing output affected by all the mechanisms described above. Cost optimisation and context management are complementary, not substitutes.

What is the safest context budget to start with for a new LLM application?

A reasonable starting point for tasks where output quality matters: 20–30% of the model's marketed context window. For GPT-4o at 128K, that is roughly 25K–38K tokens. Increase from there only with explicit measurement of quality at each step up. For RAG specifically, five high-quality chunks at a strict relevance threshold is a better default than twelve at a looser one.

AI & LLMsMay 31, 20268 min readReviewed May 31, 2026

Context rot is real: what the 18-model study means for production LLM engineering

Every frontier model gets measurably worse as context grows. Here is the production engineering playbook.

By FlowVerify Editorial Team

Every major LLM provider has spent the last two years racing to expand context windows. The headlines followed: 128K tokens, then 1M, then 2M. The implication is that larger is strictly better: more context means the model knows more, attends to more, gets it more right.

Chroma's 2025 research tested that assumption directly. They ran 18 frontier models (GPT-4.1, Claude Opus 4, Gemini 2.5, and others) on a retrieval accuracy task at increasing context lengths. Every single one got worse as context grew, not when the context window filled, but much earlier. Chroma named this context rot. Some models showed severe degradation by 1,000 tokens of surrounding context. Across the full test set, models fell short of their marketed maximum effective context window by up to 99%.

This is the phenomenon the research named context rot: measurable degradation in LLM output quality as input context grows, independent of whether the model's context window is anywhere near capacity.

For engineers building production LLM applications, the implication is specific: filling your context window is not a neutral act. It actively harms output quality and multiplies API costs. The engineering response is not a larger model with a bigger window; it is discipline about what goes into context in the first place.

What context rot actually is

Context rot is distinct from two related but different problems. Running out of context — when input exceeds the model's maximum token limit — produces a hard error or truncation. You know it happened. Context rot produces no error. The model accepts the input, generates output, and the output is worse. Often worse in ways that look like normal model variance rather than a systematic failure.

Three mechanisms drive it.

The first is the lost-in-the-middle problem, documented in transformer attention research going back years. Models attend reliably to content near the beginning and end of their context. Content placed in the middle receives less consistent attention. A document buried in the centre of a 128K context is attended to less reliably than the same document placed first, even if it is the most relevant content in the input.

The second is semantic interference. Context that is thematically related to the current task but no longer relevant (an old version of a file, a completed step in a multi-step workflow, a conversation turn from an earlier topic) actively misleads the model. This is not just noise; it competes with the relevant signal in attention. The Chroma research found that irrelevant context degrades performance more severely than simply leaving space empty.

The third is instruction weight drift. As context grows, instructions given at the start of the input progressively lose relative weight compared to content that appears closer to the query. This is why long-running agents often stop following their system prompt constraints while continuing to complete tasks: the instructions are still technically present, but attended to with decreasing fidelity.

The Chroma research: 18 models, no exceptions

The study's methodology was straightforward: given a set of facts and a retrieval question, how accurately does the model find the correct answer as the surrounding context volume increases? The target facts stayed constant. The irrelevant surrounding content scaled.

The result was consistent across all 18 models tested. Accuracy dropped at every context length increment. There were no models that handled long context without degradation. Some handled it more gracefully than others; the cliff edges were at different points, the slopes were different; but the directional finding held universally.

Several specific findings matter for production decisions. First, the degradation is not linear. Performance does not fall gradually and evenly as context grows. There are cliff edges: specific context lengths where accuracy drops sharply. The location of these cliffs varies by model and by task type, which means you cannot rely on general benchmarks to predict your application's specific failure point.

Second, the gap between marketed context length and effective context length is large. The researchers defined the effective context window as the length at which accuracy dropped below a useful threshold. For many models on many tasks, that threshold was reached well under 50K tokens. On some task types, it was reached under 10K.

Third, no model tested was immune. Context rot is not a property of a particular architecture or a particular provider. It is a current property of transformer attention at scale.

Task type	Typical marketed max	Practical effective range	Primary failure mode
Fact retrieval / short Q&A	1M tokens	4K–16K relevant	Lost in the middle, semantic interference
Document summarisation	200K tokens	32K–64K focused text	Calibration drift at edges
Code generation / review	128K tokens	20K–40K codebase	Irrelevant file interference
Multi-turn chat / agent loop	128K tokens	Degrades after ~50–100 meaningful turns	Instruction weight loss
RAG with retrieved chunks	128K tokens	3–8 high-quality chunks	Dilution from low-relevance chunks

Marketed context window vs. practical effective range by task type

These ranges are approximations from the research; exact figures vary by model and task. The pattern holds.

The cost side of filling your context window

Context window costs are linear arithmetic, and the arithmetic is steep at production scale.

Most LLM APIs price on tokens: input tokens, output tokens, or both. A request with 1K input tokens costs, per token, the same as a request with 128K input tokens. But the 128K request costs 128 times more in absolute terms. A 512K request costs 512 times more than a 1K request.

For products that run high call volumes (coding assistants, document review tools, customer support agents, structured data extraction pipelines) context management is not a nice-to-have optimisation. It is the unit economics of the product.

The production engineering playbook

Given the above, the engineering goal shifts from 'how do I fit more into context?' to 'what can I safely keep out?' Several patterns address this directly.

Rolling windows, not full history

For multi-turn conversations and long-running agents, keep a rolling window defined by function rather than by recency. The window should be functional, not chronological: 'the turns that remain relevant to the current task,' not 'the last N turns.' When a topic changes or a subtask completes, the turns from the previous phase should exit the active context. They are not deleted — store them — but they are not included in the next API call.

The implementation choice that matters: whether the trim decision is token-based or turn-count-based. Turn count is simpler to reason about but can allow very long individual turns to blow the budget. Token-count-based trimming is more reliable — you know what you are sending.

Ranked retrieval with a hard chunk limit

In RAG systems, including more retrieved chunks does not compensate for including poorly-ranked ones. Three chunks at 0.85+ similarity threshold outperform twelve chunks at mixed relevance in most retrieval benchmarks. The cut-off should be aggressive. If you cannot articulate how a given chunk answers the current query, leave it out.

A practical default: set a maximum of five chunks, require a minimum similarity score, and treat any chunk below the threshold as a discard rather than a fallback. Below-threshold chunks are not better than nothing — they are often worse than nothing.

Summarise at turn boundaries

For long agent workflows, summarise conversation history at regular checkpoints (every 20 turns, or at the natural end of a subtask) and replace the raw turn history with the summary before the next phase. The model loses the exact wording of earlier exchanges, which is usually fine: the substance of what was agreed is preserved at a fraction of the token cost.

This pattern also partially addresses instruction weight drift. A fresh summarised context resets the relative weight of the current instructions.

Pin instructions at both ends

Important instructions should appear at both the start and the end of the context. This is a direct response to the attention pattern: content at the edges receives more consistent attention than content in the middle. A system prompt pinned at both ends of a long context is attended to more reliably than a system prompt that appears only at the start.

This is not duplication for aesthetics. It is an engineering response to a documented attention characteristic.

Set a context budget as a first-class parameter

Rather than treating context length as an implicit default, make it an explicit budget with a hard ceiling. For each step in a pipeline, calculate the expected token count before making the call. If the calculation exceeds the budget, reduce scope.

The budget should be set well below the marketed maximum. A practical starting point for tasks where accuracy matters: 20–30% of the model's marketed context window. For tasks where occasional quality degradation is acceptable, you can push higher, but with explicit awareness that you are trading quality for throughput.

context_builder.py

def build_context(
    system_prompt: str,
    history: list[dict],
    retrieved_chunks: list[str],
    max_tokens: int = 16_000,  # budget, not the model's maximum
    chunk_limit: int = 5,
    min_chunk_score: float = 0.82,
    scored_chunks: list[tuple[str, float]] | None = None,
) -> list[dict]:
    context = [{"role": "system", "content": system_prompt}]
    used = count_tokens(system_prompt)

    # Fill conversation history from most recent, staying within 60% of budget
    history_budget = int(max_tokens * 0.6)
    recent = []
    for turn in reversed(history):
        turn_tokens = count_tokens(turn["content"])
        if used + turn_tokens > history_budget:
            break
        recent.insert(0, turn)
        used += turn_tokens
    context.extend(recent)

    # Add high-quality chunks only
    chunks_to_add = scored_chunks or [(c, 1.0) for c in retrieved_chunks]
    added = 0
    for chunk, score in sorted(chunks_to_add, key=lambda x: -x[1]):
        if score < min_chunk_score or added >= chunk_limit:
            break
        chunk_tokens = count_tokens(chunk)
        if used + chunk_tokens > max_tokens:
            break
        context.append({"role": "user", "content": f"Relevant context:\n{chunk}"})
        used += chunk_tokens
        added += 1

    return context

When large context is actually the right tool

Not every application is equally affected by context rot, and not every case calls for aggressive trimming.

Single-pass document analysis, where you process a contract or codebase once and then discard the context, is a legitimate use for large context windows. The degradation is real, but for a single pass with a clear output schema, the quality loss may fall within acceptable bounds. The cost is a one-time charge, not a per-call multiplier.

Needle-in-a-haystack tasks where the relevant content is placed deliberately at the beginning or end of context, not buried in the middle, sidestep the worst of the lost-in-the-middle problem. If you can control placement, you can partially control degradation.

Batch processing jobs where throughput matters more than per-request accuracy are also a reasonable fit. If you are processing 10,000 documents and 90% accuracy is sufficient, the cost savings from batching into larger contexts may outweigh the quality loss.

These are exceptions, not defaults. For any application where output quality matters on individual requests, and especially for agentic workflows running over many turns, the default assumption should be: smaller, curated context outperforms larger, comprehensive context on accuracy, cost, and latency simultaneously.

What this means for how you build

Context management has historically been treated as a post-optimisation concern, something you revisit after the product is working and the API bills arrive. The Chroma research suggests it should be a first-class design decision, addressed at the architecture stage.

The practical implication: when you sketch out an LLM pipeline, the context budget for each step is as important a design parameter as the model choice, the output schema, or the prompt structure. A well-managed 16K context will outperform a poorly-managed 128K context on most tasks — at lower cost and with lower latency.

The large context window is a useful upper bound and an occasionally valuable capability. It is not a substitute for knowing what your model actually needs to see.

Frequently asked questions

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Jul 2, 2026Read full article →

AI & LLMsMay 31, 20268 min readReviewed May 31, 2026

Context rot is real: what the 18-model study means for production LLM engineering

Every frontier model gets measurably worse as context grows. Here is the production engineering playbook.

By FlowVerify Editorial Team

What context rot actually is

Three mechanisms drive it.

The Chroma research: 18 models, no exceptions

Third, no model tested was immune. Context rot is not a property of a particular architecture or a particular provider. It is a current property of transformer attention at scale.

Task type	Typical marketed max	Practical effective range	Primary failure mode
Fact retrieval / short Q&A	1M tokens	4K–16K relevant	Lost in the middle, semantic interference
Document summarisation	200K tokens	32K–64K focused text	Calibration drift at edges
Code generation / review	128K tokens	20K–40K codebase	Irrelevant file interference
Multi-turn chat / agent loop	128K tokens	Degrades after ~50–100 meaningful turns	Instruction weight loss
RAG with retrieved chunks	128K tokens	3–8 high-quality chunks	Dilution from low-relevance chunks

Marketed context window vs. practical effective range by task type

These ranges are approximations from the research; exact figures vary by model and task. The pattern holds.

The cost side of filling your context window

Context window costs are linear arithmetic, and the arithmetic is steep at production scale.

The production engineering playbook

Given the above, the engineering goal shifts from 'how do I fit more into context?' to 'what can I safely keep out?' Several patterns address this directly.

Rolling windows, not full history

Ranked retrieval with a hard chunk limit

Summarise at turn boundaries

This pattern also partially addresses instruction weight drift. A fresh summarised context resets the relative weight of the current instructions.

Pin instructions at both ends

This is not duplication for aesthetics. It is an engineering response to a documented attention characteristic.

Set a context budget as a first-class parameter

context_builder.py

def build_context(
    system_prompt: str,
    history: list[dict],
    retrieved_chunks: list[str],
    max_tokens: int = 16_000,  # budget, not the model's maximum
    chunk_limit: int = 5,
    min_chunk_score: float = 0.82,
    scored_chunks: list[tuple[str, float]] | None = None,
) -> list[dict]:
    context = [{"role": "system", "content": system_prompt}]
    used = count_tokens(system_prompt)

    # Fill conversation history from most recent, staying within 60% of budget
    history_budget = int(max_tokens * 0.6)
    recent = []
    for turn in reversed(history):
        turn_tokens = count_tokens(turn["content"])
        if used + turn_tokens > history_budget:
            break
        recent.insert(0, turn)
        used += turn_tokens
    context.extend(recent)

    # Add high-quality chunks only
    chunks_to_add = scored_chunks or [(c, 1.0) for c in retrieved_chunks]
    added = 0
    for chunk, score in sorted(chunks_to_add, key=lambda x: -x[1]):
        if score < min_chunk_score or added >= chunk_limit:
            break
        chunk_tokens = count_tokens(chunk)
        if used + chunk_tokens > max_tokens:
            break
        context.append({"role": "user", "content": f"Relevant context:\n{chunk}"})
        used += chunk_tokens
        added += 1

    return context

When large context is actually the right tool

Not every application is equally affected by context rot, and not every case calls for aggressive trimming.

What this means for how you build

The large context window is a useful upper bound and an occasionally valuable capability. It is not a substitute for knowing what your model actually needs to see.

Context rot is real: what the 18-model study means for production LLM engineering

What context rot actually is

The Chroma research: 18 models, no exceptions

The cost side of filling your context window

The production engineering playbook

Rolling windows, not full history

Ranked retrieval with a hard chunk limit

Summarise at turn boundaries

Pin instructions at both ends

Set a context budget as a first-class parameter

When large context is actually the right tool

What this means for how you build

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Context rot is real: what the 18-model study means for production LLM engineering

What context rot actually is

The Chroma research: 18 models, no exceptions

The cost side of filling your context window

The production engineering playbook

Rolling windows, not full history

Ranked retrieval with a hard chunk limit

Summarise at turn boundaries

Pin instructions at both ends

Set a context budget as a first-class parameter

When large context is actually the right tool

What this means for how you build

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.