AI agents advertise a 200K-token context window. The reliable number is closer to 130K.
What 2026's agent-memory benchmarks actually show about tool overhead, model degradation, and the compression architectures that recover the gap.
Why your context window is not your memory budget
Every major model now ships with a context window measured in hundreds of thousands of tokens. GPT, Claude, and Gemini variants advertise 200,000-token windows or larger, and the implication is that this number is the agent’s memory: how much history, instructions, and tool output it can hold before something falls out the back. Teams building production agents tend to plan capacity around that figure. The 2026 benchmarks say it is the wrong number to plan around.
An AgentMarketCap analysis of production agent deployments tested twelve models against tasks at increasing context lengths and found eleven of them dropped below 50% task performance once the conversation passed roughly 32,000 tokens, about a sixth of a 200K window. One of the more closely watched long-context models, GPT-4, lost 15.4% of its accuracy moving from a 4,000-token prompt to a 128,000-token one, well inside its advertised range. The same analysis put the practical reliability threshold at around 130,000 tokens within a 200,000-token window: the point past which accuracy and latency both start working against the agent instead of for it.
Teams do not reach for 200,000 tokens by guessing. It is the number printed in the model card, the number that leads every release announcement, and the number that shows up first in API documentation. Actual token accounting lives nowhere nearly as visible: most teams find out how much of their budget tool schemas and system prompts are eating only after building a dashboard for it, usually once a production incident makes the gap impossible to ignore.
The exact shape of the gap differs by provider and by how aggressively a given model degrades past its sweet spot, but the direction is the same everywhere tested: window size is a ceiling, not a guarantee. Treating it as a guarantee is how teams end up debugging an agent that "used to work," once a session runs long enough to reach the part of the window that was never reliable to begin with.
The gap between 200,000 and 130,000 is not a rounding error. It is roughly a third of the advertised window, and it is the part of the budget that decides whether a long-running agent stays coherent on day three of a workflow or starts repeating itself.
Where the budget goes before the first reply
Tool definitions are the first claim on that budget, and they are paid before a user types anything. Every tool an agent can call needs a name, a parameter schema, a description, and usually a few examples written into the system prompt. Connect an agent to four or five MCP servers and each one appends its own tool catalogue to that prompt. In one production system the AgentMarketCap report examined, tool definitions consumed 72% of a 200,000-token window, leaving 57,000 tokens for the actual conversation, before accounting for any of the model’s own degradation past 32,000 tokens. A separate system in the same report had quietly lost 85,000 tokens of usable task context to the same kind of overhead.
The schemas add up fast in practice. A file-system tool, a web-search tool, a code-execution sandbox, a ticketing-system integration, and a calendar tool are a modest toolkit for a general-purpose agent, and each one typically needs several hundred tokens once its parameters, description, and a couple of usage examples are written out in full. Five tools at 600 tokens apiece is already 3,000 tokens before any MCP server gets involved. A handful of MCP servers, each exposing ten or more tools of their own, can multiply that several times over without anyone deciding to spend the budget that way.
Three ways to spend what is left
Once the tool tax is paid, what remains needs a spending plan. The field has mostly converged on three approaches, and each one makes a different trade between coherence, latency, and cost.
Sliding-window truncation keeps the most recent turns and drops everything older. It is the cheapest to build and adds no latency, but it has no memory of turn 5 by the time the agent reaches turn 40. That is fine for a stateless lookup, a poor fit for anything that builds on earlier decisions.
Hierarchical summarisation keeps a hot layer of recent turns verbatim, compresses the next block into rolling summaries, and compresses everything earlier again. Tested against 36,000 real engineering-session transcripts, an iterative version of this approach, one that updates summaries incrementally rather than reconstructing them from scratch each time, outperformed full-context reconstruction specifically on long sessions.
Selective memory offloading treats the context window as working memory and an external vector or graph store as disk: instead of carrying the full history forward, the agent retrieves only what is relevant to the current turn.
| Strategy | Mechanism | Best fit | Where it loses |
|---|---|---|---|
| Sliding-window truncation | Keep last N turns, drop the rest | Stateless lookups, single-turn tasks | No memory of anything outside the window |
| Hierarchical summarisation | Hot / warm / cold layers, compressed as turns age | Long interactive sessions with one coherent thread | Summary quality caps recall accuracy |
| Selective memory offloading | Context as RAM, vector or graph store as disk | Multi-day or multi-session agents | Retrieval quality becomes the new bottleneck |
The benchmark that actually matters: accuracy versus latency versus cost
Mem0's 2026 benchmark is the clearest illustration of what offloading actually buys, because it tested the same tasks under three conditions: replaying the full conversation history every turn, retrieving from a vector store, and retrieving from a graph store.
| Approach | Accuracy | p95 latency | Tokens / conversation |
|---|---|---|---|
| Full context replay | ~67% | 17.12s | ~26,000 |
| Vector retrieval | 66.9% | 1.44s | ~1,800 |
| Graph retrieval | 68.4% | 2.59s | ~1,800 |
Vector retrieval matched full-context accuracy within a single percentage point while cutting p95 latency by roughly 12x and token consumption by roughly 93%. Graph retrieval did slightly better on accuracy at a small latency cost. Neither result fits the usual assumption that compression is a quality tax paid for a cost saving. In this benchmark, it was close to free.
Read that result with the source attached. It is a vendor’s benchmark of its own retrieval product, scored against tasks shaped like its own use case. That does not make the numbers wrong: full-context replay really did take 17 seconds at the 95th percentile in this test, and that is a real number any team can reproduce. But a benchmark designed by the company selling the faster option will reliably find that the faster option is also accurate enough. The question worth asking is not whether the number is real. It is what it is being compared against.
How to read agent-memory benchmarks without buying the framing
Three benchmarks dominate the 2026 reports on this topic, and they test different things. LoCoMo runs 1,540 questions across single-hop, multi-hop, open-domain, and temporal recall. LongMemEval runs 500 questions covering user recall, preference tracking, knowledge updates, and multi-session memory. BEAM is built specifically to test what happens at scale, at 1 million and 10 million token conversations, across ten categories including contradiction resolution and instruction adherence.
That is why a vendor can publish a 94.4 score on LongMemEval in one sentence and a 25% accuracy drop scaling from 1 million to 10 million tokens, per BEAM, in the next, without the two numbers contradicting each other. One is testing recall within a session of fixed size. The other is testing what degrades as that session grows by an order of magnitude. Neither is the single, real agent-memory score, because there is not one. In practice this shows up as two engineers on the same team reading two different vendor pages and arriving at opposite conclusions about whether retrieval-based memory is worth building. Neither one is wrong about the number they read. They are answering different questions.
Before trusting a published number, three questions are worth asking: which benchmark produced it, what context length it was measured at, and whether it is being compared against a fixed token budget or a moving one. A 67% accuracy figure means something different when the alternative spent 1,800 tokens to get there versus 26,000. Memory-product vendors have a real commercial interest in publishing comparisons that make compression look free, and per the numbers above, sometimes it genuinely is. Read the methodology before the headline.
A decision rule for the next agent you ship
The practical rule that shows up across the 2026 reports: trigger compression at 60-70% of window capacity, not at the hard limit. Waiting until the window is nearly full means the agent has likely already been operating in degraded territory for tens of thousands of tokens by the time compression kicks in, given how early accuracy started dropping in the tests above.
Match the strategy to the shape of the task, not to whatever is already wired up. A single-pass reasoning task over one large fixed document has no turn 40 to lose track of. A longer window with no compression at all is the right call there, and compression would only add latency for no benefit. Interactive agents that run for hours or days, on support tickets, coding sessions, or long customer workflows, are exactly where hierarchical summarisation or retrieval earns its complexity, because the latency and token cost compound with every additional turn.
Two examples make the difference concrete. A contract-review agent that reads one eighty-page document and produces a single structured summary has nothing to forget: every clause is still relevant on the last page, so a long, uncompressed window is the cheaper and more accurate choice. A support agent that stays open across a multi-day ticket, pulling in new replies, escalation notes, and policy lookups as the conversation grows, is the opposite case. By hour six it is accumulating exactly the kind of stale, low-relevance turns that hierarchical summarisation or retrieval is built to discard.
def needs_compression(used_tokens: int, window_tokens: int, threshold: float = 0.65) -> bool:
return used_tokens / window_tokens >= threshold
def compress_if_needed(history, used_tokens, window_tokens, keep_verbatim=10):
if not needs_compression(used_tokens, window_tokens):
return history
hot = history[-keep_verbatim:]
cold = history[:-keep_verbatim]
return [summarize(cold)] + hotContext windows will keep growing. Every model release pushes the advertised maximum higher, and the reliable number under real load will keep landing somewhere short of it, for the same reasons it does today: tool catalogues get bigger, degradation curves do not flatten on their own, and latency scales with whatever the model has to read before it answers. The architecture decision — what to keep verbatim, what to compress, what to retrieve only on demand — does not disappear as the ceiling rises. It just moves to a bigger number.
Frequently asked questions
Related reading
An AI agent deleted PocketOS's production database in 9 seconds. Credential scoping was the real failure.
A Cursor agent found one unscoped API token and wiped a production database and its backups in nine seconds. The real failure was credential scoping, not the model.
Four agentic payments protocols, mapped: what AP2, ACP, UCP, and x402 actually solve
AP2, ACP, UCP, and x402 all claim to solve 'agentic payments.' They solve different problems, and one already failed in its first real deployment for reasons that had nothing to do with the protocol.
Prompt injection in AI browsers can’t be patched away. Here’s what actually works.
AI browser agents read web pages and user instructions through the same channel, so a hidden instruction in a Reddit post or calendar invite can hijack the agent. Patching the prompt won’t fix that.