Prompt caching in production: why the hit rate depends on prompt structure, not the API setting
The mechanical explanation for why enabling caching and seeing a near-zero hit rate is the most common outcome
Prompt caching is available across the major LLM APIs and the economics look straightforward: cache your long system prompt, save 90% on those tokens for every subsequent request that hits the cache. Teams enable the feature, verify the API parameter is set, and check their cost dashboard a week later. The hit rate is near zero. The conclusion is usually 'caching doesn't work at our scale.'
The caching works. Most prompt structures don't.
The mechanism behind prompt caching is a prefix match, not a similarity match. The cache keys on the leading tokens of a request. Any change to any token before the cache boundary invalidates the cache for everything that follows. This single fact explains most prompt caching failures in production, and it has direct implications for how you structure every prompt in your application.
The cache boundary is a prefix, not an intent
On Anthropic's API, you mark cache breakpoints explicitly using cache_control: {"type": "ephemeral"} on a system or user block. The cache lasts five minutes. On OpenAI's API, the cache is automatic: any common prefix of 1,024 or more tokens is cached without explicit marking. In both cases, the rule is the same: the cached region is the leading token sequence, and any change to any token in that sequence , at any position, invalidates the cache for all tokens after the change.
The minimum eligible prefix is 1,024 tokens for Claude Sonnet and Opus, 2,048 for Haiku, and 1,024 for GPT-4o. Below those thresholds, caching doesn't engage regardless of how the prompt is structured. Above them, every token in the cached prefix must match exactly between requests.
Not approximately. Not semantically. Exactly. A single changed character in a 3,000-token system prompt invalidates the entire cache for everything after that change. The culprit is usually obvious: a date field that rolls over at midnight, a session ID that changes per user, a dynamic instruction injected from a feature flag.
The common structure that defeats itself
The most frequent pattern in production LLM applications looks something like this: a system prompt header, then a current-date or session-metadata field, then per-user context pulled from a database, then static instructions, then conversation history, then the user message. Each component assembled in whatever order made sense when the feature was first built.
Dynamic content inserted early (a current date, a user identifier, a freshness timestamp) invalidates every cached token that follows. The static instructions at the end of the system prompt are recomputed from scratch on every request, because the prefix changes every time the early dynamic field changes. Caching is enabled. The hit rate is 0%.
The five-minute diagnostic: for your highest-traffic endpoint, log the first 200 tokens of a typical request. Count how many of those 200 tokens differ between any two requests from different users or from the same user on different days. If dynamic content appears in the first 200 tokens, everything after it is a guaranteed cache miss.
RAG injection: why the ordering matters more than the content
Retrieval-augmented generation is where this gets structurally worse. A common RAG prompt: system prompt with behavior instructions, then retrieved document chunks (three to five, unique per query), then the user's question. The retrieved chunks change with every query. The system prompt is never cached.
Even with the system prompt placed first and the retrieved content placed second, the cache boundary falls at the end of the system prompt. On a 2,000-token system prompt with 800 tokens of retrieved content per request, the system prompt is eligible for caching , but only if the prompt assembly code actually places the breakpoint after the system prompt and before the retrieval injection. If retrieved content is concatenated into the system block rather than a separate user or assistant block, the boundary is wrong and the system prompt cache never activates.
The higher-value optimization: separate static and dynamic retrieval. Company policies, product documentation, reference material that changes weekly can live in the stable prefix position, cached via a breakpoint, refreshed only when the source documents update. Per-query retrieved chunks, which change with every request, belong after the cache boundary. The split is usually a one-afternoon change that can push hit rates from 10% to 60% on the same RAG application.
Four structure patterns that maintain high hit rates
The rule that covers most cases: static content first, dynamic content last. Every component that doesn't change between two requests from different users goes before the cache boundary. Everything user-specific, query-specific, or time-varying goes after it.
Pattern 1: Stable prefix ordering
Audit every component of your prompt and ask: does this change between two requests from the same user, one minute apart? If yes, it belongs in the late position. If no, it belongs in the early position. Most teams discover that 60-80% of their prompt content is static or slow-changing. All of it should be before the cache boundary.
Pattern 2: Conversation history trimming direction
Long conversation histories are cache-friendly up to the point where they need to be trimmed for context length. The problem: if you trim by removing old turns from the beginning of the conversation, you shift the prefix for all subsequent content, breaking the cache for anything cached after the removed turns. Trim from the most recent end instead. Or summarize old turns into a stable summary block placed at the beginning of the history, so the prefix stays consistent even as the conversation grows.
Pattern 3: Separate static and dynamic retrieval tiers
For RAG applications, retrieved content falls into two categories. Documents that rarely change (product documentation, company policies, reference guides) can be treated as a second stable tier, placed after the system prompt but before the cache boundary. Per-query retrieved chunks, unique to each search, go after the boundary. This is not always possible (some RAG architectures mix static and dynamic retrieval), but when it is, the cache hit rate improvement is significant.
Pattern 4: Explicit cache boundary placement
On Anthropic's API, use cache_control explicitly rather than relying on implicit boundaries. Mark the breakpoint at the exact position where static content ends and dynamic content begins. You can set up to four breakpoints per request, though one or two is usually right. The explicit boundary gives you control that the automatic caching in OpenAI's API doesn't. This forces the discipline of thinking about what's static versus dynamic at prompt-assembly time.
import anthropic
client = anthropic.Anthropic()
# Static system content — cached after first request
static_instructions = """You are a document analysis assistant.
You have access to the following reference documentation:
[2,000 tokens of static reference material here]
Analyze documents according to the guidelines above."""
# Per-request dynamic content — never cached
user_context = f"Current user: {user_name}, Role: {user_role}"
retrieved_chunks = get_relevant_chunks(user_query)
response = client.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=[
{
"type": "text",
"text": static_instructions,
"cache_control": {"type": "ephemeral"} # Cache boundary here
},
{
"type": "text",
"text": user_context # Dynamic — after the boundary
}
],
messages=[
{"role": "user", "content": retrieved_chunks + "\n\n" + user_query}
]
)
# Read cache performance from usage
usage = response.usage
print(f"Cache read: {usage.cache_read_input_tokens} tokens")
print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
print(f"Regular input: {usage.input_tokens} tokens")
Reading the metrics that matter
The aggregate cache hit rate in a dashboard is the wrong metric to optimize. An aggregate of 40% can mean one endpoint hits 80% and another hits 0%. Each endpoint has its own prompt structure and its own caching profile. Track hit rate per endpoint, not across the application.
Both Anthropic and OpenAI expose cache-specific token counts in the usage response. On Anthropic: cache_read_input_tokens (tokens served from cache), cache_creation_input_tokens (tokens written to cache this request), and input_tokens (tokens computed normally). Log all three. The hit rate for a given endpoint is cache_read_input_tokens divided by total input tokens across the trailing window.
| Hit rate | Probable cause | First thing to check |
|---|---|---|
| < 5% | Prefix changes on every request | Is there a date, session ID, or user ID in the first 200 tokens? |
| 5–20% | Dynamic content early in prompt | Which component is first: static instructions or per-user context? |
| 20–50% | Partial structure problem | Are retrieved documents injected before or after the cache boundary? |
| 50–80% | Good structure, cache TTL friction | Are requests clustered within the 5-minute TTL window? |
| > 80% | Well-structured prefix | Monitor for prompt drift as the application evolves |
A hit rate below 20% on a high-traffic endpoint almost always means dynamic content in the prefix. That's fixable with prompt restructuring. A hit rate below 5% means the prefix changes on every request — a structural problem that requires moving dynamic content to after the cache boundary, not a configuration change.
When caching trades one cost for another
Caching adds first-token latency on cache misses. On a miss, the API computes and stores the KV cache for the prefix, adding overhead on top of normal compute. The benefit arrives on the next request that hits the cache. For workloads where every request is unique — one-shot extraction on novel documents, fresh conversations with no shared prefix. In those cases, the added overhead on misses may exceed the savings on occasional hits.
Prompt caching is net positive when system prompts are long (1,500 or more tokens) and reused across many requests within the TTL window, when a significant fraction of requests share a prefix (multi-turn sessions, same-user requests within five minutes), or when static reference content is large and stable. For purely one-shot workloads at low volume, the miss overhead with few hits is a net cost. Profile before assuming caching helps.
The cost structure on Anthropic's API: cache creation costs 1.25 times the normal input token price. Cache reads cost 0.1 times the normal price. Breakeven is at roughly eight cache reads per cache creation within the five-minute TTL window. If a cached prefix is read more than eight times before it expires, caching saves money. Below that, the economics are slightly negative.
The structure decision happens before the first request
The practical implication: prompt structure is a caching decision, not just a readability decision. The order in which components are assembled — static versus dynamic, shared versus per-request — determines the cache hit rate before a single API call is made. Most teams make this decision implicitly, in whatever order seemed logical when the feature was first built.
Revisiting that structure with caching in mind is usually a short change with measurable cost impact at any meaningful volume. The change doesn't require rewriting logic. It's a reordering of existing prompt components, with a cache boundary placed where static content ends. The hard part is auditing which components are actually static and which only appear static because the test cases don't vary them.
Turning caching on is a single API parameter. Structuring prompts to hit the cache is a design question that should happen at the same time as prompt assembly, not after the cost dashboard shows a problem.
Frequently asked questions
Related reading
When per-seat pricing breaks: what GitHub Copilot's billing shift signals for AI-powered SaaS
AI agents consume compute in ways that don't map to user count — and Copilot's June 2026 billing shift is the clearest signal yet. Here's what the transition reveals about pricing for AI-powered products.
When the model fails: engineering graceful degradation into LLM-powered features
LLM features fail slowly, partially, and semantically — not with clean error codes. Designing for this requires different patterns from the distributed systems toolkit you already know.
AI agents in production: the cost controls most teams build too late
Most teams discover that their AI agent has been burning money the wrong way after the invoice arrives. Five operational controls prevent that — and most teams build them too late.