How long does the prompt cache last?

On Anthropic's API, the ephemeral cache lasts five minutes from the time it was last accessed. A request that hits the cache extends the TTL by five minutes. On OpenAI's API, the cache TTL is not publicly documented but behaves as hours to days for frequently-used prefixes. Both caches are per-API-key, not shared across different accounts or applications.

Does changing the model version invalidate the cache?

Yes. The cache is specific to the model version. Switching from claude-opus-4-5 to claude-sonnet-4-6, or from gpt-4o-2024-08-06 to gpt-4o-mini, invalidates any cached prefix for the previous model. Model version pinning is generally advisable for any production endpoint where cache hit rate matters.

Can I cache conversation history?

Yes, and this is often the highest-value use case. Conversation history is typically static for previous turns — the model can't change what was said earlier. Cache the history up to the most recent assistant turn, then add the new user message dynamically. The key is trimming from the newest end, not the oldest, to preserve the cached prefix.

What happens if I set a cache breakpoint in the middle of a dynamic section?

The cache will attempt to match the prefix up to the breakpoint. If that prefix changes between requests (because dynamic content appears before the breakpoint), the cache misses on every request and also pays the cache creation overhead. The net result is slightly higher cost and latency than not caching at all. The breakpoint should always fall after the last dynamic element in the prefix.

AI & LLMsMay 22, 20267 min readReviewed May 22, 2026

Prompt caching in production: why the hit rate depends on prompt structure, not the API setting

The mechanical explanation for why enabling caching and seeing a near-zero hit rate is the most common outcome

By FlowVerify Editorial Team

Prompt caching is available across the major LLM APIs and the economics look straightforward: cache your long system prompt, save 90% on those tokens for every subsequent request that hits the cache. Teams enable the feature, verify the API parameter is set, and check their cost dashboard a week later. The hit rate is near zero. The conclusion is usually 'caching doesn't work at our scale.'

The caching works. Most prompt structures don't.

The mechanism behind prompt caching is a prefix match, not a similarity match. The cache keys on the leading tokens of a request. Any change to any token before the cache boundary invalidates the cache for everything that follows. This single fact explains most prompt caching failures in production, and it has direct implications for how you structure every prompt in your application.

The cache boundary is a prefix, not an intent

On Anthropic's API, you mark cache breakpoints explicitly using cache_control: {"type": "ephemeral"} on a system or user block. The cache lasts five minutes. On OpenAI's API, the cache is automatic: any common prefix of 1,024 or more tokens is cached without explicit marking. In both cases, the rule is the same: the cached region is the leading token sequence, and any change to any token in that sequence , at any position, invalidates the cache for all tokens after the change.

The minimum eligible prefix is 1,024 tokens for Claude Sonnet and Opus, 2,048 for Haiku, and 1,024 for GPT-4o. Below those thresholds, caching doesn't engage regardless of how the prompt is structured. Above them, every token in the cached prefix must match exactly between requests.

Not approximately. Not semantically. Exactly. A single changed character in a 3,000-token system prompt invalidates the entire cache for everything after that change. The culprit is usually obvious: a date field that rolls over at midnight, a session ID that changes per user, a dynamic instruction injected from a feature flag.

The common structure that defeats itself

The most frequent pattern in production LLM applications looks something like this: a system prompt header, then a current-date or session-metadata field, then per-user context pulled from a database, then static instructions, then conversation history, then the user message. Each component assembled in whatever order made sense when the feature was first built.

Dynamic content inserted early (a current date, a user identifier, a freshness timestamp) invalidates every cached token that follows. The static instructions at the end of the system prompt are recomputed from scratch on every request, because the prefix changes every time the early dynamic field changes. Caching is enabled. The hit rate is 0%.

The five-minute diagnostic: for your highest-traffic endpoint, log the first 200 tokens of a typical request. Count how many of those 200 tokens differ between any two requests from different users or from the same user on different days. If dynamic content appears in the first 200 tokens, everything after it is a guaranteed cache miss.

RAG injection: why the ordering matters more than the content

Retrieval-augmented generation is where this gets structurally worse. A common RAG prompt: system prompt with behavior instructions, then retrieved document chunks (three to five, unique per query), then the user's question. The retrieved chunks change with every query. The system prompt is never cached.

Even with the system prompt placed first and the retrieved content placed second, the cache boundary falls at the end of the system prompt. On a 2,000-token system prompt with 800 tokens of retrieved content per request, the system prompt is eligible for caching , but only if the prompt assembly code actually places the breakpoint after the system prompt and before the retrieval injection. If retrieved content is concatenated into the system block rather than a separate user or assistant block, the boundary is wrong and the system prompt cache never activates.

The higher-value optimization: separate static and dynamic retrieval. Company policies, product documentation, reference material that changes weekly can live in the stable prefix position, cached via a breakpoint, refreshed only when the source documents update. Per-query retrieved chunks, which change with every request, belong after the cache boundary. The split is usually a one-afternoon change that can push hit rates from 10% to 60% on the same RAG application.

Four structure patterns that maintain high hit rates

The rule that covers most cases: static content first, dynamic content last. Every component that doesn't change between two requests from different users goes before the cache boundary. Everything user-specific, query-specific, or time-varying goes after it.

Pattern 1: Stable prefix ordering

Audit every component of your prompt and ask: does this change between two requests from the same user, one minute apart? If yes, it belongs in the late position. If no, it belongs in the early position. Most teams discover that 60-80% of their prompt content is static or slow-changing. All of it should be before the cache boundary.

Pattern 2: Conversation history trimming direction

Long conversation histories are cache-friendly up to the point where they need to be trimmed for context length. The problem: if you trim by removing old turns from the beginning of the conversation, you shift the prefix for all subsequent content, breaking the cache for anything cached after the removed turns. Trim from the most recent end instead. Or summarize old turns into a stable summary block placed at the beginning of the history, so the prefix stays consistent even as the conversation grows.

Pattern 3: Separate static and dynamic retrieval tiers

For RAG applications, retrieved content falls into two categories. Documents that rarely change (product documentation, company policies, reference guides) can be treated as a second stable tier, placed after the system prompt but before the cache boundary. Per-query retrieved chunks, unique to each search, go after the boundary. This is not always possible (some RAG architectures mix static and dynamic retrieval), but when it is, the cache hit rate improvement is significant.

Pattern 4: Explicit cache boundary placement

On Anthropic's API, use cache_control explicitly rather than relying on implicit boundaries. Mark the breakpoint at the exact position where static content ends and dynamic content begins. You can set up to four breakpoints per request, though one or two is usually right. The explicit boundary gives you control that the automatic caching in OpenAI's API doesn't. This forces the discipline of thinking about what's static versus dynamic at prompt-assembly time.

cache_boundary.py

import anthropic

client = anthropic.Anthropic()

# Static system content — cached after first request
static_instructions = """You are a document analysis assistant.
You have access to the following reference documentation:

[2,000 tokens of static reference material here]

Analyze documents according to the guidelines above."""

# Per-request dynamic content — never cached
user_context = f"Current user: {user_name}, Role: {user_role}"
retrieved_chunks = get_relevant_chunks(user_query)

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": static_instructions,
            "cache_control": {"type": "ephemeral"}  # Cache boundary here
        },
        {
            "type": "text",
            "text": user_context  # Dynamic — after the boundary
        }
    ],
    messages=[
        {"role": "user", "content": retrieved_chunks + "\n\n" + user_query}
    ]
)

# Read cache performance from usage
usage = response.usage
print(f"Cache read: {usage.cache_read_input_tokens} tokens")
print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
print(f"Regular input: {usage.input_tokens} tokens")

Reading the metrics that matter

The aggregate cache hit rate in a dashboard is the wrong metric to optimize. An aggregate of 40% can mean one endpoint hits 80% and another hits 0%. Each endpoint has its own prompt structure and its own caching profile. Track hit rate per endpoint, not across the application.

Both Anthropic and OpenAI expose cache-specific token counts in the usage response. On Anthropic: cache_read_input_tokens (tokens served from cache), cache_creation_input_tokens (tokens written to cache this request), and input_tokens (tokens computed normally). Log all three. The hit rate for a given endpoint is cache_read_input_tokens divided by total input tokens across the trailing window.

Hit rate	Probable cause	First thing to check
< 5%	Prefix changes on every request	Is there a date, session ID, or user ID in the first 200 tokens?
5–20%	Dynamic content early in prompt	Which component is first: static instructions or per-user context?
20–50%	Partial structure problem	Are retrieved documents injected before or after the cache boundary?
50–80%	Good structure, cache TTL friction	Are requests clustered within the 5-minute TTL window?
> 80%	Well-structured prefix	Monitor for prompt drift as the application evolves

Diagnosing cache hit rate by symptom

A hit rate below 20% on a high-traffic endpoint almost always means dynamic content in the prefix. That's fixable with prompt restructuring. A hit rate below 5% means the prefix changes on every request — a structural problem that requires moving dynamic content to after the cache boundary, not a configuration change.

When caching trades one cost for another

Caching adds first-token latency on cache misses. On a miss, the API computes and stores the KV cache for the prefix, adding overhead on top of normal compute. The benefit arrives on the next request that hits the cache. For workloads where every request is unique — one-shot extraction on novel documents, fresh conversations with no shared prefix. In those cases, the added overhead on misses may exceed the savings on occasional hits.

Prompt caching is net positive when system prompts are long (1,500 or more tokens) and reused across many requests within the TTL window, when a significant fraction of requests share a prefix (multi-turn sessions, same-user requests within five minutes), or when static reference content is large and stable. For purely one-shot workloads at low volume, the miss overhead with few hits is a net cost. Profile before assuming caching helps.

The cost structure on Anthropic's API: cache creation costs 1.25 times the normal input token price. Cache reads cost 0.1 times the normal price. Breakeven is at roughly eight cache reads per cache creation within the five-minute TTL window. If a cached prefix is read more than eight times before it expires, caching saves money. Below that, the economics are slightly negative.

The structure decision happens before the first request

The practical implication: prompt structure is a caching decision, not just a readability decision. The order in which components are assembled — static versus dynamic, shared versus per-request — determines the cache hit rate before a single API call is made. Most teams make this decision implicitly, in whatever order seemed logical when the feature was first built.

Revisiting that structure with caching in mind is usually a short change with measurable cost impact at any meaningful volume. The change doesn't require rewriting logic. It's a reordering of existing prompt components, with a cache boundary placed where static content ends. The hard part is auditing which components are actually static and which only appear static because the test cases don't vary them.

Turning caching on is a single API parameter. Structuring prompts to hit the cache is a design question that should happen at the same time as prompt assembly, not after the cost dashboard shows a problem.

Frequently asked questions

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Jul 2, 2026Read full article →

AI & LLMsMay 22, 20267 min readReviewed May 22, 2026

Prompt caching in production: why the hit rate depends on prompt structure, not the API setting

The mechanical explanation for why enabling caching and seeing a near-zero hit rate is the most common outcome

By FlowVerify Editorial Team

The caching works. Most prompt structures don't.

The cache boundary is a prefix, not an intent

The common structure that defeats itself

RAG injection: why the ordering matters more than the content

Four structure patterns that maintain high hit rates

Pattern 1: Stable prefix ordering

Pattern 2: Conversation history trimming direction

Pattern 3: Separate static and dynamic retrieval tiers

Pattern 4: Explicit cache boundary placement

cache_boundary.py

import anthropic

client = anthropic.Anthropic()

# Static system content — cached after first request
static_instructions = """You are a document analysis assistant.
You have access to the following reference documentation:

[2,000 tokens of static reference material here]

Analyze documents according to the guidelines above."""

# Per-request dynamic content — never cached
user_context = f"Current user: {user_name}, Role: {user_role}"
retrieved_chunks = get_relevant_chunks(user_query)

response = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": static_instructions,
            "cache_control": {"type": "ephemeral"}  # Cache boundary here
        },
        {
            "type": "text",
            "text": user_context  # Dynamic — after the boundary
        }
    ],
    messages=[
        {"role": "user", "content": retrieved_chunks + "\n\n" + user_query}
    ]
)

# Read cache performance from usage
usage = response.usage
print(f"Cache read: {usage.cache_read_input_tokens} tokens")
print(f"Cache created: {usage.cache_creation_input_tokens} tokens")
print(f"Regular input: {usage.input_tokens} tokens")

Reading the metrics that matter

Hit rate	Probable cause	First thing to check
< 5%	Prefix changes on every request	Is there a date, session ID, or user ID in the first 200 tokens?
5–20%	Dynamic content early in prompt	Which component is first: static instructions or per-user context?
20–50%	Partial structure problem	Are retrieved documents injected before or after the cache boundary?
50–80%	Good structure, cache TTL friction	Are requests clustered within the 5-minute TTL window?
> 80%	Well-structured prefix	Monitor for prompt drift as the application evolves

Diagnosing cache hit rate by symptom

Prompt caching in production: why the hit rate depends on prompt structure, not the API setting

The cache boundary is a prefix, not an intent

The common structure that defeats itself

RAG injection: why the ordering matters more than the content

Four structure patterns that maintain high hit rates

Pattern 1: Stable prefix ordering

Pattern 2: Conversation history trimming direction

Pattern 3: Separate static and dynamic retrieval tiers

Pattern 4: Explicit cache boundary placement

Reading the metrics that matter

When caching trades one cost for another

The structure decision happens before the first request

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Vibe-coded codebases look fine until month three

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Prompt caching in production: why the hit rate depends on prompt structure, not the API setting

The cache boundary is a prefix, not an intent

The common structure that defeats itself

RAG injection: why the ordering matters more than the content

Four structure patterns that maintain high hit rates

Pattern 1: Stable prefix ordering

Pattern 2: Conversation history trimming direction

Pattern 3: Separate static and dynamic retrieval tiers

Pattern 4: Explicit cache boundary placement

Reading the metrics that matter

When caching trades one cost for another

The structure decision happens before the first request

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

AI agents advertise a 200K-token context window. The reliable number is closer to 130K.

Vibe-coded codebases look fine until month three

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.