What timeout value should I use for LLM API calls?

Use two timeouts: a time-to-first-token threshold of 5–8 seconds and a completion timeout of 30–45 seconds. If no token arrives within the TTFT threshold, retry with a lower max_tokens value. The completion timeout fires mid-stream only if generation has stalled, not because the response is simply long.

How do I detect a semantically bad LLM response?

Start with structural checks (minimum length, expected keys, JSON parsability) and rule-based heuristics (apology phrases, unexpectedly short answers to complex questions). For high-stakes features, add a cheap LLM-as-validator — a binary YES/NO prompt to a small model like Haiku or GPT-4o mini costs roughly $0.001 per call and adds under 1 second of latency.

When should I use a cached LLM response vs a non-AI fallback?

Use a cached LLM response when the underlying data the LLM summarised has not changed significantly since the cache entry was written. Use a non-AI fallback when the cached response would be stale enough to mislead or confuse users. Always label cached content with an age indicator so users can judge its freshness themselves.

How is an LLM circuit breaker different from a standard circuit breaker?

A standard circuit breaker counts error responses. LLM providers degrade in latency before they fail in errors — your error rate stays at 0% while p95 latency climbs from 10 to 45 seconds. An LLM-aware circuit breaker counts timeout events in a longer sliding window (30 seconds is more useful than 10), opens after 3–5 timeout events, and half-opens with a short probe request (max_tokens: 50) to test recovery.

AI & LLMsMay 20, 20267 min readReviewed May 20, 2026

When the model fails: engineering graceful degradation into LLM-powered features

LLM APIs fail differently from every other dependency in your stack. Three failure modes require three different detection and fallback strategies.

By FlowVerify Editorial Team

Here's what a payment gateway timeout looks like: an HTTP connection fails at 2 seconds, you catch the exception, log the error, and show a user-friendly message. The failure is clean, binary, and your circuit breaker handles it.

Here's what an LLM API timeout looks like: the connection opens. Streaming begins at second 7. Tokens arrive steadily until second 24, when the response finishes. HTTP 200. You parse the output and find 420 words of confident, fluent text about a topic adjacent to what the user asked.

No error code. No exception. Just a convincing wrong answer.

This is the challenge that makes graceful degradation for LLM features harder than for any other dependency in your stack. The failures are slow, partial, and semantically invisible. Standard distributed systems patterns (retry on 5xx, circuit break on error rate, validate the response schema) catch a fraction of what actually goes wrong.

Three failure modes, three different strategies

Latency failures. LLMs routinely take 5–30 seconds to respond. Providers queue requests, and models go slow without emitting a 429 or 503. If you apply a 2-second timeout (standard for most APIs), you'll trigger fallbacks on a significant fraction of legitimate requests. If you apply a 60-second timeout and stream nothing in the meantime, users assume the product is broken.

Format failures. The model returns text that doesn't match the structure your application expects. If you asked for JSON with specific keys and the model returned a prose paragraph instead, downstream parsing fails. These are detectable. Most teams only handle HTTP errors and don't validate the response structure.

Semantic failures. The model returns something structurally valid but substantively wrong. Hallucinated data. An answer to a slightly different question. Internally contradictory output. These failures look like success until a user reads the result. At that point, trust in the feature drops, not just satisfaction with this one response.

Each failure type needs different detection logic and a different fallback strategy. Conflating them into a single 'AI error' handler is where most production issues start.

Two timeouts, not one

Setting a single request timeout for LLM calls is almost always wrong. Too short, and legitimate requests fail. Too long, and users wait 40 seconds before seeing a fallback screen.

The right design is two separate thresholds:

Time to first token (TTFT): 5–8 seconds. The time from sending the request to receiving the first streaming token. If 8 seconds pass with no response, the model is either queued behind a saturated pool or the connection has stalled. This is your early detection signal.

Completion timeout: 30–45 seconds. Once streaming starts, a long-form response might take another 20–30 seconds. That's normal. If you're still waiting 45 seconds after the first token arrived, the generation has stalled.

When the TTFT threshold fires, retry once with a lower max_tokens limit. This serves two purposes: shorter requests queue ahead of longer ones during a backlog, and you get a partial response faster. If the completion timeout fires mid-stream, evaluate whether you have enough of the response to show (60%+ complete and structurally valid) or fall back cleanly.

Quality gates between the model and your users

HTTP 200 is not a quality signal. Build a validation layer between LLM output and your product. Three tiers, from cheapest to most thorough:

Tier 1: Structural validation. Does the response meet a minimum length? Does it contain expected sections or keys? Does the JSON parse? These checks take microseconds and catch 70–75% of production garbage.

Tier 2: Rule-based heuristics. Does the response start with a refusal phrase? ("I'm sorry, I can't..." or "As an AI language model...") Is it unusually short for a complex request? Does it mention entities that shouldn't appear? A small set of string-match rules catches another 10–15% of failures that tier 1 misses.

Tier 3: LLM-as-validator. Use a fast, cheap model with a binary prompt: "Is this response coherent, relevant, and complete? Answer YES or NO." Cost is roughly $0.001 per validation at current pricing. Latency is 500ms–1s. Use tier 3 only when tiers 1 and 2 both pass and showing a bad response carries high cost: customer-facing summaries, automated decisions, outputs that get acted on without further review.

For most applications, tiers 1 and 2 are enough. Adding tier 3 to every request adds measurable cost and latency. Reserve it for features where a wrong answer is worse than a fallback state.

Caching the last known-good response

For features that return information rather than unique generative output — summaries, status reports, document digests, personalised dashboards — caching the last known-good LLM response is the most reliable fallback you can build.

The distinction from performance caching is important: you're not caching to reduce latency. You're caching to ensure there's always a fallback value. The expiry policy follows from this, not from your latency SLO.

A one-hour-old summary of a user's account status is almost always better than a blank state or an error message. A 24-hour-old digest of a slowly changing document is usually still useful. The right expiry is tied to how often the underlying data changes meaningfully, not to how fast users expect responses.

Cache key design is harder for AI features than for deterministic queries. The input prompt may include the current time, user-specific context, or dynamic data that would make every call a cache miss. Normalise the key to stable inputs: user ID, document ID, and feature intent. Strip ephemeral context that changes per-request.

Circuit breakers that read LLM failure signals correctly

A standard circuit breaker counts errors in a time window and opens after a threshold. For LLM dependencies, error rate is the wrong signal.

LLM providers degrade before they fail. The first sign of a problem is usually increased latency, not errors. Your error rate stays at 0% while p95 response time climbs from 10 seconds to 45 seconds. By the time error rate rises, users have been experiencing degraded AI for 10–15 minutes.

An LLM-aware circuit breaker:

Counts timeout events rather than HTTP errors.
Uses a 30-second sliding window. LLM requests legitimately take 20+ seconds; a 10-second window misclassifies slow-but-valid responses as failures.
Opens after 3–5 timeout events in the window (adjust to your request volume and fallback cost).
Half-opens after 60 seconds, sending a probe with max_tokens: 50 to test recovery.

Track circuit open and close events as first-class metrics. They tell you the fraction of time your AI feature is actually degraded for users, a number that rarely appears in standard availability dashboards.

“LLM providers degrade in latency before they fail in errors. By the time your error rate rises, users have been experiencing a broken experience for 15 minutes.”

— FlowVerify Engineering

What the user sees when AI fails

The worst fallback state is a blank loading screen that eventually resolves to 'Something went wrong.' Close behind it: a spinner that never resolves.

Surface the fallback quickly. Once your timeout fires or quality gate rejects output, show the fallback state within 100ms of detection. Not at the full completion timeout — at the point of detection. Every second between detection and display is a second the user is actively confused.

Acknowledge the failure without technical detail. "We couldn't generate this right now" is more accurate and more calming than "AI error" or "Something went wrong." It sets an honest expectation without raising more questions than it answers.

Give a path forward. Show the cached version with a timestamp. Show the non-AI fallback. Give a retry button that actually retries. An empty state with no action is a dead end that makes the failure feel permanent.

Don't hide that AI is involved. If you silently serve cached or non-AI content without acknowledging it, users notice inconsistencies and distrust the feature rather than attributing the issue to a temporary outage. A small, dismissible banner — "AI features are limited right now" — is better than no signal.

Mapping failure modes to responses

Every LLM call needs a consistent fallback decision path. The table below maps each failure mode to how you detect it and what you do next.

Failure mode	How to detect	Primary fallback	What the user sees
TTFT timeout	No token within 5–8 s	Retry with lower max_tokens	Skeleton state, then result or fallback
Completion timeout	> 45 s after first token	Show partial if ≥ 60% complete	'Took too long — here's a partial result'
Format failure	Schema / parse check fails	Cached response or non-AI view	'Something went wrong'
Semantic garbage	Quality gate (tiers 1–2)	Cached response or non-AI view	Same as format failure
Provider degraded	Timeout rate in circuit breaker window	Cached response + degraded banner	'AI features are limited right now'
Provider down	Error rate + circuit open	Non-AI fallback + degraded banner	'AI features unavailable'

LLM failure modes and their handling

What to tune in your stack

The specific numbers — TTFT threshold, completion timeout, circuit breaker window, quality gate minimum length — depend on your model choice, typical request complexity, and the cost of each fallback. A feature calling Claude Opus for long-form analysis has different characteristics from one calling Haiku for short classifications.

What matters is that each threshold is set explicitly and revisited as your usage patterns change. The defaults that ship with most LLM SDKs are designed for interactive demos, not for production features that need to survive a provider degradation at 2 am.

Start with the quality gate — it requires no infrastructure and catches a large fraction of production problems. Add the TTFT/completion timeout split next. Build the LLM-aware circuit breaker once you have enough request volume to tune the window and threshold sensibly. The response cache and user messaging come last, but they're what users actually experience when everything else fails.

Frequently asked questions

Railway disconnected a carrier to contain an outage. It cut its last route instead.

Jul 5, 2026Read full article →

AI & LLMsMay 20, 20267 min readReviewed May 20, 2026

When the model fails: engineering graceful degradation into LLM-powered features

LLM APIs fail differently from every other dependency in your stack. Three failure modes require three different detection and fallback strategies.

By FlowVerify Editorial Team

No error code. No exception. Just a convincing wrong answer.

Three failure modes, three different strategies

Each failure type needs different detection logic and a different fallback strategy. Conflating them into a single 'AI error' handler is where most production issues start.

Two timeouts, not one

Setting a single request timeout for LLM calls is almost always wrong. Too short, and legitimate requests fail. Too long, and users wait 40 seconds before seeing a fallback screen.

The right design is two separate thresholds:

Quality gates between the model and your users

HTTP 200 is not a quality signal. Build a validation layer between LLM output and your product. Three tiers, from cheapest to most thorough:

For most applications, tiers 1 and 2 are enough. Adding tier 3 to every request adds measurable cost and latency. Reserve it for features where a wrong answer is worse than a fallback state.

Caching the last known-good response

Circuit breakers that read LLM failure signals correctly

A standard circuit breaker counts errors in a time window and opens after a threshold. For LLM dependencies, error rate is the wrong signal.

An LLM-aware circuit breaker:

Counts timeout events rather than HTTP errors.
Uses a 30-second sliding window. LLM requests legitimately take 20+ seconds; a 10-second window misclassifies slow-but-valid responses as failures.
Opens after 3–5 timeout events in the window (adjust to your request volume and fallback cost).
Half-opens after 60 seconds, sending a probe with max_tokens: 50 to test recovery.

“LLM providers degrade in latency before they fail in errors. By the time your error rate rises, users have been experiencing a broken experience for 15 minutes.”

— FlowVerify Engineering

What the user sees when AI fails

The worst fallback state is a blank loading screen that eventually resolves to 'Something went wrong.' Close behind it: a spinner that never resolves.

Mapping failure modes to responses

Every LLM call needs a consistent fallback decision path. The table below maps each failure mode to how you detect it and what you do next.

Failure mode	How to detect	Primary fallback	What the user sees
TTFT timeout	No token within 5–8 s	Retry with lower max_tokens	Skeleton state, then result or fallback
Completion timeout	> 45 s after first token	Show partial if ≥ 60% complete	'Took too long — here's a partial result'
Format failure	Schema / parse check fails	Cached response or non-AI view	'Something went wrong'
Semantic garbage	Quality gate (tiers 1–2)	Cached response or non-AI view	Same as format failure
Provider degraded	Timeout rate in circuit breaker window	Cached response + degraded banner	'AI features are limited right now'
Provider down	Error rate + circuit open	Non-AI fallback + degraded banner	'AI features unavailable'

LLM failure modes and their handling

When the model fails: engineering graceful degradation into LLM-powered features

Three failure modes, three different strategies

Two timeouts, not one

Quality gates between the model and your users

Caching the last known-good response

Circuit breakers that read LLM failure signals correctly

What the user sees when AI fails

Mapping failure modes to responses

What to tune in your stack

Frequently asked questions

Related reading

Railway disconnected a carrier to contain an outage. It cut its last route instead.

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

X, Zoom, and Teams went down from one fibre cut. The transit layer doesn’t show up on most redundancy diagrams.

Stay ahead on eSignatures, compliance, and document workflows

Railway disconnected a carrier to contain an outage. It cut its last route instead.

When the model fails: engineering graceful degradation into LLM-powered features

Three failure modes, three different strategies

Two timeouts, not one

Quality gates between the model and your users

Caching the last known-good response

Circuit breakers that read LLM failure signals correctly

What the user sees when AI fails

Mapping failure modes to responses

What to tune in your stack

Frequently asked questions

Related reading

Railway disconnected a carrier to contain an outage. It cut its last route instead.

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

X, Zoom, and Teams went down from one fibre cut. The transit layer doesn’t show up on most redundancy diagrams.

Stay ahead on eSignatures, compliance, and document workflows

Railway disconnected a carrier to contain an outage. It cut its last route instead.