When the model fails: engineering graceful degradation into LLM-powered features
LLM APIs fail differently from every other dependency in your stack. Three failure modes require three different detection and fallback strategies.
Here's what a payment gateway timeout looks like: an HTTP connection fails at 2 seconds, you catch the exception, log the error, and show a user-friendly message. The failure is clean, binary, and your circuit breaker handles it.
Here's what an LLM API timeout looks like: the connection opens. Streaming begins at second 7. Tokens arrive steadily until second 24, when the response finishes. HTTP 200. You parse the output and find 420 words of confident, fluent text about a topic adjacent to what the user asked.
No error code. No exception. Just a convincing wrong answer.
This is the challenge that makes graceful degradation for LLM features harder than for any other dependency in your stack. The failures are slow, partial, and semantically invisible. Standard distributed systems patterns (retry on 5xx, circuit break on error rate, validate the response schema) catch a fraction of what actually goes wrong.
Three failure modes, three different strategies
Latency failures. LLMs routinely take 5–30 seconds to respond. Providers queue requests, and models go slow without emitting a 429 or 503. If you apply a 2-second timeout (standard for most APIs), you'll trigger fallbacks on a significant fraction of legitimate requests. If you apply a 60-second timeout and stream nothing in the meantime, users assume the product is broken.
Format failures. The model returns text that doesn't match the structure your application expects. If you asked for JSON with specific keys and the model returned a prose paragraph instead, downstream parsing fails. These are detectable. Most teams only handle HTTP errors and don't validate the response structure.
Semantic failures. The model returns something structurally valid but substantively wrong. Hallucinated data. An answer to a slightly different question. Internally contradictory output. These failures look like success until a user reads the result. At that point, trust in the feature drops, not just satisfaction with this one response.
Each failure type needs different detection logic and a different fallback strategy. Conflating them into a single 'AI error' handler is where most production issues start.
Two timeouts, not one
Setting a single request timeout for LLM calls is almost always wrong. Too short, and legitimate requests fail. Too long, and users wait 40 seconds before seeing a fallback screen.
The right design is two separate thresholds:
Time to first token (TTFT): 5–8 seconds. The time from sending the request to receiving the first streaming token. If 8 seconds pass with no response, the model is either queued behind a saturated pool or the connection has stalled. This is your early detection signal.
Completion timeout: 30–45 seconds. Once streaming starts, a long-form response might take another 20–30 seconds. That's normal. If you're still waiting 45 seconds after the first token arrived, the generation has stalled.
When the TTFT threshold fires, retry once with a lower max_tokens limit. This serves two purposes: shorter requests queue ahead of longer ones during a backlog, and you get a partial response faster. If the completion timeout fires mid-stream, evaluate whether you have enough of the response to show (60%+ complete and structurally valid) or fall back cleanly.
Quality gates between the model and your users
HTTP 200 is not a quality signal. Build a validation layer between LLM output and your product. Three tiers, from cheapest to most thorough:
Tier 1: Structural validation. Does the response meet a minimum length? Does it contain expected sections or keys? Does the JSON parse? These checks take microseconds and catch 70–75% of production garbage.
Tier 2: Rule-based heuristics. Does the response start with a refusal phrase? ("I'm sorry, I can't..." or "As an AI language model...") Is it unusually short for a complex request? Does it mention entities that shouldn't appear? A small set of string-match rules catches another 10–15% of failures that tier 1 misses.
Tier 3: LLM-as-validator. Use a fast, cheap model with a binary prompt: "Is this response coherent, relevant, and complete? Answer YES or NO." Cost is roughly $0.001 per validation at current pricing. Latency is 500ms–1s. Use tier 3 only when tiers 1 and 2 both pass and showing a bad response carries high cost: customer-facing summaries, automated decisions, outputs that get acted on without further review.
For most applications, tiers 1 and 2 are enough. Adding tier 3 to every request adds measurable cost and latency. Reserve it for features where a wrong answer is worse than a fallback state.
Caching the last known-good response
For features that return information rather than unique generative output — summaries, status reports, document digests, personalised dashboards — caching the last known-good LLM response is the most reliable fallback you can build.
The distinction from performance caching is important: you're not caching to reduce latency. You're caching to ensure there's always a fallback value. The expiry policy follows from this, not from your latency SLO.
A one-hour-old summary of a user's account status is almost always better than a blank state or an error message. A 24-hour-old digest of a slowly changing document is usually still useful. The right expiry is tied to how often the underlying data changes meaningfully, not to how fast users expect responses.
Cache key design is harder for AI features than for deterministic queries. The input prompt may include the current time, user-specific context, or dynamic data that would make every call a cache miss. Normalise the key to stable inputs: user ID, document ID, and feature intent. Strip ephemeral context that changes per-request.
Circuit breakers that read LLM failure signals correctly
A standard circuit breaker counts errors in a time window and opens after a threshold. For LLM dependencies, error rate is the wrong signal.
LLM providers degrade before they fail. The first sign of a problem is usually increased latency, not errors. Your error rate stays at 0% while p95 response time climbs from 10 seconds to 45 seconds. By the time error rate rises, users have been experiencing degraded AI for 10–15 minutes.
An LLM-aware circuit breaker:
- Counts timeout events rather than HTTP errors.
- Uses a 30-second sliding window. LLM requests legitimately take 20+ seconds; a 10-second window misclassifies slow-but-valid responses as failures.
- Opens after 3–5 timeout events in the window (adjust to your request volume and fallback cost).
- Half-opens after 60 seconds, sending a probe with max_tokens: 50 to test recovery.
Track circuit open and close events as first-class metrics. They tell you the fraction of time your AI feature is actually degraded for users, a number that rarely appears in standard availability dashboards.
“LLM providers degrade in latency before they fail in errors. By the time your error rate rises, users have been experiencing a broken experience for 15 minutes.”
What the user sees when AI fails
The worst fallback state is a blank loading screen that eventually resolves to 'Something went wrong.' Close behind it: a spinner that never resolves.
Surface the fallback quickly. Once your timeout fires or quality gate rejects output, show the fallback state within 100ms of detection. Not at the full completion timeout — at the point of detection. Every second between detection and display is a second the user is actively confused.
Acknowledge the failure without technical detail. "We couldn't generate this right now" is more accurate and more calming than "AI error" or "Something went wrong." It sets an honest expectation without raising more questions than it answers.
Give a path forward. Show the cached version with a timestamp. Show the non-AI fallback. Give a retry button that actually retries. An empty state with no action is a dead end that makes the failure feel permanent.
Don't hide that AI is involved. If you silently serve cached or non-AI content without acknowledging it, users notice inconsistencies and distrust the feature rather than attributing the issue to a temporary outage. A small, dismissible banner — "AI features are limited right now" — is better than no signal.
Mapping failure modes to responses
Every LLM call needs a consistent fallback decision path. The table below maps each failure mode to how you detect it and what you do next.
| Failure mode | How to detect | Primary fallback | What the user sees |
|---|---|---|---|
| TTFT timeout | No token within 5–8 s | Retry with lower max_tokens | Skeleton state, then result or fallback |
| Completion timeout | > 45 s after first token | Show partial if ≥ 60% complete | 'Took too long — here's a partial result' |
| Format failure | Schema / parse check fails | Cached response or non-AI view | 'Something went wrong' |
| Semantic garbage | Quality gate (tiers 1–2) | Cached response or non-AI view | Same as format failure |
| Provider degraded | Timeout rate in circuit breaker window | Cached response + degraded banner | 'AI features are limited right now' |
| Provider down | Error rate + circuit open | Non-AI fallback + degraded banner | 'AI features unavailable' |
What to tune in your stack
The specific numbers — TTFT threshold, completion timeout, circuit breaker window, quality gate minimum length — depend on your model choice, typical request complexity, and the cost of each fallback. A feature calling Claude Opus for long-form analysis has different characteristics from one calling Haiku for short classifications.
What matters is that each threshold is set explicitly and revisited as your usage patterns change. The defaults that ship with most LLM SDKs are designed for interactive demos, not for production features that need to survive a provider degradation at 2 am.
Start with the quality gate — it requires no infrastructure and catches a large fraction of production problems. Add the TTFT/completion timeout split next. Build the LLM-aware circuit breaker once you have enough request volume to tune the window and threshold sensibly. The response cache and user messaging come last, but they're what users actually experience when everything else fails.
Frequently asked questions
Related reading
Prompt caching in production: why the hit rate depends on prompt structure, not the API setting
Prompt caching keys on the leading token prefix. One dynamic field early in the prompt invalidates the cache for everything after it. Here is what that means for how you structure production prompts.
When per-seat pricing breaks: what GitHub Copilot's billing shift signals for AI-powered SaaS
AI agents consume compute in ways that don't map to user count — and Copilot's June 2026 billing shift is the clearest signal yet. Here's what the transition reveals about pricing for AI-powered products.
AI agents in production: the cost controls most teams build too late
Most teams discover that their AI agent has been burning money the wrong way after the invoice arrives. Five operational controls prevent that — and most teams build them too late.