LLM costs at scale are a product design problem
Why the standard infra playbook hits a ceiling and where the real savings are
Your first LLM feature ships. Usage climbs. Then the bill arrives, and it is higher than the numbers in your original back-of-envelope calculation. The standard advice is to optimise the infrastructure layer: quantise the model, add a KV cache, run a fast model first and escalate only on failure. That advice is correct, and you should do all of it. But most teams find that after the infra pass, the bill is still too high, and the next round of infra optimisation delivers sharply diminishing returns.
The reason is that the biggest cost drivers for most LLM-backed features are not infrastructure decisions at all. They are product decisions: what the feature asks the model to do, how frequently it does it, and how much context it attaches to each call. Optimising the infra layer without rethinking the product layer is like improving fuel economy while leaving the engine idling.
The standard infra playbook and where it ends
The infra moves that most teams run in the first cost-optimisation pass are well-documented. Quantisation (reducing model weights from 32-bit to 4-bit or 8-bit) cuts inference cost by 50-70% with minimal quality loss for most production tasks. Speculative decoding and KV caching reduce latency and, for batched workloads, cut cost further. Model cascade (routing simple requests to a small, fast model and escalating only the complex ones to a larger model) can reduce large-model call volume by 40-65% depending on the task distribution.
These are real, and they add up. A team that has done quantisation plus model cascade can reasonably expect to be 60-80% cheaper than they were at launch. The problem is that once you have run these passes, you have largely exhausted the infra lever. The remaining optimisations (further quantisation, different batching configurations, custom kernels) deliver 5-15% improvements at substantially higher implementation complexity.
The unit that actually matters: cost per user action
Most engineering teams track LLM cost in cost-per-thousand-tokens. That is the right unit for infrastructure decisions; it lets you compare models, quantisation strategies, and batch sizes on a level surface. It is the wrong unit for product decisions.
For product decisions, the unit that matters is cost per user action: how much does it cost when a user does the thing your feature exists for? If your feature is document summarisation, the relevant cost is how much you spend on average every time a user clicks Summarise, not how many tokens that request used in isolation.
The reason cost per user action often diverges from naive token-counting is iteration. In most LLM-backed features, users do not trigger the feature once per session. They iterate. A user trying to produce a good contract summary might trigger it four or five times, each with a different prompt framing or a slightly modified document. Your cost per user action in this scenario is 4-5x your cost per single API call. If you built your unit-economics model on single-call costs, your projections will be off by the same factor.
Call shape: the product decision that sets your cost ceiling
Call shape describes three properties of how your product interacts with the model: what context you attach, how often you call, and when you decide to call at all.
Context size is the most commonly under-optimised. Attaching a full conversation history to every call is the default behaviour in most LLM SDK wrappers. For a conversation running 20 turns, that means a 30,000-token context on every call. A rolling three-turn window preserves enough context for most conversational tasks at a fraction of the cost. For document tasks, attaching the full document when only the relevant section matters is the same pattern at higher stakes.
Call frequency has the widest variance. If your LLM feature triggers on every keystroke to provide real-time suggestions, you are making orders of magnitude more calls than a feature that triggers on paragraph completion or explicit user request. Most teams set call frequency at MVP time based on what felt right in a prototype. By the time usage is material, nobody has revisited whether the trigger point still makes sense.
Call decision (whether you call the model at all) is often the highest-impact of the three. A surprising fraction of requests that flow through to your LLM could be handled by a much cheaper first-pass: a regex match for structured input, a small embedding model to detect near-duplicate queries with cached answers, or a sub-$0.001 classifier to route trivial requests away from the expensive one. Teams that add a classification gate before their main model call typically find that 30-60% of requests never reach the expensive model.
Four product-layer levers
Once you have diagnosed your call shape, these four interventions are worth considering in rough order of implementation complexity.
Context budget cap. Set a maximum token budget for the context attached to each call, and enforce it. For conversational features, a rolling three-turn window is often sufficient. For document features, chunk and retrieve only the relevant sections rather than attaching the full document. A well-designed context budget can cut per-call cost by 30-60% without measurable quality degradation for most tasks.
Deferred and async calls. Move LLM calls out of the synchronous request path where the user does not notice latency. Async calls can be batched, rate-limited to cheaper off-peak windows, and crucially deduplicated before dispatch. If your system makes 100 calls in ten minutes and twenty of them are near-identical, an async queue with a deduplication pass eliminates those twenty calls before they are ever sent.
Classification gates. Add a cheap first-pass classifier before the expensive model call. The classifier can be a fast, small model, a learned embedding similarity check against a cache of previous responses, or a well-designed regex for structured inputs. The goal is to intercept requests that do not need the full model. For most production features, this gate eliminates 30-60% of expensive calls.
Graceful degradation tiers. Define a cost-aware fallback sequence before you need it. When a per-user or per-day budget threshold is approached, route to a lighter model or truncated context rather than cutting the feature off entirely. Users tolerate quality variation better than they tolerate feature absence, and a defined degradation sequence means cost spikes do not become incidents.
| Approach | Typical savings | Implementation | Hard ceiling? |
|---|---|---|---|
| Quantisation (4-bit/8-bit) | 50-70% | Low | Yes; quality degrades past this point |
| KV cache + speculative decoding | 15-30% | Low-medium | Yes |
| Model cascade | 40-65% on large-model calls | Medium | Depends on task mix |
| Context budget cap | 30-60% per call | Low | No |
| Deferred / async calls | 20-40% via batching | Medium | No |
| Classification gate | 30-60% on expensive calls | Medium | No |
| Call frequency redesign | 10-100x reduction | High (UX rethink) | No |
Feature gates under cost pressure
When per-user cost approaches a threshold, the instinct is to silently degrade: add a spinner, switch to a worse model, return a truncated result. Silent degradation is the wrong move. It produces inconsistent user experience without giving users any control over the trade-off.
The better approach is explicit tiering. Give users visibility into their remaining usage budget and an explicit choice about how to spend it. For a 'generate analysis' button, if a user has exhausted their monthly allocation, surface a lightweight summary by default with a clear 'full analysis' option that counts against their quota or prompts an upgrade. Users who need the expensive path will use it. Users who do not will accept the lighter version, and you have turned a degradation problem into a product design decision.
Implementation is a cost-aware routing function that checks budget remaining before selecting the model and context window to use.
def route_llm_call(user_id: str, context: str, task: str) -> str:
budget_remaining = get_budget_remaining(user_id) # tokens left this period
estimated_tokens = estimate_tokens(context)
if budget_remaining < estimated_tokens * BUDGET_SAFETY_FACTOR:
# Route to cheaper model with truncated context
return call_model(
model="fast-small",
context=truncate_to_budget(context, budget_remaining),
task=task,
)
return call_model(model="primary", context=context, task=task)What to measure
Four metrics that tell you whether the product-layer work is having an effect:
Cost per user action (not per token). If this is trending down, the product-layer work is working. If it is flat despite infra changes, you have not moved the product call pattern yet.
Classification gate intercept rate. The percentage of incoming requests handled by the cheap first-pass rather than the expensive model. A healthy range is 30-60%. Below 20% suggests the gate is too narrow; above 80% may indicate the expensive model is not well-matched to the feature's complexity distribution.
P95 cost per user per day. Median cost understates the problem because LLM cost has high variance; power users drive disproportionate spend. P95 shows whether you have a structural issue or a tail issue.
Budget usage per tier. What fraction of users in each pricing tier hit their cost ceiling each billing period? If it is low, your tier structure has room to expand. If it is consistently high, you have pricing pressure building.
The infra work is necessary but not sufficient
Run the infra pass. Do the quantisation, the caching, the model cascade. These matter, and skipping them is leaving real savings on the table.
But if the bill is still material after that pass, the answer is not more infra optimisation. It is a specific question about the product: what is the feature asking the model to do, how often, and is that actually what the user needs to accomplish the task? The cost ceiling for most LLM-backed features is set by product decisions that predate the cost conversation. Moving it requires revisiting those decisions. Cost engineering is also, unavoidably, a product design conversation.
Frequently asked questions
Related reading
Open-source licensing for engineers: a corporate codebase guide
Legal is not reviewing every npm install — you are. Here is the practical check to run before adding a dependency, and the licence type that catches most SaaS teams off guard.
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.