Is model quantisation worth the quality trade-off?

For most production tasks (summarisation, classification, structured extraction), 4-bit and 8-bit quantised models perform within 2-5% of their unquantised equivalents on standard benchmarks. The trade-off is worth it in almost all cases. Where it matters is tasks requiring high numerical precision or very long reasoning chains, where quantised models can accumulate errors.

How do I set a context budget without degrading quality?

Start by measuring quality with full context as a baseline. Then test rolling three-turn windows for conversational features, or retrieved-chunk-only context for document features. In most cases, quality differences are below the threshold of user detection. Where they are not, those tasks likely need custom handling rather than a blanket context cap.

AI & LLMsMay 8, 20267 min readReviewed May 8, 2026

LLM costs at scale are a product design problem

Q: What is a reasonable classification gate intercept rate to target?

30-60% is a healthy range for most features. Below 20% means the gate is too narrow; you are not catching easy cases. Above 80% may indicate the expensive model is not well-matched to the feature's actual complexity distribution, and a redesign of the feature scope may help more than further gate tuning.

Why the standard infra playbook hits a ceiling and where the real savings are

By FlowVerify Editorial Team

Your first LLM feature ships. Usage climbs. Then the bill arrives, and it is higher than the numbers in your original back-of-envelope calculation. The standard advice is to optimise the infrastructure layer: quantise the model, add a KV cache, run a fast model first and escalate only on failure. That advice is correct, and you should do all of it. But most teams find that after the infra pass, the bill is still too high, and the next round of infra optimisation delivers sharply diminishing returns.

The reason is that the biggest cost drivers for most LLM-backed features are not infrastructure decisions at all. They are product decisions: what the feature asks the model to do, how frequently it does it, and how much context it attaches to each call. Optimising the infra layer without rethinking the product layer is like improving fuel economy while leaving the engine idling.

The standard infra playbook and where it ends

The infra moves that most teams run in the first cost-optimisation pass are well-documented. Quantisation (reducing model weights from 32-bit to 4-bit or 8-bit) cuts inference cost by 50-70% with minimal quality loss for most production tasks. Speculative decoding and KV caching reduce latency and, for batched workloads, cut cost further. Model cascade (routing simple requests to a small, fast model and escalating only the complex ones to a larger model) can reduce large-model call volume by 40-65% depending on the task distribution.

These are real, and they add up. A team that has done quantisation plus model cascade can reasonably expect to be 60-80% cheaper than they were at launch. The problem is that once you have run these passes, you have largely exhausted the infra lever. The remaining optimisations (further quantisation, different batching configurations, custom kernels) deliver 5-15% improvements at substantially higher implementation complexity.

The unit that actually matters: cost per user action

Most engineering teams track LLM cost in cost-per-thousand-tokens. That is the right unit for infrastructure decisions; it lets you compare models, quantisation strategies, and batch sizes on a level surface. It is the wrong unit for product decisions.

For product decisions, the unit that matters is cost per user action: how much does it cost when a user does the thing your feature exists for? If your feature is document summarisation, the relevant cost is how much you spend on average every time a user clicks Summarise, not how many tokens that request used in isolation.

The reason cost per user action often diverges from naive token-counting is iteration. In most LLM-backed features, users do not trigger the feature once per session. They iterate. A user trying to produce a good contract summary might trigger it four or five times, each with a different prompt framing or a slightly modified document. Your cost per user action in this scenario is 4-5x your cost per single API call. If you built your unit-economics model on single-call costs, your projections will be off by the same factor.

Call shape: the product decision that sets your cost ceiling

Call shape describes three properties of how your product interacts with the model: what context you attach, how often you call, and when you decide to call at all.

Context size is the most commonly under-optimised. Attaching a full conversation history to every call is the default behaviour in most LLM SDK wrappers. For a conversation running 20 turns, that means a 30,000-token context on every call. A rolling three-turn window preserves enough context for most conversational tasks at a fraction of the cost. For document tasks, attaching the full document when only the relevant section matters is the same pattern at higher stakes.

Call frequency has the widest variance. If your LLM feature triggers on every keystroke to provide real-time suggestions, you are making orders of magnitude more calls than a feature that triggers on paragraph completion or explicit user request. Most teams set call frequency at MVP time based on what felt right in a prototype. By the time usage is material, nobody has revisited whether the trigger point still makes sense.

Call decision (whether you call the model at all) is often the highest-impact of the three. A surprising fraction of requests that flow through to your LLM could be handled by a much cheaper first-pass: a regex match for structured input, a small embedding model to detect near-duplicate queries with cached answers, or a sub-$0.001 classifier to route trivial requests away from the expensive one. Teams that add a classification gate before their main model call typically find that 30-60% of requests never reach the expensive model.

Four product-layer levers

Once you have diagnosed your call shape, these four interventions are worth considering in rough order of implementation complexity.

Context budget cap. Set a maximum token budget for the context attached to each call, and enforce it. For conversational features, a rolling three-turn window is often sufficient. For document features, chunk and retrieve only the relevant sections rather than attaching the full document. A well-designed context budget can cut per-call cost by 30-60% without measurable quality degradation for most tasks.

Deferred and async calls. Move LLM calls out of the synchronous request path where the user does not notice latency. Async calls can be batched, rate-limited to cheaper off-peak windows, and crucially deduplicated before dispatch. If your system makes 100 calls in ten minutes and twenty of them are near-identical, an async queue with a deduplication pass eliminates those twenty calls before they are ever sent.

Classification gates. Add a cheap first-pass classifier before the expensive model call. The classifier can be a fast, small model, a learned embedding similarity check against a cache of previous responses, or a well-designed regex for structured inputs. The goal is to intercept requests that do not need the full model. For most production features, this gate eliminates 30-60% of expensive calls.

Graceful degradation tiers. Define a cost-aware fallback sequence before you need it. When a per-user or per-day budget threshold is approached, route to a lighter model or truncated context rather than cutting the feature off entirely. Users tolerate quality variation better than they tolerate feature absence, and a defined degradation sequence means cost spikes do not become incidents.

Approach	Typical savings	Implementation	Hard ceiling?
Quantisation (4-bit/8-bit)	50-70%	Low	Yes; quality degrades past this point
KV cache + speculative decoding	15-30%	Low-medium	Yes
Model cascade	40-65% on large-model calls	Medium	Depends on task mix
Context budget cap	30-60% per call	Low	No
Deferred / async calls	20-40% via batching	Medium	No
Classification gate	30-60% on expensive calls	Medium	No
Call frequency redesign	10-100x reduction	High (UX rethink)	No

Infra-layer vs product-layer cost optimisations

Feature gates under cost pressure

When per-user cost approaches a threshold, the instinct is to silently degrade: add a spinner, switch to a worse model, return a truncated result. Silent degradation is the wrong move. It produces inconsistent user experience without giving users any control over the trade-off.

The better approach is explicit tiering. Give users visibility into their remaining usage budget and an explicit choice about how to spend it. For a 'generate analysis' button, if a user has exhausted their monthly allocation, surface a lightweight summary by default with a clear 'full analysis' option that counts against their quota or prompts an upgrade. Users who need the expensive path will use it. Users who do not will accept the lighter version, and you have turned a degradation problem into a product design decision.

Implementation is a cost-aware routing function that checks budget remaining before selecting the model and context window to use.

cost_router.py

def route_llm_call(user_id: str, context: str, task: str) -> str:
    budget_remaining = get_budget_remaining(user_id)  # tokens left this period
    estimated_tokens = estimate_tokens(context)

    if budget_remaining < estimated_tokens * BUDGET_SAFETY_FACTOR:
        # Route to cheaper model with truncated context
        return call_model(
            model="fast-small",
            context=truncate_to_budget(context, budget_remaining),
            task=task,
        )

    return call_model(model="primary", context=context, task=task)

What to measure

Four metrics that tell you whether the product-layer work is having an effect:

Cost per user action (not per token). If this is trending down, the product-layer work is working. If it is flat despite infra changes, you have not moved the product call pattern yet.

Classification gate intercept rate. The percentage of incoming requests handled by the cheap first-pass rather than the expensive model. A healthy range is 30-60%. Below 20% suggests the gate is too narrow; above 80% may indicate the expensive model is not well-matched to the feature's complexity distribution.

P95 cost per user per day. Median cost understates the problem because LLM cost has high variance; power users drive disproportionate spend. P95 shows whether you have a structural issue or a tail issue.

Budget usage per tier. What fraction of users in each pricing tier hit their cost ceiling each billing period? If it is low, your tier structure has room to expand. If it is consistently high, you have pricing pressure building.

The infra work is necessary but not sufficient

Run the infra pass. Do the quantisation, the caching, the model cascade. These matter, and skipping them is leaving real savings on the table.

But if the bill is still material after that pass, the answer is not more infra optimisation. It is a specific question about the product: what is the feature asking the model to do, how often, and is that actually what the user needs to accomplish the task? The cost ceiling for most LLM-backed features is set by product decisions that predate the cost conversation. Moving it requires revisiting those decisions. Cost engineering is also, unavoidably, a product design conversation.

Frequently asked questions

Open-source licensing for engineers: a corporate codebase guide

Legal is not reviewing every npm install — you are. Here is the practical check to run before adding a dependency, and the licence type that catches most SaaS teams off guard.

May 13, 2026Read full article →

AI & LLMsMay 8, 20267 min readReviewed May 8, 2026

LLM costs at scale are a product design problem

Why the standard infra playbook hits a ceiling and where the real savings are

By FlowVerify Editorial Team

The standard infra playbook and where it ends

The unit that actually matters: cost per user action

Call shape: the product decision that sets your cost ceiling

Call shape describes three properties of how your product interacts with the model: what context you attach, how often you call, and when you decide to call at all.

Four product-layer levers

Once you have diagnosed your call shape, these four interventions are worth considering in rough order of implementation complexity.

Approach	Typical savings	Implementation	Hard ceiling?
Quantisation (4-bit/8-bit)	50-70%	Low	Yes; quality degrades past this point
KV cache + speculative decoding	15-30%	Low-medium	Yes
Model cascade	40-65% on large-model calls	Medium	Depends on task mix
Context budget cap	30-60% per call	Low	No
Deferred / async calls	20-40% via batching	Medium	No
Classification gate	30-60% on expensive calls	Medium	No
Call frequency redesign	10-100x reduction	High (UX rethink)	No

Infra-layer vs product-layer cost optimisations

Feature gates under cost pressure

Implementation is a cost-aware routing function that checks budget remaining before selecting the model and context window to use.

cost_router.py

def route_llm_call(user_id: str, context: str, task: str) -> str:
    budget_remaining = get_budget_remaining(user_id)  # tokens left this period
    estimated_tokens = estimate_tokens(context)

    if budget_remaining < estimated_tokens * BUDGET_SAFETY_FACTOR:
        # Route to cheaper model with truncated context
        return call_model(
            model="fast-small",
            context=truncate_to_budget(context, budget_remaining),
            task=task,
        )

    return call_model(model="primary", context=context, task=task)

What to measure

Four metrics that tell you whether the product-layer work is having an effect:

Cost per user action (not per token). If this is trending down, the product-layer work is working. If it is flat despite infra changes, you have not moved the product call pattern yet.

The infra work is necessary but not sufficient

Run the infra pass. Do the quantisation, the caching, the model cascade. These matter, and skipping them is leaving real savings on the table.

LLM costs at scale are a product design problem

The standard infra playbook and where it ends

The unit that actually matters: cost per user action

Call shape: the product decision that sets your cost ceiling

Four product-layer levers

Feature gates under cost pressure

What to measure

The infra work is necessary but not sufficient

Frequently asked questions

Related reading

Open-source licensing for engineers: a corporate codebase guide

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

Stay ahead on eSignatures, compliance, and document workflows

Open-source licensing for engineers: a corporate codebase guide

LLM costs at scale are a product design problem

The standard infra playbook and where it ends

The unit that actually matters: cost per user action

Call shape: the product decision that sets your cost ceiling

Four product-layer levers

Feature gates under cost pressure

What to measure

The infra work is necessary but not sufficient

Frequently asked questions

Related reading

Open-source licensing for engineers: a corporate codebase guide

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

Stay ahead on eSignatures, compliance, and document workflows

Open-source licensing for engineers: a corporate codebase guide