Local LLMs in production, 2026: the honest economics
Self-hosting open-weight models is cheaper and easier than 18 months ago. That does not make it the right call for most workloads.
The picture has changed — just not the way the headlines say
Two years ago, the case against local LLMs was clear: quality gaps were real, serving infrastructure was fiddly, and engineering overhead was steep relative to an API call. The case for was mostly theoretical — data privacy, cost at scale, customisation — but 'at scale' turned out to mean volumes that most companies never actually reached.
The 2026 picture has shifted. Gemma 4 26B A4B runs on a single A100 80GB at throughput that previously required a multi-GPU rig. Qwen 3 30B and Mistral Small 4 are production-quality on a broad class of tasks. Llama 4 Scout handles long-context workloads that once required much larger models. The quantisation story has improved: 4-bit models that were noticeably weaker in 2024 are now competitive for most non-frontier tasks.
And yet: the blog posts telling you that local LLMs are now the obvious default get the economics wrong in one consistent direction. They compare token-level costs without accounting for the full cost stack, and produce a break-even calculation that looks dramatically in favour of self-hosting until you add the items that actually dominate.
The four scenarios where local actually wins
Before the numbers, it is worth naming the four situations where self-hosting is the right call regardless of cost. For these situations, cost is not the primary driver.
Data that cannot leave your infrastructure. Healthcare records, financial data under specific regulatory regimes, government workloads, and legal documents in some jurisdictions. If your data classification policy says 'no external processing', the cost comparison is moot. Self-hosting is the only option, and the engineering overhead is a compliance cost, not a discretionary one.
High-volume, high-cost-tier workloads. Running GPT-4o-tier APIs at $5/$15 per million tokens input/output at 10 million or more tokens per day means an API bill of $1,500–3,000 per month or higher. A single A100 80GB cloud instance at roughly $2.50 per hour ($1,800 per month running continuously), running a competitive open-weight model, can break even at that volume, provided quality holds for your specific task.
Edge or offline inference. IoT devices, on-device processing, situations requiring inference without network round-trips. Managed APIs cannot help here, full stop.
Fine-tuning with proprietary task data. If your use case requires a domain-specific fine-tune that you cannot send to a provider's infrastructure, self-hosting is part of the fine-tuning pipeline regardless of cost.
If none of these four apply, the decision comes down to economics. That is where the analysis usually goes wrong.
The cost comparison done right
The standard comparison goes: API at $3 per million tokens versus self-hosted at $0.07 per million tokens. The self-hosted number comes from a single line item: GPU compute cost divided by throughput. It is not wrong; it is incomplete.
Here is what the full cost stack actually looks like:
| Cost element | Managed API | Self-hosted |
|---|---|---|
| Token compute | $0.15–$15/M tokens (varies by model tier) | $0.05–$0.10/M (at sustained GPU use) |
| Infrastructure baseline | $0 | $500–2,000/month (GPU instances or on-prem amortised) |
| Serving setup (one-time) | $0 | 2–4 weeks of senior engineer time |
| Serving maintenance (ongoing) | $0 | ~0.1–0.25 FTE per model in production |
| Model updates | Automatic, provider-managed | Manual: evaluate, test, redeploy (4–6 cycles/year) |
| Traffic spikes | Automatic scaling | Manual or pre-provisioned headroom |
| Uptime SLA | Provider-backed | Your responsibility |
| Latency (p50) | 300–800ms | 50–300ms (lower; higher variance) |
The two items that flip the comparison are engineering maintenance overhead and model update cycles. Neither appears in most self-hosting cost analyses.
The engineering overhead is not a one-time cost
Most break-even analyses treat serving setup as a one-time cost: two weeks of work, then done. This is where most teams get burned.
Self-hosting a model in production means owning a serving stack. In practice: capacity planning for traffic peaks, graceful degradation when GPU memory is exhausted, load balancing if you run more than one instance, monitoring for quality regression over time, and responding to incidents when the inference server crashes at 2am. None of this is technically difficult for an experienced ML infrastructure engineer. At a 15-person company, however, you probably do not have one, which means these tasks land as opportunity cost on whoever is closest.
A working rule: budget roughly 0.2 FTE per model you run in production. Depending on your engineering cost base, that is $4,000–7,000 per month. It is a recurring line item, not a setup cost.
The break-even in practice
Working through the numbers: if you are replacing a mid-tier managed API (something in the GPT-4o-mini or Haiku price range, roughly $0.15–$0.60/M blended) — the break-even using the full cost stack lands around 50–100 million tokens per month.
At 50M tokens per month, the API spend is roughly $15,000 per year. A self-hosted setup with one L40S instance running continuously, amortised setup, and 0.2 FTE of maintenance runs approximately $12,000–18,000 per year, depending on your engineering labour cost. The cost advantage exists, but it is not the order-of-magnitude saving that token-level comparisons imply.
At 10M tokens per month (a $3,000/year API spend at mid-tier pricing), it is genuinely hard to beat self-hosting once engineering overhead is included. Most teams at this usage level are better served by the API.
The picture changes if you are replacing a frontier-tier API at $5/$15 per million tokens. At 10M tokens per month, the API spend is roughly $15,000 per year. A capable 30B-class open-weight model, self-hosted, can undercut that, provided quality is acceptable. For most non-frontier tasks in 2026, it is.
A rough cost estimator (substitute your own numbers):
# Adjust these three variables for your workload
TOKENS_PER_DAY = 5_000_000 # daily token estimate
API_COST_PER_MILLION = 0.30 # $0.30/M for mid-tier; $6.00/M for frontier
GPU_COST_PER_HOUR = 0.72 # L40S on-demand; adjust for your provider
# Engineering overhead
ENGINEERING_FTE = 0.20 # fraction of a senior engineer's time
ENGINEERING_MONTHLY = 12_000 # your senior engineer's monthly cost
api_monthly = TOKENS_PER_DAY * 30 / 1_000_000 * API_COST_PER_MILLION
gpu_monthly = GPU_COST_PER_HOUR * 24 * 30
eng_monthly = ENGINEERING_MONTHLY * ENGINEERING_FTE
self_monthly = gpu_monthly + eng_monthly
print(f"API monthly: ${api_monthly:,.0f}")
print(f"Self-hosted monthly: ${self_monthly:,.0f}")
print(f" GPU compute: ${gpu_monthly:,.0f}")
print(f" Engineering (est.): ${eng_monthly:,.0f}")
print()
winner = "self-hosted" if self_monthly < api_monthly else "managed API"
print(f"Break-even favours: {winner}")What to benchmark before committing local LLMs to production
If the numbers suggest you are near the break-even threshold, the decision comes down to quality on your specific task. Generic benchmarks will not answer this.
The single most useful test is not MMLU or HumanEval. It is a curated set of 100–200 actual production inputs — the kind of prompts your system sends today — rated either by human reviewers or a judge model, run against both the candidate self-hosted model and your current API. If quality on your task is within 5%, the economics usually justify the move. At a 10% gap, the managed API is providing value that the token math does not capture.
Other things to measure during any evaluation period:
- P50, P95, and P99 latency under realistic concurrent load, not single-request latency in a quiet test environment
- Throughput ceiling before quality degrades from aggressive batching
- GPU memory behaviour under sustained load; fragmentation is a real production failure mode that synthetic benchmarks miss
- Output quality on long inputs if your task involves documents rather than short prompts
If you cannot commit 2–3 days of engineering time to this evaluation, you cannot commit to self-hosting. The evaluation phase is a scaled-down version of the operational overhead you will carry permanently.
The honest recommendation
Self-hosting is viable at meaningfully lower thresholds than it was two years ago. Models are better, serving tooling (vLLM, SGLang) is more mature, and cloud GPU costs have come down. For data-sovereignty requirements, the analysis starts and ends with the compliance question — cost is secondary.
For everyone else: the cost advantage is real but narrower than token-level comparisons suggest, and it accrues only above a usage threshold that is higher than most teams currently hit. The more tractable question is not 'managed API or self-hosted' but 'which API tier is right for my current workload, and at what point should I revisit this as volume grows?'
Re-run the calculation when your API spend crosses $5,000 per month on a single use case. Below that, managed APIs hold the advantage on total cost — not because the local option does not work, but because zero engineering overhead has a value that the token maths does not include.
Frequently asked questions
Related reading
Context rot is real: what the 18-model study means for production LLM engineering
Chroma's 2025 research tested 18 frontier models and found every one degrades as context grows. This is what context rot means for production engineering decisions — and the specific patterns that address it.
The AI productivity paradox is more interesting than either side admits
AI is making specific tasks measurably faster: coding 55%, X-ray reading 36%, customer service sales up 16%. And yet 90% of firms saw no firm-level productivity gain. Here's what the gap means.
Model Context Protocol: what it actually standardises (and what you'll still have to build yourself)
MCP is becoming the standard interface for connecting AI agents to external tools. But most teams adopting it don't have a clear picture of what the protocol covers and what it deliberately leaves out.