The self-hosted LLM cost model: what the calculators miss
Token prices tell one story. GPU load, ops overhead, and quality-per-output tell another.
The self-hosted LLM cost model starts with one number: 80% savings over the API. Run Llama 4 Maverick on your own A100 and you are paying roughly $0.30 to $0.50 per million tokens instead of $2 to $5 for equivalent frontier models. The arithmetic is correct. What it misses is everything that is not a token.
The headline number and where it comes from
The comparison is fair on the right dimension. An A100-hosted inference run of Llama 4 Maverick at moderate batch sizes lands around $0.30 to $0.50 per million tokens. Claude 3.5 Haiku costs $0.80 per million input tokens and $4.00 per million output tokens. GPT-4o mini runs $0.15 input and $0.60 output. For output-heavy generation workloads, the cost ratio really is 5 to 10 times in favour of local.
The comparison also assumes you are generating tokens at a consistent clip. That assumption is where the model breaks.
The GPU load problem
A100 GPU instances on AWS cost about $3.20 per hour, or roughly $2,300 per month. At 100% load, that works out to around $0.45 per million tokens. At 20% load — a realistic figure for a single-tenant inference server handling business-hours traffic — the effective per-token cost is five times higher. Research on production inference deployments puts the threshold at 60% average GPU load: below that, self-hosted costs more per token than most managed API tiers.
API providers charge only for tokens you use. Idle capacity is their operational problem. Self-hosted means idle capacity is your cost, and most B2B SaaS workloads are highly spiky: heavy during business hours, quiet overnight, near-zero on weekends. That traffic pattern is where managed APIs have a structural advantage that token-price calculators never capture.
“Token calculators compare your API bill to a fully-loaded GPU. Your GPU is never fully loaded.”
| Average GPU load | Effective cost / M tokens | vs Claude 3.5 Haiku input ($0.80/M) |
|---|---|---|
| 100% | $0.45 | 0.56x (cheaper) |
| 60% | $0.75 | 0.94x (roughly parity) |
| 20% | $2.25 | 2.8x (more expensive) |
| 10% | $4.50 | 5.6x (much more expensive) |
Most B2B SaaS request patterns show a peak-to-trough ratio of 10 to 20 times: heavy traffic during European and North American business hours, quiet otherwise. A single A100 provisioned for peak load typically runs at 8 to 15% average load across the week. That puts most single-server setups well below the break-even threshold.
The self-hosted LLM cost model, fully accounted
Ops overhead for a maintained self-hosted inference stack is real, even when it does not appear in a token calculator.
Model upgrades are not automatic. When a new checkpoint releases with better performance on your task, someone needs to run your evaluation suite against it, benchmark throughput on your hardware, update the serving configuration, and verify that structured output formats still parse correctly. API providers handle this. For teams running local inference, it recurs every two to four months.
A realistic time estimate for a 10 to 30 person team maintaining a production inference stack: two to three weeks of initial setup covering hardware selection, vLLM serving, an OpenAI-compatible API shim, monitoring, and alerting; two to five days per model update cycle; one to two days of ongoing incident response and tuning per month. That adds up to roughly 0.15 to 0.25 of one engineer's annual capacity.
At a fully-loaded engineering cost of $150,000 per year, that is $22,500 to $37,500 in implicit infrastructure labour, before hardware. Add that to compute costs and the break-even point shifts from the often-cited '$500 per month in API spend' to somewhere around $3,000 to $5,000 per month for most teams. At $50 million tokens per day the maths clearly favour self-hosting; at $50,000 tokens per day they almost never do.
Where local inference clearly wins
None of this means self-hosted is the wrong call. It means the answer depends on which workload you are asking about.
Embeddings at volume. Generating embeddings for a large static corpus is a batch job. You run it overnight at high GPU load, latency does not matter, and open-weight embedding models — BGE-M3, Nomic Embed v2, E5-large — are competitive with API equivalents on standard retrieval benchmarks. Embedding generation is also where you first see the break-even flip: a corpus job running six hours overnight keeps the GPU above 60% average load for that window, which is exactly the threshold where self-hosted costs pencil out.
Classification and structured extraction. Short input, short output, high volume, clear schema: detecting intent, categorising tickets, extracting fields from documents. A fine-tuned 7B or 8B model regularly matches frontier accuracy on these tasks at a fraction of the inference cost. Llama 3.1 8B and Qwen 2.5 7B cover most use cases in this category, and the quality ceiling is rarely a constraint for narrow extraction tasks at inference time.
Privacy-constrained workloads. If your customers' data cannot leave your infrastructure for compliance or contractual reasons, local inference is not a cost decision. It is the only path. For workloads touching regulated data — healthcare records, financial documents, data subject to DPDP or similar regional residency requirements — the cost model is secondary to the compliance requirement.
Where the API still holds
Frontier reasoning quality remains the clearest case for the API. Complex code generation, long-context synthesis, multi-step planning, any task where a measurable quality regression would affect user retention: these are the situations where the open-weight model gap still matters, and where cost-per-token comparisons are the wrong frame. The relevant metric is cost-per-useful-output. Open-weight models have narrowed the gap since 2024; they have not closed it.
Features in active development are the second case. When you are iterating on prompts, output formats, and feature design, every local model upgrade becomes a deployment. The API abstracts that away entirely. Adding infrastructure overhead to a feature that might be cut in the next sprint rarely makes sense.
A third case: real-time interactive features where first-token latency affects the user experience. Managed API endpoints return the first token in 100 to 300 milliseconds. A self-hosted server matches that under low load, but a throughput-optimised configuration built for batch jobs can push first-token latency above a second at peak queue depth. For anything a user watches stream in real-time, that tradeoff matters.
The self-hosted LLM cost model for your team
Before the local versus API question becomes a project, collect five things: last month's actual API bill; hourly request volume plotted across a typical week (measure the peak-to-trough ratio); task category; available MLOps capacity on the team; and any compliance constraints on data residency.
| Factor | Stay on the API | Evaluate local inference |
|---|---|---|
| Monthly API spend | Under $3,000 | Over $8,000 |
| Traffic pattern | Spiky, business-hours peak | Batch-heavy or consistent 24/7 |
| Primary task type | Reasoning, generation, long-context | Embeddings, classification, extraction |
| MLOps experience in team | None available | At least one engineer |
| Data residency requirements | No constraints | Data must stay in own infrastructure |
The range between $3,000 and $8,000 per month is a judgment call based on workload profile, team composition, and risk tolerance. Most teams in that range are better positioned on the API until the spend crosses a threshold that justifies a dedicated hire.
The 80% savings claim is arithmetically correct on the right workload. Most B2B SaaS teams are not yet on the right workload. The productive question is not 'is self-hosted cheaper?' but 'which of my workloads is batch-heavy enough, high-volume enough, and narrow enough that local inference is structurally advantaged?' Answer that first, and the cost model follows.
Frequently asked questions
Related reading
Open-source licensing for engineers: a corporate codebase guide
Legal is not reviewing every npm install — you are. Here is the practical check to run before adding a dependency, and the licence type that catches most SaaS teams off guard.
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.