The self-hosted LLM cost model: what the calculators miss
Token prices tell one story. GPU load, ops overhead, and quality-per-output tell another.
The self-hosted LLM cost model starts with one number: 80% savings over the API. Run Llama 4 Maverick on your own A100 and you are paying roughly $0.30 to $0.50 per million tokens instead of $2 to $5 for equivalent frontier models. The arithmetic is correct. What it misses is everything that is not a token.
The headline number and where it comes from
The comparison is fair on the right dimension. An A100-hosted inference run of Llama 4 Maverick at moderate batch sizes lands around $0.30 to $0.50 per million tokens. Claude 3.5 Haiku costs $0.80 per million input tokens and $4.00 per million output tokens. GPT-4o mini runs $0.15 input and $0.60 output. For output-heavy generation workloads, the cost ratio really is 5 to 10 times in favour of local.
The comparison also assumes you are generating tokens at a consistent clip. That assumption is where the model breaks.
The GPU load problem
A100 GPU instances on AWS cost about $3.20 per hour, or roughly $2,300 per month. At 100% load, that works out to around $0.45 per million tokens. At 20% load — a realistic figure for a single-tenant inference server handling business-hours traffic — the effective per-token cost is five times higher. Research on production inference deployments puts the threshold at 60% average GPU load: below that, self-hosted costs more per token than most managed API tiers.
API providers charge only for tokens you use. Idle capacity is their operational problem. Self-hosted means idle capacity is your cost, and most B2B SaaS workloads are highly spiky: heavy during business hours, quiet overnight, near-zero on weekends. That traffic pattern is where managed APIs have a structural advantage that token-price calculators never capture.
“Token calculators compare your API bill to a fully-loaded GPU. Your GPU is never fully loaded.”
| Average GPU load | Effective cost / M tokens | vs Claude 3.5 Haiku input ($0.80/M) |
|---|---|---|
| 100% | $0.45 | 0.56x (cheaper) |
| 60% | $0.75 | 0.94x (roughly parity) |
| 20% | $2.25 | 2.8x (more expensive) |
| 10% | $4.50 | 5.6x (much more expensive) |
Most B2B SaaS request patterns show a peak-to-trough ratio of 10 to 20 times: heavy traffic during European and North American business hours, quiet otherwise. A single A100 provisioned for peak load typically runs at 8 to 15% average load across the week. That puts most single-server setups well below the break-even threshold.
The self-hosted LLM cost model, fully accounted
Ops overhead for a maintained self-hosted inference stack is real, even when it does not appear in a token calculator.
Model upgrades are not automatic. When a new checkpoint releases with better performance on your task, someone needs to run your evaluation suite against it, benchmark throughput on your hardware, update the serving configuration, and verify that structured output formats still parse correctly. API providers handle this. For teams running local inference, it recurs every two to four months.
A realistic time estimate for a 10 to 30 person team maintaining a production inference stack: two to three weeks of initial setup covering hardware selection, vLLM serving, an OpenAI-compatible API shim, monitoring, and alerting; two to five days per model update cycle; one to two days of ongoing incident response and tuning per month. That adds up to roughly 0.15 to 0.25 of one engineer's annual capacity.
At a fully-loaded engineering cost of $150,000 per year, that is $22,500 to $37,500 in implicit infrastructure labour, before hardware. Add that to compute costs and the break-even point shifts from the often-cited '$500 per month in API spend' to somewhere around $3,000 to $5,000 per month for most teams. At $50 million tokens per day the maths clearly favour self-hosting; at $50,000 tokens per day they almost never do.
Where local inference clearly wins
None of this means self-hosted is the wrong call. It means the answer depends on which workload you are asking about.
Embeddings at volume. Generating embeddings for a large static corpus is a batch job. You run it overnight at high GPU load, latency does not matter, and open-weight embedding models — BGE-M3, Nomic Embed v2, E5-large — are competitive with API equivalents on standard retrieval benchmarks. Embedding generation is also where you first see the break-even flip: a corpus job running six hours overnight keeps the GPU above 60% average load for that window, which is exactly the threshold where self-hosted costs pencil out.
Classification and structured extraction. Short input, short output, high volume, clear schema: detecting intent, categorising tickets, extracting fields from documents. A fine-tuned 7B or 8B model regularly matches frontier accuracy on these tasks at a fraction of the inference cost. Llama 3.1 8B and Qwen 2.5 7B cover most use cases in this category, and the quality ceiling is rarely a constraint for narrow extraction tasks at inference time.
Privacy-constrained workloads. If your customers' data cannot leave your infrastructure for compliance or contractual reasons, local inference is not a cost decision. It is the only path. For workloads touching regulated data — healthcare records, financial documents, data subject to DPDP or similar regional residency requirements — the cost model is secondary to the compliance requirement.
Where the API still holds
Frontier reasoning quality remains the clearest case for the API. Complex code generation, long-context synthesis, multi-step planning, any task where a measurable quality regression would affect user retention: these are the situations where the open-weight model gap still matters, and where cost-per-token comparisons are the wrong frame. The relevant metric is cost-per-useful-output. Open-weight models have narrowed the gap since 2024; they have not closed it.
Features in active development are the second case. When you are iterating on prompts, output formats, and feature design, every local model upgrade becomes a deployment. The API abstracts that away entirely. Adding infrastructure overhead to a feature that might be cut in the next sprint rarely makes sense.
A third case: real-time interactive features where first-token latency affects the user experience. Managed API endpoints return the first token in 100 to 300 milliseconds. A self-hosted server matches that under low load, but a throughput-optimised configuration built for batch jobs can push first-token latency above a second at peak queue depth. For anything a user watches stream in real-time, that tradeoff matters.
The self-hosted LLM cost model for your team
Before the local versus API question becomes a project, collect five things: last month's actual API bill; hourly request volume plotted across a typical week (measure the peak-to-trough ratio); task category; available MLOps capacity on the team; and any compliance constraints on data residency.
| Factor | Stay on the API | Evaluate local inference |
|---|---|---|
| Monthly API spend | Under $3,000 | Over $8,000 |
| Traffic pattern | Spiky, business-hours peak | Batch-heavy or consistent 24/7 |
| Primary task type | Reasoning, generation, long-context | Embeddings, classification, extraction |
| MLOps experience in team | None available | At least one engineer |
| Data residency requirements | No constraints | Data must stay in own infrastructure |
The range between $3,000 and $8,000 per month is a judgment call based on workload profile, team composition, and risk tolerance. Most teams in that range are better positioned on the API until the spend crosses a threshold that justifies a dedicated hire.
The 80% savings claim is arithmetically correct on the right workload. Most B2B SaaS teams are not yet on the right workload. The productive question is not 'is self-hosted cheaper?' but 'which of my workloads is batch-heavy enough, high-volume enough, and narrow enough that local inference is structurally advantaged?' Answer that first, and the cost model follows.
Frequently asked questions
Related reading
Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them
Microsoft shipped seven MAI models five weeks after a contract amendment capped what OpenAI owes it at $38 billion through 2030. Read the two events together and the launch looks like a hedge, not a roadmap milestone.
$662 billion in AI data-center leases isn't on any balance sheet yet
Moody's says hyperscalers carry $662 billion in data-center leases that haven't hit their balance sheets. Add stretched GPU depreciation, and the capex number everyone quotes is the smallest one in play.
AI agents advertise a 200K-token context window. The reliable number is closer to 130K.
Vendors advertise 200,000-token context windows. The number production agents can actually use reliably is closer to 130,000 — and closing that gap is a compression-architecture decision, not a bigger-window one.