Local LLMs in production: what's actually shipping, what got abandoned, and how to tell the difference
A task-category audit of where local inference genuinely wins, where it quietly failed, and what three questions cut through the benchmark noise.
The local LLMs-in-production measurement problem
The conversation about local LLMs in production has a measurement problem. The Datadog State of AI Engineering report from late 2025 found that 67% of enterprises had deployed LLMs in some capacity. Fewer than 30% had systems that reliably performed in production. That 37-point gap is not a measurement artefact. It reflects something structural about how the category is being talked about versus how it is actually being used.
"Local LLM in production" gets conflated with "we tested a local model" or "we have a research project running" or "we have an endpoint that answers test traffic." None of those are production. Production is when real users depend on the system and its failure mode is your pager going off.
This piece separates the two for the local inference case specifically: what task categories are genuinely shipping at scale with local or on-premise models, which ones got tried and quietly dropped, and what the actual decision criteria look like when you strip out the benchmark framing.
Where local inference is genuinely winning
Four task categories have an unambiguous production track record with local models. They share a common shape: narrow scope, high frequency, latency-sensitive, and constrained input-output pairs.
IDE code completion
This is where local inference has the clearest production track record. GitHub Copilot, Cursor, and Codeium all ship a mix of local and cloud inference, with local models handling token-by-token autocomplete and cloud models handling longer completions, chat, and anything that benefits from a larger context window. The latency argument is real: cloud round-trips at 80-200ms interrupt typing flow in a way users notice. Local 3B-7B models quantised to 4-bit can run autocomplete at 80-120 tokens per second on a recent M-series Mac or a mid-range GPU-equipped developer workstation. The task is constrained, the error surface is recoverable (a bad suggestion is just a Tab key not pressed), and the frequency is high enough that per-query cloud costs would add up quickly.
Document and content classification
A 7B model fine-tuned for a specific taxonomy outperforms a general-purpose frontier model on that taxonomy at a fraction of the cost, and it runs offline. Companies with document-heavy workflows in legal, insurance, and finance have seen real ROI from this pattern. The input and output are constrained, the model can be fine-tuned on domain-labelled data, and the latency requirements favour local inference for batch workloads. A pattern that has emerged: use a cloud model to define and validate the initial taxonomy, use a local fine-tuned model for the production classification pipeline.
On-device personalisation and keyboard suggestions
Apple's Writing Tools, Smart Reply, and predictive keyboard features, and Google's Pixel on-device AI, use small quantised models for tasks where the privacy argument is non-negotiable and the task scope is narrow. These are not general-purpose chat. They are autocomplete for specific UX surfaces, with a latency budget of 30-100ms and a small context window by design. The category works because the task is permanently constrained. Nobody is asking an on-device keyboard model to reason over a 50-page document.
Batch data extraction and transformation
ETL pipelines that extract structured fields from unstructured documents — invoice parsing, contract clause extraction, form digitisation — work well with local inference when the schema is fixed and the model is fine-tuned on examples. The accuracy ceiling is below frontier cloud models, but when 95%+ accuracy is good enough, the cost profile of running a 13B model on internal GPUs often beats per-token cloud pricing at the volumes these pipelines run.
| Task | Local: what works | Local: the ceiling | Recommendation |
|---|---|---|---|
| Code autocomplete | Sub-50ms, offline, private | Struggles with complex multi-file reasoning | Local wins |
| Document classification (fixed schema) | Fine-tunable, cheap at volume | Needs labelled in-domain data to compete | Local wins (fine-tuned) |
| On-device keyboard suggestions | Battery, privacy, no network required | Not suitable for long-form generation | Local wins |
| Batch data extraction | Cost-effective at scale, deterministic | Accuracy ceiling below frontier models | Local wins at volume |
| Long-context document Q&A | Fast retrieval step works locally | Generation quality fails above 8k-token context | Hybrid: local retrieval, cloud generation |
| Multi-step agentic workflows | Individual step speed | Error accumulates with no self-correction | Cloud wins |
| General-purpose customer chat | Handles majority of queries adequately | Tail-case failure rate too high | Cloud wins |
| Codebase-wide reasoning | Handles single-file questions | Context limit and reasoning ceiling | Cloud wins |
What got tried and quietly dropped
The winning cases above all share one trait: the task is narrow. The failures cluster around the opposite — the task was too open-ended for the model capacity available locally.
Long-context document Q&A on-device
The pitch was compelling in 2024: run a full RAG pipeline on-device, no data leaves the machine, no cloud dependency. The reality is that local models in the 7B-13B range struggle with context windows above 8k tokens, where the model starts losing content from earlier in the window, and with reasoning over documents where the answer requires synthesising across multiple passages. Most teams that tried this for enterprise document Q&A converged on a hybrid by mid-2025: local embeddings and retrieval, cloud inference for the answer generation step. The latency and privacy benefits of local retrieval are preserved; the accuracy problem at the generation step is handled by the model that can actually handle it.
Multi-step agentic workflows
The failure mode is predictable in retrospect. A 7B model following a 6-step tool-use chain has enough headroom for each individual step. Error accumulates. A step-3 misclassification that a frontier model would self-correct becomes a step-5 hard failure that a 7B model doesn't recover from. Teams that tried to run full agentic pipelines locally found that the latency gains from local inference disappeared once they added retry logic and error recovery. The total wall-clock time was often worse than just using cloud.
Customer-facing chat at general scope
General-purpose chat for customer support, where the query distribution is wide and unpredictable, keeps going back to cloud inference. The failure mode is not mean-case performance — a 13B model can handle 90% of queries adequately. It is tail-case quality. In a customer-facing context, the 1-in-10 visibly wrong or confused response is the one that gets escalated. The frontier model's wider knowledge and better instruction following at the tail end of the query distribution is worth the per-query cost when the alternative is a customer screenshot of a broken answer.
Codebase-wide reasoning and architectural advice
'Explain this codebase' and 'what would I need to change to add X' require both a long context window and strong multi-step reasoning. Local models with 8k-16k context windows fail at the first requirement; their reasoning quality at these tasks is noticeably below frontier models. IDE vendors who tried to make local models handle 'explain this module' found that users quickly learned to route complex questions to the cloud endpoint while keeping autocomplete local. The behaviour sorted itself out through usage, not through a configuration decision.
The hybrid decision: when to reach for the API anyway
The pattern in well-designed production systems is a latency-times-scope matrix rather than a binary local/cloud choice. A few conditions reliably favour local inference:
- Response latency matters and requests are high-frequency (sub-100ms, serving continuously)
- The task scope is fixed and the model can be fine-tuned on a labelled dataset
- Data privacy or residency requirements block sending data to a third-party API
- Cost arithmetic at scale favours on-premise GPU over per-token API pricing — roughly above 2-5M tokens per day on a given task
- Offline operation is a hard requirement
Cloud inference holds its position when the task requires general world knowledge or multi-step reasoning, when the context window exceeds what local models handle reliably, when the query distribution is wide, when the engineering team lacks capacity to fine-tune and maintain a local deployment, or when the workload is sporadic enough that the fixed cost of local GPU capacity doesn't amortise.
The most common mistake is applying cost reasoning from a batch workload to an interactive workload, or applying latency reasoning from an interactive workload to a batch workload. The arithmetic breaks down quickly when you cross those contexts.
Three criteria that actually matter for local LLMs in production
If you're evaluating whether a specific use case should run locally or in the cloud, three questions cut through most of the noise.
Is the task scope permanently constrained?
Autocomplete, classification, and extraction tasks have fixed input-output shapes that don't change quarter-to-quarter. Chat, Q&A, and agent tasks have open-ended shapes. Local inference excels at the former and consistently struggles with the latter. If you can describe the full range of expected inputs and outputs in a one-page spec that won't need constant revision, local inference is worth evaluating seriously. If the spec requires ongoing model updates to handle new query types, the task has open-ended drift.
What does a tail-case failure look like?
A 1-in-10 failure on autocomplete is a non-pressed Tab key. A 1-in-10 failure on customer-facing chat is a visible wrong answer. Local models have a worse tail-case failure rate than frontier cloud models on most general tasks. The question isn't average quality — it's what the worst 5% looks like, and whether the product can absorb it. If it can't, local inference at the current capability level isn't ready for that task.
What's the actual cost comparison?
Local inference is not free. GPU hardware, energy, maintenance, model evaluation cycles, and the engineering time to fine-tune and serve a model are real costs. The break-even against cloud API pricing depends on volume, task complexity, and GPU occupancy. For most teams below 500k tokens per day on a given task, cloud APIs are cheaper once hidden engineering costs are counted. Above roughly 2-5M tokens per day on a constrained task, the economics shift in favour of local.
What the next twelve months probably look like
The 8B-13B model tier is expanding in capability without proportional increases in hardware requirements. Quantisation improvements and architectural changes — MLA attention, SSM hybrids — are pushing the capability curve upward. Tasks that required a 70B model 18 months ago are increasingly handled by 13B models in 2026. The sweet spot for local inference is getting wider, not narrower.
Speculative decoding is closing the latency gap for longer generations. Small-model draft plus large-model verification is appearing in more production systems, particularly for code generation where acceptance rates on small-model drafts are high.
Fine-tuning pipelines are cheaper and faster. The cost of creating a domain-specific model for classification or extraction has dropped to the point where it is comparable to a few weeks of cloud API usage for a medium-volume task. This changes the 'is it worth building a local model at all' calculation for a much wider range of engineering teams.
What will not change: frontier cloud models keep the lead on open-ended reasoning, long-context Q&A, and general-purpose chat. The gap between a 13B local model and a frontier cloud model on multi-step reasoning is not a hardware or quantisation problem. It is a scale problem, and the scale advantage of frontier cloud providers is structural. Local inference will keep winning at the constrained-scope, high-frequency, latency-sensitive, privacy-sensitive end of the task spectrum. That is a real and growing category. It is not a replacement for cloud inference on the other end.
Frequently asked questions
Related reading
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.
The AI coding productivity data keeps contradicting itself. Here's why.
AI coding productivity studies swing between '26% faster' and '19% slower on real tasks'. Both are probably correct — for different things. Here's what the research actually measures and what you should track instead.