Is it worth running a local LLM without dedicated GPU hardware?

For most teams, no. Without dedicated GPU hardware, local inference is slower than cloud on longer generations or constrained to very small models with reduced accuracy. The economics favour local inference when you have dedicated hardware, sufficient volume (roughly 2-5M tokens per day on a given task), and the engineering capacity to maintain the model deployment.

What is the smallest model size that works well for document classification?

Fine-tuned 7B models work well for fixed-schema classification tasks. The key is fine-tuning on labelled in-domain data — a general-purpose 7B model underperforms a fine-tuned 7B on your specific taxonomy. If you lack labelled data, use a frontier cloud model to generate labels, then fine-tune locally.

How do I tell whether a task is narrow-scope enough for local inference?

A useful test: can you describe the full range of expected inputs and outputs in a one-page spec that will not change quarter-to-quarter? If yes, the task is likely narrow enough. If the spec needs ongoing revision to handle new query types, the task has open-ended drift — keep it on cloud inference.

Where does the '67% deployed, under 30% reliable' figure come from?

It comes from the Datadog State of AI Engineering report. 'Reliable' in that context means consistently meeting accuracy and latency SLAs in production. The gap is partly because organisations count experimental deployments and test environments as 'deployed'.

AI & LLMsMay 9, 20268 min readReviewed May 9, 2026

Local LLMs in production: what's actually shipping, what got abandoned, and how to tell the difference

A task-category audit of where local inference genuinely wins, where it quietly failed, and what three questions cut through the benchmark noise.

By FlowVerify Editorial Team

The local LLMs-in-production measurement problem

The conversation about local LLMs in production has a measurement problem. The Datadog State of AI Engineering report from late 2025 found that 67% of enterprises had deployed LLMs in some capacity. Fewer than 30% had systems that reliably performed in production. That 37-point gap is not a measurement artefact. It reflects something structural about how the category is being talked about versus how it is actually being used.

"Local LLM in production" gets conflated with "we tested a local model" or "we have a research project running" or "we have an endpoint that answers test traffic." None of those are production. Production is when real users depend on the system and its failure mode is your pager going off.

This piece separates the two for the local inference case specifically: what task categories are genuinely shipping at scale with local or on-premise models, which ones got tried and quietly dropped, and what the actual decision criteria look like when you strip out the benchmark framing.

Where local inference is genuinely winning

Four task categories have an unambiguous production track record with local models. They share a common shape: narrow scope, high frequency, latency-sensitive, and constrained input-output pairs.

IDE code completion

This is where local inference has the clearest production track record. GitHub Copilot, Cursor, and Codeium all ship a mix of local and cloud inference, with local models handling token-by-token autocomplete and cloud models handling longer completions, chat, and anything that benefits from a larger context window. The latency argument is real: cloud round-trips at 80-200ms interrupt typing flow in a way users notice. Local 3B-7B models quantised to 4-bit can run autocomplete at 80-120 tokens per second on a recent M-series Mac or a mid-range GPU-equipped developer workstation. The task is constrained, the error surface is recoverable (a bad suggestion is just a Tab key not pressed), and the frequency is high enough that per-query cloud costs would add up quickly.

Document and content classification

A 7B model fine-tuned for a specific taxonomy outperforms a general-purpose frontier model on that taxonomy at a fraction of the cost, and it runs offline. Companies with document-heavy workflows in legal, insurance, and finance have seen real ROI from this pattern. The input and output are constrained, the model can be fine-tuned on domain-labelled data, and the latency requirements favour local inference for batch workloads. A pattern that has emerged: use a cloud model to define and validate the initial taxonomy, use a local fine-tuned model for the production classification pipeline.

On-device personalisation and keyboard suggestions

Apple's Writing Tools, Smart Reply, and predictive keyboard features, and Google's Pixel on-device AI, use small quantised models for tasks where the privacy argument is non-negotiable and the task scope is narrow. These are not general-purpose chat. They are autocomplete for specific UX surfaces, with a latency budget of 30-100ms and a small context window by design. The category works because the task is permanently constrained. Nobody is asking an on-device keyboard model to reason over a 50-page document.

Batch data extraction and transformation

ETL pipelines that extract structured fields from unstructured documents — invoice parsing, contract clause extraction, form digitisation — work well with local inference when the schema is fixed and the model is fine-tuned on examples. The accuracy ceiling is below frontier cloud models, but when 95%+ accuracy is good enough, the cost profile of running a 13B model on internal GPUs often beats per-token cloud pricing at the volumes these pipelines run.

Task	Local: what works	Local: the ceiling	Recommendation
Code autocomplete	Sub-50ms, offline, private	Struggles with complex multi-file reasoning	Local wins
Document classification (fixed schema)	Fine-tunable, cheap at volume	Needs labelled in-domain data to compete	Local wins (fine-tuned)
On-device keyboard suggestions	Battery, privacy, no network required	Not suitable for long-form generation	Local wins
Batch data extraction	Cost-effective at scale, deterministic	Accuracy ceiling below frontier models	Local wins at volume
Long-context document Q&A	Fast retrieval step works locally	Generation quality fails above 8k-token context	Hybrid: local retrieval, cloud generation
Multi-step agentic workflows	Individual step speed	Error accumulates with no self-correction	Cloud wins
General-purpose customer chat	Handles majority of queries adequately	Tail-case failure rate too high	Cloud wins
Codebase-wide reasoning	Handles single-file questions	Context limit and reasoning ceiling	Cloud wins

Task categories: where local inference lands versus cloud

What got tried and quietly dropped

The winning cases above all share one trait: the task is narrow. The failures cluster around the opposite — the task was too open-ended for the model capacity available locally.

Long-context document Q&A on-device

The pitch was compelling in 2024: run a full RAG pipeline on-device, no data leaves the machine, no cloud dependency. The reality is that local models in the 7B-13B range struggle with context windows above 8k tokens, where the model starts losing content from earlier in the window, and with reasoning over documents where the answer requires synthesising across multiple passages. Most teams that tried this for enterprise document Q&A converged on a hybrid by mid-2025: local embeddings and retrieval, cloud inference for the answer generation step. The latency and privacy benefits of local retrieval are preserved; the accuracy problem at the generation step is handled by the model that can actually handle it.

Multi-step agentic workflows

The failure mode is predictable in retrospect. A 7B model following a 6-step tool-use chain has enough headroom for each individual step. Error accumulates. A step-3 misclassification that a frontier model would self-correct becomes a step-5 hard failure that a 7B model doesn't recover from. Teams that tried to run full agentic pipelines locally found that the latency gains from local inference disappeared once they added retry logic and error recovery. The total wall-clock time was often worse than just using cloud.

Customer-facing chat at general scope

General-purpose chat for customer support, where the query distribution is wide and unpredictable, keeps going back to cloud inference. The failure mode is not mean-case performance — a 13B model can handle 90% of queries adequately. It is tail-case quality. In a customer-facing context, the 1-in-10 visibly wrong or confused response is the one that gets escalated. The frontier model's wider knowledge and better instruction following at the tail end of the query distribution is worth the per-query cost when the alternative is a customer screenshot of a broken answer.

Codebase-wide reasoning and architectural advice

'Explain this codebase' and 'what would I need to change to add X' require both a long context window and strong multi-step reasoning. Local models with 8k-16k context windows fail at the first requirement; their reasoning quality at these tasks is noticeably below frontier models. IDE vendors who tried to make local models handle 'explain this module' found that users quickly learned to route complex questions to the cloud endpoint while keeping autocomplete local. The behaviour sorted itself out through usage, not through a configuration decision.

The hybrid decision: when to reach for the API anyway

The pattern in well-designed production systems is a latency-times-scope matrix rather than a binary local/cloud choice. A few conditions reliably favour local inference:

Response latency matters and requests are high-frequency (sub-100ms, serving continuously)
The task scope is fixed and the model can be fine-tuned on a labelled dataset
Data privacy or residency requirements block sending data to a third-party API
Cost arithmetic at scale favours on-premise GPU over per-token API pricing — roughly above 2-5M tokens per day on a given task
Offline operation is a hard requirement

Cloud inference holds its position when the task requires general world knowledge or multi-step reasoning, when the context window exceeds what local models handle reliably, when the query distribution is wide, when the engineering team lacks capacity to fine-tune and maintain a local deployment, or when the workload is sporadic enough that the fixed cost of local GPU capacity doesn't amortise.

The most common mistake is applying cost reasoning from a batch workload to an interactive workload, or applying latency reasoning from an interactive workload to a batch workload. The arithmetic breaks down quickly when you cross those contexts.

Three criteria that actually matter for local LLMs in production

If you're evaluating whether a specific use case should run locally or in the cloud, three questions cut through most of the noise.

Is the task scope permanently constrained?

Autocomplete, classification, and extraction tasks have fixed input-output shapes that don't change quarter-to-quarter. Chat, Q&A, and agent tasks have open-ended shapes. Local inference excels at the former and consistently struggles with the latter. If you can describe the full range of expected inputs and outputs in a one-page spec that won't need constant revision, local inference is worth evaluating seriously. If the spec requires ongoing model updates to handle new query types, the task has open-ended drift.

What does a tail-case failure look like?

A 1-in-10 failure on autocomplete is a non-pressed Tab key. A 1-in-10 failure on customer-facing chat is a visible wrong answer. Local models have a worse tail-case failure rate than frontier cloud models on most general tasks. The question isn't average quality — it's what the worst 5% looks like, and whether the product can absorb it. If it can't, local inference at the current capability level isn't ready for that task.

What's the actual cost comparison?

Local inference is not free. GPU hardware, energy, maintenance, model evaluation cycles, and the engineering time to fine-tune and serve a model are real costs. The break-even against cloud API pricing depends on volume, task complexity, and GPU occupancy. For most teams below 500k tokens per day on a given task, cloud APIs are cheaper once hidden engineering costs are counted. Above roughly 2-5M tokens per day on a constrained task, the economics shift in favour of local.

What the next twelve months probably look like

The 8B-13B model tier is expanding in capability without proportional increases in hardware requirements. Quantisation improvements and architectural changes — MLA attention, SSM hybrids — are pushing the capability curve upward. Tasks that required a 70B model 18 months ago are increasingly handled by 13B models in 2026. The sweet spot for local inference is getting wider, not narrower.

Speculative decoding is closing the latency gap for longer generations. Small-model draft plus large-model verification is appearing in more production systems, particularly for code generation where acceptance rates on small-model drafts are high.

Fine-tuning pipelines are cheaper and faster. The cost of creating a domain-specific model for classification or extraction has dropped to the point where it is comparable to a few weeks of cloud API usage for a medium-volume task. This changes the 'is it worth building a local model at all' calculation for a much wider range of engineering teams.

What will not change: frontier cloud models keep the lead on open-ended reasoning, long-context Q&A, and general-purpose chat. The gap between a 13B local model and a frontier cloud model on multi-step reasoning is not a hardware or quantisation problem. It is a scale problem, and the scale advantage of frontier cloud providers is structural. Local inference will keep winning at the constrained-scope, high-frequency, latency-sensitive, privacy-sensitive end of the task spectrum. That is a real and growing category. It is not a replacement for cloud inference on the other end.

Frequently asked questions

The AI wrapper debate, three years in: what the survivors built

May 13, 2026Read full article →

AI & LLMsMay 9, 20268 min readReviewed May 9, 2026

Local LLMs in production: what's actually shipping, what got abandoned, and how to tell the difference

A task-category audit of where local inference genuinely wins, where it quietly failed, and what three questions cut through the benchmark noise.

By FlowVerify Editorial Team

The local LLMs-in-production measurement problem

Where local inference is genuinely winning

Four task categories have an unambiguous production track record with local models. They share a common shape: narrow scope, high frequency, latency-sensitive, and constrained input-output pairs.

IDE code completion

Document and content classification

On-device personalisation and keyboard suggestions

Batch data extraction and transformation

Task	Local: what works	Local: the ceiling	Recommendation
Code autocomplete	Sub-50ms, offline, private	Struggles with complex multi-file reasoning	Local wins
Document classification (fixed schema)	Fine-tunable, cheap at volume	Needs labelled in-domain data to compete	Local wins (fine-tuned)
On-device keyboard suggestions	Battery, privacy, no network required	Not suitable for long-form generation	Local wins
Batch data extraction	Cost-effective at scale, deterministic	Accuracy ceiling below frontier models	Local wins at volume
Long-context document Q&A	Fast retrieval step works locally	Generation quality fails above 8k-token context	Hybrid: local retrieval, cloud generation
Multi-step agentic workflows	Individual step speed	Error accumulates with no self-correction	Cloud wins
General-purpose customer chat	Handles majority of queries adequately	Tail-case failure rate too high	Cloud wins
Codebase-wide reasoning	Handles single-file questions	Context limit and reasoning ceiling	Cloud wins

Task categories: where local inference lands versus cloud

What got tried and quietly dropped

The winning cases above all share one trait: the task is narrow. The failures cluster around the opposite — the task was too open-ended for the model capacity available locally.

Long-context document Q&A on-device

Multi-step agentic workflows

Customer-facing chat at general scope

Codebase-wide reasoning and architectural advice

The hybrid decision: when to reach for the API anyway

The pattern in well-designed production systems is a latency-times-scope matrix rather than a binary local/cloud choice. A few conditions reliably favour local inference:

Response latency matters and requests are high-frequency (sub-100ms, serving continuously)
The task scope is fixed and the model can be fine-tuned on a labelled dataset
Data privacy or residency requirements block sending data to a third-party API
Cost arithmetic at scale favours on-premise GPU over per-token API pricing — roughly above 2-5M tokens per day on a given task
Offline operation is a hard requirement

Three criteria that actually matter for local LLMs in production

If you're evaluating whether a specific use case should run locally or in the cloud, three questions cut through most of the noise.

The local LLMs-in-production measurement problem

Where local inference is genuinely winning

IDE code completion

Document and content classification

On-device personalisation and keyboard suggestions

Batch data extraction and transformation

What got tried and quietly dropped

Long-context document Q&A on-device

Multi-step agentic workflows

Customer-facing chat at general scope

Codebase-wide reasoning and architectural advice

The hybrid decision: when to reach for the API anyway

Three criteria that actually matter for local LLMs in production

Is the task scope permanently constrained?

What does a tail-case failure look like?

What's the actual cost comparison?

What the next twelve months probably look like

Frequently asked questions

Is it worth running a local LLM without dedicated GPU hardware?

What is the smallest model size that works well for document classification?

How do I tell whether a task is narrow-scope enough for local inference?

Where does the '67% deployed, under 30% reliable' figure come from?

Related reading

Stay ahead on eSignatures, compliance, and document workflows

The local LLMs-in-production measurement problem

Where local inference is genuinely winning

IDE code completion

Document and content classification

On-device personalisation and keyboard suggestions

Batch data extraction and transformation

What got tried and quietly dropped

Long-context document Q&A on-device

Multi-step agentic workflows

Customer-facing chat at general scope

Codebase-wide reasoning and architectural advice

The hybrid decision: when to reach for the API anyway

Three criteria that actually matter for local LLMs in production

Is the task scope permanently constrained?

What does a tail-case failure look like?

What's the actual cost comparison?

What the next twelve months probably look like

Frequently asked questions

Is it worth running a local LLM without dedicated GPU hardware?

What is the smallest model size that works well for document classification?

How do I tell whether a task is narrow-scope enough for local inference?

Where does the '67% deployed, under 30% reliable' figure come from?

Related reading

Stay ahead on eSignatures, compliance, and document workflows