How much VRAM do I need to run a 30B parameter model in production?

A 30B model in FP16 needs approximately 60GB of VRAM. On a single 80GB card (A100 or H100 80GB), it fits with headroom for the KV cache under reasonable batch sizes. With 4-bit quantisation you can run 30B on a 24GB card (RTX 3090/4090), but throughput under batching drops significantly. For sustained production inference with predictable latency, FP16 on appropriate hardware is the better starting point.

Is Ollama suitable for production inference?

Ollama is excellent for local development and prototyping but is not designed for production serving. It lacks continuous batching, request queuing, fine-grained memory management, and multi-LoRA support. For production workloads, vLLM and SGLang are the mature options — both expose OpenAI-compatible APIs, making them straightforward drop-in replacements for API-based code.

My workload involves sensitive data. Does that mean I must self-host?

Not automatically. Many managed API providers offer data processing agreements, contractual commitments that data is not used for model training, and enterprise tiers with additional retention controls. Read the provider's DPA before concluding that self-hosting is required. Many regulated industries operate successfully on managed APIs with the right contractual structure in place. Where self-hosting definitively wins is when a hard 'data cannot leave this boundary' requirement has no DPA carve-out.

Which open-weight model is the best starting point for a production evaluation in 2026?

For general-purpose tasks, Gemma 4 26B A4B (Apache 2.0 licensed) is a practical starting point: it balances quality, efficiency, and permissive licensing. Qwen 3 30B and Mistral Small 4 are strong alternatives. Llama 4 Scout is worth evaluating for long-context workloads. Run your own task-specific evaluation rather than relying on benchmark rankings — they are poor predictors of performance on real production task distributions.

AI & LLMsJun 1, 20265 min readReviewed Jun 1, 2026

Local LLMs in production, 2026: the honest economics

Self-hosting open-weight models is cheaper and easier than 18 months ago. That does not make it the right call for most workloads.

By FlowVerify Editorial Team

The picture has changed — just not the way the headlines say

Two years ago, the case against local LLMs was clear: quality gaps were real, serving infrastructure was fiddly, and engineering overhead was steep relative to an API call. The case for was mostly theoretical — data privacy, cost at scale, customisation — but 'at scale' turned out to mean volumes that most companies never actually reached.

The 2026 picture has shifted. Gemma 4 26B A4B runs on a single A100 80GB at throughput that previously required a multi-GPU rig. Qwen 3 30B and Mistral Small 4 are production-quality on a broad class of tasks. Llama 4 Scout handles long-context workloads that once required much larger models. The quantisation story has improved: 4-bit models that were noticeably weaker in 2024 are now competitive for most non-frontier tasks.

And yet: the blog posts telling you that local LLMs are now the obvious default get the economics wrong in one consistent direction. They compare token-level costs without accounting for the full cost stack, and produce a break-even calculation that looks dramatically in favour of self-hosting until you add the items that actually dominate.

The four scenarios where local actually wins

Before the numbers, it is worth naming the four situations where self-hosting is the right call regardless of cost. For these situations, cost is not the primary driver.

Data that cannot leave your infrastructure. Healthcare records, financial data under specific regulatory regimes, government workloads, and legal documents in some jurisdictions. If your data classification policy says 'no external processing', the cost comparison is moot. Self-hosting is the only option, and the engineering overhead is a compliance cost, not a discretionary one.

High-volume, high-cost-tier workloads. Running GPT-4o-tier APIs at $5/$15 per million tokens input/output at 10 million or more tokens per day means an API bill of $1,500–3,000 per month or higher. A single A100 80GB cloud instance at roughly $2.50 per hour ($1,800 per month running continuously), running a competitive open-weight model, can break even at that volume, provided quality holds for your specific task.

Edge or offline inference. IoT devices, on-device processing, situations requiring inference without network round-trips. Managed APIs cannot help here, full stop.

Fine-tuning with proprietary task data. If your use case requires a domain-specific fine-tune that you cannot send to a provider's infrastructure, self-hosting is part of the fine-tuning pipeline regardless of cost.

If none of these four apply, the decision comes down to economics. That is where the analysis usually goes wrong.

The cost comparison done right

The standard comparison goes: API at $3 per million tokens versus self-hosted at $0.07 per million tokens. The self-hosted number comes from a single line item: GPU compute cost divided by throughput. It is not wrong; it is incomplete.

Here is what the full cost stack actually looks like:

Cost element	Managed API	Self-hosted
Token compute	$0.15–$15/M tokens (varies by model tier)	$0.05–$0.10/M (at sustained GPU use)
Infrastructure baseline	$0	$500–2,000/month (GPU instances or on-prem amortised)
Serving setup (one-time)	$0	2–4 weeks of senior engineer time
Serving maintenance (ongoing)	$0	~0.1–0.25 FTE per model in production
Model updates	Automatic, provider-managed	Manual: evaluate, test, redeploy (4–6 cycles/year)
Traffic spikes	Automatic scaling	Manual or pre-provisioned headroom
Uptime SLA	Provider-backed	Your responsibility
Latency (p50)	300–800ms	50–300ms (lower; higher variance)

Full cost comparison: managed API vs self-hosted inference

The two items that flip the comparison are engineering maintenance overhead and model update cycles. Neither appears in most self-hosting cost analyses.

The engineering overhead is not a one-time cost

Most break-even analyses treat serving setup as a one-time cost: two weeks of work, then done. This is where most teams get burned.

Self-hosting a model in production means owning a serving stack. In practice: capacity planning for traffic peaks, graceful degradation when GPU memory is exhausted, load balancing if you run more than one instance, monitoring for quality regression over time, and responding to incidents when the inference server crashes at 2am. None of this is technically difficult for an experienced ML infrastructure engineer. At a 15-person company, however, you probably do not have one, which means these tasks land as opportunity cost on whoever is closest.

A working rule: budget roughly 0.2 FTE per model you run in production. Depending on your engineering cost base, that is $4,000–7,000 per month. It is a recurring line item, not a setup cost.

The break-even in practice

Working through the numbers: if you are replacing a mid-tier managed API (something in the GPT-4o-mini or Haiku price range, roughly $0.15–$0.60/M blended) — the break-even using the full cost stack lands around 50–100 million tokens per month.

At 50M tokens per month, the API spend is roughly $15,000 per year. A self-hosted setup with one L40S instance running continuously, amortised setup, and 0.2 FTE of maintenance runs approximately $12,000–18,000 per year, depending on your engineering labour cost. The cost advantage exists, but it is not the order-of-magnitude saving that token-level comparisons imply.

At 10M tokens per month (a $3,000/year API spend at mid-tier pricing), it is genuinely hard to beat self-hosting once engineering overhead is included. Most teams at this usage level are better served by the API.

The picture changes if you are replacing a frontier-tier API at $5/$15 per million tokens. At 10M tokens per month, the API spend is roughly $15,000 per year. A capable 30B-class open-weight model, self-hosted, can undercut that, provided quality is acceptable. For most non-frontier tasks in 2026, it is.

A rough cost estimator (substitute your own numbers):

rough-costs.py

# Adjust these three variables for your workload
TOKENS_PER_DAY        = 5_000_000   # daily token estimate
API_COST_PER_MILLION  = 0.30        # $0.30/M for mid-tier; $6.00/M for frontier
GPU_COST_PER_HOUR     = 0.72        # L40S on-demand; adjust for your provider

# Engineering overhead
ENGINEERING_FTE       = 0.20        # fraction of a senior engineer's time
ENGINEERING_MONTHLY   = 12_000      # your senior engineer's monthly cost

api_monthly = TOKENS_PER_DAY * 30 / 1_000_000 * API_COST_PER_MILLION
gpu_monthly = GPU_COST_PER_HOUR * 24 * 30
eng_monthly = ENGINEERING_MONTHLY * ENGINEERING_FTE
self_monthly = gpu_monthly + eng_monthly

print(f"API monthly:          ${api_monthly:,.0f}")
print(f"Self-hosted monthly:  ${self_monthly:,.0f}")
print(f"  GPU compute:        ${gpu_monthly:,.0f}")
print(f"  Engineering (est.): ${eng_monthly:,.0f}")
print()
winner = "self-hosted" if self_monthly < api_monthly else "managed API"
print(f"Break-even favours:   {winner}")

What to benchmark before committing local LLMs to production

If the numbers suggest you are near the break-even threshold, the decision comes down to quality on your specific task. Generic benchmarks will not answer this.

The single most useful test is not MMLU or HumanEval. It is a curated set of 100–200 actual production inputs — the kind of prompts your system sends today — rated either by human reviewers or a judge model, run against both the candidate self-hosted model and your current API. If quality on your task is within 5%, the economics usually justify the move. At a 10% gap, the managed API is providing value that the token math does not capture.

Other things to measure during any evaluation period:

P50, P95, and P99 latency under realistic concurrent load, not single-request latency in a quiet test environment
Throughput ceiling before quality degrades from aggressive batching
GPU memory behaviour under sustained load; fragmentation is a real production failure mode that synthetic benchmarks miss
Output quality on long inputs if your task involves documents rather than short prompts

If you cannot commit 2–3 days of engineering time to this evaluation, you cannot commit to self-hosting. The evaluation phase is a scaled-down version of the operational overhead you will carry permanently.

The honest recommendation

Self-hosting is viable at meaningfully lower thresholds than it was two years ago. Models are better, serving tooling (vLLM, SGLang) is more mature, and cloud GPU costs have come down. For data-sovereignty requirements, the analysis starts and ends with the compliance question — cost is secondary.

For everyone else: the cost advantage is real but narrower than token-level comparisons suggest, and it accrues only above a usage threshold that is higher than most teams currently hit. The more tractable question is not 'managed API or self-hosted' but 'which API tier is right for my current workload, and at what point should I revisit this as volume grows?'

Re-run the calculation when your API spend crosses $5,000 per month on a single use case. Below that, managed APIs hold the advantage on total cost — not because the local option does not work, but because zero engineering overhead has a value that the token maths does not include.

Frequently asked questions

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Jul 2, 2026Read full article →

AI & LLMsJun 1, 20265 min readReviewed Jun 1, 2026

Local LLMs in production, 2026: the honest economics

Self-hosting open-weight models is cheaper and easier than 18 months ago. That does not make it the right call for most workloads.

By FlowVerify Editorial Team

The picture has changed — just not the way the headlines say

The four scenarios where local actually wins

Before the numbers, it is worth naming the four situations where self-hosting is the right call regardless of cost. For these situations, cost is not the primary driver.

Edge or offline inference. IoT devices, on-device processing, situations requiring inference without network round-trips. Managed APIs cannot help here, full stop.

If none of these four apply, the decision comes down to economics. That is where the analysis usually goes wrong.

The cost comparison done right

Here is what the full cost stack actually looks like:

Cost element	Managed API	Self-hosted
Token compute	$0.15–$15/M tokens (varies by model tier)	$0.05–$0.10/M (at sustained GPU use)
Infrastructure baseline	$0	$500–2,000/month (GPU instances or on-prem amortised)
Serving setup (one-time)	$0	2–4 weeks of senior engineer time
Serving maintenance (ongoing)	$0	~0.1–0.25 FTE per model in production
Model updates	Automatic, provider-managed	Manual: evaluate, test, redeploy (4–6 cycles/year)
Traffic spikes	Automatic scaling	Manual or pre-provisioned headroom
Uptime SLA	Provider-backed	Your responsibility
Latency (p50)	300–800ms	50–300ms (lower; higher variance)

Full cost comparison: managed API vs self-hosted inference

The two items that flip the comparison are engineering maintenance overhead and model update cycles. Neither appears in most self-hosting cost analyses.

The engineering overhead is not a one-time cost

Most break-even analyses treat serving setup as a one-time cost: two weeks of work, then done. This is where most teams get burned.

A working rule: budget roughly 0.2 FTE per model you run in production. Depending on your engineering cost base, that is $4,000–7,000 per month. It is a recurring line item, not a setup cost.

The break-even in practice

A rough cost estimator (substitute your own numbers):

rough-costs.py

# Adjust these three variables for your workload
TOKENS_PER_DAY        = 5_000_000   # daily token estimate
API_COST_PER_MILLION  = 0.30        # $0.30/M for mid-tier; $6.00/M for frontier
GPU_COST_PER_HOUR     = 0.72        # L40S on-demand; adjust for your provider

# Engineering overhead
ENGINEERING_FTE       = 0.20        # fraction of a senior engineer's time
ENGINEERING_MONTHLY   = 12_000      # your senior engineer's monthly cost

api_monthly = TOKENS_PER_DAY * 30 / 1_000_000 * API_COST_PER_MILLION
gpu_monthly = GPU_COST_PER_HOUR * 24 * 30
eng_monthly = ENGINEERING_MONTHLY * ENGINEERING_FTE
self_monthly = gpu_monthly + eng_monthly

print(f"API monthly:          ${api_monthly:,.0f}")
print(f"Self-hosted monthly:  ${self_monthly:,.0f}")
print(f"  GPU compute:        ${gpu_monthly:,.0f}")
print(f"  Engineering (est.): ${eng_monthly:,.0f}")
print()
winner = "self-hosted" if self_monthly < api_monthly else "managed API"
print(f"Break-even favours:   {winner}")

What to benchmark before committing local LLMs to production

If the numbers suggest you are near the break-even threshold, the decision comes down to quality on your specific task. Generic benchmarks will not answer this.

Other things to measure during any evaluation period:

P50, P95, and P99 latency under realistic concurrent load, not single-request latency in a quiet test environment
Throughput ceiling before quality degrades from aggressive batching
GPU memory behaviour under sustained load; fragmentation is a real production failure mode that synthetic benchmarks miss
Output quality on long inputs if your task involves documents rather than short prompts

Local LLMs in production, 2026: the honest economics

The picture has changed — just not the way the headlines say

The four scenarios where local actually wins

The cost comparison done right

The engineering overhead is not a one-time cost

The break-even in practice

What to benchmark before committing local LLMs to production

The honest recommendation

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them

$662 billion in AI data-center leases isn't on any balance sheet yet

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Local LLMs in production, 2026: the honest economics

The picture has changed — just not the way the headlines say

The four scenarios where local actually wins

The cost comparison done right

The engineering overhead is not a one-time cost

The break-even in practice

What to benchmark before committing local LLMs to production

The honest recommendation

Frequently asked questions

Related reading

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.

Microsoft's seven new MAI models make a lot more sense once you read the OpenAI contract behind them

$662 billion in AI data-center leases isn't on any balance sheet yet

Stay ahead on eSignatures, compliance, and document workflows

75% of enterprises rolled back an AI agent. Mature AI agent governance made that rate go up, not down.