Does this mean LLM features cannot be unit-tested at all?

Unit tests still work for the deterministic parts of an LLM feature: parsing the response, routing based on output structure, validating schema conformance. What does not work is using unit tests to validate output quality. That requires sampling across multiple runs and measuring pass rate rather than an exact expected value.

How many runs per eval case do you actually need?

20 to 50 runs per case is enough for most production workloads. The right number depends on how much variance the task produces and how tight your quality threshold is. For tight thresholds, more samples give narrower confidence intervals on the pass rate, which matters when you are using the eval suite as a deployment gate.

What is the minimum setup to detect silent quality regressions in production?

Start with format and schema validation on every response. Then add a sampled review loop: evaluate 1 to 5% of production responses at random against your quality criteria. This gives you a baseline before regressions occur, which is when the baseline actually matters.

Do all major LLM providers update models without notice?

Most providers distinguish between versioned endpoints and latest-pointing aliases. Versioned endpoints generally receive only safety updates without a version bump. In practice, even safety updates can change observable behaviour in edge cases. Pin to a specific versioned endpoint and treat any version upgrade as a deployment event requiring before-and-after evaluation.

AI & LLMsMay 10, 20266 min readReviewed May 10, 2026

The four engineering habits that become liabilities when building LLM features

Your instincts as a backend engineer are not wrong — they are calibrated for the wrong kind of system.

By FlowVerify Editorial Team

The best backend engineers I know had the hardest time with their first LLM feature. This surprised me at first, and then made sense: the habits that make someone precise and reliable in traditional software development are specifically miscalibrated for building with language models. Junior engineers who had not yet formed those habits often adapted faster.

This is not a claim about seniority or capability. It is a claim about specificity. Four particular instincts that every competent engineer has good reasons to hold become liabilities when the system you are building is probabilistic, end-to-end, and fails silently rather than noisily. Naming them is what makes it possible to put them down deliberately when building LLM features.

Habit 1: Expecting the same input to produce the same output

Traditional software is deterministic. The same input to the same code produces the same output, every time. When it does not, that is a bug. This principle is so foundational that it shapes how you design tests and what you mean by correctness: one set of inputs, one expected output, one assertion.

Language models are stochastic by design. With temperature set above zero (the default for almost every production use case), the same prompt produces different outputs on different calls. Often the differences are superficial. Sometimes they are substantial. The quality property you care about (whether the summary stays within the facts, whether the classification label holds) can be satisfied 90% of the time and fail 10%.

In our evaluation suite for a summarisation task, every case passed when we ran each one once. When we sampled the same cases 40 times each, three of them had an 81% pass rate on 'summary does not introduce information not in the source.' The assertions were correct. Running each case once made them meaningless.

The habit produces a specific blind spot: 'it looked right when I tested it.' That statement is meaningful for a deterministic system and nearly useless for an LLM feature. The replacement is to measure pass rate across repeated runs rather than a single pass. Setting a threshold (say, 93% of runs produce a valid output) is a statement about what production users actually experience. A single run is not.

Habit 2: Testing components in isolation

The unit test provides confidence that each component is correct in isolation. If you compose correct components carefully, the composed system is correct. This is one of the most useful ideas in software engineering, and it works reliably for deterministic systems.

LLM feature quality does not compose this way. Output quality is an emergent property of the assembled system: the prompt, the user message, the conversation history, retrieved context, the model version, and the total token budget all interact. A prompt that works correctly in the playground can fail in production because it is combined with several kilobytes of conversation history, which pushes critical instructions into the middle of the context where models attend to them less reliably.

You cannot catch this by testing the prompt in isolation. The failure only appears in the assembled system under realistic conditions — the same class of failure as a race condition, where individual components are correct but the composition breaks. In one case we tracked, a feature passed every single prompt-level test and failed 23% of end-to-end evaluations on production-representative inputs, entirely on long conversations. The prompt was correct. The composed system was not.

The replacement is end-to-end evaluation on representative inputs: realistic user messages, typical conversation lengths, edge-case document sizes. The unit of evaluation is the user task, not the function call. Building these representative examples takes more effort up front. They catch the failures that actually reach users.

One practical way to build representative examples: run the feature in shadow mode against a sample of real production inputs before it goes live. Keep the outputs. Those shadow outputs, annotated for quality, become your evaluation set. They reflect what users actually send, not what you thought users would send when you designed the prompt.

Habit 3: Errors are your primary quality signal

In a traditional backend service, error rate is the first metric you watch. If error rate is flat, the service is working. The whole discipline of on-call rotations and SLOs is built on this premise: observable failures are the signal that something needs attention.

LLM quality regressions are silent. The API returns HTTP 200. Tokens are consumed. The response reaches the user. Your error rate, p99 latency, and API client logs show nothing wrong, because nothing went wrong at the infrastructure level. The output quality just declined.

One team tracked their AI feature's operational metrics and support volume in parallel for eight weeks. The two diverged completely. Error rate stayed flat the entire time. Support volume about AI-generated responses climbed 60% over six weeks, traceable in hindsight to a prompt change in week two that introduced a subtle ambiguity in how the model handled edge cases. The operational metrics gave no signal at all.

The replacement is quality monitoring built from the start: format validation, schema conformance for structured outputs, length distribution checks, and sampled automated or human review for harder quality dimensions. The critical piece is building this before you need it. Without a historical quality baseline, you cannot determine whether something has changed.

A useful mental model: treat output quality as you would treat database replication lag. You do not wait for replication to break completely before you look at it. You monitor the lag continuously and alert when it crosses a threshold. Quality needs the same treatment.

Habit 4: Git history explains behaviour changes

When an API starts behaving differently, the answer is in the code or configuration. This is nearly axiomatic in traditional software development. Git history is the reliable record of what changed and when. If behaviour changed, something changed in the codebase.

LLM behaviour can change without any change to your code. Model providers update their models, sometimes with announced versioning, sometimes without, and sometimes by updating a version in place. The model behind an endpoint you have been calling for months can behave differently from the same endpoint last week, with no entry in your git history to explain it.

This produces a specific class of debugging session: behaviour changes, you search the git history, find nothing, check environment variables and configuration, find nothing, and conclude the issue is intermittent. Engineers at several companies have spent two to four days on this before identifying a model-side update as the cause. The symptom is typically a change in response format that breaks downstream parsing.

The replacement has two parts: pin model versions explicitly, and add model version to your observability. Pinned versions ensure that provider-side updates do not reach production without your awareness. Logging model version per request lets you correlate behaviour changes with version boundaries after the fact. When you do upgrade a pinned version, run your evaluation suite before and after and compare output distributions rather than a single pass/fail.

Engineering habit	What it produces	The replacement
Determinism	Test each case once; if it passes, ship.	Sample 20–50 runs per eval case; measure pass rate, not a single pass.
Isolation testing	Validate the prompt in the playground before integrating.	Evaluate end-to-end on representative inputs; the unit is the user task.
Error monitoring	Watch error rate; if it is flat, quality is fine.	Track output quality metrics from the start; silent regressions are normal.
Git-as-truth	Search the codebase when behaviour changes unexpectedly.	Pin model versions; log model version per request; treat upgrades as deploys.

The four habits and their LLM-feature replacements

What this looks like when your team builds LLM features

None of the four replacements requires sophisticated tooling to start. An evaluation loop that samples each representative case 30 times, assembles the full input as production would, and reports pass rate as a fraction can be built in an afternoon. Running it in CI gives you regression detection without requiring a dedicated eval platform.

eval_loop.py

def eval_case(case, runs=30):
    results = [run_feature(case.input) for _ in range(runs)]
    pass_rate = sum(1 for r in results if check_quality(r, case.expected)) / runs
    return pass_rate  # e.g. 0.93 means 93% of runs met the quality bar

# Gate on the distribution, not on a single run
assert eval_case(my_case) >= 0.90, f"Pass rate below threshold"

The harder part is noticing when the old habit is active: reaching for an exact-value assertion, running the feature once and concluding it works, checking error rate instead of sampling outputs. These habits are fast and automatic because they have been reliable for years. Naming them is what makes it possible to set them aside deliberately — which is the specific work that building good LLM features requires.

The engineers who adapt fastest are not the ones who are less experienced. They are the ones who can recognise when an instinct is running and ask whether it applies to the system in front of them. That is true for any new class of system, and it is particularly true here.

Frequently asked questions

Open-source licensing for engineers: a corporate codebase guide

Legal is not reviewing every npm install — you are. Here is the practical check to run before adding a dependency, and the licence type that catches most SaaS teams off guard.

May 13, 2026Read full article →

AI & LLMsMay 10, 20266 min readReviewed May 10, 2026

The four engineering habits that become liabilities when building LLM features

Your instincts as a backend engineer are not wrong — they are calibrated for the wrong kind of system.

By FlowVerify Editorial Team

Habit 1: Expecting the same input to produce the same output

Habit 2: Testing components in isolation

Habit 3: Errors are your primary quality signal

Habit 4: Git history explains behaviour changes

Engineering habit	What it produces	The replacement
Determinism	Test each case once; if it passes, ship.	Sample 20–50 runs per eval case; measure pass rate, not a single pass.
Isolation testing	Validate the prompt in the playground before integrating.	Evaluate end-to-end on representative inputs; the unit is the user task.
Error monitoring	Watch error rate; if it is flat, quality is fine.	Track output quality metrics from the start; silent regressions are normal.
Git-as-truth	Search the codebase when behaviour changes unexpectedly.	Pin model versions; log model version per request; treat upgrades as deploys.

The four habits and their LLM-feature replacements

What this looks like when your team builds LLM features

eval_loop.py

def eval_case(case, runs=30):
    results = [run_feature(case.input) for _ in range(runs)]
    pass_rate = sum(1 for r in results if check_quality(r, case.expected)) / runs
    return pass_rate  # e.g. 0.93 means 93% of runs met the quality bar

# Gate on the distribution, not on a single run
assert eval_case(my_case) >= 0.90, f"Pass rate below threshold"

The four engineering habits that become liabilities when building LLM features

Habit 1: Expecting the same input to produce the same output

Habit 2: Testing components in isolation

Habit 3: Errors are your primary quality signal

Habit 4: Git history explains behaviour changes

What this looks like when your team builds LLM features

Frequently asked questions

Related reading

Open-source licensing for engineers: a corporate codebase guide

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

Stay ahead on eSignatures, compliance, and document workflows

Open-source licensing for engineers: a corporate codebase guide

The four engineering habits that become liabilities when building LLM features

Habit 1: Expecting the same input to produce the same output

Habit 2: Testing components in isolation

Habit 3: Errors are your primary quality signal

Habit 4: Git history explains behaviour changes

What this looks like when your team builds LLM features

Frequently asked questions

Related reading

Open-source licensing for engineers: a corporate codebase guide

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

Stay ahead on eSignatures, compliance, and document workflows

Open-source licensing for engineers: a corporate codebase guide