The four engineering habits that become liabilities when building LLM features
Your instincts as a backend engineer are not wrong — they are calibrated for the wrong kind of system.
The best backend engineers I know had the hardest time with their first LLM feature. This surprised me at first, and then made sense: the habits that make someone precise and reliable in traditional software development are specifically miscalibrated for building with language models. Junior engineers who had not yet formed those habits often adapted faster.
This is not a claim about seniority or capability. It is a claim about specificity. Four particular instincts that every competent engineer has good reasons to hold become liabilities when the system you are building is probabilistic, end-to-end, and fails silently rather than noisily. Naming them is what makes it possible to put them down deliberately when building LLM features.
Habit 1: Expecting the same input to produce the same output
Traditional software is deterministic. The same input to the same code produces the same output, every time. When it does not, that is a bug. This principle is so foundational that it shapes how you design tests and what you mean by correctness: one set of inputs, one expected output, one assertion.
Language models are stochastic by design. With temperature set above zero (the default for almost every production use case), the same prompt produces different outputs on different calls. Often the differences are superficial. Sometimes they are substantial. The quality property you care about (whether the summary stays within the facts, whether the classification label holds) can be satisfied 90% of the time and fail 10%.
In our evaluation suite for a summarisation task, every case passed when we ran each one once. When we sampled the same cases 40 times each, three of them had an 81% pass rate on 'summary does not introduce information not in the source.' The assertions were correct. Running each case once made them meaningless.
The habit produces a specific blind spot: 'it looked right when I tested it.' That statement is meaningful for a deterministic system and nearly useless for an LLM feature. The replacement is to measure pass rate across repeated runs rather than a single pass. Setting a threshold (say, 93% of runs produce a valid output) is a statement about what production users actually experience. A single run is not.
Habit 2: Testing components in isolation
The unit test provides confidence that each component is correct in isolation. If you compose correct components carefully, the composed system is correct. This is one of the most useful ideas in software engineering, and it works reliably for deterministic systems.
LLM feature quality does not compose this way. Output quality is an emergent property of the assembled system: the prompt, the user message, the conversation history, retrieved context, the model version, and the total token budget all interact. A prompt that works correctly in the playground can fail in production because it is combined with several kilobytes of conversation history, which pushes critical instructions into the middle of the context where models attend to them less reliably.
You cannot catch this by testing the prompt in isolation. The failure only appears in the assembled system under realistic conditions — the same class of failure as a race condition, where individual components are correct but the composition breaks. In one case we tracked, a feature passed every single prompt-level test and failed 23% of end-to-end evaluations on production-representative inputs, entirely on long conversations. The prompt was correct. The composed system was not.
The replacement is end-to-end evaluation on representative inputs: realistic user messages, typical conversation lengths, edge-case document sizes. The unit of evaluation is the user task, not the function call. Building these representative examples takes more effort up front. They catch the failures that actually reach users.
One practical way to build representative examples: run the feature in shadow mode against a sample of real production inputs before it goes live. Keep the outputs. Those shadow outputs, annotated for quality, become your evaluation set. They reflect what users actually send, not what you thought users would send when you designed the prompt.
Habit 3: Errors are your primary quality signal
In a traditional backend service, error rate is the first metric you watch. If error rate is flat, the service is working. The whole discipline of on-call rotations and SLOs is built on this premise: observable failures are the signal that something needs attention.
LLM quality regressions are silent. The API returns HTTP 200. Tokens are consumed. The response reaches the user. Your error rate, p99 latency, and API client logs show nothing wrong, because nothing went wrong at the infrastructure level. The output quality just declined.
One team tracked their AI feature's operational metrics and support volume in parallel for eight weeks. The two diverged completely. Error rate stayed flat the entire time. Support volume about AI-generated responses climbed 60% over six weeks, traceable in hindsight to a prompt change in week two that introduced a subtle ambiguity in how the model handled edge cases. The operational metrics gave no signal at all.
The replacement is quality monitoring built from the start: format validation, schema conformance for structured outputs, length distribution checks, and sampled automated or human review for harder quality dimensions. The critical piece is building this before you need it. Without a historical quality baseline, you cannot determine whether something has changed.
A useful mental model: treat output quality as you would treat database replication lag. You do not wait for replication to break completely before you look at it. You monitor the lag continuously and alert when it crosses a threshold. Quality needs the same treatment.
Habit 4: Git history explains behaviour changes
When an API starts behaving differently, the answer is in the code or configuration. This is nearly axiomatic in traditional software development. Git history is the reliable record of what changed and when. If behaviour changed, something changed in the codebase.
LLM behaviour can change without any change to your code. Model providers update their models, sometimes with announced versioning, sometimes without, and sometimes by updating a version in place. The model behind an endpoint you have been calling for months can behave differently from the same endpoint last week, with no entry in your git history to explain it.
This produces a specific class of debugging session: behaviour changes, you search the git history, find nothing, check environment variables and configuration, find nothing, and conclude the issue is intermittent. Engineers at several companies have spent two to four days on this before identifying a model-side update as the cause. The symptom is typically a change in response format that breaks downstream parsing.
The replacement has two parts: pin model versions explicitly, and add model version to your observability. Pinned versions ensure that provider-side updates do not reach production without your awareness. Logging model version per request lets you correlate behaviour changes with version boundaries after the fact. When you do upgrade a pinned version, run your evaluation suite before and after and compare output distributions rather than a single pass/fail.
| Engineering habit | What it produces | The replacement |
|---|---|---|
| Determinism | Test each case once; if it passes, ship. | Sample 20–50 runs per eval case; measure pass rate, not a single pass. |
| Isolation testing | Validate the prompt in the playground before integrating. | Evaluate end-to-end on representative inputs; the unit is the user task. |
| Error monitoring | Watch error rate; if it is flat, quality is fine. | Track output quality metrics from the start; silent regressions are normal. |
| Git-as-truth | Search the codebase when behaviour changes unexpectedly. | Pin model versions; log model version per request; treat upgrades as deploys. |
What this looks like when your team builds LLM features
None of the four replacements requires sophisticated tooling to start. An evaluation loop that samples each representative case 30 times, assembles the full input as production would, and reports pass rate as a fraction can be built in an afternoon. Running it in CI gives you regression detection without requiring a dedicated eval platform.
def eval_case(case, runs=30):
results = [run_feature(case.input) for _ in range(runs)]
pass_rate = sum(1 for r in results if check_quality(r, case.expected)) / runs
return pass_rate # e.g. 0.93 means 93% of runs met the quality bar
# Gate on the distribution, not on a single run
assert eval_case(my_case) >= 0.90, f"Pass rate below threshold"The harder part is noticing when the old habit is active: reaching for an exact-value assertion, running the feature once and concluding it works, checking error rate instead of sampling outputs. These habits are fast and automatic because they have been reliable for years. Naming them is what makes it possible to set them aside deliberately — which is the specific work that building good LLM features requires.
The engineers who adapt fastest are not the ones who are less experienced. They are the ones who can recognise when an instinct is running and ask whether it applies to the system in front of them. That is true for any new class of system, and it is particularly true here.
Frequently asked questions
Related reading
Open-source licensing for engineers: a corporate codebase guide
Legal is not reviewing every npm install — you are. Here is the practical check to run before adding a dependency, and the licence type that catches most SaaS teams off guard.
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.