LLM evaluation done wrong: why one eval setup can't answer three different questions
Most teams build one "judge the output" pipeline. It gives false confidence on all three things it's being asked to do.
Something degrades in your LLM feature. Not dramatically. Users aren't filing tickets, but quality has drifted downward over the past two weeks. You check your eval pipeline. Everything's green.
This is the most common failure mode in LLM evaluation in production: not that evals fail to run, but that they pass with false confidence while something real has gone wrong.
The root cause is almost always the same. A single "judge the output" setup is being asked to answer three distinct questions. Each requires different data, different tooling, and different pass/fail logic. One pipeline can't handle all three.
Three questions bundled into one setup
When a team ships their first LLM feature, they typically wire up one eval: feed each output to a judge model, ask it to rate quality on a 1–5 scale, flag anything below 3. Simple to describe, quick to build, easy to extend.
"Quality" in that sentence is doing enormous work. An LLM product team needs to answer three separate questions, each of which calls for a different setup:
- Which model or prompt should I use? This is model selection: comparing candidate A against candidate B on your specific task.
- Is this feature good enough to ship? This is quality gating: checking whether outputs clear a defined threshold before a PR merges.
- Did something I changed yesterday make things worse? This is regression detection: spotting quality degradation in a recent change before users do.
A single judge pipeline answers all three badly. It runs too slowly and expensively for CI, uses the wrong scoring logic for regression detection, and lacks the sample size and calibration for reliable model selection. Teams that build one setup for all three problems end up with a dashboard that reads green right up until it doesn't — and when it doesn't, the numbers don't tell them where to look.
| Problem | Goal | When it runs | Core requirement |
|---|---|---|---|
| Model selection | Which candidate is better for my use case? | Ad hoc, before a major change | 100–200 representative inputs + third-party judge + pairwise comparison |
| Quality gating | Does this change pass the quality bar? | Every PR, in CI | Deterministic checks first, then a small calibrated judge |
| Regression detection | Did I just make things worse? | Every PR, against a baseline | Fixed golden set ≥100 examples + score delta tracking |
Model selection: comparing what you have against what you could have
Model selection is the easiest of the three to get right and the most commonly over-engineered.
The goal is relative: which of two candidates produces better outputs on your actual traffic? Not "is candidate A good?" in the abstract — but "is A better than B, on these inputs, by a margin that justifies the cost or latency difference?"
What you need: a representative set of inputs (50–200 examples), a judge that scores both outputs on the same rubric, and a check that the margin is statistically meaningful before you declare a winner.
Three things go wrong repeatedly.
- Using too few examples. Twenty inputs is not enough to detect a 5% quality improvement with confidence. You need roughly 100 to detect differences of that magnitude at typical variance levels. With 20 examples, you are treating noise as signal.
- Using a judge from the same model family as the candidates you are evaluating. Self-preference bias is documented and consistent: GPT-4o rates GPT-family outputs higher than equivalent Claude outputs, and Claude models do the reverse. If you are comparing GPT-4o-mini to Claude Haiku, use a third model as the judge — or calibrate against human annotations before you run the experiment.
- Asking the judge to score rather than compare. Pairwise preference ("which output is better, A or B?") is reliably more accurate than absolute scoring ("rate output A on a 1–5 scale"). This holds across the LLM-as-judge literature and in practical production settings. When possible, compare outputs head-to-head rather than scoring each in isolation.
Once you have a result, the practical question is: what margin justifies switching? For a summarisation task where outputs are reviewed by humans anyway, a 3% improvement probably does not justify a model migration. For a customer-facing feature, it might. Set the threshold before you run the experiment, not after you see the numbers.
Quality gating in CI: start with deterministic checks
Quality gating is what most teams mean when they say "we have evals". The goal: block a PR from merging when a prompt change or model update degrades output quality below an acceptable threshold.
The first move should always be deterministic checks, not LLM judges.
Deterministic checks are fast, have no variance from judge behaviour, and catch entire failure categories cleanly. For most LLM features, you can verify the following without a judge:
- Does the output parse as valid JSON, or match the expected schema?
- Does it stay within length bounds?
- Does it avoid specific strings that should never appear — PII patterns, internal codes, forbidden phrases?
- Does it include required fields or sections?
- For classification tasks: does it return one of the valid labels?
In a typical production LLM pipeline, deterministic checks catch 40–60% of failures, depending on the task. This is the layer most teams skip because it is not interesting to build. It would catch most of the failures that make it to production.
For outputs that are structurally valid but substantively wrong, you do need a judge. Use a smaller, faster model here: GPT-4o-mini or Claude Haiku is a reasonable default for CI. Frontier models are expensive at CI scale, and the incremental accuracy is not worth it for gating decisions.
Regression detection: the question most teams aren't asking
Regression detection is where most teams have the biggest blind spot.
The goal: given that the product was at quality level Q last week, has a recent change pushed it below Q? This is not the same as "are the outputs good?" It is a relative question about change over time.
What this requires is a golden set: a fixed collection of inputs whose expected output characteristics are well understood and stable. Not necessarily reference answers — but inputs whose expected behaviour should not change across releases. The golden set stays the same between runs. That stability is what makes comparison meaningful.
What most teams do instead: run their eval on current outputs and check whether the average score is above some threshold. This does not catch regressions. If your threshold is 3.5 and your average is 3.8 today, everything passes — but if it was 4.1 last week, you have a real regression that went undetected.
“Most eval dashboards tell you the quality of your outputs. Very few tell you whether quality has changed.”
Proper regression detection tracks score delta between the current run and a baseline, not just the absolute score. Braintrust, DeepEval, and PromptFoo all have comparison modes that do this, but the default configurations do not have it enabled. You have to set it up explicitly.
A golden set needs at least 100 examples to detect a 10% regression with reasonable statistical confidence, given judge variance. With 20 examples, only catastrophic failures show up as statistically clear. With 300 examples, you can catch subtler shifts early — worth the one-time investment of assembling the set.
Diversity matters too. If your product handles five types of input and your golden set has 90 examples from one type, you will miss regressions that only affect the others. Ensure the golden set reflects your actual traffic distribution, and revisit it once a quarter as that distribution shifts.
Where LLM judges actually earn their place
None of this means LLM judges are unreliable. The issue is specificity: they are reliable in a narrower window than most teams assume, and teams use them where they are weakest.
Judges work well for rubric-based, well-defined tasks: does the output follow a specified format, maintain a persona, avoid certain content types, include a required section? When the evaluation criteria can be stated precisely and do not require domain expertise to verify, LLM judges reach close to human-level agreement.
Pairwise comparison on tasks with clear criteria is also reliable. "Is output A more concise while preserving the key information from output B?" is a well-specified comparison task. Judges handle it accurately. Absolute scoring on open-ended tasks, much less so.
Where judges are not reliable:
- Factual correctness on domain-specific knowledge. A judge using a general-purpose model to verify whether a legal citation is accurate, whether a medical dosage is correct, or whether a code snippet runs as expected is not reliable. You need a reference answer, a domain-expert check, or a specialised evaluator.
- Long-output quality without decomposition. Asking "is this 3,000-word report good?" produces low-quality scores because the judge is aggregating too many dimensions simultaneously. Break it into components: accuracy, completeness, tone, structure. Each component gets its own rubric and its own judge call.
- Regression detection without a baseline. A judge that scores outputs in isolation is measuring abstract quality, not change. A score of 3.8 means nothing without knowing what the score was last week.
The verbosity bias is worth a specific mention: LLM judges consistently rate longer outputs higher, regardless of quality. This is documented across model families and holds in practice. If your product has any reason to produce concise outputs (cost, latency, readability), a vanilla judge will push you in the wrong direction without you noticing.
A minimal production eval setup for all three questions
Putting these together, a setup that handles all three questions cleanly looks like this:
Layer 1 — Deterministic CI checks. Runs on every PR. No LLM involved. Schema validation, length bounds, forbidden-string scan, label-set verification. Cheap and fast. Blocks merges on structural failures.
Layer 2 — Regression detection on a fixed golden set. Runs on every PR. Uses a small judge (Haiku or mini-class) against a golden set of at least 100 examples. Tracks score delta against the last passing commit, not absolute score. Blocks merges if the delta crosses a defined threshold.
Layer 3 — Model and prompt selection experiments. Runs ad hoc, outside CI. Pairwise comparisons using a third-party judge, 100–200 representative inputs, and a statistical significance check before declaring a winner.
Human annotation feeds into all three layers as calibration input — not in the hot path, but as a periodic check that keeps judges honest. Once a month, route 50 outputs from layer 2 to a human reviewer and compare ratings to the judge's. If agreement drops, recalibrate.
The common mistake is building layer 3 first because it feels most powerful, then assuming it substitutes for layers 1 and 2. It does not. Frontier-model pairwise experiments are too slow and expensive for CI, and their scoring logic is not designed for the delta-tracking that makes regression detection work.
Most eval problems in production come down to this: teams are answering question 3 (which model is best?) when they think they are answering question 1 (is this safe to ship?) or question 2 (did I just break something?). Separating the three questions into the right setups is not glamorous infrastructure work, but it is the difference between a dashboard that tells you something real and one that stays green until it is too late.
Frequently asked questions
Related reading
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.
The AI coding productivity data keeps contradicting itself. Here's why.
AI coding productivity studies swing between '26% faster' and '19% slower on real tasks'. Both are probably correct — for different things. Here's what the research actually measures and what you should track instead.