Your LLM eval suite is a confidence machine, not a quality gate
There is a moment most teams shipping LLM-powered products recognise. You have set up an evaluation pipeline. It runs on every pull request. Coverage numbers tick upward week by week. Engineers feel good about merges. Then a user reports that the assistant is hallucinating product names, or confidently giving wrong answers on the exact use case you thought was covered. You go back to the evals. They still pass.
The problem is not that evals are hard to write. The problem is that most teams are building the wrong kind of eval entirely, and the tooling ecosystem has quietly encouraged this for the past two years.
The two kinds of eval, and why teams build only one
When you sit down to write eval cases for an LLM feature, the natural instinct is to write examples of what good looks like. Here is a customer query. Here is the ideal response. Score: 1 if the model output matches, 0 if not.
These are coverage evals. They answer the question: can our system handle the scenarios we thought about? They are useful. They are not sufficient.
The kind of eval most production teams are missing is what you might call a discrimination eval. It answers a different question: can this eval detect when the system has gotten worse?
This is not a subtle distinction. A coverage eval that only contains scenarios the model handles well will pass after every regression because the model still handles those scenarios — it is the other scenarios, the ones you did not think to write, where quality has degraded. You have built a test suite that confirms your assumptions rather than challenges them.
Goodhart’s Law arrives in your CI pipeline
Charles Goodhart, a British economist, observed in 1975 that when a measure becomes a target, it ceases to be a good measure. His observation was about monetary policy. It describes your LLM eval suite almost exactly.
Once you start optimising — tweaking prompts, swapping model versions, adjusting retrieval — the benchmark you have written is under pressure. Not because anyone is being dishonest, but because every iteration you run is sampling from the same distribution your test set was drawn from. You tune the system to pass those cases. You improve your score. You ship.
What you have not done is ask whether the score still means anything.
In traditional software testing, a regression test that passes after a refactor is strong evidence the refactor is safe, because the test exercises a specific code path with a specific input. In LLM evaluation, a test that passes after a prompt change is only evidence that the model still handles the cases your prompt change was designed to handle well. The cases you did not think about, the edge inputs, the adversarial phrasings, the legitimate queries that happen to touch your retrieval system’s blind spots — those are invisible.
The golden-dataset trap
The most common eval setup you will find in the wild goes like this: someone on the team spends a day or two writing 30 to 100 “golden” examples by hand. The examples are representative of typical usage. A judge model — usually a stronger model than the one being evaluated — scores outputs against a rubric. The pass rate is tracked over time.
This setup is not wrong. It is incomplete in a way that compounds over time.
Golden datasets go stale. User behaviour in production shifts. Jargon evolves. The distribution of queries you are actually getting in month six is not the distribution you sampled when you wrote the golden set in month one. But the eval still runs, still passes, and still generates the same confidence.
Worse, golden datasets are drawn from the scenarios the person writing them could imagine. The scenarios most likely to cause production failures are precisely the ones that are hard to imagine in advance: the edge case no one predicted, the input phrasing that triggers a retrieval failure, the multi-step query that looks fine but causes the model to lose track of context by step three.
Haiku Hayashi, an infrastructure engineer at a mid-sized AI startup, described the pattern in a recent post: “We had 97% on our eval suite the week before an incident where the assistant was giving incorrect date calculations for about 12% of users. We went back and looked — not a single eval case tested date arithmetic under timezone ambiguity. The eval was passing because we had built it to test what we were confident about.”
What a discrimination eval actually looks like
A discrimination eval is built around failure modes, not success cases. The goal is not to demonstrate that the system works; it is to be the first thing that breaks when the system degrades.
This requires a different source of data. Instead of writing examples from imagination, you mine them from production. Every user complaint is an eval case. Every output your team flagged during a review is an eval case. Every query that hit a retrieval null result is an eval case. Every input that produced a confident but wrong answer, especially the ones that slipped through unnoticed, is the most valuable eval case you have.
A practical process looks like this:
1. Instrument your production traces for failure signals. Not just explicit user feedback (thumbs down, support tickets) but implicit signals: low-confidence retrievals, outputs that were regenerated, sessions that ended abruptly, queries that came back rephrased within two turns.
2. Sample from those failure signals weekly. Review a batch of ten to twenty cases. Label the ones where the output was genuinely bad. Add them to your eval suite with human-annotated ground truth, not LLM-generated labels.
3. Keep a separate “regression set” that is append-only. Every incident that makes it to a post-mortem should produce at least one eval case. This set should never shrink. It is the memory of the system’s failure history.
4. Audit your eval coverage quarterly. For each category of failure you have observed in production, count how many eval cases cover that failure mode. If a category is unrepresented, it will not catch the next regression in that area.
The goal is not a large eval suite. It is a discriminating one. Fifty production-derived cases that cover real failure modes will catch more regressions than five hundred synthetic golden cases that cover the happy path.
The judge model problem
Many teams now use an LLM as the judge for their evaluation pipeline. This is a pragmatic choice — human annotation is slow and expensive, and a capable model can assess output quality at scale.
The problem is that the judge model inherits its own biases and failure modes. It tends to prefer verbose outputs over concise ones. It scores answers that sound confident higher than answers that hedge appropriately, even when the hedge is correct. It performs poorly on domain-specific factual claims because it cannot easily verify them. And, crucially, it rates outputs from models of similar capability as itself more generously than outputs from smaller models — which means your eval scores will shift when you change the judge, even if the system under evaluation has not changed at all.
None of this makes LLM judges useless. It means they should be calibrated. Before you trust a judge model on your eval suite, run it against a sample of cases where you have human-annotated ground truth and measure the agreement rate. If the judge is systematically wrong in a particular direction, correct for it. If it cannot reach acceptable agreement on a category of output, do not use it to evaluate that category — that category needs human review or a specialised metric.
Teams that skip this calibration step are, in effect, using one model’s taste as a proxy for quality and then measuring their system against that taste. It is a reasonable approximation. It is not a quality gate.
The CI eval is a regression detector, not a quality certifier
Here is the framing shift that changes how teams think about this: your CI eval suite should be designed to detect regressions, not to certify quality.
Quality certification is a different activity. It requires human review, domain expertise, statistical sampling from production, and periodic adversarial testing. It is slow and expensive and cannot happen on every PR.
What can happen on every PR is regression detection: did this change break something we know matters? The bar for a CI eval case is not “this represents high quality output.” The bar is “if this case starts failing, I want to know immediately.”
That distinction changes what you write. You write cases from previous incidents. You write cases from user complaints. You write adversarial phrasings. You write inputs that look fine but are known to cause model confusion. You make the eval hard to pass by covering the cases you would rather not think about.
And you keep it fast. The eval that runs on every PR should take five minutes at most. Not because thoroughness is bad but because a slow eval gets skipped. A 30-case regression suite that runs in 90 seconds and covers your six known failure modes is more valuable than a 500-case suite that runs in 40 minutes and gets disabled because it is blocking deploys.
Starting from where you are
Most teams reading this have a golden-dataset eval and limited production-derived cases. The shift does not require throwing anything away.
Start by identifying the last three incidents or user complaints related to output quality. For each one, write two to three eval cases that would have caught it. Add them to your suite. Mark them as regression cases.
Do that every month. Within six months, you will have a core set of 20 to 30 high-value cases that are drawn from actual failure history. That core set will catch regressions your golden dataset cannot see.
Then start auditing your judge. Pick 20 cases from your suite, have a human label them, and measure agreement. You will probably find categories where the judge is unreliable. Those are the categories where you should add specialised metrics or human spot-checks.
The tooling for this has improved significantly. Frameworks like DeepEval, LangSmith, and Braintrust all support case tagging, regression tracking, and human annotation workflows. The infrastructure is not the bottleneck. The bottleneck is the habit of only writing cases you expect to pass.
The teams shipping reliable LLM products at scale share one characteristic that is not about model choice or infrastructure: they treat their eval suite as a record of everything that has gone wrong. Every incident leaves a trace in the test data. Every regression gets a case added so it cannot happen silently again.
That is not a novel idea. It is exactly how good engineering teams have treated integration tests for decades. The difference is that most teams instinctively apply that discipline to their deterministic code and not to the probabilistic layer they added last year. Closing that gap is the most reliable thing you can do to raise the ceiling on LLM output quality in production.
Related reading
The AI productivity paradox is more interesting than either side admits
AI is making specific tasks measurably faster: coding 55%, X-ray reading 36%, customer service sales up 16%. And yet 90% of firms saw no firm-level productivity gain. Here's what the gap means.
Your AI coding tool is making engineers faster and your system slower
AI coding tools are raising PR velocity across the industry. They are also raising incident rates by 243%. Here is what the data shows and how to read the tradeoff honestly.
gRPC vs REST for internal services: the decision you're probably making too early
The gRPC vs REST debate isn't about performance. It's about when the cost of schema enforcement is worth paying — and most teams reach for gRPC before they've hit the problems it actually solves.