What's the difference between a golden set and a test set?

A test set is a held-out sample used to measure model accuracy against reference labels — common in supervised ML. A golden set for LLM regression detection is fixed set of inputs whose expected output behaviour is well understood. It does not need reference answers; it needs known-good characteristics that should remain stable across releases. The golden set never changes between runs, which is what makes score comparisons meaningful over time.

Can I use the same judge model for model selection and quality gating?

For quality gating, a small, fast model (GPT-4o-mini, Claude Haiku) is the right default: cheap, quick in CI, good enough for rubric-based checks. For model selection, use a third-party judge not from the same model family as the candidates you're evaluating. Self-preference bias makes same-family judging unreliable. You'll end up with different judge configurations for the two purposes, which is correct.

How do I calibrate my LLM judge against human annotations?

Take 100 outputs from your production pipeline, have a domain-appropriate human rate each one on your eval rubric, and have your judge rate the same outputs. Calculate inter-rater agreement (Cohen's kappa or simple percentage agreement both work). If agreement is below 70%, the judge is not reliable enough for production gating decisions. Rewrite the judge prompt, adjust the rubric, or switch to a different model until agreement is above that threshold. Repeat this calibration once per quarter as your product evolves.

What should a regression detection threshold be?

There is no universal answer, but a common starting point is: block if average score drops by more than 0.3 on a 1–5 scale across the golden set, or if the pass rate drops by more than 5 percentage points compared to the baseline commit. Set the threshold based on what a regression looks like in your product — run a few intentional regressions during setup to calibrate. The threshold should be tight enough to catch real problems and loose enough to not block every merge.

AI & LLMsMay 15, 20268 min readReviewed May 15, 2026

LLM evaluation done wrong: why one eval setup can't answer three different questions

Most teams build one "judge the output" pipeline. It gives false confidence on all three things it's being asked to do.

By FlowVerify Editorial Team

Something degrades in your LLM feature. Not dramatically. Users aren't filing tickets, but quality has drifted downward over the past two weeks. You check your eval pipeline. Everything's green.

This is the most common failure mode in LLM evaluation in production: not that evals fail to run, but that they pass with false confidence while something real has gone wrong.

The root cause is almost always the same. A single "judge the output" setup is being asked to answer three distinct questions. Each requires different data, different tooling, and different pass/fail logic. One pipeline can't handle all three.

Three questions bundled into one setup

When a team ships their first LLM feature, they typically wire up one eval: feed each output to a judge model, ask it to rate quality on a 1–5 scale, flag anything below 3. Simple to describe, quick to build, easy to extend.

"Quality" in that sentence is doing enormous work. An LLM product team needs to answer three separate questions, each of which calls for a different setup:

Which model or prompt should I use? This is model selection: comparing candidate A against candidate B on your specific task.
Is this feature good enough to ship? This is quality gating: checking whether outputs clear a defined threshold before a PR merges.
Did something I changed yesterday make things worse? This is regression detection: spotting quality degradation in a recent change before users do.

A single judge pipeline answers all three badly. It runs too slowly and expensively for CI, uses the wrong scoring logic for regression detection, and lacks the sample size and calibration for reliable model selection. Teams that build one setup for all three problems end up with a dashboard that reads green right up until it doesn't — and when it doesn't, the numbers don't tell them where to look.

Problem	Goal	When it runs	Core requirement
Model selection	Which candidate is better for my use case?	Ad hoc, before a major change	100–200 representative inputs + third-party judge + pairwise comparison
Quality gating	Does this change pass the quality bar?	Every PR, in CI	Deterministic checks first, then a small calibrated judge
Regression detection	Did I just make things worse?	Every PR, against a baseline	Fixed golden set ≥100 examples + score delta tracking

The three LLM eval problems and what each one actually needs

Model selection: comparing what you have against what you could have

Model selection is the easiest of the three to get right and the most commonly over-engineered.

The goal is relative: which of two candidates produces better outputs on your actual traffic? Not "is candidate A good?" in the abstract — but "is A better than B, on these inputs, by a margin that justifies the cost or latency difference?"

What you need: a representative set of inputs (50–200 examples), a judge that scores both outputs on the same rubric, and a check that the margin is statistically meaningful before you declare a winner.

Three things go wrong repeatedly.

Using too few examples. Twenty inputs is not enough to detect a 5% quality improvement with confidence. You need roughly 100 to detect differences of that magnitude at typical variance levels. With 20 examples, you are treating noise as signal.
Using a judge from the same model family as the candidates you are evaluating. Self-preference bias is documented and consistent: GPT-4o rates GPT-family outputs higher than equivalent Claude outputs, and Claude models do the reverse. If you are comparing GPT-4o-mini to Claude Haiku, use a third model as the judge — or calibrate against human annotations before you run the experiment.
Asking the judge to score rather than compare. Pairwise preference ("which output is better, A or B?") is reliably more accurate than absolute scoring ("rate output A on a 1–5 scale"). This holds across the LLM-as-judge literature and in practical production settings. When possible, compare outputs head-to-head rather than scoring each in isolation.

Once you have a result, the practical question is: what margin justifies switching? For a summarisation task where outputs are reviewed by humans anyway, a 3% improvement probably does not justify a model migration. For a customer-facing feature, it might. Set the threshold before you run the experiment, not after you see the numbers.

Quality gating in CI: start with deterministic checks

Quality gating is what most teams mean when they say "we have evals". The goal: block a PR from merging when a prompt change or model update degrades output quality below an acceptable threshold.

The first move should always be deterministic checks, not LLM judges.

Deterministic checks are fast, have no variance from judge behaviour, and catch entire failure categories cleanly. For most LLM features, you can verify the following without a judge:

Does the output parse as valid JSON, or match the expected schema?
Does it stay within length bounds?
Does it avoid specific strings that should never appear — PII patterns, internal codes, forbidden phrases?
Does it include required fields or sections?
For classification tasks: does it return one of the valid labels?

In a typical production LLM pipeline, deterministic checks catch 40–60% of failures, depending on the task. This is the layer most teams skip because it is not interesting to build. It would catch most of the failures that make it to production.

For outputs that are structurally valid but substantively wrong, you do need a judge. Use a smaller, faster model here: GPT-4o-mini or Claude Haiku is a reasonable default for CI. Frontier models are expensive at CI scale, and the incremental accuracy is not worth it for gating decisions.

Regression detection: the question most teams aren't asking

Regression detection is where most teams have the biggest blind spot.

The goal: given that the product was at quality level Q last week, has a recent change pushed it below Q? This is not the same as "are the outputs good?" It is a relative question about change over time.

What this requires is a golden set: a fixed collection of inputs whose expected output characteristics are well understood and stable. Not necessarily reference answers — but inputs whose expected behaviour should not change across releases. The golden set stays the same between runs. That stability is what makes comparison meaningful.

What most teams do instead: run their eval on current outputs and check whether the average score is above some threshold. This does not catch regressions. If your threshold is 3.5 and your average is 3.8 today, everything passes — but if it was 4.1 last week, you have a real regression that went undetected.

“Most eval dashboards tell you the quality of your outputs. Very few tell you whether quality has changed.”

— FlowVerify

Proper regression detection tracks score delta between the current run and a baseline, not just the absolute score. Braintrust, DeepEval, and PromptFoo all have comparison modes that do this, but the default configurations do not have it enabled. You have to set it up explicitly.

A golden set needs at least 100 examples to detect a 10% regression with reasonable statistical confidence, given judge variance. With 20 examples, only catastrophic failures show up as statistically clear. With 300 examples, you can catch subtler shifts early — worth the one-time investment of assembling the set.

Diversity matters too. If your product handles five types of input and your golden set has 90 examples from one type, you will miss regressions that only affect the others. Ensure the golden set reflects your actual traffic distribution, and revisit it once a quarter as that distribution shifts.

Where LLM judges actually earn their place

None of this means LLM judges are unreliable. The issue is specificity: they are reliable in a narrower window than most teams assume, and teams use them where they are weakest.

Judges work well for rubric-based, well-defined tasks: does the output follow a specified format, maintain a persona, avoid certain content types, include a required section? When the evaluation criteria can be stated precisely and do not require domain expertise to verify, LLM judges reach close to human-level agreement.

Pairwise comparison on tasks with clear criteria is also reliable. "Is output A more concise while preserving the key information from output B?" is a well-specified comparison task. Judges handle it accurately. Absolute scoring on open-ended tasks, much less so.

Where judges are not reliable:

Factual correctness on domain-specific knowledge. A judge using a general-purpose model to verify whether a legal citation is accurate, whether a medical dosage is correct, or whether a code snippet runs as expected is not reliable. You need a reference answer, a domain-expert check, or a specialised evaluator.
Long-output quality without decomposition. Asking "is this 3,000-word report good?" produces low-quality scores because the judge is aggregating too many dimensions simultaneously. Break it into components: accuracy, completeness, tone, structure. Each component gets its own rubric and its own judge call.
Regression detection without a baseline. A judge that scores outputs in isolation is measuring abstract quality, not change. A score of 3.8 means nothing without knowing what the score was last week.

The verbosity bias is worth a specific mention: LLM judges consistently rate longer outputs higher, regardless of quality. This is documented across model families and holds in practice. If your product has any reason to produce concise outputs (cost, latency, readability), a vanilla judge will push you in the wrong direction without you noticing.

A minimal production eval setup for all three questions

Putting these together, a setup that handles all three questions cleanly looks like this:

Layer 1 — Deterministic CI checks. Runs on every PR. No LLM involved. Schema validation, length bounds, forbidden-string scan, label-set verification. Cheap and fast. Blocks merges on structural failures.

Layer 2 — Regression detection on a fixed golden set. Runs on every PR. Uses a small judge (Haiku or mini-class) against a golden set of at least 100 examples. Tracks score delta against the last passing commit, not absolute score. Blocks merges if the delta crosses a defined threshold.

Layer 3 — Model and prompt selection experiments. Runs ad hoc, outside CI. Pairwise comparisons using a third-party judge, 100–200 representative inputs, and a statistical significance check before declaring a winner.

Human annotation feeds into all three layers as calibration input — not in the hot path, but as a periodic check that keeps judges honest. Once a month, route 50 outputs from layer 2 to a human reviewer and compare ratings to the judge's. If agreement drops, recalibrate.

The common mistake is building layer 3 first because it feels most powerful, then assuming it substitutes for layers 1 and 2. It does not. Frontier-model pairwise experiments are too slow and expensive for CI, and their scoring logic is not designed for the delta-tracking that makes regression detection work.

Most eval problems in production come down to this: teams are answering question 3 (which model is best?) when they think they are answering question 1 (is this safe to ship?) or question 2 (did I just break something?). Separating the three questions into the right setups is not glamorous infrastructure work, but it is the difference between a dashboard that tells you something real and one that stays green until it is too late.

Frequently asked questions

The AI wrapper debate, three years in: what the survivors built

May 13, 2026Read full article →

AI & LLMsMay 15, 20268 min readReviewed May 15, 2026

LLM evaluation done wrong: why one eval setup can't answer three different questions

Most teams build one "judge the output" pipeline. It gives false confidence on all three things it's being asked to do.

By FlowVerify Editorial Team

Something degrades in your LLM feature. Not dramatically. Users aren't filing tickets, but quality has drifted downward over the past two weeks. You check your eval pipeline. Everything's green.

This is the most common failure mode in LLM evaluation in production: not that evals fail to run, but that they pass with false confidence while something real has gone wrong.

Three questions bundled into one setup

"Quality" in that sentence is doing enormous work. An LLM product team needs to answer three separate questions, each of which calls for a different setup:

Which model or prompt should I use? This is model selection: comparing candidate A against candidate B on your specific task.
Is this feature good enough to ship? This is quality gating: checking whether outputs clear a defined threshold before a PR merges.
Did something I changed yesterday make things worse? This is regression detection: spotting quality degradation in a recent change before users do.

Problem	Goal	When it runs	Core requirement
Model selection	Which candidate is better for my use case?	Ad hoc, before a major change	100–200 representative inputs + third-party judge + pairwise comparison
Quality gating	Does this change pass the quality bar?	Every PR, in CI	Deterministic checks first, then a small calibrated judge
Regression detection	Did I just make things worse?	Every PR, against a baseline	Fixed golden set ≥100 examples + score delta tracking

The three LLM eval problems and what each one actually needs

Model selection: comparing what you have against what you could have

Model selection is the easiest of the three to get right and the most commonly over-engineered.

Three things go wrong repeatedly.

Using too few examples. Twenty inputs is not enough to detect a 5% quality improvement with confidence. You need roughly 100 to detect differences of that magnitude at typical variance levels. With 20 examples, you are treating noise as signal.
Using a judge from the same model family as the candidates you are evaluating. Self-preference bias is documented and consistent: GPT-4o rates GPT-family outputs higher than equivalent Claude outputs, and Claude models do the reverse. If you are comparing GPT-4o-mini to Claude Haiku, use a third model as the judge — or calibrate against human annotations before you run the experiment.
Asking the judge to score rather than compare. Pairwise preference ("which output is better, A or B?") is reliably more accurate than absolute scoring ("rate output A on a 1–5 scale"). This holds across the LLM-as-judge literature and in practical production settings. When possible, compare outputs head-to-head rather than scoring each in isolation.

Quality gating in CI: start with deterministic checks

Quality gating is what most teams mean when they say "we have evals". The goal: block a PR from merging when a prompt change or model update degrades output quality below an acceptable threshold.

The first move should always be deterministic checks, not LLM judges.

Deterministic checks are fast, have no variance from judge behaviour, and catch entire failure categories cleanly. For most LLM features, you can verify the following without a judge:

Does the output parse as valid JSON, or match the expected schema?
Does it stay within length bounds?
Does it avoid specific strings that should never appear — PII patterns, internal codes, forbidden phrases?
Does it include required fields or sections?
For classification tasks: does it return one of the valid labels?

Regression detection: the question most teams aren't asking

Regression detection is where most teams have the biggest blind spot.

“Most eval dashboards tell you the quality of your outputs. Very few tell you whether quality has changed.”

— FlowVerify

Where LLM judges actually earn their place

None of this means LLM judges are unreliable. The issue is specificity: they are reliable in a narrower window than most teams assume, and teams use them where they are weakest.

Where judges are not reliable:

Factual correctness on domain-specific knowledge. A judge using a general-purpose model to verify whether a legal citation is accurate, whether a medical dosage is correct, or whether a code snippet runs as expected is not reliable. You need a reference answer, a domain-expert check, or a specialised evaluator.
Long-output quality without decomposition. Asking "is this 3,000-word report good?" produces low-quality scores because the judge is aggregating too many dimensions simultaneously. Break it into components: accuracy, completeness, tone, structure. Each component gets its own rubric and its own judge call.
Regression detection without a baseline. A judge that scores outputs in isolation is measuring abstract quality, not change. A score of 3.8 means nothing without knowing what the score was last week.

A minimal production eval setup for all three questions

Putting these together, a setup that handles all three questions cleanly looks like this:

LLM evaluation done wrong: why one eval setup can't answer three different questions

Three questions bundled into one setup

Model selection: comparing what you have against what you could have

Quality gating in CI: start with deterministic checks

Regression detection: the question most teams aren't asking

Where LLM judges actually earn their place

A minimal production eval setup for all three questions

Frequently asked questions

Related reading

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

The AI coding productivity data keeps contradicting itself. Here's why.

Stay ahead on eSignatures, compliance, and document workflows

The AI wrapper debate, three years in: what the survivors built

LLM evaluation done wrong: why one eval setup can't answer three different questions

Three questions bundled into one setup

Model selection: comparing what you have against what you could have

Quality gating in CI: start with deterministic checks

Regression detection: the question most teams aren't asking

Where LLM judges actually earn their place

A minimal production eval setup for all three questions

Frequently asked questions

Related reading

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

The AI coding productivity data keeps contradicting itself. Here's why.

Stay ahead on eSignatures, compliance, and document workflows

The AI wrapper debate, three years in: what the survivors built