How large does a calibration anchor set need to be?

200–500 human-labelled examples is enough for most products. The key is keeping them current: relabel a subset every 6–8 weeks, and run a calibration check every time you change the judge model, update the rubric, or change the output format. A stale anchor set is worse than no anchor set; it gives you false confidence.

Which judge model should I use to avoid self-preference bias?

Use a model from a different family than the models you are evaluating. If you are comparing Claude and GPT-4 outputs, use Gemini as the judge. Not because Gemini is a better judge in absolute terms, but because it has weaker self-preference bias for the specific outputs you are comparing. A 2026 ICLR study called this contamination "preference leakage": it applies whenever the evaluator and generator share training lineage or stylistic fingerprints.

What is the minimum viable eval setup for a small team?

One thing: a weekly hour where someone on the team reads 20–30 real outputs and flags anything that looks bad. Use LLM-as-a-judge for high-volume screening and routing; use humans for the calls that actually matter. Do not try to build a fully automated pipeline before you understand what failure looks like in your specific product. You will calibrate against the wrong thing.

AI & LLMsMay 8, 20266 min readReviewed May 8, 2026

Why most LLM-as-a-judge eval setups are broken

Q: Is LLM-as-a-judge ever reliable?

Yes, for catching obvious failures at scale: hallucinated facts, format violations, safety issues, outputs that refused to engage with the task. These are cases where outputs differ substantially from what you want, and a judge model catches them reliably and cheaply. Where judges break down is in marginal pairwise comparisons, which are precisely the comparisons that drive most product decisions.

Three systematic biases that corrupt your quality scores, and what a trustworthy eval pipeline actually looks like

By FlowVerify Editorial Team

LLM-as-a-judge became popular for a good reason: human eval is slow, expensive, and does not scale. When you are comparing 50 model outputs or iterating through prompt variants, waiting for annotators takes days you do not have. So engineers reach for LLM-as-a-judge: a capable frontier model with a scoring rubric that evaluates outputs automatically. It feels rigorous. The scores update on every deploy. You can run it in CI.

The problem is not the concept. It is that most setups introduce systematic biases that do not become visible until your eval scores stop correlating with what users are actually complaining about. By that point, you have spent weeks optimising against a broken instrument.

There are three biases. They compound. And fixing any one of them requires structural changes to your eval design — not a better prompt.

Why it works, until it does not

Early LLM judge setups tend to show genuine correlation with human judgment. That correlation is partly real and partly a statistical artefact of easy cases.

When one output is clearly better than the other, a competent judge model and a competent human rater both agree. The model has absorbed enough human feedback to simulate human preferences well in unambiguous cases. This early success builds trust in the setup.

The problem is that the interesting cases are the marginal ones, specifically those that drive product decisions. Is this prompt variant 6% better or 3% worse? Did the fine-tune help for our specific domain? In marginal comparisons, the biases dominate. The easy cases mask the dysfunction long enough to be dangerous.

Bias 1: the judge cares about position, not quality

Position bias is the most documented failure mode in LLM evaluation. When you ask a model to compare output A and output B, it reliably favours whichever appears first in the prompt. The margin varies between models but is almost never zero.

The scale of this was quantified in MT-Bench research: across multiple strong judge models, reversing the position of two outputs changed the winner roughly 35–40% of the time, even when the underlying content was identical. One in three of your A/B comparisons may be telling you more about your prompt template than about your outputs.

The obvious fix is to randomise the order and average two runs. Running both orderings and taking a consensus cancels the directional bias. But you have now doubled your eval cost, and the inconsistency rate itself becomes signal: a judge with a 40% flip rate on position changes is a noisy instrument. Averaging the scores hides that information rather than resolving it.

What this means in practice: any single-pass pairwise eval is running a noisy experiment on top of a biased instrument. Your evaluation budget should account for this. Use pairwise judges for rough directional screening, not for decisions where margin matters.

Bias 2: the judge rewards length, not accuracy

Verbosity bias is arguably more damaging than position bias, because it actively incentivises the wrong output. LLM judges consistently score longer outputs higher, independent of quality. This has been reproduced across model families and across rubric types; it is not a quirk of any single provider.

The likely mechanism is training data. Human raters scoring outputs quickly, without deep engagement, tend to interpret thoroughness as quality. More caveats feel more careful. Longer answers feel more effortful. Models have absorbed this correlation from RLHF data. They now reward what hurried humans reward, not what careful humans reward.

Decomposing scoring criteria helps: "Rate accuracy 1–5. Rate conciseness 1–5. Do not combine." But even rubric-decomposed judges show residual verbosity skew. The only reliable calibration is human baselines: a sample of outputs hand-scored by people who actually read them carefully, against which you can measure and correct your judge's length distortion.

The downstream risk is that if you optimise your model or prompt against a verbosity-biased judge, you will end up with a chattier product. Your eval scores go up; user satisfaction stays flat or drops. This is the Goodhart dynamic in its most expensive form.

Bias 3: the judge prefers outputs that look like its own

Self-preference is the most uncomfortable finding, because it means your choice of judge model is not a neutral decision. A substantial body of research published in 2024 and 2025 shows that LLMs rate their own outputs, and outputs stylistically similar to their own, higher than outputs from other model families.

A 2026 ICLR-accepted study formalised this as "preference leakage": when the evaluator and the data generator share training lineage or stylistic fingerprints, the evaluator's scores are contaminated in favour of the generator. This applies even when the judge is supposedly evaluating on objective criteria.

In practical terms: use GPT-4 as your judge to compare GPT-4 and Claude outputs, and you get GPT-4-favourable results. Use Claude as the judge, and the verdict flips. Neither reflects what a user would actually prefer.

This matters most in three scenarios. Model migrations, where you are evaluating whether switching providers improves quality. Fine-tuning experiments, where a model trained on data sharing lineage with your judge gets inflated scores. And any competitive evaluation where the judge's provenance is not disclosed, which covers most published benchmarks.

The calibration problem that compounds everything else

Even after addressing the three biases above (both-orderings for position, decomposed rubrics for verbosity, third-party judge for self-preference), you still have calibration drift.

A judge model is only trustworthy relative to a ground truth you can inspect. The moment you swap to a newer version of the same judge model, the calibration resets. Update your rubric phrasing, and it resets. Change the output format, and it resets. Each of these changes produces a different scoring distribution, and without a fixed anchor you cannot tell whether the scores have shifted because quality changed or because the judge changed.

The structural fix is a held-out anchor set: 200–500 outputs with human labels, never used for prompt optimisation, only for calibration audits. Every time you change anything about the eval pipeline, run the judge against this set and measure rank correlation with the human scores. If the Spearman correlation drops below your threshold, the judge has drifted — regardless of what the production scores say.

Maintaining the anchor set is the work most teams skip. It requires discipline about what goes into it (representative, not cherry-picked), how often it gets refreshed (a stale anchor set gives false confidence), and who owns the relabelling (usually the same person who owns the product quality bar). It is also the single most valuable piece of infrastructure a team can build if their product quality depends on an eval pipeline.

What a trustworthy LLM-as-a-judge setup looks like

Approach	Position bias	Verbosity bias	Self-preference	Calibration drift	Cost
Human eval	None	None	None	Low	High
Naive LLM judge (single pass, same-family)	High	High	High	High	Very low
Calibrated LLM judge (both-orderings, decomposed rubrics, third-party judge, anchor set)	Low	Medium	Low	Medium (monitored)	Medium

Three eval approaches compared across the axes that matter

The honest prescription is not "stop using LLM-as-a-judge." It is: use it for the right job.

LLM judges are good at catching obvious failures at scale: hallucinated facts, format violations, safety issues, outputs that refused to engage with the task. These are cases where outputs differ substantially from what you want, and a judge catches them reliably and cheaply. Use the judge as a triage filter; route outputs that score below a threshold to human review, not to automatic rejection.

For marginal comparisons that drive product decisions, the control structure is:

Run both orderings when comparing two outputs. Log the inconsistency rate as a judge health signal.
Score one dimension per call. Accuracy. Conciseness. Format adherence. Never ask for a combined score.
Use a third-family judge when comparing outputs from different providers. Gemini as judge for a Claude vs GPT-4 evaluation is imperfect but materially less biased.
Maintain an anchor set. 200–500 examples, relabelled in batches every 6–8 weeks, calibration checked on every pipeline change.
Track confidence alongside scores. Add a one-word confidence signal to your rubric (high, medium, low) and route low-confidence outputs to human review automatically.

The narrow, honest case where LLM-as-a-judge works cleanly

The failure modes above concentrate in marginal pairwise comparisons. When scoring is near-verifiable (did the model include all required fields? did it stay within the character limit? did it produce valid JSON?), judges are both accurate and cheap. These are low-ambiguity criteria, and the biases have less surface to operate on.

The reliability hierarchy, roughly: verifiable criteria, then absolute scoring on clear single dimensions, then pairwise comparison of close outputs. Most setups live at the bottom of this hierarchy. That is where the biases do the most damage and where the gap between eval scores and real quality is widest.

“The teams building LLM products that compound quality improvements over time are not the ones with the most sophisticated judge prompts. They are the ones that kept a small human eval programme running alongside automated scoring.”

— FlowVerify Engineering

That is a process decision, not a technology decision. A weekly hour where someone reads 30 real outputs and flags what looks bad is more valuable than a sophisticated judge architecture built on a biased instrument. Start there. Add automation around it only once you know what failure looks like in your product.

Frequently asked questions

LLM evaluation done wrong: why one eval setup can't answer three different questions

LLM evaluation in production is three different problems bundled into one confused setup. Here's how to separate them, and what each one actually needs.

May 15, 2026Read full article →

AI & LLMsMay 8, 20266 min readReviewed May 8, 2026

Why most LLM-as-a-judge eval setups are broken

Three systematic biases that corrupt your quality scores, and what a trustworthy eval pipeline actually looks like

By FlowVerify Editorial Team

There are three biases. They compound. And fixing any one of them requires structural changes to your eval design — not a better prompt.

Why it works, until it does not

Early LLM judge setups tend to show genuine correlation with human judgment. That correlation is partly real and partly a statistical artefact of easy cases.

Bias 1: the judge cares about position, not quality

Bias 2: the judge rewards length, not accuracy

Bias 3: the judge prefers outputs that look like its own

The calibration problem that compounds everything else

Even after addressing the three biases above (both-orderings for position, decomposed rubrics for verbosity, third-party judge for self-preference), you still have calibration drift.

What a trustworthy LLM-as-a-judge setup looks like

Approach	Position bias	Verbosity bias	Self-preference	Calibration drift	Cost
Human eval	None	None	None	Low	High
Naive LLM judge (single pass, same-family)	High	High	High	High	Very low
Calibrated LLM judge (both-orderings, decomposed rubrics, third-party judge, anchor set)	Low	Medium	Low	Medium (monitored)	Medium

Three eval approaches compared across the axes that matter

The honest prescription is not "stop using LLM-as-a-judge." It is: use it for the right job.

For marginal comparisons that drive product decisions, the control structure is:

Run both orderings when comparing two outputs. Log the inconsistency rate as a judge health signal.
Score one dimension per call. Accuracy. Conciseness. Format adherence. Never ask for a combined score.
Use a third-family judge when comparing outputs from different providers. Gemini as judge for a Claude vs GPT-4 evaluation is imperfect but materially less biased.
Maintain an anchor set. 200–500 examples, relabelled in batches every 6–8 weeks, calibration checked on every pipeline change.
Track confidence alongside scores. Add a one-word confidence signal to your rubric (high, medium, low) and route low-confidence outputs to human review automatically.

The narrow, honest case where LLM-as-a-judge works cleanly

“The teams building LLM products that compound quality improvements over time are not the ones with the most sophisticated judge prompts. They are the ones that kept a small human eval programme running alongside automated scoring.”

— FlowVerify Engineering

Why most LLM-as-a-judge eval setups are broken

Why it works, until it does not

Bias 1: the judge cares about position, not quality

Bias 2: the judge rewards length, not accuracy

Bias 3: the judge prefers outputs that look like its own

The calibration problem that compounds everything else

What a trustworthy LLM-as-a-judge setup looks like

The narrow, honest case where LLM-as-a-judge works cleanly

Frequently asked questions

Related reading

LLM evaluation done wrong: why one eval setup can't answer three different questions

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

Stay ahead on eSignatures, compliance, and document workflows

LLM evaluation done wrong: why one eval setup can't answer three different questions

Why most LLM-as-a-judge eval setups are broken

Why it works, until it does not

Bias 1: the judge cares about position, not quality

Bias 2: the judge rewards length, not accuracy

Bias 3: the judge prefers outputs that look like its own

The calibration problem that compounds everything else

What a trustworthy LLM-as-a-judge setup looks like

The narrow, honest case where LLM-as-a-judge works cleanly

Frequently asked questions

Related reading

LLM evaluation done wrong: why one eval setup can't answer three different questions

The AI wrapper debate, three years in: what the survivors built

LLM database access: the RBAC gap most teams don't see

Stay ahead on eSignatures, compliance, and document workflows

LLM evaluation done wrong: why one eval setup can't answer three different questions