Why most LLM-as-a-judge eval setups are broken
Three systematic biases that corrupt your quality scores, and what a trustworthy eval pipeline actually looks like
LLM-as-a-judge became popular for a good reason: human eval is slow, expensive, and does not scale. When you are comparing 50 model outputs or iterating through prompt variants, waiting for annotators takes days you do not have. So engineers reach for LLM-as-a-judge: a capable frontier model with a scoring rubric that evaluates outputs automatically. It feels rigorous. The scores update on every deploy. You can run it in CI.
The problem is not the concept. It is that most setups introduce systematic biases that do not become visible until your eval scores stop correlating with what users are actually complaining about. By that point, you have spent weeks optimising against a broken instrument.
There are three biases. They compound. And fixing any one of them requires structural changes to your eval design — not a better prompt.
Why it works, until it does not
Early LLM judge setups tend to show genuine correlation with human judgment. That correlation is partly real and partly a statistical artefact of easy cases.
When one output is clearly better than the other, a competent judge model and a competent human rater both agree. The model has absorbed enough human feedback to simulate human preferences well in unambiguous cases. This early success builds trust in the setup.
The problem is that the interesting cases are the marginal ones, specifically those that drive product decisions. Is this prompt variant 6% better or 3% worse? Did the fine-tune help for our specific domain? In marginal comparisons, the biases dominate. The easy cases mask the dysfunction long enough to be dangerous.
Bias 1: the judge cares about position, not quality
Position bias is the most documented failure mode in LLM evaluation. When you ask a model to compare output A and output B, it reliably favours whichever appears first in the prompt. The margin varies between models but is almost never zero.
The scale of this was quantified in MT-Bench research: across multiple strong judge models, reversing the position of two outputs changed the winner roughly 35–40% of the time, even when the underlying content was identical. One in three of your A/B comparisons may be telling you more about your prompt template than about your outputs.
The obvious fix is to randomise the order and average two runs. Running both orderings and taking a consensus cancels the directional bias. But you have now doubled your eval cost, and the inconsistency rate itself becomes signal: a judge with a 40% flip rate on position changes is a noisy instrument. Averaging the scores hides that information rather than resolving it.
What this means in practice: any single-pass pairwise eval is running a noisy experiment on top of a biased instrument. Your evaluation budget should account for this. Use pairwise judges for rough directional screening, not for decisions where margin matters.
Bias 2: the judge rewards length, not accuracy
Verbosity bias is arguably more damaging than position bias, because it actively incentivises the wrong output. LLM judges consistently score longer outputs higher, independent of quality. This has been reproduced across model families and across rubric types; it is not a quirk of any single provider.
The likely mechanism is training data. Human raters scoring outputs quickly, without deep engagement, tend to interpret thoroughness as quality. More caveats feel more careful. Longer answers feel more effortful. Models have absorbed this correlation from RLHF data. They now reward what hurried humans reward, not what careful humans reward.
Decomposing scoring criteria helps: "Rate accuracy 1–5. Rate conciseness 1–5. Do not combine." But even rubric-decomposed judges show residual verbosity skew. The only reliable calibration is human baselines: a sample of outputs hand-scored by people who actually read them carefully, against which you can measure and correct your judge's length distortion.
The downstream risk is that if you optimise your model or prompt against a verbosity-biased judge, you will end up with a chattier product. Your eval scores go up; user satisfaction stays flat or drops. This is the Goodhart dynamic in its most expensive form.
Bias 3: the judge prefers outputs that look like its own
Self-preference is the most uncomfortable finding, because it means your choice of judge model is not a neutral decision. A substantial body of research published in 2024 and 2025 shows that LLMs rate their own outputs, and outputs stylistically similar to their own, higher than outputs from other model families.
A 2026 ICLR-accepted study formalised this as "preference leakage": when the evaluator and the data generator share training lineage or stylistic fingerprints, the evaluator's scores are contaminated in favour of the generator. This applies even when the judge is supposedly evaluating on objective criteria.
In practical terms: use GPT-4 as your judge to compare GPT-4 and Claude outputs, and you get GPT-4-favourable results. Use Claude as the judge, and the verdict flips. Neither reflects what a user would actually prefer.
This matters most in three scenarios. Model migrations, where you are evaluating whether switching providers improves quality. Fine-tuning experiments, where a model trained on data sharing lineage with your judge gets inflated scores. And any competitive evaluation where the judge's provenance is not disclosed, which covers most published benchmarks.
The calibration problem that compounds everything else
Even after addressing the three biases above (both-orderings for position, decomposed rubrics for verbosity, third-party judge for self-preference), you still have calibration drift.
A judge model is only trustworthy relative to a ground truth you can inspect. The moment you swap to a newer version of the same judge model, the calibration resets. Update your rubric phrasing, and it resets. Change the output format, and it resets. Each of these changes produces a different scoring distribution, and without a fixed anchor you cannot tell whether the scores have shifted because quality changed or because the judge changed.
The structural fix is a held-out anchor set: 200–500 outputs with human labels, never used for prompt optimisation, only for calibration audits. Every time you change anything about the eval pipeline, run the judge against this set and measure rank correlation with the human scores. If the Spearman correlation drops below your threshold, the judge has drifted — regardless of what the production scores say.
Maintaining the anchor set is the work most teams skip. It requires discipline about what goes into it (representative, not cherry-picked), how often it gets refreshed (a stale anchor set gives false confidence), and who owns the relabelling (usually the same person who owns the product quality bar). It is also the single most valuable piece of infrastructure a team can build if their product quality depends on an eval pipeline.
What a trustworthy LLM-as-a-judge setup looks like
| Approach | Position bias | Verbosity bias | Self-preference | Calibration drift | Cost |
|---|---|---|---|---|---|
| Human eval | None | None | None | Low | High |
| Naive LLM judge (single pass, same-family) | High | High | High | High | Very low |
| Calibrated LLM judge (both-orderings, decomposed rubrics, third-party judge, anchor set) | Low | Medium | Low | Medium (monitored) | Medium |
The honest prescription is not "stop using LLM-as-a-judge." It is: use it for the right job.
LLM judges are good at catching obvious failures at scale: hallucinated facts, format violations, safety issues, outputs that refused to engage with the task. These are cases where outputs differ substantially from what you want, and a judge catches them reliably and cheaply. Use the judge as a triage filter; route outputs that score below a threshold to human review, not to automatic rejection.
For marginal comparisons that drive product decisions, the control structure is:
- Run both orderings when comparing two outputs. Log the inconsistency rate as a judge health signal.
- Score one dimension per call. Accuracy. Conciseness. Format adherence. Never ask for a combined score.
- Use a third-family judge when comparing outputs from different providers. Gemini as judge for a Claude vs GPT-4 evaluation is imperfect but materially less biased.
- Maintain an anchor set. 200–500 examples, relabelled in batches every 6–8 weeks, calibration checked on every pipeline change.
- Track confidence alongside scores. Add a one-word confidence signal to your rubric (high, medium, low) and route low-confidence outputs to human review automatically.
The narrow, honest case where LLM-as-a-judge works cleanly
The failure modes above concentrate in marginal pairwise comparisons. When scoring is near-verifiable (did the model include all required fields? did it stay within the character limit? did it produce valid JSON?), judges are both accurate and cheap. These are low-ambiguity criteria, and the biases have less surface to operate on.
The reliability hierarchy, roughly: verifiable criteria, then absolute scoring on clear single dimensions, then pairwise comparison of close outputs. Most setups live at the bottom of this hierarchy. That is where the biases do the most damage and where the gap between eval scores and real quality is widest.
“The teams building LLM products that compound quality improvements over time are not the ones with the most sophisticated judge prompts. They are the ones that kept a small human eval programme running alongside automated scoring.”
That is a process decision, not a technology decision. A weekly hour where someone reads 30 real outputs and flags what looks bad is more valuable than a sophisticated judge architecture built on a biased instrument. Start there. Add automation around it only once you know what failure looks like in your product.
Frequently asked questions
Related reading
LLM evaluation done wrong: why one eval setup can't answer three different questions
LLM evaluation in production is three different problems bundled into one confused setup. Here's how to separate them, and what each one actually needs.
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.