Your LLM-as-a-judge eval is measuring the wrong thing
Three systematic biases that corrupt automated evaluation pipelines — and a practical fix for each
You have a product that calls a language model. You need to know whether the model's outputs are getting better or worse as you iterate. Human review does not scale past a few hundred samples a week, so you set up LLM-as-a-judge: you send candidate outputs to a capable model (GPT-4o, Claude Sonnet, Gemini 1.5 Pro), ask it to score them against a rubric, and plot the results over time. You ship whichever version scored highest.
This setup has a real problem. The scores tell you how well the judge model likes the output, which is not the same as how good the output actually is. Three systematic biases corrupt the signal, often in combination. Once you understand them, the reason eval numbers that look clean in development frequently fail to predict real user experience becomes obvious.
The setup almost everyone is running
The basic LLM-as-a-judge pipeline is straightforward to assemble. Pick a capable model. Write a rubric: accuracy, relevance, coherence, tone. Send the candidate output and the rubric to the judge. Parse the score. Repeat at scale. The appeal is real: you get a number, the number responds to model changes, and it costs a fraction of human review.
The problem is not the architecture. It is the assumption baked into the architecture: that a language model acting as a judge produces assessments that are independent of its own training. That assumption is false in three specific, measurable ways.
Bias 1: The model prefers itself
When you use GPT-4o as a judge to evaluate GPT-4o outputs, you are not getting an independent assessment. Research published in late 2024, and since replicated across multiple model families, found that LLMs exhibit a measurable self-preference bias: they give higher scores to outputs generated by a model from the same training lineage, even when blind human raters and deterministic metrics disagree.
The mechanism appears to be perplexity. A model assigns lower perplexity to text that resembles its training distribution. In a comparison setting, the judge internally finds one candidate more predictable, closer to what it would have produced, and scores it higher. This is not a reasoning failure; it is a structural feature of how autoregressive models evaluate likelihood.
The practical consequence: if you are iterating on a GPT-4o-based product and using GPT-4o as your judge, you will systematically overestimate the quality of GPT-4o outputs relative to those from other generators. And if you ever switch model families, say from GPT-4o to Claude Sonnet, your eval scores will shift in ways that reflect the new judge's preferences as much as any real quality change.
Bias 2: Position decides more than content
When LLM-as-a-judge is used in a comparison format — "which of these two answers is better?" — the order in which candidates appear shifts the result significantly. A systematic study published in 2025, covering 15 LLM judges across 22 tasks, found that position bias is not random noise. It is consistent within a judge model and persists even when explicit debiasing instructions are added to the prompt.
The bias is strongest when the quality gap between candidates is small. When one answer is clearly worse, position does not matter much: the judge picks correctly regardless of ordering. But in the scenarios where your eval is most useful (close calls, marginal improvements, A/B experiments between two similar models) position can flip the verdict.
This is exactly the wrong direction for the bias to run. You are most reliant on automated eval when the differences are subtle. That is precisely when position bias has the most influence.
The fix is straightforward but requires discipline: run every pairwise comparison twice, once with each ordering, and aggregate. Record a verdict only when both orderings agree. Treat disagreements as abstentions: no result, not a coin flip. This halves your throughput and roughly doubles your judge cost. It also gives you signal you can act on, rather than a number that reflects seating order.
Bias 3: Verbosity — a moving target
The received wisdom is that LLM judges prefer longer answers. More detail, higher score, regardless of whether the detail adds value. That pattern was well-documented through 2023 and into 2024. Recent research has complicated it. Models trained with more recent direct preference optimisation and RLHF have shown the opposite in some conditions: conciseness preferences, where bloated answers are penalised and tighter responses score better.
Which means the verbosity bias is still a bias, but its direction can vary by judge model, and that direction can change as the model is updated. If you are using a commercial API judge without pinning model versions, a judge update can shift your historical score baselines with no change to your actual outputs. You may read an apparent quality improvement where the judge simply changed its mind about answer length.
Three habits counter this. First, pin the judge model version. If a specific model snapshot is your judge today, it should be your judge six months from now. When you do upgrade, re-score a held-out reference set with both versions before adopting the new baseline. Second, write rubrics that penalise unnecessary content explicitly, not just reward completeness. "Does the answer include unnecessary information?" as a distinct criterion counteracts the pull toward verbosity. Third, calibrate your rubric against human judgements on a small held-out set before scaling it up. Forty manually labelled examples will surface a systematic disagreement between your rubric and your users' actual preferences before it contaminates thousands of evals.
The three-layer fix
Putting this together, a defensible LLM-as-a-judge pipeline for a production system needs three things.
Family separation. The model generating your product outputs should not be from the same provider or training lineage as your judge. This does not eliminate self-preference bias on its own, but it removes the strongest version of it.
Position swapping. For every pairwise comparison, run both orderings. Record the result only when both agree. This is more expensive, but the alternative is a bias that scales with the number of close decisions you make, which is exactly where you need precision.
Version pinning and calibration. Fix the judge model version. Re-calibrate against human labels whenever the judge is updated or the rubric changes. A rubric that was well-calibrated six months ago may be systematically off today because the judge's preferences shifted.
| Bias | Symptom in practice | Fix |
|---|---|---|
| Self-preference | Judge from same family as generator inflates scores; quality differences across model families appear larger than they are | Separate generator and judge by provider family |
| Position bias | Comparison results vary with candidate ordering; close decisions are unreliable | Swap order on every pairwise comparison; aggregate both; treat disagreements as abstentions |
| Verbosity (direction varies) | Score baselines shift after a judge update with no corresponding output change | Pin judge model version; write rubrics that penalise unnecessary content; calibrate against human labels |
The 30% that does not need a model at all
Before investing in a more rigorous LLM-as-a-judge setup, audit what you are actually evaluating. For many production AI systems, a significant fraction of the outputs that get routed to a judge model could be scored with deterministic logic.
Does the output contain a valid JSON object with the expected keys? Deterministic. Does the extracted date fall within a valid range? Deterministic. Does the response cite a source that was present in the retrieval context? Mostly deterministic. Does the tool call include all required parameters with the right types? Deterministic.
These do not need a judge. They need a schema validator, a regex, or a string match. Routing them through an LLM judge adds cost, latency, and noise. More importantly, it makes your eval outputs harder to debug: when everything goes through the same pipeline, it is difficult to separate "the model got the reasoning right" from "the model produced a syntactically valid answer".
“If you could write a unit test for it, route it to a validator. Reserve the judge for what a unit test cannot measure.”
A rough heuristic: whenever there is a known ground truth (a correct answer, a schema, a constraint), use a deterministic check. Reserve LLM-as-a-judge for assessments that genuinely require natural language understanding: semantic correctness, appropriateness given context, reasoning quality, tone. The less you ask the judge to do, the cleaner the signal when you do ask.
What a trustworthy eval pipeline looks like
The simplest version that avoids all three bias traps is a four-stage pipeline.
- Deterministic checks first. Format, schema, citation presence, constraint satisfaction. These produce binary pass/fail signals with no model involved. They are fast, cheap, and produce results that a human can inspect and reproduce.
- Reference-based metrics where applicable. For tasks with known correct answers, such as summarisation against a gold summary or extraction against a labelled dataset, deterministic similarity metrics like BERTScore or exact match give a first signal that does not require a judge.
- LLM-as-a-judge for genuinely subjective assessments. Run with a judge from a different family than the generator. Swap positions on all pairwise comparisons. Pin the model version. Document the rubric and its calibration date.
- Periodic human calibration. Every six to eight weeks, manually score a random sample of outputs the judge has been evaluating. Compare. If the agreement rate drops below roughly 70%, the rubric or the judge needs updating.
Most of the engineering effort in this pipeline sits in stages one and four, which involve no language models at all. That is the point. The more you anchor automated eval on ground truth, the more the LLM-as-a-judge layer is actually measuring quality rather than preference.
The research on these biases is not an argument against LLM-as-a-judge as a technique. It is an argument for building the controls that make it work. Without those controls, you are running a feedback loop where the model that trained on certain data tells you it likes outputs that look like that data. With them, you have an eval signal worth shipping against.
Frequently asked questions
Related reading
LLM evaluation done wrong: why one eval setup can't answer three different questions
LLM evaluation in production is three different problems bundled into one confused setup. Here's how to separate them, and what each one actually needs.
The AI wrapper debate, three years in: what the survivors built
Three years after the GPT-4 wrapper wave, a handful of AI companies are thriving and most are gone. The split was not random — and the pattern tells you something useful about building on top of LLMs in 2026.
LLM database access: the RBAC gap most teams don't see
Giving an LLM access to your database is easy. The problem is that your application-layer RBAC is invisible when the model generates SQL. Here's where it goes wrong and how to fix it at the layer that enforces.