Your LLM-as-a-judge eval is measuring the wrong thing
Most LLM-as-a-judge pipelines have a structural problem: self-preference, position ordering, and a verbosity signal that shifts with model updates. Here is what each bias does to your scores — and how to fix it.