Your LLM judge works in the test harness. Here's why it fails in production.
LLM-as-a-judge evals look reliable in the test harness. Here's what breaks after months in production: calibration drift, noisy decision boundaries, cascade failures in multi-step pipelines, and the meta-evaluation trap.