Tag

#Evals

Your LLM-as-a-judge eval is measuring the wrong thing

Most LLM-as-a-judge pipelines have a structural problem: self-preference, position ordering, and a verbosity signal that shifts with model updates. Here is what each bias does to your scores — and how to fix it.

By FlowVerify Editorial Team

May 10, 2026

AI & LLMs

The four engineering habits that become liabilities when building LLM features

The habits that make you reliable in traditional software development are miscalibrated for building with LLMs — four of them in particular. Knowing which four lets you put them down deliberately.

By FlowVerify Editorial Team

May 10, 2026

AI & LLMs

Why most LLM-as-a-judge eval setups are broken

LLM-as-a-judge is appealing: cheap, automatic, and scalable. It fails in three specific, predictable ways that only become visible once your eval scores stop correlating with what users are actually complaining about.

By FlowVerify Editorial Team

May 8, 2026

Stay ahead on eSignatures, compliance, and document workflows

Practical guides, product updates, and compliance notes — straight to your inbox. No fluff.

Newsletter is opening soon. We'll switch this on once we've got our first issue ready.