tech2026-06-13engineeringevals

An LLM judge needs a chaperone

Asking a model 'is this good?' and trusting the answer is how you ship noise. Pair it with checks it can't fake.

When I built an eval tool, the tempting move was to ask a model whether an output was good and believe it. That's noisy and it costs money on every call.

So the cheap, deterministic checks run first, schema, must-include, must-exclude, regex. They're free and they catch the obvious failures before any model sees them.

Only what's left goes to an LLM judge, scored against an explicit rubric with a confidence label. When the judge and the checks disagree, I surface it instead of hiding it.

The framing matters as much as the scoring. The output isn't a number, it's a decision: ship or hold. 'Block this release' is more useful than '3.7 out of 5.'