From a suite to a release decision
The pipeline pairs deterministic checks with an LLM judge so the score is both cheap and defensible — and frames the result as a release gate, not a leaderboard.
A task, a set of test cases (each with deterministic checks), and the prompt/model variants under test.
Each variant produces an output for each case. Mock mode returns deterministic synthetic outputs; live mode runs the real model.
Schema validity, must-include / exclude, regex, max-length — run first because they're free and catch the obvious failures.
claude-haiku-4-5 scores relevance, faithfulness, and safety (1–5) against the task rubric, with a one-line rationale and a confidence label.
Per-variant averages for quality, check pass-rate, latency, cost, and unsafe-case count.
The best variant's average quality is compared to the suite threshold to produce a ship / hold recommendation.
The default mode needs no API key and no database, so the demo always works and CI never depends on a paid model call. The runner exposes one interface; live mode swaps in the Anthropic adapter when ANTHROPIC_API_KEY is set.
Why a judge needs guardrailsAn LLM judge alone is noisy. Pairing it with deterministic checks, an explicit rubric, and confidence labels — and surfacing disagreement rather than hiding it — is what makes the eval trustworthy enough to gate a release on.