Know whether your AI is good enough to ship.
An evaluation and observability workbench that compares prompt / model variants on quality, cost, latency, and safety — and recommends a release gate.
Runs in mock mode — no API key required. Add ANTHROPIC_API_KEY to score with real models.
Pick a suite and run an eval
Each suite compares a baseline variant against an improved one, scores every case on quality, cost, latency, and safety, and recommends whether to ship.
Support reply quality
gate ≥ 4.0Are auto-drafted support replies accurate, on-policy, and grounded in the ticket?
JSON contact extraction
gate ≥ 4.2Does the model return strictly valid JSON with the required fields?
Schema, must-include/exclude, and regex run before any model call — free and fast.
claude-haiku-4-5 scores relevance, faithfulness, and safety against a rubric with confidence labels.
Average quality vs a threshold yields a ship / hold recommendation — not just a wall of scores.