Gate your deploys on groundedness (CI)
Problem. You tweak a prompt or swap a retriever and silently regress grounding. You find out from an angry user, not from CI — because the usual “LLM-as-judge” eval gives a different score every run, so you can’t gate on it.
Fix. Keep a small test set of (messages, sources) cases and run verified.eval. The
score is a deterministic groundedness number (verbatim coverage, no LLM judge), so it’s
stable enough to fail a build on.
import { MaxModel } from 'maxmodel'
const mx = new MaxModel({ apiKey: process.env.MAXMODEL_KEY! })
import cases from './groundedness.cases.json' // [{ messages, sources }, ...]
const r = await mx.verified.eval({ model: 'gpt-5.5-pro', cases })
// SDK returns camelCase; over HTTP these are coverage_mean / unsupported_rate.
console.log(`coverageMean=${r.aggregate.coverageMean} unsupportedRate=${r.aggregate.unsupportedRate}`)
const THRESHOLD = 0.85
if (r.aggregate.coverageMean < THRESHOLD) {
console.error(`Groundedness regressed: ${r.aggregate.coverageMean} < ${THRESHOLD}`)
process.exit(1)
}Or wire in the published Action — same eval, with a PR step-summary table:
# .github/workflows/groundedness.yml
name: groundedness
on: [pull_request]
jobs:
gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: maxmodel-docs/groundedness-gate@v1
with:
api-key: ${{ secrets.MAXMODEL_KEY }}
dataset: ./groundedness.cases.json # { "cases": [ { "messages": [...], "sources": [...] } ] }
min-coverage: '0.85'The dataset file is a single JSON object with a cases array:
{
"cases": [
{ "messages": [{ "role": "user", "content": "What is the refund window?" }],
"sources": [{ "id": "r.md", "text": "Full refunds within 30 days." }] }
]
}Why it works as a gate. An LLM-judge eval flaps across reruns (self-agreement Krippendorff’s Alpha as low as 0.27; see Benchmarks), so a threshold on it produces flaky CI. A verbatim coverage score is byte-identical run to run — the same input always yields the same number. That’s the difference between a gate you can trust and one you’ll end up disabling.
See Eval (groundedness) for the full verified.eval surface and limits.