CookbookGroundedness gate (CI)

Gate your deploys on groundedness (CI)

Problem. You tweak a prompt or swap a retriever and silently regress grounding. You find out from an angry user, not from CI — because the usual “LLM-as-judge” eval gives a different score every run, so you can’t gate on it.

Fix. Keep a small test set of (messages, sources) cases and run verified.eval. The score is a deterministic groundedness number (verbatim coverage, no LLM judge), so it’s stable enough to fail a build on.

import { MaxModel } from 'maxmodel'
const mx = new MaxModel({ apiKey: process.env.MAXMODEL_KEY! })
 
import cases from './groundedness.cases.json'   // [{ messages, sources }, ...]
 
const r = await mx.verified.eval({ model: 'gpt-5.5-pro', cases })
 
// SDK returns camelCase; over HTTP these are coverage_mean / unsupported_rate.
console.log(`coverageMean=${r.aggregate.coverageMean} unsupportedRate=${r.aggregate.unsupportedRate}`)
 
const THRESHOLD = 0.85
if (r.aggregate.coverageMean < THRESHOLD) {
  console.error(`Groundedness regressed: ${r.aggregate.coverageMean} < ${THRESHOLD}`)
  process.exit(1)
}

Or wire in the published Action — same eval, with a PR step-summary table:

# .github/workflows/groundedness.yml
name: groundedness
on: [pull_request]
jobs:
  gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: maxmodel-docs/groundedness-gate@v1
        with:
          api-key: ${{ secrets.MAXMODEL_KEY }}
          dataset: ./groundedness.cases.json   # { "cases": [ { "messages": [...], "sources": [...] } ] }
          min-coverage: '0.85'

The dataset file is a single JSON object with a cases array:

{
  "cases": [
    { "messages": [{ "role": "user", "content": "What is the refund window?" }],
      "sources": [{ "id": "r.md", "text": "Full refunds within 30 days." }] }
  ]
}

Why it works as a gate. An LLM-judge eval flaps across reruns (self-agreement Krippendorff’s Alpha as low as 0.27; see Benchmarks), so a threshold on it produces flaky CI. A verbatim coverage score is byte-identical run to run — the same input always yields the same number. That’s the difference between a gate you can trust and one you’ll end up disabling.

See Eval (groundedness) for the full verified.eval surface and limits.