Deterministic groundedness eval
Score how grounded your model+sources pipeline is across a whole dataset — using the verbatim check as the metric, no LLM judge. Every other eval tool (RAGAS, etc.) grades an LLM with another LLM; this one is exact by construction.
- SDK:
mx.verified.eval(params) - HTTP:
POST https://api.maxmodel.com/v1/verified/eval
SDK
const r = await mx.verified.eval({
model: 'gpt-5.5-pro',
cases: [
{ messages: [{ role: 'user', content: 'What is the refund window?' }],
sources: [{ id: 'r.md', text: 'Full refunds within 30 days.' }] },
{ messages: [{ role: 'user', content: 'How much is Pro?' }],
sources: [{ id: 'p.md', text: 'Pro is $29/month.' }] },
],
})
r.aggregate // { coverageMean: 1, unsupportedRate: 0, n: 2, errored: 0 }
r.cases // [{ coverage, citations, unsupported, error? }, ...]
r.usage // { totalTokens, modelCalls } ← a dataset run = many model callsr = mx.verified.eval(model="gpt-5.5-pro", cases=[
{"messages": [{"role": "user", "content": "refund window?"}],
"sources": [{"id": "r.md", "text": "Full refunds within 30 days."}]},
])
r.aggregate.coverage_mean # 1.0
r.aggregate.unsupported_rate
r.model_callsRequest / response (HTTP)
// POST /v1/verified/eval
{ "model": "gpt-5.5-pro", "mode": "strict",
"cases": [ { "messages": [...], "sources": [...] }, ... ] }{ "aggregate": { "coverage_mean": 0.82, "unsupported_rate": 0.18, "n": 120, "errored": 0 },
"cases": [ { "coverage": 0.9, "citations": 4, "unsupported": 1 }, ... ],
"usage": { "total_tokens": 98000, "model_calls": 120 } }CI quality gate
Because the eval is deterministic (same input → same score), it makes a reliable CI gate — unlike an LLM-judge metric that varies run-to-run. Drop a dataset of cases in your repo and fail the PR when groundedness regresses:
# .github/workflows/groundedness.yml
on: [pull_request]
jobs:
gate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: maxmodel/groundedness-gate@v1
with:
api-key: ${{ secrets.MAXMODEL_KEY }}
dataset: eval/dataset.json
min-coverage: '0.8'The action runs this eval and fails (with a step-summary table) when coverage_mean drops below
the threshold. Source: action/ in the repo (zero-dependency, calls
/v1/verified/eval). Same input → same gate.
Notes
coverage_mean— mean of per-casecoverage(grounded claims / total claims).unsupported_rate— total unsupported claims / total claims across all cases.- A case that errors (e.g. extraction failure) is reported with
errorand excluded from the aggregate; its count surfaces aserrored. Auth/quota errors fail the whole batch. - Cases run with bounded concurrency. Max 200 cases per request (
400 too_many_casesabove that) — page larger datasets client-side. - Pick a fast model for large runs; reasoning models (
gpt-5.5-pro) make a 200-case batch slow.