Eval (groundedness)

Deterministic groundedness eval

Score how grounded your model+sources pipeline is across a whole dataset — using the verbatim check as the metric, no LLM judge. Every other eval tool (RAGAS, etc.) grades an LLM with another LLM; this one is exact by construction.

  • SDK: mx.verified.eval(params)
  • HTTP: POST https://api.maxmodel.com/v1/verified/eval

SDK

const r = await mx.verified.eval({
  model: 'gpt-5.5-pro',
  cases: [
    { messages: [{ role: 'user', content: 'What is the refund window?' }],
      sources: [{ id: 'r.md', text: 'Full refunds within 30 days.' }] },
    { messages: [{ role: 'user', content: 'How much is Pro?' }],
      sources: [{ id: 'p.md', text: 'Pro is $29/month.' }] },
  ],
})
 
r.aggregate   // { coverageMean: 1, unsupportedRate: 0, n: 2, errored: 0 }
r.cases       // [{ coverage, citations, unsupported, error? }, ...]
r.usage       // { totalTokens, modelCalls }   ← a dataset run = many model calls
r = mx.verified.eval(model="gpt-5.5-pro", cases=[
    {"messages": [{"role": "user", "content": "refund window?"}],
     "sources": [{"id": "r.md", "text": "Full refunds within 30 days."}]},
])
r.aggregate.coverage_mean   # 1.0
r.aggregate.unsupported_rate
r.model_calls

Request / response (HTTP)

// POST /v1/verified/eval
{ "model": "gpt-5.5-pro", "mode": "strict",
  "cases": [ { "messages": [...], "sources": [...] }, ... ] }
{ "aggregate": { "coverage_mean": 0.82, "unsupported_rate": 0.18, "n": 120, "errored": 0 },
  "cases": [ { "coverage": 0.9, "citations": 4, "unsupported": 1 }, ... ],
  "usage": { "total_tokens": 98000, "model_calls": 120 } }

CI quality gate

Because the eval is deterministic (same input → same score), it makes a reliable CI gate — unlike an LLM-judge metric that varies run-to-run. Drop a dataset of cases in your repo and fail the PR when groundedness regresses:

# .github/workflows/groundedness.yml
on: [pull_request]
jobs:
  gate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: maxmodel/groundedness-gate@v1
        with:
          api-key: ${{ secrets.MAXMODEL_KEY }}
          dataset: eval/dataset.json
          min-coverage: '0.8'

The action runs this eval and fails (with a step-summary table) when coverage_mean drops below the threshold. Source: action/ in the repo (zero-dependency, calls /v1/verified/eval). Same input → same gate.

Notes

  • coverage_mean — mean of per-case coverage (grounded claims / total claims).
  • unsupported_rate — total unsupported claims / total claims across all cases.
  • A case that errors (e.g. extraction failure) is reported with error and excluded from the aggregate; its count surfaces as errored. Auth/quota errors fail the whole batch.
  • Cases run with bounded concurrency. Max 200 cases per request (400 too_many_cases above that) — page larger datasets client-side.
  • Pick a fast model for large runs; reasoning models (gpt-5.5-pro) make a 200-case batch slow.