Audit-grade groundedness (legal, finance, healthcare)

In regulated and high-stakes work, an ungrounded answer is a liability — and “we test it” only counts if the test is reproducible. MaxModel’s eval is deterministic: the same input yields the same groundedness score every run, scored by a verbatim string match — no LLM judge. That’s the property an audit needs.

Honest framing. This is audit-readiness, not a compliance certification, and not legal advice. MaxModel verifies that each cited claim is traceable to a source you supplied — not that the source is correct. Use it as evidence of grounding discipline, alongside your own review.

Why deterministic matters for audit

	LLM-as-judge metrics (RAGAS, etc.)	MaxModel verified eval
Run twice, same input	scores can differ	identical
Reproducible by a third party	needs the judge model + prompt	only your source text + the output
Explainable	”the judge thought so”	the exact character span, or it’s dropped
Survives an adversarial review	hard	the check is a string match anyone can re-run

An auditor (or opposing counsel, or a regulator’s reviewer) can take your sources and your output and recompute the same groundedness number without your model, weights, or API. That is what “defensible” means.

The evidence

Hallucination is real in production legal AI. Stanford (Magesh et al., 2025) measured ~17% (Lexis+ AI) to ~33% (Westlaw AI) hallucination in commercial legal research tools — with citations. “Looks cited” is not “is grounded.”
Exact matching is already the gold standard. LegalBench-RAG scores retrieval with exact file + character-index matching, not semantic similarity. MaxModel applies the same exactness to the answer: a cited quote must appear character-for-character in the source.
Citations get post-rationalized. Research finds up to 57% of RAG citations are correct-but-not-faithful (arXiv:2412.18004). A verbatim check makes that structurally impossible — no quote, no claim.

How to use it

Gate every release on a groundedness floor. Run a representative dataset through verified.eval (or the CI quality gate); fail the build if coverage_mean drops below your threshold. Deterministic → a reliable gate.
Keep the receipts. Each verified answer returns citations with [source, range] offsets and an unsupported[] list of dropped claims — a per-answer audit trail you can store.
Tighten for tables. For numeric facts in financial/legal tables, use checkNumbers: 'row' (reference) so a number must sit in the cell whose row and column the claim names.

// A groundedness floor you can show an auditor — same score every run.
const report = await mx.verified.eval({ model: 'gpt-5.5-pro', cases })
report.aggregate          // { coverageMean, unsupportedRate, n, errored }
// Re-runnable by anyone with the sources + outputs. No model needed to reproduce it.

What this does and doesn’t claim

✅ Deterministic, reproducible groundedness — traceability of every claim to your sources.
✅ A defensible, re-computable metric and per-answer audit trail.
❌ Not a guarantee your sources are correct (it checks grounding, not truth).
❌ Not a regulatory compliance certification, and not legal/medical/financial advice.

See How verification works for the mechanism and Eval for the API.

Extract (verified JSON)How verification works