Audit-grade groundedness (legal, finance, healthcare)
In regulated and high-stakes work, an ungrounded answer is a liability — and “we test it” only counts if the test is reproducible. MaxModel’s eval is deterministic: the same input yields the same groundedness score every run, scored by a verbatim string match — no LLM judge. That’s the property an audit needs.
Honest framing. This is audit-readiness, not a compliance certification, and not legal advice. MaxModel verifies that each cited claim is traceable to a source you supplied — not that the source is correct. Use it as evidence of grounding discipline, alongside your own review.
Why deterministic matters for audit
| LLM-as-judge metrics (RAGAS, etc.) | MaxModel verified eval | |
|---|---|---|
| Run twice, same input | scores can differ | identical |
| Reproducible by a third party | needs the judge model + prompt | only your source text + the output |
| Explainable | ”the judge thought so” | the exact character span, or it’s dropped |
| Survives an adversarial review | hard | the check is a string match anyone can re-run |
An auditor (or opposing counsel, or a regulator’s reviewer) can take your sources and your output and recompute the same groundedness number without your model, weights, or API. That is what “defensible” means.
The evidence
- Hallucination is real in production legal AI. Stanford (Magesh et al., 2025) measured ~17% (Lexis+ AI) to ~33% (Westlaw AI) hallucination in commercial legal research tools — with citations. “Looks cited” is not “is grounded.”
- Exact matching is already the gold standard. LegalBench-RAG scores retrieval with exact file + character-index matching, not semantic similarity. MaxModel applies the same exactness to the answer: a cited quote must appear character-for-character in the source.
- Citations get post-rationalized. Research finds up to 57% of RAG citations are correct-but-not-faithful (arXiv:2412.18004). A verbatim check makes that structurally impossible — no quote, no claim.
How to use it
- Gate every release on a groundedness floor. Run a representative dataset through
verified.eval(or the CI quality gate); fail the build ifcoverage_meandrops below your threshold. Deterministic → a reliable gate. - Keep the receipts. Each verified answer returns
citationswith[source, range]offsets and anunsupported[]list of dropped claims — a per-answer audit trail you can store. - Tighten for tables. For numeric facts in financial/legal tables, use
checkNumbers: 'row'(reference) so a number must sit in the cell whose row and column the claim names.
// A groundedness floor you can show an auditor — same score every run.
const report = await mx.verified.eval({ model: 'gpt-5.5-pro', cases })
report.aggregate // { coverageMean, unsupportedRate, n, errored }
// Re-runnable by anyone with the sources + outputs. No model needed to reproduce it.What this does and doesn’t claim
- ✅ Deterministic, reproducible groundedness — traceability of every claim to your sources.
- ✅ A defensible, re-computable metric and per-answer audit trail.
- ❌ Not a guarantee your sources are correct (it checks grounding, not truth).
- ❌ Not a regulatory compliance certification, and not legal/medical/financial advice.
See How verification works for the mechanism and Eval for the API.