Audit-grade eval

Audit-grade groundedness (legal, finance, healthcare)

In regulated and high-stakes work, an ungrounded answer is a liability — and “we test it” only counts if the test is reproducible. MaxModel’s eval is deterministic: the same input yields the same groundedness score every run, scored by a verbatim string match — no LLM judge. That’s the property an audit needs.

Honest framing. This is audit-readiness, not a compliance certification, and not legal advice. MaxModel verifies that each cited claim is traceable to a source you supplied — not that the source is correct. Use it as evidence of grounding discipline, alongside your own review.

Why deterministic matters for audit

LLM-as-judge metrics (RAGAS, etc.)MaxModel verified eval
Run twice, same inputscores can differidentical
Reproducible by a third partyneeds the judge model + promptonly your source text + the output
Explainable”the judge thought so”the exact character span, or it’s dropped
Survives an adversarial reviewhardthe check is a string match anyone can re-run

An auditor (or opposing counsel, or a regulator’s reviewer) can take your sources and your output and recompute the same groundedness number without your model, weights, or API. That is what “defensible” means.

The evidence

  • Hallucination is real in production legal AI. Stanford (Magesh et al., 2025) measured ~17% (Lexis+ AI) to ~33% (Westlaw AI) hallucination in commercial legal research tools — with citations. “Looks cited” is not “is grounded.”
  • Exact matching is already the gold standard. LegalBench-RAG scores retrieval with exact file + character-index matching, not semantic similarity. MaxModel applies the same exactness to the answer: a cited quote must appear character-for-character in the source.
  • Citations get post-rationalized. Research finds up to 57% of RAG citations are correct-but-not-faithful (arXiv:2412.18004). A verbatim check makes that structurally impossible — no quote, no claim.

How to use it

  1. Gate every release on a groundedness floor. Run a representative dataset through verified.eval (or the CI quality gate); fail the build if coverage_mean drops below your threshold. Deterministic → a reliable gate.
  2. Keep the receipts. Each verified answer returns citations with [source, range] offsets and an unsupported[] list of dropped claims — a per-answer audit trail you can store.
  3. Tighten for tables. For numeric facts in financial/legal tables, use checkNumbers: 'row' (reference) so a number must sit in the cell whose row and column the claim names.
// A groundedness floor you can show an auditor — same score every run.
const report = await mx.verified.eval({ model: 'gpt-5.5-pro', cases })
report.aggregate          // { coverageMean, unsupportedRate, n, errored }
// Re-runnable by anyone with the sources + outputs. No model needed to reproduce it.

What this does and doesn’t claim

  • ✅ Deterministic, reproducible groundedness — traceability of every claim to your sources.
  • ✅ A defensible, re-computable metric and per-answer audit trail.
  • ❌ Not a guarantee your sources are correct (it checks grounding, not truth).
  • ❌ Not a regulatory compliance certification, and not legal/medical/financial advice.

See How verification works for the mechanism and Eval for the API.