Benchmarks & proof
This is a verification product, so we won’t ask you to trust our accuracy claims — we hand
you the experiments instead. The numbers below are from peer-reviewed papers (cited) and from
our own runs (labeled [our run]). Both of our experiments reproduce in a few dozen lines;
the source is in benchmarks/ — run it yourself.
Model-based grounding has a floor. A substring check doesn’t.
The usual way to catch a hallucination is to have a second model judge whether the answer is grounded. These detectors are good — and, by construction, never 100%, because they’re classifiers and judges, not proofs:
| Detector | Method | Reported accuracy (scope) | <100% because |
|---|---|---|---|
| Patronus Lynx (70B) | fine-tuned Llama-3 + CoT | 87.4% (HaluBench, Jul 2024) | it’s a model |
| Vectara HHEM-2.1 | fine-tuned FLAN-T5 classifier | 76.55% bal. acc (AggreFact-SOTA) | it’s a model |
| RAGAS Faithfulness | LLM-based metric | 66.9% (HaluBench) | it’s a model |
These are each tool’s own reported figure, scoped to a named benchmark and date. Newer models may score higher — but a classifier has a non-zero error rate by construction, and that doesn’t move with the leaderboard. A verbatim substring check has zero classification error for what it covers.
For paraphrase and synthesis, those detectors are the only thing that works — use them. For the part of an answer that’s a direct quote, you don’t need an 87%-accurate model.
Demo 1 — the judge doesn’t agree with itself
“Rating Roulette” (Haldar & Hockenmaier, EMNLP 2025 Findings) ran LLM judges three times each, identical prompt and settings, and measured self-agreement (Krippendorff’s Alpha; 0.8 = “good agreement”):
| Judge (self-agreement, 3 reruns) | SummaC | MT-Bench |
|---|---|---|
| Llama-3.1-70B | 0.33 | 0.27 |
| DeepSeek-R1-Distill | 0.63 | 0.51 |
| Qwen3-32B (best) | 0.79 | 0.56 |
Every judge in the paper lands below the 0.8 bar on the same input. Above random (Alpha 0 is chance) — but below the reliability bar. Not a coin flip; just not something to call verified.
Reproduce it: rate groundedness (1–5) with an LLM judge over the same items three times, report the disagreement; run a verbatim check over the same items and diff the runs.
# [our run] reproduces Rating Roulette + the deterministic contrast
export MAXMODEL_KEY=sk-...
node demo-variance.mjs # judge α across 3 runs vs verbatim (identical)[our run] — judge
claude-haiku-4-5(a strong 2025 model), 16 items, 3 identical reruns, 1–5 rating, temperature 1: Krippendorff’s α ≈ 0.84, with 3 of 16 items changing their rating across the identical reruns. The deterministic verbatim check over the same 16 items was byte-identical across all 3 runs (α = 1.000). Even a strong modern judge sits right at the reliability bar and still flips ~1 in 5 items; the paper’s judges score lower (0.27–0.79). The verbatim check has no variance to flip. (Numbers are stochastic — re-run and you’ll get something close, which is exactly the point.)
Demo 2 — how many “citations” are actually quotable?
A citation can look correct and still be post-rationalized — cited but not actually relied on (Wallat et al., Correctness is not Faithfulness in RAG Attributions, ICTIR 2025). On long-form ELI5, base RAG systems fully support only ~50% of their statements (ALCE, EMNLP 2023, 2023-era models).
So: take attributed answers and check each quoted span against its cited source with a verbatim substring match. Report the verbatim-presence rate — not ”% fake,” because a valid paraphrase will correctly fail a verbatim check. The gap between “looks cited” and “is quotable” is the whole point.
node demo-audit.mjs # verbatim-presence over freshly generated attributed answers[our run] — 20 ELI5-style questions, generator
claude-haiku-4-5explicitly told to quote the source verbatim: of 60 quoted spans, 1 (1.7%) was not verbatim-present — a paraphrase wrapped in quotation marks. That’s the best case: a strong model, instructed to quote exactly. With weaker models, looser “cite the source” prompting, or multi-source synthesis, the gap widens — and MaxModel drops exactly those spans intounsupported[]instead of shipping them. A first-party number from your own model beats any third-party stat: run it on the model you actually use.
Where this does not work
Verbatim checking is an exact floor for the extractive/quotable subset — citations, numbers, the load-bearing claims in legal/medical/financial text. It does not cover paraphrase or synthesis (those need a probabilistic detector), it depends on the model emitting a real quote, and multilingual normalization has edge cases (we disclose the rules — see How verification works). It’s a complement to Lynx/HHEM/RAGAS, not a replacement.
Sources
- Haldar & Hockenmaier, Rating Roulette, EMNLP 2025 Findings.
- Wallat et al., Correctness is not Faithfulness in RAG Attributions, ICTIR/SIGIR 2025 (arXiv:2412.18004).
- Gao et al., Enabling LLMs to Generate Text with Citations (ALCE), EMNLP 2023 (arXiv:2305.14627).
- Lynx, Patronus/Contextual/Stanford 2024 (arXiv:2407.08488); Vectara HHEM-2.1-Open model card.
- Magesh et al., Hallucination-Free?, JELS 2025 — commercial legal RAG tools hallucinate 17–33%.