Benchmarks

Benchmarks & proof

This is a verification product, so we won’t ask you to trust our accuracy claims — we hand you the experiments instead. The numbers below are from peer-reviewed papers (cited) and from our own runs (labeled [our run]). Both of our experiments reproduce in a few dozen lines; the source is in benchmarks/ — run it yourself.

Model-based grounding has a floor. A substring check doesn’t.

The usual way to catch a hallucination is to have a second model judge whether the answer is grounded. These detectors are good — and, by construction, never 100%, because they’re classifiers and judges, not proofs:

DetectorMethodReported accuracy (scope)<100% because
Patronus Lynx (70B)fine-tuned Llama-3 + CoT87.4% (HaluBench, Jul 2024)it’s a model
Vectara HHEM-2.1fine-tuned FLAN-T5 classifier76.55% bal. acc (AggreFact-SOTA)it’s a model
RAGAS FaithfulnessLLM-based metric66.9% (HaluBench)it’s a model

These are each tool’s own reported figure, scoped to a named benchmark and date. Newer models may score higher — but a classifier has a non-zero error rate by construction, and that doesn’t move with the leaderboard. A verbatim substring check has zero classification error for what it covers.

For paraphrase and synthesis, those detectors are the only thing that works — use them. For the part of an answer that’s a direct quote, you don’t need an 87%-accurate model.

Demo 1 — the judge doesn’t agree with itself

“Rating Roulette” (Haldar & Hockenmaier, EMNLP 2025 Findings) ran LLM judges three times each, identical prompt and settings, and measured self-agreement (Krippendorff’s Alpha; 0.8 = “good agreement”):

Judge (self-agreement, 3 reruns)SummaCMT-Bench
Llama-3.1-70B0.330.27
DeepSeek-R1-Distill0.630.51
Qwen3-32B (best)0.790.56

Every judge in the paper lands below the 0.8 bar on the same input. Above random (Alpha 0 is chance) — but below the reliability bar. Not a coin flip; just not something to call verified.

Reproduce it: rate groundedness (1–5) with an LLM judge over the same items three times, report the disagreement; run a verbatim check over the same items and diff the runs.

# [our run] reproduces Rating Roulette + the deterministic contrast
export MAXMODEL_KEY=sk-...
node demo-variance.mjs    # judge α across 3 runs vs verbatim (identical)

[our run] — judge claude-haiku-4-5 (a strong 2025 model), 16 items, 3 identical reruns, 1–5 rating, temperature 1: Krippendorff’s α ≈ 0.84, with 3 of 16 items changing their rating across the identical reruns. The deterministic verbatim check over the same 16 items was byte-identical across all 3 runs (α = 1.000). Even a strong modern judge sits right at the reliability bar and still flips ~1 in 5 items; the paper’s judges score lower (0.27–0.79). The verbatim check has no variance to flip. (Numbers are stochastic — re-run and you’ll get something close, which is exactly the point.)

Demo 2 — how many “citations” are actually quotable?

A citation can look correct and still be post-rationalized — cited but not actually relied on (Wallat et al., Correctness is not Faithfulness in RAG Attributions, ICTIR 2025). On long-form ELI5, base RAG systems fully support only ~50% of their statements (ALCE, EMNLP 2023, 2023-era models).

So: take attributed answers and check each quoted span against its cited source with a verbatim substring match. Report the verbatim-presence ratenot ”% fake,” because a valid paraphrase will correctly fail a verbatim check. The gap between “looks cited” and “is quotable” is the whole point.

node demo-audit.mjs       # verbatim-presence over freshly generated attributed answers

[our run] — 20 ELI5-style questions, generator claude-haiku-4-5 explicitly told to quote the source verbatim: of 60 quoted spans, 1 (1.7%) was not verbatim-present — a paraphrase wrapped in quotation marks. That’s the best case: a strong model, instructed to quote exactly. With weaker models, looser “cite the source” prompting, or multi-source synthesis, the gap widens — and MaxModel drops exactly those spans into unsupported[] instead of shipping them. A first-party number from your own model beats any third-party stat: run it on the model you actually use.

Where this does not work

Verbatim checking is an exact floor for the extractive/quotable subset — citations, numbers, the load-bearing claims in legal/medical/financial text. It does not cover paraphrase or synthesis (those need a probabilistic detector), it depends on the model emitting a real quote, and multilingual normalization has edge cases (we disclose the rules — see How verification works). It’s a complement to Lynx/HHEM/RAGAS, not a replacement.

Sources

  • Haldar & Hockenmaier, Rating Roulette, EMNLP 2025 Findings.
  • Wallat et al., Correctness is not Faithfulness in RAG Attributions, ICTIR/SIGIR 2025 (arXiv:2412.18004).
  • Gao et al., Enabling LLMs to Generate Text with Citations (ALCE), EMNLP 2023 (arXiv:2305.14627).
  • Lynx, Patronus/Contextual/Stanford 2024 (arXiv:2407.08488); Vectara HHEM-2.1-Open model card.
  • Magesh et al., Hallucination-Free?, JELS 2025 — commercial legal RAG tools hallucinate 17–33%.