Limits & latency

Limits & latency

The honest numbers on size caps, cost, and where the time goes.

Size limits

LimitValueWhat happens past it
Total sources text per call~100k characters400 sources_too_large
Cases per verified.eval request200400 too_many_cases — page larger sets client-side

sources is your own retrieved context — keep it to the chunks that could plausibly answer the question. More sources is more to match against, not better grounding.

Cost: a verified call is ≥ 1 model call

Verification is a layer on top of a normal gateway call, so the model usage is billed as usual on your gateway key:

  • verified.create routes at least one model call (the EXTRACT step). usage.modelCalls reports how many.
  • retry: N re-generates while any claim is still unsupported, up to N more times — so a call with retry can cost up to N + 1 model calls. It stops early once fully grounded.
  • verified.eval is one model call per case. A 200-case dataset is 200 model calls; pick a fast model for large runs (see Models).

The deterministic VERIFY step itself adds no model cost — it’s local string matching.

Where the latency goes

A verified call’s wall-clock time is almost entirely the underlying model call. The verification step is plain substring matching over your sources (bounded by the ~100k-char budget), so it’s negligible next to generation — microseconds-to-milliseconds of local CPU, not a second model round-trip.

That means the practical latency levers are the ones you already know:

  • Model choice dominates. A responses-endpoint model like gpt-5.5-pro can take 10–20s; a fast chat model (claude-haiku-4-5-20251001, gpt-5-mini, gemini-2.5-flash) returns in ~1–2s. Same verification either way.
  • retry multiplies latency the same way it multiplies cost — each retry is another full generation.
  • verified.create is non-streaming by design: the claim set has to be complete before the check can return a trustworthy answer. If you need token streaming, use chat.completions.create (unverified).

Rate limits & quotas

Verified output uses the same key and the same gateway quota as your other maxmodel.com calls — there’s no separate rate-limit pool. If you hit the gateway’s limit you’ll get the standard rate-limit error (see Errors); back off and retry.