What we measure#
We run every notable VLM on live video and score it on five axes. Capability, speed, latency, and cost are measured directly; real-time readiness is a verdict computed on top of them.
- Capability
- How well a model actually sees: OCR, documents, spatial reasoning, video, grounding, structured extraction, rolled into one weighted index.
- Speed
- Sustained output tokens per second at batch-1. How fast results stream back once generation is underway.
- Latency
- Time to first token, think time, and end-to-end wall-clock on a standardized short vision task. The numbers that decide whether you can close the loop on live video.
- Cost
- Blended price per 1M tokens and amortized cost per task at the reference provider. What a live workload actually bills.
- Real-time readiness
- The verdict: can this model answer a live frame in under 200 ms end-to-end at its best region? A pass/fail gate on top of the raw latency.
Each axis has its own leaderboard. Start with the VLM Intelligence Index, then drill into capability, latency, and the rest.
The six capability evaluations#
Capability is decomposed into six independent evals. Each is scored as accuracy (percent correct) on a held-out set, run identically across every model. Together they roll up into the Intelligence Index below.
- OCR & Text% acc
- Reading printed and handwritten text, signs, and labels off frames and dense documents.
- Document & Chart% acc
- Reading documents, tables, charts, and diagrams (DocVQA and ChartQA style).
- Scene & Spatial% acc
- Understanding a scene and reasoning about where objects are and how they relate (MMMU style).
- Video QA% acc
- Following a video stream over time: events, ordering, and counting (Video-MME style).
- Grounding & Detection% acc
- Pointing at the right pixels: boxes and points for the objects you ask about (RefCOCO style).
- Structured Extraction% acc
- Turning a frame into valid JSON that matches your schema (forms, receipts, UIs).
The VLM Intelligence Index#
The Intelligence Index is a single 0–100 number: a weighted, normalized composite of the six capability evals. Weights reflect how much each axis matters for real-time vision work. Document and spatial understanding carry the most, grounding the least, and they sum to exactly one. The current definition is version 2.1.
| Evaluation | Weight | Share |
|---|---|---|
| Document & Chart | 0.20 | 20% |
| Scene & Spatial | 0.20 | 20% |
| OCR & Text | 0.16 | 16% |
| Video QA | 0.16 | 16% |
| Structured Extraction | 0.16 | 16% |
| Grounding & Detection | 0.12 | 12% |
| Total | 1.00 |
To score a model we take its accuracy on each eval, multiply by that eval's weight, and sum. Because weights are normalized, a model that scored 70% everywhere lands at an index of 70. Weights are versioned so historical comparisons stay honest when the definition changes.
Confidence intervals#
A single accuracy number hides how much you should trust it. Every score we publish carries a 95% confidence interval derived from the eval's sample size, so two models a point apart are visibly indistinguishable when their intervals overlap.
Charts render CIs as error bars; tables render them inline as value ±half-width. We would rather show a wide bar honestly than imply a precision the data doesn't support. Same stance you'll find in the Index leaderboard.
Latency & real-time readiness#
Real-time vision lives or dies on latency. To close the loop on a live camera or screen, the answer has to land before the next frame matters. We separate three things that are often conflated:
- TTFT
- Time to first token: how long until results start streaming back, excluding any reasoning think time.
- Think time
- For reasoning models, the internal deliberation before a user-visible answer begins. Real, and counted against end-to-end.
- End-to-end
- Wall-clock for a standardized short vision task (~24 output tokens), including any think time. This is the number that decides whether you see and act in real time on a live frame.
A model is real-time ready when its best-region end-to-end for that task lands under 200 ms. That threshold isn't arbitrary: it's the ceiling for interaction that feels instantaneous on live video, the bar our own API is built to clear. See the latency leaderboard for per-model numbers, and Measuring real-time readiness for the full derivation.
Regions & infrastructure#
Latency is meaningless without a location. The same model answers faster from Virginia than from São Paulo, so we run it across 8 regions and report the best-region figure for real-time readiness.
- North AmericaUS East (Virginia) · US West (Oregon)
- EuropeEU West (Ireland) · EU Central (Frankfurt)
- AsiaAsia Pacific (Tokyo) · Asia Pacific (Mumbai)
- South AmericaSouth America (São Paulo)
- Middle EastMiddle East (Dubai)
Every measurement is also tagged with the provider that served it. We group providers into four types so you can compare the same model across infrastructure:
- Overshoot
- Our real-time vision edge: any video source to any VLM, results streamed back as fast as 200 ms, in every region.
- Hosted API
- First-party and aggregator endpoints (OpenAI, Anthropic, Google Vertex, Groq, Fireworks, Together, and more).
- Self-host (vLLM)
- Open-weights models served on vLLM, a cost and latency baseline you can reproduce yourself.
- Other
- Any endpoint that doesn't fit the buckets above.
Regional and provider breakdowns get their own views: regional latency and provider comparison.
Reproducibility#
A benchmark you can't reproduce is marketing. Every number on this site is deterministic and traceable to its inputs:
- One harness, one prompt template, one decoding config per eval, applied identically to every model.
- A single reference provider per model fixes speed, latency, and price so cross-model comparisons are apples-to-apples.
- Every score is stamped with an as-of date and sample size; the snapshot backing this build is Jun 2026.
- The full typed schema and adapter are documented. Pull the raw data and re-derive any figure yourself.
For the data schema, glossary, and how to pull our numbers, see the docs. For methodology deep-dives and launch analyses, see the blog.