Skip to content

How we run VLMs on live video

Overshoot builds the fastest API for real-time vision, and we run this benchmark the way our product ships: every model on live video, same harness, same reference provider, confidence intervals published, not hidden.

Snapshot as of Jun 2026 · Index 2.1

What we measure#

We run every notable VLM on live video and score it on five axes. Capability, speed, latency, and cost are measured directly; real-time readiness is a verdict computed on top of them.

Capability
How well a model actually sees: OCR, documents, spatial reasoning, video, grounding, structured extraction, rolled into one weighted index.
Speed
Sustained output tokens per second at batch-1. How fast results stream back once generation is underway.
Latency
Time to first token, think time, and end-to-end wall-clock on a standardized short vision task. The numbers that decide whether you can close the loop on live video.
Cost
Blended price per 1M tokens and amortized cost per task at the reference provider. What a live workload actually bills.
Real-time readiness
The verdict: can this model answer a live frame in under 200 ms end-to-end at its best region? A pass/fail gate on top of the raw latency.

Each axis has its own leaderboard. Start with the VLM Intelligence Index, then drill into capability, latency, and the rest.

The six capability evaluations#

Capability is decomposed into six independent evals. Each is scored as accuracy (percent correct) on a held-out set, run identically across every model. Together they roll up into the Intelligence Index below.

OCR & Text% acc
Reading printed and handwritten text, signs, and labels off frames and dense documents.
Document & Chart% acc
Reading documents, tables, charts, and diagrams (DocVQA and ChartQA style).
Scene & Spatial% acc
Understanding a scene and reasoning about where objects are and how they relate (MMMU style).
Video QA% acc
Following a video stream over time: events, ordering, and counting (Video-MME style).
Grounding & Detection% acc
Pointing at the right pixels: boxes and points for the objects you ask about (RefCOCO style).
Structured Extraction% acc
Turning a frame into valid JSON that matches your schema (forms, receipts, UIs).

The VLM Intelligence Index#

The Intelligence Index is a single 0–100 number: a weighted, normalized composite of the six capability evals. Weights reflect how much each axis matters for real-time vision work. Document and spatial understanding carry the most, grounding the least, and they sum to exactly one. The current definition is version 2.1.

Intelligence Index weights per capability benchmark
EvaluationWeightShare
Document & Chart0.20
20%
Scene & Spatial0.20
20%
OCR & Text0.16
16%
Video QA0.16
16%
Structured Extraction0.16
16%
Grounding & Detection0.12
12%
Total1.00

To score a model we take its accuracy on each eval, multiply by that eval's weight, and sum. Because weights are normalized, a model that scored 70% everywhere lands at an index of 70. Weights are versioned so historical comparisons stay honest when the definition changes.

Confidence intervals#

A single accuracy number hides how much you should trust it. Every score we publish carries a 95% confidence interval derived from the eval's sample size, so two models a point apart are visibly indistinguishable when their intervals overlap.

Charts render CIs as error bars; tables render them inline as value ±half-width. We would rather show a wide bar honestly than imply a precision the data doesn't support. Same stance you'll find in the Index leaderboard.

Latency & real-time readiness#

Real-time vision lives or dies on latency. To close the loop on a live camera or screen, the answer has to land before the next frame matters. We separate three things that are often conflated:

TTFT
Time to first token: how long until results start streaming back, excluding any reasoning think time.
Think time
For reasoning models, the internal deliberation before a user-visible answer begins. Real, and counted against end-to-end.
End-to-end
Wall-clock for a standardized short vision task (~24 output tokens), including any think time. This is the number that decides whether you see and act in real time on a live frame.

A model is real-time ready when its best-region end-to-end for that task lands under 200 ms. That threshold isn't arbitrary: it's the ceiling for interaction that feels instantaneous on live video, the bar our own API is built to clear. See the latency leaderboard for per-model numbers, and Measuring real-time readiness for the full derivation.

Regions & infrastructure#

Latency is meaningless without a location. The same model answers faster from Virginia than from São Paulo, so we run it across 8 regions and report the best-region figure for real-time readiness.

  • North AmericaUS East (Virginia) · US West (Oregon)
  • EuropeEU West (Ireland) · EU Central (Frankfurt)
  • AsiaAsia Pacific (Tokyo) · Asia Pacific (Mumbai)
  • South AmericaSouth America (São Paulo)
  • Middle EastMiddle East (Dubai)

Every measurement is also tagged with the provider that served it. We group providers into four types so you can compare the same model across infrastructure:

Overshoot
Our real-time vision edge: any video source to any VLM, results streamed back as fast as 200 ms, in every region.
Hosted API
First-party and aggregator endpoints (OpenAI, Anthropic, Google Vertex, Groq, Fireworks, Together, and more).
Self-host (vLLM)
Open-weights models served on vLLM, a cost and latency baseline you can reproduce yourself.
Other
Any endpoint that doesn't fit the buckets above.

Regional and provider breakdowns get their own views: regional latency and provider comparison.

Reproducibility#

A benchmark you can't reproduce is marketing. Every number on this site is deterministic and traceable to its inputs:

  • One harness, one prompt template, one decoding config per eval, applied identically to every model.
  • A single reference provider per model fixes speed, latency, and price so cross-model comparisons are apples-to-apples.
  • Every score is stamped with an as-of date and sample size; the snapshot backing this build is Jun 2026.
  • The full typed schema and adapter are documented. Pull the raw data and re-derive any figure yourself.

For the data schema, glossary, and how to pull our numbers, see the docs. For methodology deep-dives and launch analyses, see the blog.