Overshoot builds the fastest API for real-time vision: connect any video source to any Vision Language Model and get results back in under 200 ms. We run this benchmark the way our product ships, every notable VLM on live video, and publish the raw numbers, the confidence intervals, and the methodology behind them.
The five axes#
- Capability, six evals (OCR, document & chart, scene & spatial, video QA, grounding, structured extraction) combined into the VLM Intelligence Index.
- Speed, sustained output tokens per second.
- Latency, TTFT and end-to-end latency, including
the
<200msreal-time-readiness flag. - Cost, blended price and cost per task.
- Context, maximum context window.
Plus a dedicated vision-agent track for computer-use tasks and a regional latency view across eight regions.
Principles#
- We run models the way they ship. Live video in, results streamed back, measured end to end. No offline shortcuts.
- Reproducible. Every number ships with a sample size and a confidence interval. See the methodology.
- Infrastructure-aware. A model is a (model, provider, region) triple. We never collapse those into a single "the model is fast" claim.
- Application-first. The right model depends on the job. The boards slice by the axis that matters to you, starting with the 200 ms bar.
Start with reading the benchmarks.