Methodology
Measuring real-time readiness: the 200ms bar for vision models
Why we treat end-to-end latency as a first-class benchmark axis, how we measure it across regions, and which VLMs actually clear the bar for live video.
Most model leaderboards answer one question: how smart is it? For real-time vision, that question is necessary but nowhere near sufficient. A model that scores 92 on document understanding is useless for a live camera feed if the first token lands 8 seconds after the frame does.
So we measure a second axis with equal weight: real-time readiness, the share of hosting configurations where a model returns a complete answer to a short vision task in under 200 milliseconds, end to end.
Why 200ms#
Two hundred milliseconds is roughly the threshold where an interaction stops feeling like a request/response round-trip and starts feeling like the system is watching. It is also, not coincidentally, the latency budget of a lot of physical-world applications: reading a label as a package rotates past a camera, describing a scene for a blind user walking down a street, or letting an agent act on what it sees on screen before the screen changes.
Below 200ms, a vision model is a sensor. Above it, it is a form you fill out.
What we actually time#
For every model, provider, and region we run a standardized short task (an image and a prompt that expects roughly 24 output tokens) and record:
- TTFT, time to the first streamed token, excluding reasoning "think" time. This isolates the serving stack: input processing, the vision tower, and prefill.
- Think time, for reasoning models, the wall-clock spent before the first answer token. This is compute, not network, so it does not vary by region.
- End-to-end, TTFT + think + generation of the 24-token answer at the model's sustained output speed.
realtimeReady is simply e2e < 200ms. No partial credit.
Region matters more than people admit#
The same model on the same provider can be real-time in Virginia and sluggish in São Paulo purely from network round-trips. We measure end-to-end latency from eight regions, us-east through me-central, and report the spread, not a single number. A model that clears 200ms only when the user sits next to the datacenter is not, for our purposes, real-time.
This is where serving infrastructure earns its keep. Purpose-built real-time stacks keep small, non-reasoning VLMs under the bar across most regions; general-purpose hosted APIs and self-hosted vLLM deployments trade latency for flexibility and usually miss it outside their home region.
What clears the bar today#
The pattern is consistent: small, non-reasoning, open-weight VLMs on latency-optimized infra are the ones living under 200ms. Reasoning models, however capable, spend seconds thinking and are structurally excluded from the real-time tier. That is not a criticism; it is a different job. The latency leaderboard shows exactly who lands where, and the regional view shows how far that readiness travels.
The takeaway for builders: pick the axis that matches your application. If you are answering one hard question about one image, optimize for the Intelligence Index. If you are answering a thousand easy questions about a thousand frames, optimize for the 200ms bar, and check that it holds in your region.