Skip to content
All writing

Methodology

Measuring real-time readiness: the 200ms bar for vision models

Why we treat end-to-end latency as a first-class benchmark axis, how we measure it across regions, and which VLMs actually clear the bar for live video.

Overshoot BenchmarksJun 20267 min read

Most model leaderboards answer one question: how smart is it? For real-time vision, that question is necessary but nowhere near sufficient. A model that scores 92 on document understanding is useless for a live camera feed if the first token lands 8 seconds after the frame does.

So we measure a second axis with equal weight: real-time readiness, the share of hosting configurations where a model returns a complete answer to a short vision task in under 200 milliseconds, end to end.

Why 200ms#

Two hundred milliseconds is roughly the threshold where an interaction stops feeling like a request/response round-trip and starts feeling like the system is watching. It is also, not coincidentally, the latency budget of a lot of physical-world applications: reading a label as a package rotates past a camera, describing a scene for a blind user walking down a street, or letting an agent act on what it sees on screen before the screen changes.

Below 200ms, a vision model is a sensor. Above it, it is a form you fill out.

What we actually time#

For every model, provider, and region we run a standardized short task (an image and a prompt that expects roughly 24 output tokens) and record:

  • TTFT, time to the first streamed token, excluding reasoning "think" time. This isolates the serving stack: input processing, the vision tower, and prefill.
  • Think time, for reasoning models, the wall-clock spent before the first answer token. This is compute, not network, so it does not vary by region.
  • End-to-end, TTFT + think + generation of the 24-token answer at the model's sustained output speed.

realtimeReady is simply e2e < 200ms. No partial credit.

Region matters more than people admit#

The same model on the same provider can be real-time in Virginia and sluggish in São Paulo purely from network round-trips. We measure end-to-end latency from eight regions, us-east through me-central, and report the spread, not a single number. A model that clears 200ms only when the user sits next to the datacenter is not, for our purposes, real-time.

This is where serving infrastructure earns its keep. Purpose-built real-time stacks keep small, non-reasoning VLMs under the bar across most regions; general-purpose hosted APIs and self-hosted vLLM deployments trade latency for flexibility and usually miss it outside their home region.

What clears the bar today#

The pattern is consistent: small, non-reasoning, open-weight VLMs on latency-optimized infra are the ones living under 200ms. Reasoning models, however capable, spend seconds thinking and are structurally excluded from the real-time tier. That is not a criticism; it is a different job. The latency leaderboard shows exactly who lands where, and the regional view shows how far that readiness travels.

The takeaway for builders: pick the axis that matches your application. If you are answering one hard question about one image, optimize for the Intelligence Index. If you are answering a thousand easy questions about a thousand frames, optimize for the 200ms bar, and check that it holds in your region.