Guide
Deploying a real-time VLM: from camera to answer in under 200ms
A field guide to the latency budget of a live vision pipeline (capture, transport, preprocessing, prefill, and decode) and where the milliseconds actually go.
Every real-time vision deployment lives inside a latency budget. If the round trip from a frame to an answer exceeds ~200ms, the system stops feeling like it is watching and starts feeling like a form submission. This guide walks the budget end to end and shows where the milliseconds hide.
The budget#
For a standardized short task (one frame in, ~24 tokens out) the wall clock breaks down into five stages:
- Capture & encode, the camera or screen produces a frame and the client encodes it. Usually single-digit milliseconds, but a badly chosen codec or a full-resolution still can cost you 50ms before anything leaves the device.
- Transport, the frame travels to the model. This is the term that varies most by region; it is why we measure latency by location.
- Preprocessing, resize, patchify, run the vision tower. Easy to underestimate: we have seen a 15-frame clip spend 428ms of CPU here before the request even touched the GPU.
- Prefill, the model reads the image tokens plus your prompt. Time to first token is dominated by this stage.
- Decode, the model streams the answer at its sustained output speed.
Where the milliseconds go#
The counterintuitive part: for a short answer, decode is rarely the bottleneck. A small model streaming at 300 tok/s emits 24 tokens in ~80ms. The budget is usually spent on preprocessing and prefill (the vision tower and the image-token prefill), plus transport.
That reshapes how you optimize:
- Pick the right model size. A 2–8B VLM on latency-optimized infra clears the bar in most regions; a frontier reasoning model never will, because think time alone is measured in seconds.
- Keep the frame small. Send the resolution the model actually consumes. Oversized frames tax both transport and the vision tower.
- Stay in region. Readiness that only holds next to the datacenter is not readiness. Verify against your users' region, not the closest one.
- Avoid reasoning for reflexes. Reserve chains of thought for the hard single-image question, not the thousand-frame stream.
Reading the boards for deployment#
Start from the latency board filtered to real-time readiness, cross-reference cost per task so a cheap-looking per-token price doesn't hide an expensive workload, and confirm the regional spread covers where you actually run. The model that wins is almost never the one at the top of the Intelligence Index, it is the fastest one that is still smart enough for the task.