Analysis

The VLM landscape, mid-2026: capability is converging, latency is not

Frontier vision-language models now cluster within a few points on capability. The real spread has moved to speed, latency, and cost per task, and that is where model selection is decided.

Overshoot BenchmarksJun 20269 min read

Today's leading vision-language models are starting to saturate the public capability benchmarks the way coding models did a year ago: the top of the Intelligence Index is a narrow band where adjacent models overlap inside their confidence intervals. Claude Opus 4.6, GPT-5.4, and the strongest open-weight reasoning models are separated by a few points on document understanding and scene reasoning, differences that are real but rarely decisive for a given application.

When capability converges, the decision moves to the other axes. And on speed, latency, and cost per task, the field is not converging at all.

Capability: a narrowing frontier#

Across our six capability evaluations (OCR, document & chart, scene & spatial, video QA, grounding, and structured extraction), the frontier looks flat and crowded near the top. The interesting structure is below the frontier, where specialization shows up clearly:

OCR & text is the most saturated axis. Small open models, the Qwen family in particular, punch far above their Intelligence Index here. If your workload is reading, you do not need a frontier model.
Video QA separates the field by architecture, not size. Models with native temporal handling pull ahead; image-only models that see video as a bag of frames lose ground no matter how strong they are on stills.
Grounding & detection remains the hardest axis and the one where reasoning models earn their think time. The exception is specialists: H Company's Holo3, trained on UI grounding, outpoints generalists twice its index here.

Where the real spread lives#

Plot Intelligence against cost per task and the picture inverts: the comparison charts show a two-order-of-magnitude spread in cost for models within a few points of each other on capability. The "most attractive quadrant" (high capability, low cost) is populated almost entirely by open-weight models on efficient infrastructure.

Latency spreads even wider. Reasoning models that top the capability charts are structurally excluded from real-time use; a short vision task that a small model answers in ~120ms takes a frontier reasoning model several seconds. There is no overlap. This is the single most important thing model selection gets wrong: teams pick the smartest model for a job that needed the fastest one.

Provider infrastructure is now a model property#

The same weights served on different infrastructure are, for practical purposes, different products. Our provider comparison holds the model constant and varies the infra: output speed moves by more than 2x, and end-to-end latency by more than that once you account for region. A model is not "fast" or "slow"; a (model, provider, region) triple is.

How to read the boards#

Start from the application, not the model. One hard question about one image → optimize the Intelligence Index. A thousand frames → optimize the 200ms bar.
Check the confidence intervals. Near the frontier, most rank differences are noise. We publish CIs on every capability and agent number for exactly this reason.
Cost per task, not price per token. Reasoning models emit far more tokens; a lower per-token price can still lose on cost per task.
Verify latency in your region. Readiness that only holds next to the datacenter is not readiness.

The frontier will keep narrowing on capability. The differentiation that survives, and the differentiation Overshoot exists to push on, is getting frontier-quality vision to answer in time.