Reading the benchmarks

How to interpret the Intelligence Index, confidence intervals, the efficiency frontier, and the real-time bar.

The Intelligence Index#

The VLM Intelligence Index is a single 0–100 number: a weight-normalized average of the six capability evals. Weights are published and versioned (currently v2.1). It is a summary, not a verdict. Two models within a couple of points are, for most purposes, equivalent. Use the capability breakdown when a specific axis matters.

Confidence intervals#

Every capability and agent number carries a confidence interval, shown as error bars on the charts and ± on the tables. Near the top of the board, most rank differences are smaller than their CIs. Treat those ranks as ties. We would rather show you the uncertainty than imply a precision the data can't support.

The efficiency frontier#

The comparison charts plot capability against cost. The line connecting the best-value models at each capability level is the efficiency frontier; anything below and to the right of it is dominated. For reasoning models we draw the frontier across effort levels: higher effort buys capability at a predictable cost, and you can see exactly where the returns flatten.

The real-time bar#

Latency views flag every configuration that clears 200ms end to end, the bar for closing the loop on live camera and screen. It's a hard threshold, not a score: a model is real-time in a region or it is not. See measuring real-time readiness for the full definition.

Reasoning models#

Reasoning models are marked with a bulb glyph and expose effort levels (medium / high / xhigh / max). Higher effort raises capability and cost and adds think time to latency. That think time is compute, not network, so it doesn't vary by region, but it does keep these models out of the real-time tier.