Glossary

Definitions for every metric and term used across the benchmarks.

VLM Intelligence Index: weight-normalized 0–100 composite of the six capability evals. Higher is better.

Capability evals: OCR & Text, Document & Chart, Scene & Spatial, Video QA, Grounding & Detection, Structured Extraction. Each reported as accuracy.

TTFT (Time to First Token): latency until the first token streams back, excluding reasoning think time. Isolates the serving stack.

Think time: wall-clock a reasoning model spends before its first answer token. Compute, not network; constant across regions.

End-to-end (E2E): TTFT + think + generation of a standardized ~24-token answer. The number the real-time bar is measured against.

Real-time ready: true when E2E is < 200ms for a given (model, provider, region), fast enough to close the loop on live video. A hard flag, not a score.

Output speed: sustained output tokens per second at batch-1.

Blended price: 3:1 input:output weighted price per 1M tokens at the reference provider.

Cost per task: weighted average USD to complete one Intelligence-Index task, accounting for reasoning models' larger output.

Reference provider: the provider we quote a model's headline speed / latency / price from: Overshoot for open-weight models, the vendor API for closed models.

Effort level: a reasoning model's compute setting (medium / high / xhigh / max). Trades capability for cost and latency.

Pass@1: probability a vision agent completes a task on its first attempt, reported with a confidence interval.

Open weights: model weights are publicly downloadable. These models can run on any provider, including Overshoot and self-hosted vLLM.