VLM Intelligence Index: weight-normalized 0–100 composite of the six capability evals. Higher is better.
Capability evals: OCR & Text, Document & Chart, Scene & Spatial, Video QA, Grounding & Detection, Structured Extraction. Each reported as accuracy.
TTFT (Time to First Token): latency until the first token streams back, excluding reasoning think time. Isolates the serving stack.
Think time: wall-clock a reasoning model spends before its first answer token. Compute, not network; constant across regions.
End-to-end (E2E): TTFT + think + generation of a standardized ~24-token answer. The number the real-time bar is measured against.
Real-time ready: true when E2E is < 200ms for a given (model, provider,
region), fast enough to close the loop on live video. A hard flag, not a score.
Output speed: sustained output tokens per second at batch-1.
Blended price: 3:1 input:output weighted price per 1M tokens at the reference provider.
Cost per task: weighted average USD to complete one Intelligence-Index task, accounting for reasoning models' larger output.
Reference provider: the provider we quote a model's headline speed / latency / price from: Overshoot for open-weight models, the vendor API for closed models.
Effort level: a reasoning model's compute setting (medium / high / xhigh / max). Trades capability for cost and latency.
Pass@1: probability a vision agent completes a task on its first attempt, reported with a confidence interval.
Open weights: model weights are publicly downloadable. These models can run on any provider, including Overshoot and self-hosted vLLM.