VLMs that see the screen and act on it.

Tasks: 180
Suites: 3
Models: 11

Computer-use tasks run end to end from raw screen pixels: the model sees, decides, and clicks. We score whether it closes the loop, and what it costs to get there.

Computer-use agents driving real UIs from screen pixels. 60 tasks · 11 models.

Frontier x-axis

Overshoot-CUA score

Each line spans one model's effort levels, most efficient to the right

Pass@1 by model

Reasoning models are indicated by a lightbulb icon

Table rows

#
1	View Claude Opus 4.6Claude Opus 4.6max	53% ±3%	$4.40	180k	100
2	View GPT-5.4GPT-5.4xhigh	47% ±2%	$3.51	160.2k	89
3	View Holo3 35B A3BHolo3 35B A3Bdefault	45% ±3%	$0.05	106k	106
4	View Claude Sonnet 4.6Claude Sonnet 4.6high	42% ±3%	$3.52	165.6k	92
5	View Qwen3.6 35B A3BQwen3.6 35B A3Bhigh	40% ±2%	$0.06	165.6k	92
6	View Gemini 3.1 ProGemini 3.1 Prohigh	39% ±4%	$3.47	178.2k	99
7	View GPT-5.4 miniGPT-5.4 minixhigh	39% ±4%	$3.45	208.8k	116
8	View Qwen3.6 27BQwen3.6 27Bdefault	33% ±3%	$0.12	120k	120
9	View Gemini 3 FlashGemini 3 Flashdefault	29% ±2%	$2.34	122k	122
10	View Gemma 4 31BGemma 4 31Bdefault	26% ±3%	$0.13	120k	120
11	View Claude Haiku 4.5Claude Haiku 4.5default	23% ±3%	$2.36	129k	129

Every model sees and acts on the same screen pixels through one shared agent scaffold, so the gaps here are the model, not the harness. Read how we run the agent track