Skip to content

VLMs that see the screen and act on it.

Tasks
180
Suites
3
Models
11

Computer-use tasks run end to end from raw screen pixels: the model sees, decides, and clicks. We score whether it closes the loop, and what it costs to get there.

Computer-use agents driving real UIs from screen pixels. 60 tasks · 11 models.

Frontier x-axis
Overshoot-CUA score

Each line spans one model's effort levels, most efficient to the right

Overshoot-CUA score. 11 models, each line spanning effort levels by Avg cost per task and Pass@1. Highest Pass@1: Claude Opus 4.6 at 53%. Toggle the data table for exact values.most efficient ↗Claude Opus 4.6GPT-5.4Holo3 35B A3BClaude Sonnet 4.6Qwen3.6 35B A3BGPT-5.4 miniQwen3.6 27BGemini 3 FlashGemma 4 31BClaude Haiku 4.5$5.00$4.00$3.00$2.00$1.00$0Avg cost per task0%20%40%60%80%100%Pass@1
Pass@1 by model
Pass@1 by model. 11 bars. Highest: Claude Opus 4.6 at 53%. Toggle the data table for exact values.Claude Opus 4.6Claude Opus 4.653% ±3%GPT-5.4GPT-5.447% ±2%Holo3 35B A3BHolo3 35B A3B45% ±3%Claude Sonnet 4.6Claude Sonnet 4.642% ±3%Qwen3.6 35B A3BQwen3.6 35B A3B40% ±2%Gemini 3.1 ProGemini 3.1 Pro39% ±4%GPT-5.4 miniGPT-5.4 mini39% ±4%Qwen3.6 27BQwen3.6 27B33% ±3%Gemini 3 FlashGemini 3 Flash29% ±2%Gemma 4 31BGemma 4 31B26% ±3%Claude Haiku 4.5Claude Haiku 4.523% ±3%

Reasoning models are indicated by a lightbulb icon

Table rows
#Details
1View Claude Opus 4.6Claude Opus 4.6max
53% ±3%
$4.40180k100
2View GPT-5.4GPT-5.4xhigh
47% ±2%
$3.51160.2k89
3View Holo3 35B A3BHolo3 35B A3Bdefault
45% ±3%
$0.05106k106
4View Claude Sonnet 4.6Claude Sonnet 4.6high
42% ±3%
$3.52165.6k92
5View Qwen3.6 35B A3BQwen3.6 35B A3Bhigh
40% ±2%
$0.06165.6k92
6View Gemini 3.1 ProGemini 3.1 Prohigh
39% ±4%
$3.47178.2k99
7View GPT-5.4 miniGPT-5.4 minixhigh
39% ±4%
$3.45208.8k116
8View Qwen3.6 27BQwen3.6 27Bdefault
33% ±3%
$0.12120k120
9View Gemini 3 FlashGemini 3 Flashdefault
29% ±2%
$2.34122k122
10View Gemma 4 31BGemma 4 31Bdefault
26% ±3%
$0.13120k120
11View Claude Haiku 4.5Claude Haiku 4.5default
23% ±3%
$2.36129k129

Every model sees and acts on the same screen pixels through one shared agent scaffold, so the gaps here are the model, not the harness. Read how we run the agent track