VLMs that see the screen and act on it.
- Tasks
- 180
- Suites
- 3
- Models
- 11
Computer-use tasks run end to end from raw screen pixels: the model sees, decides, and clicks. We score whether it closes the loop, and what it costs to get there.
Computer-use agents driving real UIs from screen pixels. 60 tasks · 11 models.
Frontier x-axis
Each line spans one model's effort levels, most efficient to the right
Reasoning models are indicated by a lightbulb icon
Table rows
| # | Details | |||||
|---|---|---|---|---|---|---|
| 1 | View Claude Opus 4.6Claude Opus 4.6max | 53% ±3% | $4.40 | 180k | 100 | |
| 2 | View GPT-5.4GPT-5.4xhigh | 47% ±2% | $3.51 | 160.2k | 89 | |
| 3 | View Holo3 35B A3BHolo3 35B A3Bdefault | 45% ±3% | $0.05 | 106k | 106 | |
| 4 | View Claude Sonnet 4.6Claude Sonnet 4.6high | 42% ±3% | $3.52 | 165.6k | 92 | |
| 5 | View Qwen3.6 35B A3BQwen3.6 35B A3Bhigh | 40% ±2% | $0.06 | 165.6k | 92 | |
| 6 | View Gemini 3.1 ProGemini 3.1 Prohigh | 39% ±4% | $3.47 | 178.2k | 99 | |
| 7 | View GPT-5.4 miniGPT-5.4 minixhigh | 39% ±4% | $3.45 | 208.8k | 116 | |
| 8 | View Qwen3.6 27BQwen3.6 27Bdefault | 33% ±3% | $0.12 | 120k | 120 | |
| 9 | View Gemini 3 FlashGemini 3 Flashdefault | 29% ±2% | $2.34 | 122k | 122 | |
| 10 | View Gemma 4 31BGemma 4 31Bdefault | 26% ±3% | $0.13 | 120k | 120 | |
| 11 | View Claude Haiku 4.5Claude Haiku 4.5default | 23% ±3% | $2.36 | 129k | 129 |
Every model sees and acts on the same screen pixels through one shared agent scaffold, so the gaps here are the model, not the harness. Read how we run the agent track