Back to BenchmarksFrontier
OSWorld
Operating system interaction tasks testing computer use and automation
Leaderboard
(4 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 66.30 | — |
| 2 | Claude 3.7 Sonnet | 35.80 | — |
| 3 | o3 | 23.00 | — |
| 4 | Qwen2.5-Max | 5.00 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0