Back to Benchmarks

OSWorld

Operating system interaction tasks testing computer use and automation

Frontier
Category:agents
EDI:146.8
Slope:2.82
View Source

Leaderboard

(4 models)
RankModelScoreStderr
1Claude Opus 4.566.30
2Claude 3.7 Sonnet35.80
3o323.00
4Qwen2.5-Max5.00

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

OSWorld: Top Score 66.3% - AI Benchmark | NeoSignal