Back to BenchmarksFrontier
The Agent Company
Multi-step workplace automation tasks testing autonomous agent capabilities
Leaderboard
(8 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | Claude 3.7 Sonnet | 52.73 | — |
| 2 | Claude Opus 4.5 | 46.45 | — |
| 3 | Gemini 2.5 Pro (Jun 2025) | 39.85 | — |
| 4 | DeepSeek V3 | 29.91 | — |
| 5 | Qwen2.5-Max | 23.99 | — |
| 6 | Llama 3.1 405B | 22.90 | — |
| 7 | Gemini 1.5 Flash | 22.10 | — |
| 8 | GPT-4o | 14.55 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0