Back to BenchmarksFrontier
DeepResearch Bench
Multi-document synthesis and research tasks testing deep research capabilities
Leaderboard
(7 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 52.60 | — |
| 2 | GPT-5.2 | 51.00 | — |
| 3 | Grok 4 | 47.90 | — |
| 4 | o3 | 46.60 | — |
| 5 | Claude 3.7 Sonnet | 43.60 | — |
| 6 | Gemini 2.5 Pro (Jun 2025) | 42.80 | — |
| 7 | DeepSeek V3 | 35.10 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0