Back to BenchmarksFrontier
ARC-AGI
Abstract Reasoning Corpus - novel visual reasoning tasks testing general intelligence
Leaderboard
(17 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | GPT-5.2 | 86.20 | — |
| 2 | Claude Opus 4.5 | 80.00 | — |
| 3 | Gemini 3 Pro | 75.00 | — |
| 4 | o3 | 60.80 | — |
| 5 | o4-mini (high) | 58.70 | — |
| 6 | o4-mini-2025-04-16 medium | 41.80 | — |
| 7 | Gemini 2.5 Pro (Jun 2025) | 33.30 | — |
| 8 | o1 | 30.70 | — |
| 9 | Claude 3.7 Sonnet | 28.60 | — |
| 10 | DeepSeek V3 | 21.20 | — |
| 11 | Grok-3 mini | 16.50 | — |
| 12 | DeepSeek R1 | 15.80 | — |
| 13 | GPT-4.1 | 10.30 | — |
| 14 | GPT-4o | 4.50 | — |
| 15 | Llama-4-Maverick-17B-128E-Instruct | 4.40 | — |
| 16 | GPT-4.1 mini | 3.50 | — |
| 17 | Llama 4 Scout | 0.50 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0