Back to BenchmarksFrontier
SWE-Bench Verified (Bash Only)
Verified real-world GitHub issues requiring code understanding and modification
Leaderboard
(12 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 64.80 | ±0.02 |
| 2 | GPT-5.2 | 61.60 | ±0.02 |
| 3 | Claude 3.7 Sonnet | 52.20 | ±0.02 |
| 4 | DeepSeek V3 | 52.10 | ±0.02 |
| 5 | o3 | 43.69 | ±0.02 |
| 6 | GPT-4.1 | 41.00 | ±0.02 |
| 7 | Grok-3 mini | 38.60 | ±0.02 |
| 8 | o4-mini-2025-04-16 medium | 34.60 | ±0.02 |
| 9 | GPT-4.1 mini | 32.80 | ±0.02 |
| 10 | Qwen Plus | 28.00 | ±0.02 |
| 11 | GPT-4o | 25.40 | ±0.02 |
| 12 | Gemini 2.5 Pro (Jun 2025) | 22.00 | ±0.02 |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0