Back to Benchmarks

SWE-Bench Verified (Bash Only)

Verified real-world GitHub issues requiring code understanding and modification

Frontier
Category:coding
EDI:142.8
Slope:2.99
View Source

Leaderboard

(12 models)
RankModelScoreStderr
1Claude Opus 4.564.80±0.02
2GPT-5.261.60±0.02
3Claude 3.7 Sonnet52.20±0.02
4DeepSeek V352.10±0.02
5o343.69±0.02
6GPT-4.141.00±0.02
7Grok-3 mini38.60±0.02
8o4-mini-2025-04-16 medium34.60±0.02
9GPT-4.1 mini32.80±0.02
10Qwen Plus28.00±0.02
11GPT-4o25.40±0.02
12Gemini 2.5 Pro (Jun 2025)22.00±0.02

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0