Back to BenchmarksFrontier
Aider polyglot
Code editing tasks across multiple programming languages using Aider framework
Leaderboard
(35 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | GPT-5.2 | 88.00 | — |
| 2 | o3 | 84.90 | — |
| 3 | Gemini 2.5 Pro (Jun 2025) | 83.10 | — |
| 4 | Grok 4 | 79.60 | — |
| 5 | DeepSeek V3 | 74.20 | — |
| 6 | o4-mini (high) | 72.00 | — |
| 7 | claude-opus-4-20250514 32K | 72.00 | — |
| 8 | Claude Opus 4.5 | 70.70 | — |
| 9 | Claude 3.7 Sonnet | 64.90 | — |
| 10 | o1 | 61.70 | — |
| 11 | Qwen3-235B-A22B | 59.60 | — |
| 12 | kimi-k2-thinking (official) | 59.10 | — |
| 13 | DeepSeek R1 | 56.90 | — |
| 14 | Grok-3 mini | 53.30 | — |
| 15 | GPT-4.1 | 52.40 | — |
| 16 | chatgpt-4o-03-27-2025 | 45.30 | — |
| 17 | GPT-OSS 120B | 41.80 | — |
| 18 | Qwen3-32B | 40.00 | — |
| 19 | Gemini 3 Pro | 38.20 | — |
| 20 | Gemini 2.0 Pro Exp (Feb 2025) | 35.60 | — |
| 21 | GPT-4.1 mini | 32.40 | — |
| 22 | Claude 3.5 Haiku | 28.00 | — |
| 23 | chatgpt-4o-01-29-2025 | 27.10 | — |
| 24 | GPT-4o | 23.10 | — |
| 25 | Qwen2.5-Max | 21.80 | — |
| 26 | QwQ-32B | 20.90 | — |
| 27 | Gemini 2.0 Flash Thinking Exp | 18.20 | — |
| 28 | DeepSeek-V2.5 | 17.80 | — |
| 29 | Qwen2.5-Coder-32B-Instruct | 16.40 | — |
| 30 | Llama-4-Maverick-17B-128E-Instruct | 15.60 | — |
| 31 | yi-lightning | 12.90 | — |
| 32 | c4ai-command-a-03-2025 | 12.00 | — |
| 33 | Codestral | 11.10 | — |
| 34 | Gemma 3 27B | 4.90 | — |
| 35 | gpt-4o-mini-2024-07-18 | 3.60 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0