Back to Benchmarks

GSM8K

Grade school math word problems requiring multi-step arithmetic reasoning

Frontier
Category:mathematics
EDI:105.0
Slope:2.17
View Source

Leaderboard

(20 models)
RankModelScoreStderr
1DeepSeek V394.50
2Qwen2.5-Max94.20
3Qwen2.5-Coder-32B-Instruct93.00
4GPT-4.192.00
5gpt-4o-mini-2024-07-1891.30
6Phi-3-mini-4k-instruct88.70
7Claude Opus 4.586.70
8Gemma 3 27B84.90
9Mistral Large84.20
10Gemini 1.5 Flash82.40
11Llama 3.1 405B82.40
12Mixtral-8x7B-v0.174.40
13Llama-2-70b-hf69.60
14Qwen 3 235B61.30
15Llama-2-7b58.70
16gpt-3.5-turbo-110657.77
17falcon-180B54.40
18Mistral-7B-v0.154.40
19gemma-7b46.40
20Yi-6B44.90

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0