Back to Benchmarks

MATH level 5

Level 5 (hardest) problems from the MATH dataset requiring advanced mathematical reasoning

Frontier
Category:mathematics
EDI:127.8
Slope:4.14
View Source

Leaderboard

(39 models)
RankModelScoreStderr
1GPT-5.298.13±0.00
2o4-mini (high)97.83±0.00
3o397.77±0.00
4Claude Opus 4.597.73±0.00
5Qwen3-Max-Instruct97.13±0.00
6DeepSeek V396.64±0.00
7Claude Haiku 4.596.36±0.01
8Gemini 2.5 Pro (Jun 2025)95.90±0.00
9o194.71±0.01
10DeepSeek R193.05±0.01
11Claude 3.7 Sonnet91.16±0.01
12Grok-3 mini90.94±0.01
13GPT-4.1 mini87.29±0.01
14Gemini 2.0 Pro Exp (Feb 2025)83.46±0.01
15GPT-4.183.01±0.01
16Mistral Large81.63±0.01
17Gemma 3 27B74.04±0.01
18Llama 4 Maverick (FP8)73.02±0.01
19Gemini 1.5 Flash70.39±0.01
20Qwen3-235B-A22B68.86±0.01
21Qwen2.5-Max67.18±0.01
22Qwen Plus65.28±0.01
23Phi-464.94±0.01
24Grok 463.52±0.01
25Llama 4 Scout62.27±0.01
26GPT-4o53.28±0.01
27gpt-4o-mini-2024-07-1852.63±0.01
28Llama 3.1 405B49.77±0.01
29GPT-4 Turbo46.73±0.01
30Claude 3.5 Haiku46.36±0.01
31Llama 3.3 70B41.60±0.01
32Claude 3 Opus37.48±0.01
33Yi-6B25.48±0.01
34Meta-Llama-3-8B-Instruct22.55±0.01
35Phi-3-medium-128k-instruct17.56±0.01
36gpt-3.5-turbo-110615.89±0.01
37Mistral-7B-v0.114.94±0.01
38Mixtral-8x7B-v0.19.29±0.01
39Llama-2-7b3.29±0.00

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

MATH level 5: Top Score 98.1% - AI Benchmark | NeoSignal