Back to Benchmarks

OTIS Mock AIME 2024-2025

Competition-level math problems from OTIS Mock AIME evaluating olympiad-level math

Frontier
Category:mathematics
EDI:137.9
Slope:5.30
View Source

Leaderboard

(38 models)
RankModelScoreStderr
1GPT-5.296.11±0.03
2Gemini 3 Pro92.78±0.04
3GPT-OSS 120B88.89±0.04
4DeepSeek V387.82±0.04
5Qwen 3 235B86.67±0.05
6Claude Opus 4.586.11±0.04
7Gemini 2.5 Pro (Jun 2025)84.72±0.05
8Grok 484.00±0.05
9o383.89±0.04
10kimi-k2-thinking (official)83.06±0.05
11o4-mini (high)81.67±0.05
12Grok-3 mini77.78±0.06
13Claude Sonnet 4.577.78±0.06
14Qwen3-Max-Instruct73.33±0.06
15o173.33±0.07
16Claude Haiku 4.566.67±0.07
17Gemini 2.0 Flash Thinking Exp57.78±0.07
18Claude 3.7 Sonnet57.78±0.07
19DeepSeek R153.33±0.08
20GPT-4.1 mini44.72±0.06
21GPT-4.138.33±0.06
22Mistral Large32.22±0.06
23Gemini 1.5 Flash23.06±0.05
24Llama 4 Maverick (FP8)20.56±0.05
25Gemma 3 27B19.72±0.05
26Qwen Plus17.78±0.04
27Qwen2.5-Max16.11±0.04
28Phi-413.75±0.04
29Llama 3.1 405B9.72±0.03
30Llama 4 Scout7.78±0.03
31gpt-4o-mini-2024-07-186.94±0.03
32GPT-4 Turbo6.67±0.02
33GPT-4o6.39±0.03
34Llama 3.3 70B5.14±0.02
35Claude 3 Opus4.72±0.02
36Claude 3.5 Haiku4.31±0.02
37Meta-Llama-3-8B-Instruct4.31±0.02
38Llama-2-7b0.00

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0