Back to Benchmarks

SimpleBench

Simple reasoning tasks designed to test fundamental capabilities

Frontier
Category:reasoning
EDI:149.2
Slope:2.73
View Source

Leaderboard

(27 models)
RankModelScoreStderr
1Gemini 3 Pro76.40
2Gemini 2.5 Pro (Jun 2025)62.40
3Claude Opus 4.562.00
4GPT-5.261.60
5Grok 460.50
6o353.10
7Claude 3.7 Sonnet46.40
8o141.70
9DeepSeek V340.80
10o4-mini (high)38.70
11Grok-3 mini36.10
12GPT-4.134.50
13Qwen3-235B-A22B31.00
14DeepSeek R130.90
15Gemini 2.0 Flash Thinking Exp30.70
16Llama-4-Maverick-17B-128E-Instruct27.70
17Gemini 1.5 Flash27.10
18kimi-k2-thinking (official)26.30
19GPT-4 Turbo25.10
20Claude 3 Opus23.50
21Llama 3.1 405B23.00
22Mistral Large22.50
23GPT-OSS 120B22.10
24Llama 3.3 70B19.90
25GPT-4o17.80
26c4ai-command-a-03-202517.40
27gpt-4o-mini-2024-07-1810.70

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

SimpleBench: Top Score 76.4% - AI Benchmark | NeoSignal