SimpleBench

Simple reasoning tasks designed to test fundamental capabilities

Frontier

Category:reasoning

EDI:149.2

Slope:2.73

Leaderboard

(27 models)

Rank	Model	Score	Stderr
1	Gemini 3 Pro	76.40	—
2	Gemini 2.5 Pro (Jun 2025)	62.40	—
3	Claude Opus 4.5	62.00	—
4	GPT-5.2	61.60	—
5	Grok 4	60.50	—
6	o3	53.10	—
7	Claude 3.7 Sonnet	46.40	—
8	o1	41.70	—
9	DeepSeek V3	40.80	—
10	o4-mini (high)	38.70	—
11	Grok-3 mini	36.10	—
12	GPT-4.1	34.50	—
13	Qwen3-235B-A22B	31.00	—
14	DeepSeek R1	30.90	—
15	Gemini 2.0 Flash Thinking Exp	30.70	—
16	Llama-4-Maverick-17B-128E-Instruct	27.70	—
17	Gemini 1.5 Flash	27.10	—
18	kimi-k2-thinking (official)	26.30	—
19	GPT-4 Turbo	25.10	—
20	Claude 3 Opus	23.50	—
21	Llama 3.1 405B	23.00	—
22	Mistral Large	22.50	—
23	GPT-OSS 120B	22.10	—
24	Llama 3.3 70B	19.90	—
25	GPT-4o	17.80	—
26	c4ai-command-a-03-2025	17.40	—
27	gpt-4o-mini-2024-07-18	10.70	—

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0