Back to Benchmarks

GPQA_diamond

Graduate-level science questions in physics, chemistry, and biology requiring expert knowledge

Frontier
Category:reasoning
EDI:135.7
Slope:2.71
View Source

Leaderboard

(45 models)
RankModelScoreStderr
1Gemini 3 Pro92.61±0.02
2GPT-5.291.40±0.02
3Grok 487.00±0.02
4Claude Opus 4.586.05±0.02
5Gemini 2.5 Pro (Jun 2025)85.29±0.02
6kimi-k2-thinking (official)84.22±0.02
7DeepSeek V383.42±0.02
8Claude Sonnet 4.582.32±0.03
9o381.82±0.02
10Qwen 3 235B80.05±0.03
11Claude 3.7 Sonnet79.73±0.03
12o4-mini (high)79.61±0.02
13o176.77±0.03
14Grok-3 mini76.26±0.03
15GPT-OSS 120B75.76±0.03
16Qwen3-Max-Instruct72.60±0.03
17DeepSeek R171.72±0.03
18Claude Haiku 4.571.21±0.03
19Qwen3-235B-A22B70.71±0.03
20GPT-4.168.69±0.03
21Llama 4 Maverick (FP8)66.98±0.03
22GPT-4.1 mini65.85±0.03
23Gemini 2.0 Pro Exp (Feb 2025)65.66±0.03
24Qwen Plus65.40±0.03
25Mistral Large59.53±0.03
26Gemini 1.5 Flash57.23±0.03
27Gemini 2.0 Flash Thinking Exp57.07±0.04
28Qwen2.5-Max56.12±0.03
29Phi-456.06±0.03
30Llama 4 Scout51.83±0.03
31Llama 3.1 405B50.92±0.03
32GPT-4o49.21±0.03
33Gemma 3 27B48.86±0.03
34Llama 3.3 70B47.44±0.03
35Claude 3 Opus47.16±0.03
36GPT-4 Turbo46.59±0.03
37Meta-Llama-3-8B-Instruct40.56±0.03
38Claude 3.5 Haiku38.13±0.03
39gpt-4o-mini-2024-07-1837.72±0.02
40Yi-6B31.98±0.02
41Mixtral-8x7B-v0.130.59±0.02
42gpt-3.5-turbo-110628.03±0.02
43Phi-3-medium-128k-instruct27.59±0.02
44Mistral-7B-v0.127.15±0.02
45Llama-2-7b26.33±0.02

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0