Back to Benchmarks

MMLU

Massive Multitask Language Understanding across 57 subjects

Frontier
Category:language
EDI:109.2
Slope:1.49
View Source

Leaderboard

(35 models)
RankModelScoreStderr
1GPT-4o88.10
2Claude 3.7 Sonnet87.30
3DeepSeek V387.20
4Gemini 1.5 Flash86.90
5GPT-4.186.40
6Llama 3.3 70B86.30
7Qwen2.5-Max85.30
8Qwen 2.5 72B85.00
9Phi-484.80
10Claude 3 Opus84.60
11Llama 3.1 405B84.50
12gpt-4o-mini-2024-07-1881.80
13GPT-4 Turbo81.30
14Mistral Large80.00
15Gemini 2.5 Pro (Jun 2025)79.70
16Meta-Llama-3-8B-Instruct79.30
17yi-lightning79.30
18Phi-3-medium-128k-instruct78.00
19Mixtral-8x7B-v0.177.80
20Gemma 3 27B75.70
21Phi-3-small-8k-instruct75.70
22Claude 3.5 Haiku74.30
23Claude Opus 4.573.40
24gpt-3.5-turbo-110671.40
25falcon-180B70.60
26Llama-2-70b-hf69.90
27c4ai-command-a-03-202569.40
28Phi-3-mini-4k-instruct68.80
29Qwen3-Max-Instruct68.60
30Yi-6B68.40
31Qwen 3 235B66.30
32gemma-7b66.10
33Llama-2-7b62.60
34Mistral-7B-v0.162.50
35GPT-OSS 120B25.70

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0