Back to Benchmarks

HellaSwag

Commonsense NLI about grounded situations

Very Hard
Category:language
EDI:74.8
Slope:0.67
View Source

Leaderboard

(17 models)
RankModelScoreStderr
1GPT-4.195.30
2Llama 3.1 405B89.20
3falcon-180B89.00
4DeepSeek V388.90
5Mixtral-8x7B-v0.186.70
6Llama-2-70b-hf85.30
7Qwen 2.5 72B84.80
8Qwen2.5-Max83.00
9Phi-3-medium-128k-instruct82.40
10gemma-7b82.20
11Mistral-7B-v0.181.00
12Llama-2-7b80.70
13Phi-3-small-8k-instruct77.00
14Phi-3-mini-4k-instruct76.70
15Yi-6B76.40
16GPT-OSS 120B70.50
17Phi-453.60

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0