Back to Benchmarks

Winogrande

Large-scale Winograd schema challenge - anchor benchmark for ECI calculation

Frontier
Category:language
EDI:109.0
Slope:1.00
View Source

Leaderboard

(21 models)
RankModelScoreStderr
1Llama 3.1 405B89.20
2Claude 3 Opus88.50
3GPT-4.187.50
4falcon-180B87.10
5DeepSeek V386.30
6Meta-Llama-3-8B-Instruct83.50
7Qwen 2.5 72B82.30
8gpt-3.5-turbo-110681.60
9Phi-3-medium-128k-instruct81.50
10Phi-3-small-8k-instruct81.50
11Qwen2.5-Max80.80
12Llama-2-70b-hf80.20
13gemma-7b79.00
14Mixtral-8x7B-v0.177.20
15Llama-2-7b76.70
16Mistral-7B-v0.175.30
17Claude 3.7 Sonnet75.10
18Phi-473.40
19Yi-6B73.00
20Phi-3-mini-4k-instruct70.80
21GPT-OSS 120B66.10

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

Winogrande: Top Score 89.2% - AI Benchmark | NeoSignal