Back to Benchmarks

BBH

Big-Bench Hard - 23 challenging tasks requiring multi-step reasoning

Frontier
Category:reasoning
EDI:113.0
Slope:1.86
View Source

Leaderboard

(16 models)
RankModelScoreStderr
1DeepSeek V387.50
2Llama 3.1 405B82.90
3Phi-3-medium-128k-instruct81.40
4Qwen 2.5 72B79.80
5Phi-3-small-8k-instruct79.10
6GPT-4.175.12
7Phi-3-mini-4k-instruct71.70
8Llama-2-70b-hf64.90
9gpt-3.5-turbo-110661.59
10Phi-459.40
11Llama-2-7b58.50
12Mistral-7B-v0.156.10
13gemma-7b55.10
14Qwen 3 235B55.00
15Yi-6B47.20
16falcon-180B37.10

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

BBH: Top Score 87.5% - AI Benchmark | NeoSignal