Back to Benchmarks

Balrog

Balanced reasoning and logic games evaluation

Frontier
Category:specialized
EDI:159.0
Slope:0.91
View Source

Leaderboard

(16 models)
RankModelScoreStderr
1Grok 443.60
2Gemini 2.5 Pro (Jun 2025)43.30
3DeepSeek R134.90
4GPT-5.232.80
5Claude 3.7 Sonnet32.60
6GPT-4o32.30
7Grok-3 mini29.50
8Llama 3.1 405B27.90
9Llama 3.3 70B23.00
10Gemini 1.5 Flash21.00
11DeepSeek V319.50
12Claude 3.5 Haiku19.30
13Mistral Large17.60
14gpt-4o-mini-2024-07-1817.40
15Qwen2.5-Max16.20
16Phi-411.60

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

Balrog: Top Score 43.6% - AI Benchmark | NeoSignal