Back to Benchmarks

ARC AI2

AI2 Reasoning Challenge - science questions requiring reasoning

Frontier
Category:reasoning
EDI:104.5
Slope:2.12
View Source

Leaderboard

(20 models)
RankModelScoreStderr
1Llama 3.1 405B95.30
2DeepSeek V395.30
3Qwen 2.5 72B94.50
4Phi-3-medium-128k-instruct91.60
5Phi-3-small-8k-instruct90.70
6gpt-3.5-turbo-110687.40
7Mixtral-8x7B-v0.187.30
8Claude Opus 4.586.30
9Phi-3-mini-4k-instruct84.90
10Qwen 3 235B84.40
11Meta-Llama-3-8B-Instruct82.80
12Mistral-7B-v0.178.60
13gemma-7b78.30
14Llama-2-70b-hf78.30
15Phi-475.90
16Qwen2.5-Max70.50
17falcon-180B67.80
18Llama-2-7b60.30
19Yi-6B55.60
20GPT-OSS 120B41.10

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0