Back to Benchmarks

Terminal Bench

Terminal/command-line interaction tasks for agent evaluation

Frontier
Category:agents
EDI:150.6
Slope:2.79
View Source

Leaderboard

(9 models)
RankModelScoreStderr
1GPT-5.264.90
2Gemini 3 Pro64.30
3Claude Opus 4.563.10
4kimi-k2-thinking (official)35.70
5Gemini 2.5 Pro (Jun 2025)32.60
6Grok 427.20
7Grok Code Fast 125.80
8Qwen3-Max-Instruct25.40
9GPT-OSS 120B18.70

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0