Back to BenchmarksFrontier
Terminal Bench
Terminal/command-line interaction tasks for agent evaluation
Leaderboard
(9 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | GPT-5.2 | 64.90 | — |
| 2 | Gemini 3 Pro | 64.30 | — |
| 3 | Claude Opus 4.5 | 63.10 | — |
| 4 | kimi-k2-thinking (official) | 35.70 | — |
| 5 | Gemini 2.5 Pro (Jun 2025) | 32.60 | — |
| 6 | Grok 4 | 27.20 | — |
| 7 | Grok Code Fast 1 | 25.80 | — |
| 8 | Qwen3-Max-Instruct | 25.40 | — |
| 9 | GPT-OSS 120B | 18.70 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0