Back to BenchmarksFrontier
Cybench
Cybersecurity CTF challenges testing security analysis and exploitation
Leaderboard
(11 models)| Rank | Model | Score | Stderr |
|---|---|---|---|
| 1 | Claude Opus 4.5 | 55.00 | — |
| 2 | o3 | 22.50 | — |
| 3 | Claude 3.7 Sonnet | 20.00 | — |
| 4 | GPT-4.1 | 17.50 | — |
| 5 | GPT-4o | 12.50 | — |
| 6 | o1 | 10.00 | — |
| 7 | Claude 3 Opus | 10.00 | — |
| 8 | Mixtral-8x7B-v0.1 | 7.50 | — |
| 9 | Llama 3.1 405B | 7.50 | — |
| 10 | Gemini 1.5 Flash | 7.50 | — |
| 11 | Meta-Llama-3-8B-Instruct | 5.00 | — |
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0