Benchmarks
AI model evaluation datasets from Epoch AI
FrontierMath-Tier-4-2025-07-01-Private
mathematics
Tier 4 (hardest) FrontierMath problems - research-level difficulty
GSO-Bench
specialized
General science and observation benchmark
Balrog
specialized
Balanced reasoning and logic games evaluation
FrontierMath-2025-02-28-Private
mathematics
Research-level mathematics problems at the frontier of current AI capabilities
Terminal Bench
agents
Terminal/command-line interaction tasks for agent evaluation
SimpleBench
reasoning
Simple reasoning tasks designed to test fundamental capabilities
DeepResearch Bench
specialized
Multi-document synthesis and research tasks testing deep research capabilities
Cybench
agents
Cybersecurity CTF challenges testing security analysis and exploitation
The Agent Company
agents
Multi-step workplace automation tasks testing autonomous agent capabilities
OSWorld
agents
Operating system interaction tasks testing computer use and automation
ARC-AGI
specialized
Abstract Reasoning Corpus - novel visual reasoning tasks testing general intelligence
VPCT
specialized
Visual perception and comprehension tasks
WeirdML
specialized
Unusual machine learning tasks testing adaptability
SWE-Bench Verified (Bash Only)
coding
Verified real-world GitHub issues requiring code understanding and modification
CadEval
coding
CAD/technical code evaluation tasks
Aider polyglot
coding
Code editing tasks across multiple programming languages using Aider framework
OTIS Mock AIME 2024-2025
mathematics
Competition-level math problems from OTIS Mock AIME evaluating olympiad-level math
GPQA_diamond
reasoning
Graduate-level science questions in physics, chemistry, and biology requiring expert knowledge
Fiction.LiveBench
specialized
Fiction comprehension and reasoning tasks
ANLI
language
Adversarial NLI - challenging natural language inference
MATH level 5
mathematics
Level 5 (hardest) problems from the MATH dataset requiring advanced mathematical reasoning
GeoBench
specialized
Geographic and spatial reasoning tasks
BBH
reasoning
Big-Bench Hard - 23 challenging tasks requiring multi-step reasoning
ScienceQA
language
Science question answering across multiple domains
MMLU
language
Massive Multitask Language Understanding across 57 subjects
Winogrande
language
Large-scale Winograd schema challenge - anchor benchmark for ECI calculation
Lech Mazur Writing
language
Writing quality and style evaluation
VideoMME
specialized
Video understanding and multimodal evaluation tasks
GSM8K
mathematics
Grade school math word problems requiring multi-step arithmetic reasoning
ARC AI2
reasoning
AI2 Reasoning Challenge - science questions requiring reasoning
OpenBookQA
language
Open-book question answering requiring common knowledge
HellaSwag
language
Commonsense NLI about grounded situations
PIQA
language
Physical intuition question answering
TriviaQA
language
Trivia questions requiring broad knowledge
LAMBADA
language
Language modeling and broad context understanding
BFCL
agents
Berkeley Function Calling Leaderboard - evaluates LLMs on tool/function calling accuracy across simple, multiple, parallel, and multi-turn scenarios
GAIA Overall
agents
General AI Assistant benchmark - 466 tasks across 3 difficulty levels testing reasoning, web browsing, tool use, and multi-modality
SWE-bench Verified
agents
GitHub issue resolution benchmark - tests coding agents on resolving real-world software issues with 500 human-verified instances
Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai
Licensed under CC-BY 4.0