Benchmarks

AI model evaluation datasets from Epoch AI

FrontierMath-Tier-4-2025-07-01-Private

mathematics

Tier 4 (hardest) FrontierMath problems - research-level difficulty

FrontierSlope: 3.51

GSO-Bench

specialized

General science and observation benchmark

FrontierSlope: 2.78

Balrog

specialized

Balanced reasoning and logic games evaluation

FrontierSlope: 0.91

FrontierMath-2025-02-28-Private

mathematics

Research-level mathematics problems at the frontier of current AI capabilities

FrontierSlope: 3.72

Terminal Bench

agents

Terminal/command-line interaction tasks for agent evaluation

FrontierSlope: 2.79

SimpleBench

reasoning

Simple reasoning tasks designed to test fundamental capabilities

FrontierSlope: 2.73

DeepResearch Bench

specialized

Multi-document synthesis and research tasks testing deep research capabilities

FrontierSlope: 0.80

Cybench

agents

Cybersecurity CTF challenges testing security analysis and exploitation

FrontierSlope: 3.31

The Agent Company

agents

Multi-step workplace automation tasks testing autonomous agent capabilities

FrontierSlope: 3.21

OSWorld

agents

Operating system interaction tasks testing computer use and automation

FrontierSlope: 2.82

ARC-AGI

specialized

Abstract Reasoning Corpus - novel visual reasoning tasks testing general intelligence

FrontierSlope: 4.87

VPCT

specialized

Visual perception and comprehension tasks

FrontierSlope: 1.61

WeirdML

specialized

Unusual machine learning tasks testing adaptability

FrontierSlope: 1.83

SWE-Bench Verified (Bash Only)

coding

Verified real-world GitHub issues requiring code understanding and modification

FrontierSlope: 2.99

CadEval

coding

CAD/technical code evaluation tasks

FrontierSlope: 2.28

Aider polyglot

coding

Code editing tasks across multiple programming languages using Aider framework

FrontierSlope: 3.97

OTIS Mock AIME 2024-2025

mathematics

Competition-level math problems from OTIS Mock AIME evaluating olympiad-level math

FrontierSlope: 5.30

GPQA_diamond

reasoning

Graduate-level science questions in physics, chemistry, and biology requiring expert knowledge

FrontierSlope: 2.71

Fiction.LiveBench

specialized

Fiction comprehension and reasoning tasks

FrontierSlope: 2.78

ANLI

language

Adversarial NLI - challenging natural language inference

FrontierSlope: 1.44

MATH level 5

mathematics

Level 5 (hardest) problems from the MATH dataset requiring advanced mathematical reasoning

FrontierSlope: 4.14

GeoBench

specialized

Geographic and spatial reasoning tasks

FrontierSlope: 0.72

BBH

reasoning

Big-Bench Hard - 23 challenging tasks requiring multi-step reasoning

FrontierSlope: 1.86

ScienceQA

language

Science question answering across multiple domains

FrontierSlope: 1.90

MMLU

language

Massive Multitask Language Understanding across 57 subjects

FrontierSlope: 1.49

Winogrande

language

Large-scale Winograd schema challenge - anchor benchmark for ECI calculation

FrontierSlope: 1.00

Lech Mazur Writing

language

Writing quality and style evaluation

FrontierSlope: 0.92

VideoMME

specialized

Video understanding and multimodal evaluation tasks

FrontierSlope: 0.38

GSM8K

mathematics

Grade school math word problems requiring multi-step arithmetic reasoning

FrontierSlope: 2.17

ARC AI2

reasoning

AI2 Reasoning Challenge - science questions requiring reasoning

FrontierSlope: 2.12

OpenBookQA

language

Open-book question answering requiring common knowledge

FrontierSlope: 1.15

HellaSwag

language

Commonsense NLI about grounded situations

Very HardSlope: 0.67

PIQA

language

Physical intuition question answering

Very HardSlope: 0.37

TriviaQA

language

Trivia questions requiring broad knowledge

MediumSlope: 0.40

LAMBADA

language

Language modeling and broad context understanding

EasySlope: 0.34

BFCL

agents

Berkeley Function Calling Leaderboard - evaluates LLMs on tool/function calling accuracy across simple, multiple, parallel, and multi-turn scenarios

N/A

GAIA Overall

agents

General AI Assistant benchmark - 466 tasks across 3 difficulty levels testing reasoning, web browsing, tool use, and multi-modality

N/A

SWE-bench Verified

agents

GitHub issue resolution benchmark - tests coding agents on resolving real-world software issues with 500 human-verified instances

N/A

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0