ARC-AGI measures abstract reasoning with novel visual puzzles. It's become a key benchmark for evaluating general intelligence in AI models. But where do you find the current rankings? The official ARC-AGI site has results, but they might not include every model. Papers with Code has a leaderboard, but it may be outdated. Individual model announcements cite their own numbers. Getting a comprehensive, current view of which models lead on ARC-AGI—or any benchmark—requires checking multiple sources.
NeoSignal ARC-AGI Benchmark Leaderboard with model rankings
NeoSignal Benchmark Leaderboards provide that comprehensive view. This screenshot shows the ARC-AGI leaderboard: 50 models ranked by score. GPT-5.2-2025-12-11 xhigh leads at 86.20, Claude Opus 4.5 follows at 80.00, then GPT-5.2-2025-12-11 high at 78.70. The page header shows benchmark metadata: Category (Specialized), EDI (146.0), Slope (4.87), and a link to the source data. Each model links to its detail page. The sidebar shows the chat with suggested questions about what ARC-AGI measures and why it's considered difficult.
The benefit: authoritative benchmark rankings in one place. No source hunting, no version confusion, no manual comparison. Navigate to a benchmark, see the leaderboard, understand which models lead.
Detailed Walkthrough
The Leaderboard Fragmentation Problem
Benchmark results are scattered across the internet:
Multiple Sources: ARC-AGI results appear on arcprize.org, Papers with Code, model provider blogs, and research papers. Each source may have different models or different versions of results.
Update Lag: Academic leaderboards may not include the latest model versions. Model providers announce results before they appear on aggregator sites.
Missing Context: A leaderboard showing scores doesn't explain what the benchmark measures, how hard it is, or how to interpret the numbers.
No Cross-Reference: Seeing that Claude scores 80 on ARC-AGI doesn't tell you how that compares to its performance on other benchmarks or how it ranks overall.
NeoSignal Benchmark Leaderboards solve this by aggregating results from Epoch AI's evaluation database and presenting them with full context.
Save & organize insights
Save articles and excerpts to your personal library
Leaderboard Page Structure
Each benchmark detail page follows a consistent layout:
Header Section
- Back navigation to Benchmarks Browser
- Benchmark name (prominent heading)
- Description of what the benchmark measures
- Difficulty badge based on EDI
Metadata Row Key metrics displayed inline:
- Category (Reasoning, Math, Coding, Agents, Language, Specialized)
- EDI (Estimated Difficulty Index)
- Slope (improvement rate)
- View Source link to original data
Methodology Section (if available) Detailed explanation of how the benchmark works.
Leaderboard Table Ranked model performance with scores and uncertainty.
Attribution Epoch AI data source acknowledgment.
Leaderboard Table
The core of each page is the leaderboard table:
Column Structure
| Column | Description |
|---|---|
| Rank | Position in the leaderboard (1, 2, 3...) |
| Model | Model name (links to detail page) |
| Score | Performance on the benchmark |
| Stderr | Standard error / uncertainty |
Rank Visualization Top 3 positions get special treatment:
- 1st place: Gold badge
- 2nd place: Silver badge
- 3rd place: Bronze badge
- 4th and below: Neutral badge
Top 3 rows also have subtle background highlighting.
Score Formatting Scores display to two decimal places for precision. This matters when models are closely ranked—the difference between 80.00 and 79.87 can determine rankings.
Stderr Display Standard error appears as ±value when available, or a dash when not reported. This indicates confidence in the score. A score of 80.00 ±1.5 might actually range from 78.5 to 81.5.
Model Links
Every model name in the leaderboard links to its NeoSignal component detail page:
Click-Through Value From the ARC-AGI leaderboard, click "Claude Opus 4.5" to see:
- Complete model profile
- Performance across all benchmarks
- Compatibility with frameworks and cloud
- Related market signals
This enables exploration: see a model rank high on one benchmark, click through to understand its full capability profile.
Benchmark Metadata
The header provides context for interpreting scores:
Category Which evaluation category this benchmark belongs to. Helps understand what the benchmark tests:
- Reasoning: Multi-step logical inference
- Mathematics: Mathematical problem solving
- Coding: Code generation and understanding
- Agents: Autonomous task completion
- Language: Language understanding
- Specialized: Unique capabilities (vision, research, etc.)
EDI (Estimated Difficulty Index) A normalized difficulty score:
- Below 100: Easier, possibly saturated benchmarks
- 100-130: Moderate difficulty
- 130-150: Hard
- Above 150: Frontier difficulty
For ARC-AGI, EDI of 146.0 indicates high difficulty—models struggle to score well, and the benchmark remains unsaturated.
Slope How fast models improve on this benchmark:
- Low slope (< 1): Slow progress, persistent difficulty
- Medium slope (1-3): Steady improvement
- High slope (> 3): Rapid improvement
ARC-AGI's slope of 4.87 indicates rapid recent improvement—models are getting better at abstract reasoning quickly.
View Source Direct link to Epoch AI's benchmark data page for verification and additional context.
Difficulty Badges
Each benchmark displays a difficulty badge based on EDI:
Badge Levels
- Easy: EDI < 80
- Moderate: EDI 80-110
- Hard: EDI 110-140
- Frontier: EDI > 140
The badge provides at-a-glance understanding of benchmark difficulty without parsing numbers.
Chat Integration
Benchmark pages integrate with NeoSignal AI Chat:
Benchmark Context The chat knows which benchmark you're viewing. Context includes:
- Benchmark name and slug
- Category
- Description
- EDI (difficulty)
- Top performing models
Suggested Questions The chat suggests relevant questions:
- "What does ARC-AGI benchmark measure?"
- "Which models perform best on ARC-AGI?"
- "Why is ARC-AGI considered difficult?"
Intelligent Responses Ask questions and get grounded answers:
- Q: "What does ARC-AGI benchmark measure?"
- A: Detailed explanation of abstract reasoning tasks, visual puzzles, and why it tests general intelligence.
Example Leaderboards
Different benchmarks show different competitive landscapes:
ARC-AGI (Specialized)
- Tests abstract visual reasoning
- EDI: 146, Slope: 4.87
- Top models: GPT-5.x variants, Claude Opus 4.5, Gemini models
- Scores range from 50s to 80s
FrontierMath (Mathematics)
- Research-level math problems
- EDI: 156, Slope: 3.7
- Top models: o1-preview, Claude 3.5 Sonnet, GPT-4
- Scores typically below 50%
SWE-Bench (Coding)
- Real GitHub issue resolution
- EDI: 143, Slope: 3.0
- Top models: Claude 3.5 Sonnet, GPT-4, specialized code models
- Scores range from 20s to 50s
The Agent Company (Agents)
- Workplace automation tasks
- EDI: 147, Slope: 3.2
- Tests autonomous multi-step task completion
- Scores typically below 40%
Interpreting Results
Understanding leaderboard results requires context:
Score Magnitude Different benchmarks have different scoring scales:
- Some use percentage accuracy (0-100)
- Some use task completion rates
- Some use specialized metrics
The score column shows the benchmark's native metric.
Model Variants Models appear with version specifiers:
- "gpt-5.2-2025-12-11 high" indicates a specific release date and configuration
- "claude-opus-4.5" is the standard version
- "o3-mini-2025-01-31 high" shows the mini variant with high compute
Stderr Interpretation Standard error indicates reliability:
- Low stderr (< 1): Consistent performance, reliable ranking
- High stderr (> 3): Variable performance, ranking may shift on retest
- Missing stderr: Uncertainty data not reported
Navigation Patterns
From Benchmarks Browser Click a benchmark card to view its leaderboard.
To Model Details Click any model name to see its full profile.
Back to Browser "Back to Benchmarks" link returns to the filtered view.
Source Verification "View Source" opens Epoch AI's original data.
Real-World Usage Patterns
Model Selection: You need the best model for abstract reasoning. Navigate to ARC-AGI leaderboard, see rankings, click through to top models to evaluate their full profiles.
Competitive Analysis: You're tracking how Claude compares to GPT on specific benchmarks. Check relevant leaderboards to see current rankings.
Benchmark Understanding: A paper mentions FrontierMath. Navigate to its leaderboard to understand difficulty (EDI 156), current state-of-the-art, and which models have been evaluated.
Progress Tracking: You want to monitor improvements on a benchmark over time. Check the slope metric to understand improvement rate. High slope means rapid progress.
Model Verification: A model claims top performance on SWE-Bench. Check the leaderboard to verify the claim and see the full competitive landscape.
Data Freshness
Leaderboard data updates as Epoch AI publishes new evaluations:
Automatic Updates New model results appear without manual intervention.
Version Tracking Model variants are tracked separately. GPT-5 and GPT-5-turbo appear as distinct entries.
Historical Context Older model versions remain in leaderboards for historical comparison.
Mobile Experience
Leaderboard pages adapt to mobile devices:
Responsive Table The leaderboard table supports horizontal scrolling on narrow screens.
Touch Targets Model links meet minimum touch sizes for easy tapping.
Stacked Layout Metadata items stack vertically on mobile instead of inline.
Readable Scores Font sizes ensure scores remain readable on small screens.
From Leaderboard to Decision
NeoSignal Benchmark Leaderboards transform benchmark result checking from a multi-source research task into a single-page lookup. Navigate to any benchmark, see the complete leaderboard, understand difficulty through EDI and slope, and click through to model details.
The leaderboard table shows you rankings. The metadata shows you difficulty. The model links enable exploration. Together, they make benchmark performance transparent.
That's the NeoSignal approach: aggregate benchmark results, present them with context, connect them to model profiles. You focus on understanding which models excel where; the leaderboard handles the aggregation.