Back to Benchmarks

WeirdML

Unusual machine learning tasks testing adaptability

Frontier
Category:specialized
EDI:145.0
Slope:1.83
View Source

Leaderboard

(31 models)
RankModelScoreStderr
1GPT-5.272.20
2Gemini 3 Pro69.93
3Claude Opus 4.563.70
4o358.21
5Gemini 2.5 Pro (Jun 2025)54.03
6o4-mini (high)52.56
7GPT-OSS 120B48.17
8o147.56
9Grok 445.73
10kimi-k2-thinking (official)42.79
11Grok-3 mini42.58
12DeepSeek V341.63
13Qwen3-Max-Instruct41.17
14Qwen 3 235B41.04
15Claude 3.7 Sonnet39.97
16GPT-4.139.37
17GPT-4.1 mini37.61
18DeepSeek R136.49
19Grok Code Fast 135.06
20Mistral Large33.13
21Claude 3.5 Haiku30.73
22GPT-4o25.12
23Gemini 1.5 Flash24.87
24Llama-4-Maverick-17B-128E-Instruct24.47
25Claude 3 Opus23.18
26Llama 3.1 405B21.38
27GPT-4 Turbo18.01
28Llama 3.3 70B14.44
29gpt-4o-mini-2024-07-1811.76
30gpt-3.5-turbo-11063.48
31Mixtral-8x7B-v0.13.17

Data source: Epoch AI, “Data on AI Benchmarking”. Published at epoch.ai

Licensed under CC-BY 4.0

WeirdML: Top Score 72.2% - AI Benchmark | NeoSignal