GAIA benchmark reveals a persistent capability gap between humans and AI on tasks trivially easy for humans. On Level 1 (simplest tasks), humans achieve 92% while best AI achieves 75%. The gap widens at higher difficulty levels, signaling fundamental limitations in AI reasoning and tool use.
Humans: 92% on all GAIA levels, best AI: 75% on Level 1
Jan 11, 2026Common failures: tool selection errors (25%), calculation mistakes (20%), information retrieval (18%)
Jan 11, 2026Gap widens at harder levels: L2 humans 92%, AI 68%; L3 humans 92%, AI 61%
Jan 11, 2026