OpenAI-powered autonomous agent. Top-tier AgentBench performer at 44.1% overall. Excels at household tasks (78%), knowledge graphs (58%), OS tasks (42%). Strong tool use and planning and self-correction capabilities. Powered by GPT-4.
Composite score across all evaluation environments
Operating system task completion accuracy
Database query and manipulation performance
Knowledge graph reasoning and retrieval
E-commerce navigation and purchasing
Simulated household task completion
Real-world web browsing task success
Underlying foundation model powering the agent
Organization that created the model
Community usage, market traction, and ecosystem maturity
GAIA benchmark reveals a persistent capability gap between humans and AI on tasks trivially easy for humans. On Level 1 (simplest tasks), humans achieve 92% while best AI achieves 75%. The gap widens at higher difficulty levels, signaling fundamental limitations in AI reasoning and tool use.
Agent evaluation metrics shifting from single-attempt success to probabilistic measures. Pass@k (success in any of k attempts) and pass^k (success in all k attempts) becoming standard for measuring both capability ceiling and reliability floor.