Anthropic-powered autonomous agent. Top-tier AgentBench performer at 42.3% overall. Excels at household tasks (72%), knowledge graphs (55%), OS tasks (40%). Strong tool use and self-correction capabilities. Powered by Claude 3.5 Sonnet.
Composite score across all evaluation environments
Operating system task completion accuracy
Database query and manipulation performance
Knowledge graph reasoning and retrieval
E-commerce navigation and purchasing
Simulated household task completion
Real-world web browsing task success
Underlying foundation model powering the agent
Organization that created the model
Community usage, market traction, and ecosystem maturity
Claude Opus 4.5 achieved 80.9% on SWE-bench Verified, becoming the first model to exceed the 80% threshold. This represents a 21x improvement from GPT-4's initial 3.8% score in October 2023, demonstrating rapid progress in AI coding capabilities.
GAIA benchmark reveals a persistent capability gap between humans and AI on tasks trivially easy for humans. On Level 1 (simplest tasks), humans achieve 92% while best AI achieves 75%. The gap widens at higher difficulty levels, signaling fundamental limitations in AI reasoning and tool use.
Agent evaluation metrics shifting from single-attempt success to probabilistic measures. Pass@k (success in any of k attempts) and pass^k (success in all k attempts) becoming standard for measuring both capability ceiling and reliability floor.
Despite the agent hype, only 16% of enterprise and 27% of startup deployments qualify as true agents. Most production architectures remain simple, built around fixed-sequence or routing-based workflows.
The browser is emerging as the dominant interface for agentic AI, transforming from a navigation tool into a programmable environment for autonomous execution. Perplexity's Comet and OpenAI's Atlas leading this shift.