Claude 3.5 Sonnet Agent

AgentsImproved tool use and computer use capabilities

Anthropic-powered autonomous agent. Top-tier AgentBench performer at 42.3% overall. Excels at household tasks (72%), knowledge graphs (55%), OS tasks (40%). Strong tool use and self-correction capabilities. Powered by Claude 3.5 Sonnet.

Metrics

AgentBench Overall42.3

Composite score across all evaluation environments

AgentBench OS40.1

Operating system task completion accuracy

AgentBench Database35.2

Database query and manipulation performance

AgentBench Knowledge Graph55

Knowledge graph reasoning and retrieval

AgentBench WebShop32.1

E-commerce navigation and purchasing

AgentBench ALFWorld72

Simulated household task completion

AgentBench Mind2Web26.8

Real-world web browsing task success

Base Modelclaude-3.5-sonnet

Underlying foundation model powering the agent

ProviderAnthropic

Organization that created the model

Score Breakdown

adoption

15%85

Community usage, market traction, and ecosystem maturity

tool use

memory context

self reflection

planning reasoning

Compatibility(1)

Sources(1)

github.com

Developed by

Anthropic

Related Signals(5)

Claude Opus 4.5 First to Break 80% on SWE-bench Verified

AgentsJan 11

Claude Opus 4.5 achieved 80.9% on SWE-bench Verified, becoming the first model to exceed the 80% threshold. This represents a 21x improvement from GPT-4's initial 3.8% score in October 2023, demonstrating rapid progress in AI coding capabilities.

Human-AI Gap Persists: 92% vs 75% on Simple Tasks

AgentsJan 11

GAIA benchmark reveals a persistent capability gap between humans and AI on tasks trivially easy for humans. On Level 1 (simplest tasks), humans achieve 92% while best AI achieves 75%. The gap widens at higher difficulty levels, signaling fundamental limitations in AI reasoning and tool use.

Pass@k and Pass^k Metrics Define New Agent Scoring Standard

AgentsJan 9

Agent evaluation metrics shifting from single-attempt success to probabilistic measures. Pass@k (success in any of k attempts) and pass^k (success in all k attempts) becoming standard for measuring both capability ceiling and reliability floor.

Only 16% of Enterprise Deployments Are True Agents

AgentsDec 29

Despite the agent hype, only 16% of enterprise and 27% of startup deployments qualify as true agents. Most production architectures remain simple, built around fixed-sequence or routing-based workflows.

Browser Emerges as Dominant Agentic Interface

AgentsDec 29

The browser is emerging as the dominant interface for agentic AI, transforming from a navigation tool into a programmable environment for autonomous execution. Perplexity's Comet and OpenAI's Atlas leading this shift.

Stack

Tools

Registry

Training

Inference

Cost