GPT-4 Agent

AgentsStrong function-calling and tool use capabilities

OpenAI-powered autonomous agent. Top-tier AgentBench performer at 44.1% overall. Excels at household tasks (78%), knowledge graphs (58%), OS tasks (42%). Strong tool use and planning and self-correction capabilities. Powered by GPT-4.

Metrics

AgentBench Overall44.1

Composite score across all evaluation environments

AgentBench OS42.4

Operating system task completion accuracy

AgentBench Database32

Database query and manipulation performance

AgentBench Knowledge Graph58.3

Knowledge graph reasoning and retrieval

AgentBench WebShop35.6

E-commerce navigation and purchasing

AgentBench ALFWorld78

Simulated household task completion

AgentBench Mind2Web24.2

Real-world web browsing task success

Base Modelgpt-4

Underlying foundation model powering the agent

ProviderOpenAI

Organization that created the model

Score Breakdown

adoption

15%95

Community usage, market traction, and ecosystem maturity

tool use

memory context

self reflection

planning reasoning

Compatibility(2)

Sources(1)

github.com

Developed by

OpenAI

Related Signals(2)

Human-AI Gap Persists: 92% vs 75% on Simple Tasks

AgentsJan 11

GAIA benchmark reveals a persistent capability gap between humans and AI on tasks trivially easy for humans. On Level 1 (simplest tasks), humans achieve 92% while best AI achieves 75%. The gap widens at higher difficulty levels, signaling fundamental limitations in AI reasoning and tool use.

Pass@k and Pass^k Metrics Define New Agent Scoring Standard

AgentsJan 9

Agent evaluation metrics shifting from single-attempt success to probabilistic measures. Pass@k (success in any of k attempts) and pass^k (success in all k attempts) becoming standard for measuring both capability ceiling and reliability floor.

Stack

Tools

Registry

Training

Inference

Cost