AgentBench evaluation reveals GPT-4 class models achieve ~44% overall success rate across 8 real-world agent environments, while open-source alternatives score 15-30% lower. Operating System and Database tasks show the largest capability gaps, highlighting the challenge of autonomous agent development.
GPT-4 achieves 44% overall AgentBench score across 8 environments (OS, DB, KG, DCG, LTP, ALFWorld, WebShop, Mind2Web)
Dec 28, 2025Open-source models score 15-30% lower than GPT-4 on agent tasks, revealing significant capability gaps
Dec 28, 2025WebArena's 812 web tasks show multi-step planning remains a key challenge for all models
Dec 28, 2025