Agent Architecture Swings Scores 12 Points on Same AI Model

Published: May 20, 2026 at 12:13 AM

Updated: May 20, 2026 at 12:13 AM

100-word summary

Hugging Face and IBM Research launched the first full-factorial AI agent benchmark, testing five architectures on five language models across six real-world tasks. The surprise: how you wire up an agent changes results by up to 12 percentage points, even when the underlying model stays the same. Model quality still dominates overall performance, but the research found general-purpose agents now match heavily customized systems on four of six benchmarks. Open-weight models showed unexpected "generality sinks" where they collapse on certain tasks, while closed-source frontier models stayed robust. The leaderboard exposes a blind spot in AI evaluation: single scores hide the fact that different agent designs fail in wildly different ways.

What happened

Why it matters

The leaderboard exposes a blind spot in AI evaluation: single scores hide the fact that different agent designs fail in wildly different ways.

Sources

Hugging Face arXiv