Krux

May 1, 2026
Running AI Benchmarks Now Costs More Than Airfare
Published: May 1, 2026 at 12:15 AM
Updated: May 1, 2026 at 12:15 AM
100-word summary
Testing one AI model properly can run $150,000, according to Hugging Face's EvalEval Coalition. A single PaperBench evaluation costs $9,500. The problem: nobody publishes their test results, so every lab pays to run identical benchmarks. Static tests can be compressed 100-200× without changing model rankings, but agent benchmarks resist shortcuts. The real culprit is reliability testing, where multiple runs to verify consistency drive most costs. Hugging Face is pushing for public evaluation logs and cost-aware leaderboards that compare accuracy against dollars spent. Right now, leaderboards hide costs entirely, which means the most expensive models dominate rankings even when cheaper alternatives perform nearly as well.
What happened
Testing one AI model properly can run $150,000, according to Hugging Face's EvalEval Coalition. A single PaperBench evaluation costs $9,500. The problem: nobody publishes their test results, so every lab pays to run identical benchmarks. Static tests can be compressed 100-200× without changing model rankings, but agent benchmarks resist shortcuts. The real culprit is reliability testing, where multiple runs to verify consistency drive most costs. Hugging Face is pushing for public evaluation logs and cost-aware leaderboards that compare accuracy against dollars spent.
Why it matters
Right now, leaderboards hide costs entirely, which means the most expensive models dominate rankings even when cheaper alternatives perform nearly as well.