Running AI Benchmarks Now Costs More Than Airfare

May 1, 2026

Running AI Benchmarks Now Costs More Than Airfare

Published: May 1, 2026 at 12:15 AM

Updated: May 1, 2026 at 12:15 AM

100-word summary

Testing one AI model properly can run $150,000, according to Hugging Face's EvalEval Coalition. A single PaperBench evaluation costs $9,500. The problem: nobody publishes their test results, so every lab pays to run identical benchmarks. Static tests can be compressed 100-200× without changing model rankings, but agent benchmarks resist shortcuts. The real culprit is reliability testing, where multiple runs to verify consistency drive most costs. Hugging Face is pushing for public evaluation logs and cost-aware leaderboards that compare accuracy against dollars spent. Right now, leaderboards hide costs entirely, which means the most expensive models dominate rankings even when cheaper alternatives perform nearly as well.

What happened

Testing one AI model properly can run $150,000, according to Hugging Face's EvalEval Coalition. A single PaperBench evaluation costs $9,500. The problem: nobody publishes their test results, so every lab pays to run identical benchmarks. Static tests can be compressed 100-200× without changing model rankings, but agent benchmarks resist shortcuts. The real culprit is reliability testing, where multiple runs to verify consistency drive most costs. Hugging Face is pushing for public evaluation logs and cost-aware leaderboards that compare accuracy against dollars spent.

Why it matters

Right now, leaderboards hide costs entirely, which means the most expensive models dominate rankings even when cheaper alternatives perform nearly as well.

Sources