
Running AI Benchmarks Now Costs More Than Airfare
Testing one AI model properly can run $150,000, according to Hugging Face's EvalEval Coalition. A single PaperBench evaluation costs $9,500. The problem: nobody publishes their test results, so every lab pays to run identical benchmarks. Static tests can be compressed 100-200× without changing model rankings, but agent benchmarks resist shortcuts. The real culprit is reliability testing, where multiple runs to verify consistency drive most costs. Hugging Face is pushing for public evaluation logs and cost-aware leaderboards that compare accuracy against dollars spent. Right now, leaderboards hide costs entirely, which means the most expensive models dominate rankings even when cheaper alternatives perform nearly as well.

