AI Benchmark Discrepancy Reveals Gaps in Performance Claims

April 22, 2025

News

AI Benchmark Discrepancy Reveals Gaps in Performance Claims

April 22, 2025

Frontiermath Accuracy for Openai’s O3 and O4-mini Compared to Leading Models. — frontiermath accuracy for openais o3 and o4 mini in comparison with main fashions picture epoch ai

The newest outcomes from FrontierMath, a benchmark check for generative AI on superior math issues, present OpenAI’s o3 mannequin carried out worse than OpenAI initially acknowledged. Whereas newer OpenAI fashions now outperform o3, the discrepancy highlights the necessity to scrutinize AI benchmarks carefully.

Epoch AI, the analysis institute that created and administers the check, launched its newest findings on April 18.

OpenAI claimed 25% completion of the check in December

Final yr, the FrontierMath rating for OpenAI o3 was a part of the almost overwhelming variety of bulletins and promotions launched as a part of OpenAI’s 12-day vacation occasion. The corporate claimed OpenAI o3, then its strongest reasoning mannequin, had solved greater than 25% of issues on FrontierMath. As compared, most rival AI fashions scored round 2%, in accordance with TechCrunch.

SEE: For Earth Day, organizations might issue generative AI’s energy into their sustainability efforts.

On April 18, Epoch AI launched check outcomes exhibiting OpenAI o3 scored nearer to 10%. So, why is there such a giant distinction? Each the mannequin and the check might have been totally different again in December. The model of OpenAI o3 that had been submitted for benchmarking final yr was a prerelease model. FrontierMath itself has modified since December, with a special variety of math issues. This isn’t essentially a reminder to not belief benchmarks; as an alternative, simply keep in mind to dig into the model numbers.

OpenAI o4 and o3 mini rating highest on new FrontierMath outcomes

The up to date outcomes present OpenAI o4 with reasoning carried out finest, scoring between 15% and 19%. It was adopted by OpenAI o3 mini, with o3 in third. Different rankings embody:

OpenAI o1
Grok-3 mini
Claude 3.7 Sonnet (16K)
Grok-3
Claude 3.7 Sonnet (64K)

Though Epoch AI independently administers the check, OpenAI initially commissioned FrontierMath and owns its content material.

Criticisms of AI benchmarking

Benchmarks are a typical solution to examine generative AI fashions, however critics say the outcomes will be influenced by check design or lack of transparency. A July 2024 research raised considerations that benchmarks typically overemphasize slim job accuracy and undergo from non-standradized analysis practices.

roosho Senior Engineer (Technical Services)

I am Rakib Raihan RooSho, Jack of all IT Trades. You got it right. Good for nothing. I try a lot of things and fail more than that. That's how I learn. Whenever I succeed, I note that in my cookbook. Eventually, that became my blog.

See Full Bio

share this article.

AI Benchmark Discrepancy Reveals Gaps in Performance Claims

AI Benchmark Discrepancy Reveals Gaps in Performance Claims

OpenAI claimed 25% completion of the check in December

OpenAI o4 and o3 mini rating highest on new FrontierMath outcomes

Criticisms of AI benchmarking

No Comment! Be the first one.

Leave a Reply Cancel reply

related posts .