AI benchmarking
How to find the smartest AI
June 20, 2025
THE DIZZYING array of letters splattered across the page of one of Jonathan Roberts’s visual-reasoning questions resembles a word search assembled by a sadist. Test-takers aren’t merely tasked with finding the hidden words in the image, but with spotting a question written in the shape of a star and then answering that in turn (see below).
The intention of Mr Roberts’s anthology of a hundred questions is not to help people pass the time on the train. Instead, it is to provide cutting-edge artificial-intelligence (AI) models like o3-pro, June’s top-tier release from OpenAI, with a test worthy of their skills.
There is no shortage of tests for AI models. Some seek to measure general knowledge, others are subject-specific. There are those that aim to assess everything from puzzle-solving and creativity to conversational ability. But not all of these so-called benchmarking tests do what they claim to. Many were hurriedly assembled, with flaws and omissions; were too easy to cheat on, having filtered into the training data of AI models; or were just too easy for today’s “frontier” systems.
ZeroBench, the challenge launched by Mr Roberts and his colleagues at the University of Cambridge, is one prominent alternative. It is targeted at large multimodal models—AI systems that can take images as well as text as input—and aims to present a test that is easy(ish) for the typical person and impossible for state-of-the-art models. For now, no large language model (LLM) can score a single point. Should some upstart one day do better, it would be quite an achievement.
ZeroBench isn’t alone. EnigmaEval is a collection of more than a thousand multimodal puzzles assembled by Scale AI, an AI data startup. Unlike ZeroBench, EnigmaEval doesn’t try to be easy for anyone. The puzzles, curated from a variety of pre-existing online quizzing resources, start at the difficulty of a fiendish cryptic crossword and get harder from there. When advanced AI systems are pitted against the hardest of these problems, their median score is zero. A frontier model from Anthropic, an AI lab, is the only model to have got a single one of these questions right.
Other question sets attempt to track more specific abilities. METR, an AI-safety group, for instance, tracks the length of time it would take people to perform individual tasks that AI models are now capable of (Anthropic is the first to break the hour mark). Another benchmark, the brashly named “Humanity’s Last Exam”, tests knowledge, rather than intelligence, with questions from the front line of human knowledge garnered from nearly a thousand academic experts.
One of the reasons for the glut of new tests is a desire to avoid the mistakes of the past. Older benchmarks abound with sloppy phrasings, bad markschemes or unfair questions. ImageNet, an early image-recognition data set, is an infamous example: a model that describes a photograph of a mirror in which fruit is reflected is penalised for saying the picture is of a mirror, but rewarded for identifying a banana.
It is impossible to ask models to solve corrected versions of these tests without compromising researchers’ ability to compare them with models that took the flawed versions. Newer tests—produced in an era when AI research is flush with resources—can be laboriously vetted to spot such errors ahead of production.
The second reason for the rush to build new tests is that models have learned the old ones. It has proved hard to keep any common benchmark out of the training data used by labs to train their models, resulting in systems that perform better on the exams than they do in normal tasks.
The third, and most pressing, issue motivating the creation of new tests is saturation—AI models coming close to getting full marks. On a selection of 500 high-school maths problems, for example, o3-pro is likely to get a near-perfect score. But as o1-mini, released nine months earlier, scored 98.9%, the results do not offer observers a real sense of progress in the field.
This is where ZeroBench and its peers come in. Each tries to measure a particular way AI capabilities are approaching—or exceeding—those of humans. Humanity’s Last Exam, for instance, sought to devise intimidating general-knowledge questions (its name derives from its status as the most fiendish such test it is possible to set), asking for anything from the number of tendons supported by a particular hummingbird bone to a translation of a stretch of Palmyrene script found on a Roman tombstone. In a future where many AI models can score full marks on such a test, benchmark-setters may have to move away from knowledge-based questions entirely.
But even evaluations which are supposed to stand the test of time get toppled overnight. ARC-AGI, a non-verbal reasoning quiz, was introduced in 2024 with the intention of being hard for AI models. Within six months, OpenAI announced a model, o3, capable of scoring 91.5%.
For some AI developers, existing benchmarks miss the point. OpenAI’s boss Sam Altman hinted at the difficulties of quantifying the unquantifiable when the firm released its GPT-4.5 in February. The system “won’t crush benchmarks”, he tweeted. Instead, he added, before publishing a short story the model had written, “There’s a magic to it I haven’t felt before.”
Some are trying to quantify that magic. Chatbot Arena, for example, allows users to have blind chats with pairs of LLMs before being asked to pick which is “better”—however they define the term. Models that win the most matchups float to the top of the leaderboard. This less rigid approach appears to capture some of that ineffable “magic” that other ranking systems cannot. They too, however, can be gamed, with more ingratiating models scoring higher with seducible human users.
Others, borrowing an argument familiar to anyone with school-age children, question what any test can reveal about an AI model beyond how good it is at passing that test. Simon Willison, an independent AI researcher in California, encourages users to keep track of the queries that existing AI systems fail to fulfil before posing them to their successors. That way users can select models that do well at the tasks that matter to them, rather than high-scoring systems ill-suited to their needs.
All this assumes that AI models are giving the tests facing them their best shot. Sandbagging, in which models deliberately fail tests in order to hide their true capabilities (in order to, for example, prevent themselves from being deleted), has been observed in a growing number of models. In a report published in May from researchers at MATS, an AI-safety group, top LLMs were able to identify when they were being tested almost as well as the researchers themselves. This too complicates the quest for reliable benchmarks.
That being said, the value to AI companies of simple leaderboards which their products can top means the race to build better benchmarks will continue. ARC-AGI 2 was released in March, and still eludes today’s top systems. But, aware of how quickly that might change, work on ARC-AGI 3 has already begun. ■
Curious about the world? To enjoy our mind-expanding science coverage, sign up to Simply Science, our weekly subscriber-only newsletter.