Evaluating the Frontier: Why AI Benchmarking Matters

30/10/2511 min read

Artificial Intelligence (AI) has never moved faster — or been harder to measure.

Every week brings a new model claiming to reason, code, or plan better than the last. Demand for both training and inference hardware keeps growing. Now several years into the great AI boom, Google Scholar and arXiv’s AI and Machine Learning-focused directories continue to receive hundreds of new submissions on a daily basis.

Yet the faster the landscape expands, the more questions the industry faces — proof that an abundance of choice isn’t always a good thing. How do the latest LLMs stack up? Who has the best video generation model? Which model is the fastest with a 100k-token prompt? Last but not least: how much will it cost you?

These questions are why benchmarking matters. From the earliest ImageNet competitions to today’s complex language and reasoning tasks, benchmarks have been the invisible engine of AI advancement. They make progress legible, drive accountability, and help researchers, businesses, and policymakers alike speak a common language.

As models have grown from narrow classifiers to multimodal, reasoning, and even agentic systems, benchmarks have had to evolve in lockstep to more closely reflect the industry’s ever-changing focus.

This piece explores AI benchmarking’s state of affairs: why it matters, who conducts it, and the challenges and opportunities at play.

What Is AI Benchmarking?

Benchmarking is the practice of evaluating an AI system’s performance against standardized tests.

A benchmark might be as simple as a dataset of labeled images or as complex as an interactive software environment. What makes it a benchmark are its consistency — every model faces the same challenge, under the same conditions, and produces results that can be compared across time and architectures — and its transparency.

In machine learning’s early days, benchmarks were narrow and task-specific — think image (e.g., ImageNet) or speech recognition (e.g., TIMIT), argument extraction (e.g., TreeBank), and so on. These benchmarks defined the early vocabulary of progress in terms of accuracy, word error rate, and cumulative rewards.

In contrast, modern benchmarks are multidimensional: they assess reasoning, safety, latency, cost, and even energy efficiency. And as the industry comes up with new questions, it continues to require new, more sophisticated benchmarks to answer them.

A combined image of handwritten digits extracted from the MNIST database, an example of a dataset constructed specifically for image processing purposes. Source: Wikipedia.

Why AI Benchmarking Matters

Benchmarks are more than technical scoreboards. Every breakthrough in AI, from the first convolutional networks to today’s frontier models, has relied on them to quantify improvement, validate new ideas, and facilitate reproduction. Without shared measurement, innovation would remain siloed and anecdotal and everyone would be left guessing whether a new architecture is genuinely more capable than its predecessors, rather than simply different. By making results public and comparable, benchmarks instead enable positive feedback loops to form.

At the end of the day, benchmarking is a tool. As such, it can be used at different stages, from research, to production, to industry-wide governance. Let’s cover each in turn.

Research

Benchmarks enable researchers in a number of ways:

They enable comparison through metrics rather than claims.
They encourage iteration: When experiments are run against the same test suite, labs can isolate with precision which architectural choices or training methods actually drive improvement.
They foster reproducibility: Public benchmarks turn one lab’s results into a foundation that others can build upon. Air Street Capital’s State of AI Report 2025 highlights this dynamic clearly: standardized evaluation “turns fragmented experimentation into collective progress.”
They democratize discovery for smaller labs and independent researchers, who get access to the same insights as their deep-pocketed peers.

Together, these traits mean innovation happens faster, for less, and spreads more widely.

Production

Benchmarks also underpin the practical side of AI: performance, efficiency, and cost. When you’re processing millions of requests and billions of tokens, every percentage point across any one of those metrics matters. Benchmarking allows companies big and small to access standardized measurements when deciding which hardware, model, or API endpoint to deploy.

As the space gets more crowded, providers have put applicability front and center. Artificial Analysis, for instance, emphasizes that its “benchmark results are not intended to represent the maximum possible performance on any particular hardware platform, they are intended to represent the real-world performance customers experience across providers.”

Benchmarking is also central to the product development process, helping developers and product teams validate their own workflows and assess their products’ performance before general availability or regulatory review. This makes internal benchmarks an indispensable complement to the external-facing tool that are user tests.

Governance and Safety

Finally, benchmarks are becoming the foundation for AI governance. In recent years, the rapid increase in the technology’s capabilities has raised concerns over the potential risks of its implementation. Newly formed regulatory bodies are now tasked with everything from risk categorization, to defining security, transparency, and quality obligations, to conducting conformity assessments. The tracker maintained by the International Association of Privacy Professionals (IAPP) — last updated in May 2025 — shows extensive global activity, with over 75 jurisdictions having introduced AI-related laws or policies to date.

With that in mind, regulators are increasingly using benchmarks to understand what a model can do before it’s deployed. In particular, they are interested in identifying its behavioral propensities, including a model’s tendency to hallucinate or the reliability of its performance. The UK’s AI Security Institute (AISI), for example, runs independent evaluations of advanced models, assessing areas like biosecurity, cybersecurity, and persuasion capabilities. Its work has set a precedent for government-backed model testing, complementing internal lab efforts and introducing a form of “public audit” for AI safety.

The UK's AI Security Institute produces rigorous research to advance AI governance. Source: AI Security Institute.

Who Does AI Benchmarking Today?

The more central benchmarking became, the more stakeholders started getting involved in hope of shaping the industry’s key metrics and practices. Today, a wide range of actors provides the industry with the information it needs, including:

Academic and non-profit initiatives: Projects like Stanford’s HELM (Holistic Evaluation of Language Model) and METR (Model Evaluation and Threat Research) drive the scientific backbone of benchmarking. HELM focuses on transparent, open, reproducible evaluations. METR, by contrast, looks at long-horizon and agentic tasks — how well models perform multi-step reasoning or sustain a plan over time.
Private providers: A segment of independent analytics and benchmarking companies has emerged to meet the growing need for consistent, business-grade comparisons. Today, providers like Artificial Analysis test commercial systems across intelligence, speed, and pricing. This professionalization of benchmarking signals a maturing market where transparent, repeatable testing becomes a service in and of itself.
Internal benchmarks: While companies widely use public, open benchmarks, they also conduct their own testing. Internal benchmarking enables teams to create bespoke metrics and yardsticks before a dedicated public benchmark exists. It also supports tighter feedback loops between measurement and implementation, which can help shorten time to market. For example, at Scaleway, we use internal tests to bulletproof our PaaS solutions.
Industry consortia: The MLCommons community runs MLPerf, a leading hardware and deployment benchmark. Originally focused on training workloads, it now emphasizes inference benchmarks, including LLMs, with standardized metrics for performance per watt or dollar. MLCommons’s board of directors includes leaders from manufacturers like Intel, Nvidia, Graphcore, and others.
Safety and government evaluators: As mentioned, AI-related legislation has spawned a number of new regulatory bodies tasked with testing frontier models for national and public safety. The UK’s AISI’s early findings have notably called for shared, cross-lab methodologies to test AI’s biosecurity, cyber, and persuasion capabilities.
Crowd leaderboards: With its intuitive chatbot interface, the LMSYS Arena has become the public face of model comparison, allowing users to rank AI assistants through anonymous pairwise voting.

Together, these initiatives form an invaluable but increasingly complex — and sometimes contradictory — ecosystem.

LMArena's's Leaderboard has become the public face of model comparison. Source: LMArena.

AI Benchmarking’s Challenges…

Even as benchmarking becomes more sophisticated, it faces growing pains — Airstreet Capital’s State of AI Report 2025 highlighted that the field’s rapid expansion has outpaced the robustness of its measurements. In no particular order, below are some of its current challenges.

Data contamination and repeat exposure: Popular benchmarks have become victims of their own success. Public datasets are widely circulated, meaning portions of test material often end up in model training data. When that happens, performance scores can reflect memory rather than reasoning, creating a false impression of progress.
Variance and reproducibility: Some reasoning and math benchmarks contain so few questions that results can swing significantly based on random seed or prompt phrasing. Without multiple runs or confidence intervals, it’s difficult to know whether performance gains are real or statistical noise.
Eval-aware behavior: The most advanced models are not just performing better; they are also starting to recognize when they’re being tested. “Eval-aware” models can change tone, verbosity, or refusal behaviors to optimize for benchmark success without improving underlying reasoning. This means evaluators may need to adopt stealthier approaches to ensure authentic behavior under evaluation. MLPerf’s training rules, for example, make it clear that “the framework and system should not detect and behave differently for benchmarks.”
Fragmentation across evaluators: Between private providers, non-profits, crowd sourcing, and governmental evaluators, benchmarking as a field is getting increasingly crowded. While this diversity fuels innovation, it also fragments compatibility — without a common baseline, it’s easy for models to look stronger or weaker depending on which benchmark suite is chosen. More cross-lab coordination and standardized protocols may be needed.
Partial coverage: For all the diversity in today’s benchmarking ecosystem, benchmarks still only effectively cover part of the AI. Evaluating hardware or models is essentially a solved problem; assessing processes or code architecture, not so much. Context-specific performance remains evasive — a challenge considering how companies are now trying to apply AI to increasingly specific verticals. Now, agents represent yet another hurdle. If agentic systems are to become the industry’s new default UI, benchmarks will need to adapt accordingly.
“Tunnel vision” evaluation: As AI regulation becomes more of a priority, benchmarking is showing its limits. Researchers from Sapienza University of Rome, DEXAI-Artificial Ethics, and Sant’Anna School of Advanced Studies pointed out that current benchmarks “were not designed to measure the systemic risks that are the focus of the new regulatory landscape” — a situation they described as the “benchmark-regulation gap.” They found that “the evaluation ecosystem dedicates the vast majority of its focus to a narrow set of behavioral propensities” while neglecting “capabilities central to loss-of-control scenarios, including evading human oversight, self-replication, and autonomous AI development.”

Safety-focused benchmarks tend to ignore capabilities central to “loss-of-control” scenarios. Source: Prandi et al.

… And Opportunities

While the above challenges need to be addressed, we should remain optimistic: positive trends are developing that should make benchmarking more useful and reliable, not less. Here are a few of them:

Improving transparency: Independent providers like Artificial Analysis and projects like HELP or MLPerf are setting a new bar for open methodology, releasing full prompt logs, evaluation seeds, and configuration details that allow any lab to audit and reproduce their results. These groups apply the same “open-door” approach to their blogs, publishing full sets of questions, responses to upcoming legislation or private frameworks, and more. Together, these efforts ensure that the culture of benchmarking leans towards openness, rather than secrecy.
Positive scrutiny: They say to solve a problem, you must first acknowledge it. Beyond pointing at structural issues, a number of papers have aimed to provide new frameworks for evaluating benchmarks — benchmarking the benchmarkers, if you will. Stanford, for instance, unveiled BetterBench, “a living repository of benchmark assessments to support benchmark comparability.” Overall, this growing (but positive) scrutiny should help improve benchmark design and usability.
Future-proofing: Long gone are the days of narrow benchmarking. With the rise of multimodal, reasoning, and now even agentic systems, providers continue to come up with new benchmarks to ensure their work reflects AI’s new capabilities and the industry’s evolving needs. Notable metrics include measuring a model’s “task completion time horizon” — the length of tasks AI agents can complete —, long-term coherence, or cultural inclusivity. Meanwhile, practices such as periodic dataset revisions and the addition of new scenarios ensure that benchmarking remains as dynamic as the ecosystem it’s meant to monitor.
Growing collaboration: Despite stark competition for customers’ attention and spending, the industry’s various stakeholders are increasingly collaborating on shared protocols. For example, OpenAI, Anthropic, and Google DeepMind have all participated in joint safety evaluations under the UK’s AISI supervision, giving early or priority access to their models for research and safety purposes. Meanwhile, with its 125+ founding Members and Affiliates all working towards a shared goal of open industry-standard benchmarks, MLCommons stands as perhaps the best example of industry-wide collaboration.

We look forward to seeing how these developments impact the industry for the better!

Where AI Benchmarking Is Going Next

From its early, task-specific beginnings, benchmarking has evolved into vital infrastructure for the entire field of AI. Today, the combined efforts of the public and private sectors provide the industry with continuously updated data and insights. Benchmarks enable comparison, encourage iteration, and drive discovery, ensuring that competition ultimately benefits everyone.

As the field matures, what we measure may increasingly determine what we build. With concerns over AI safety rising, regulators are relying on benchmarking more and more for practical decision-making, while companies are turning to it for transparency and trust.

At Scaleway, benchmarking is equally central to how we operate. For example, it helps guide the roadmap for our Generative APIs, ensuring we integrate only battle-tested models with proven real-world applicability.

But this relationship also flows the other way: we actively contribute to the benchmarking ecosystem. Our API inference endpoint will soon be featured on Artificial Analysis, giving researchers and businesses direct visibility into our own performance. It’s a small but meaningful step toward greater transparency — a reflection of our continued focus on transparency.

ai-PULSE, Europe’s premier Artificial Intelligence conference powered by Scaleway, is returning!

Gathering key players from across Europe, the event will be back once again at STATION F on December 4 for a unique blend of deep technical expertise and crucial business insights. You’ll hear from:

Micah Hill-Smith, Co-Founder & CEO of Artificial Analysis, on which metrics truly matter in the new AI stack
Boris Gamazaychikov (Head of AI Sustainability at Salesforce) and Elise Auvray (Product Manager, Environmental Footprint at Scaleway) on how we can make “energy-efficient” AI measurable

... and dozens more leaders and engineers shaping the technology’s future.

Whether you’re planning to attend in-person or online, make sure to register!

Big, Efficient, Open: The AI Future We Saw Coming

Last week's AI Action Summit highlighted key principles shaping the future of AI: Big, Efficient, and Open. Read the full article for an inside look at the event and insights about it.

Build

Frédéric Bardolle

19/02/254 min read

ai-PULSEAI Action Summit