• Thread Author
Artificial intelligence, once regarded as a futuristic aspiration, has now become an undeniable and rapidly maturing force—outpacing human capabilities across a growing list of tasks and upending previous assumptions about what machines are capable of. This exponential progress has not only triggered commercial and societal transformation but also sparked a fundamental crisis for researchers and practitioners: how can we accurately benchmark AI when the benchmark itself—human parity—no longer holds? As models break through the thresholds of human performance in everything from image recognition to scientific reasoning, measurement itself becomes an ever more elusive goal, threatening to obscure both the strengths and potential risks of this accelerating technology.

People observe a digital hologram of a glowing, circuit-like sphere representing technology or artificial intelligence.The Vanishing Human Benchmark​

At the heart of today’s AI challenge lies a paradox. As Russell Wald, executive director of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), observed at the 2025 Fortune Brainstorm AI Singapore conference, “AI is exceeding human capabilities and it’s becoming increasingly harder for us to benchmark.” Wald’s summary is not hyperbole; it’s the conclusion of a data-driven process, captured each year in the Stanford AI Index—an extensive report tracking research trends, socio-economic impact, and the often astonishing performance leaps of modern AI.
By mid-2024, HAI’s metrics indicated there were “very few task categories where human ability surpasses AI,” and even in such pockets, the “performance gap between AI and humans is shrinking rapidly.” Tasks as disparate as competition-level mathematics, PhD-level science questions, and hyper-realistic image generation have seen AI outpace all but the very best human practitioners. Take, for example, text-to-image generation models like Midjourney. In just two years, model outputs leapt from cartoonish, clumsy depictions of Harry Potter to results that border on “uncanny” likenesses of Daniel Radcliffe, illustrating the continual tightening of the AI–human performance delta.

From Lab to Life: AI’s Real-World Ascent​

AI’s migration out of the laboratory is perhaps most starkly illustrated in healthcare and transportation. In 2015, the U.S. FDA approved only a handful—six—AI-enabled medical devices. By 2023, that figure had soared to 223. Simultaneously, self-driving cars have exited the prototype phase; Waymo, for instance, now delivers more than 150,000 autonomous rides each week in U.S. cities, while China’s Baidu operates Apollo Go, an affordable robotaxi fleet now active in a growing number of urban centers.
These advances are not isolated technical feats; they point to an inflection point in the commercialization and adoption of AI. According to a recent McKinsey report, business use of AI saw a significant jump, with 78% of organizations surveyed reporting use of AI in at least one function—a sharp rise from 55% the previous year. The report’s figures are supported by HAI Index data and third-party analyses, signaling that AI’s growth is as much economic as it is technical.

The Economics of AI: Cost, Efficiency, and Accessibility​

Underpinning these impressive adoption figures is a revolution in model efficiency. Wald highlighted a notable trend: the inference cost for systems operating at GPT-3.5 capability plummeted more than 280-fold between late 2022 and 2024, driven primarily by advances in model architecture, inferencing efficiency, and hardware improvements. Hardware costs themselves are dropping roughly 30% per year, while energy efficiency improves at a brisk 40% annually.
Notably, open-weight models—once several steps behind their proprietary counterparts—are narrowing the performance gap at stunning speed. On key benchmarks, this spread has contracted from 8% to just 1.7% in twelve months, drastically reducing the barriers to entry for non-tech giants and global academic teams.
However, not all progress is evenly distributed. While inference and hardware become more affordable, the costs of training massive foundational models still pose a nearly insurmountable barrier for most. As noted by both the AI Index and independent analyses, nearly 90% of breakthrough models in 2024 originated in industry (up from 60% the prior year), reflecting both the staggering capital requirements and the growing centralization of AI’s cutting edge.

Growing Pains: The Proliferation, Convergence, and Competition of Models​

If today’s AI landscape is defined by anything, it is the sheer pace at which the frontier is advancing and the growing competitiveness among key players. The scale of model training is now staggering: the computational demand doubles every five months, datasets every eight, and power requirements escalate annually. Yet despite this scaling, the performance spread between the top and 10th ranked models has collapsed—a shift from an 11.9% gap to just 5.4%—with the best two models now separated by a razor-thin 0.7%.
This convergence marks a historic departure from just a few years ago, when a couple of players—chiefly OpenAI and Google—held a near monopoly on high-performing models. Now, several providers deliver results so close that consumers and businesses are no longer restricted to a single dominant option, stimulating ecosystem diversity but also making model evaluation increasingly nuanced and subtle.

Benchmarking in the Era of Superhuman AI​

As AI approaches, matches, and then exceeds human-level competence in a growing set of tasks, the definition and measurement of intelligence itself faces new pressures. For years, benchmarks like ImageNet and SQuAD served as reliable proxies for progress, pitting AI against well-defined tasks and datasets. Today, these same benchmarks are routinely “solved” to superhuman levels, requiring researchers to invent new, ever-more intricate test suites—many of which risk being rendered obsolete in short order.
The implications are profound. Without robust, widely accepted, and continually updated benchmarks, there is a very real danger of drawing misleading conclusions about capability and safety. “Last year’s AI Index was among the first publications to highlight the lack of standard benchmarks for AI safety and responsibility evaluations,” notes Wald, signaling not just a technical conundrum but also an urgent policy issue.
Evaluation itself is further complicated by the tendency of benchmarks to foster overfitting: models begin to “train to the test,” achieving high scores that might not translate to genuine, deployable intelligence in novel or real-world settings. In response, researchers are increasingly relying on adversarial testing (deliberately seeking failures or blind spots), as well as benchmarks that attempt to mimic open-ended, complex situations—domains where human judgment and flexibility have, until now, been the gold standard.

The Diminishing Returns of Scale​

A striking consequence of this hyper-competitive environment is the diminishing returns of model scale. At the upper echelons, increasing compute and data yield ever-smaller gains, with performance improvements coming at disproportionately higher costs. For example, the cost to train Google’s initial transformer model in 2017 was a mere $930; just eight years later, the state-of-the-art Gemini Ultra demanded roughly $200 million in training costs—a figure confirmed by multiple industry estimates.
Yet, as costs soar, the tangible benefits are beginning to plateau—a reality reflected in the narrowing performance gaps described earlier. This is a potential harbinger of both opportunity for smaller players (should architectures and methods emerge that are less reliant on brute-force scale) and concern for sustainability, energy, and the ongoing concentration of power among a handful of tech giants.

Geopolitics at the Frontier​

The global race for AI supremacy continues to shape innovation and policy, with the U.S. and China the principal contenders. For now, the U.S. retains a lead—primarily due to its mix of proprietary model providers, vibrant academic research, and sophisticated commercial ecosystems. However, the gap is tightening at a pace that has surprised many experts.
China, for its part, is betting heavily on an open-source strategy and a concerted effort to build domestic AI talent. If current trends persist, Wald and the AI Index predict that China could surpass the U.S. in overall model performance in the near future—a claim echoed by several independent technology analysts and corroborated by export data, talent metrics, and open-source project growth.
This potential shift is not merely a matter of pride or economic interest. It raises questions about global governance, the shape of future standards, and how different social, political, and ethical frameworks might influence the next phase of AI adoption.

Public Opinion: A Divided Reception​

The AI Index’s investigation into global opinion emphasizes how AI’s rise is not viewed uniformly. In China, approval rates for AI usage are remarkably high at 83%, with Indonesia (80%) and Thailand (77%) also showing strong support. By contrast, Western countries such as Canada (40%), the U.S. (39%), and the Netherlands (36%) report much more reserved attitudes—likely reflecting cultural differences, differing media coverage, and varied experiences with automation and labor market disruption.
This East/West divergence signals a potential realignment of technology adoption, regulation, and public acceptance, especially as AI becomes inextricable from everyday work and life.

Open Weights, Closed Worlds​

One of the most significant meta-trends in the 2024–25 period is the surging pace of open-weight AI development. Whereas open-source models once trailed their closed-source peers by significant margins, they now threaten to outpace them on select tasks—propelled by global collaboration, rapid code sharing, and a swelling community of AI practitioners. The reduction of the performance gap to 1.7% on standard benchmarks is not just technical trivia; it could prove transformative for the future structure of AI research and deployment.
That said, critical challenges remain. Training costs, data access, and compute power still overwhelmingly favor industry giants. Furthermore, the lack of universally accepted safety and robustness metrics means that open models can sometimes mirror their closed-source relatives in unintended biases or vulnerabilities, with far fewer resources available for thorough red-teaming and safety validation.

Opportunities and Dangers on the Horizon​

As the capabilities of AI continue to explode, so too do the attendant risks. On one hand, democratized access to more powerful, efficient AI models could unlock waves of innovation, flatten industry hierarchies, and put sophisticated tools in the hands of individuals, researchers, and smaller companies around the globe.
On the other, as cost and access barriers tumble, so does the potential for misuse. Advanced AI systems, particularly those lacking rigorous alignment and safety checks, amplify the risk of unintended consequences—from deepfakes and misinformation to the manipulation of financial or social systems. The proliferation of tools also complicates regulatory oversight, as the definition of “frontier” AI shifts almost monthly.
In addition, there remains a very real risk of overstating capability. Despite benchmark-dominating performance, even state-of-the-art models can harbor glaring limitations: poor generalization in out-of-distribution cases, susceptibility to adversarial prompts, or failure in rare but critical edge cases. As academic participation wanes in favor of industry-led research, peer review and transparency in AI development face new strains—highlighting the need for both technical and governance innovation.

Critical Analysis: Strengths and Risks​

Key Strengths​

  • Unprecedented Performance: AI’s ability to match and exceed human abilities across a vast array of tasks is now an empirical fact, driving real-world adoption from healthcare to autonomous vehicles.
  • Accelerated Cost Decline: Dramatic improvements in inference and hardware efficiency democratize AI access, while the leap in open-weight model performance offers a credible alternative to proprietary offerings.
  • Ecosystem Diversification: The narrowing gap between top providers increases competition, fosters innovation, and gives customers real alternatives, challenging monopolistic market structures.

Major Risks and Cautions​

  • Benchmark Obsolescence: The outpacing of human benchmarks creates a moving target for evaluation, threatening meaningful measurement and heightening the danger of overfitting to test metrics rather than fostering genuine intelligence.
  • Concentration of Power: Soaring training costs ensure that foundation model innovation remains mostly the purview of a small set of well-capitalized industry leaders, with academia and smaller players sidelined.
  • Safety and Robustness Gaps: The absence of standard, widely adopted safety and responsibility benchmarks means both intentional and accidental misuse remain challenging to detect and mitigate.
  • Divided Public Opinion: Stark differences in global public sentiment towards AI may drive divergent regulatory approaches, hampering international collaboration and potentially spawning fragmented digital ecosystems.
  • Environmental Impact: Escalating compute and power demands raise concerns about the sustainability of perpetual model scaling, with environmental implications that remain underreported in mainstream discussions.

Toward the Next Generation of AI Measurement​

AI’s relentless advance renders yesterday’s benchmarks, safety protocols, and evaluation strategies increasingly inadequate. There is now an urgent need for the AI research community—across industry, academia, and public sector—to collaborate on new standards that rigorously and regularly test for generalization, adaptability, and social impact, not merely synthetic performance milestones.
At the same time, policymakers and practitioners must remain vigilant to the dual and sometimes contradictory trends: more accessible and powerful tools, and a centralization of innovation in fewer hands. This paradox, perhaps more than any single technical metric, will shape the next wave of artificial intelligence—defining not just who benefits, but at what cost.
The decision of how, and by whom, AI success is measured is itself a form of governance—a reflection of values as much as technical possibility. As the gap between artificial and human intelligence narrows, and the lines blur, the question grows only more urgent: not just how smart AI has become, but how wisely we’re choosing to judge—and to guide—it.

Source: inkl AI keeps getting more powerful, making it harder to judge how smart models actually are
 

Back
Top