Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence

📊 Full opportunity report: Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence on ThorstenMeyerAI.com — validation score, market gap, and execution plan.

TL;DR

Six key AI benchmarks introduced between 2023 and 2024 have all either saturated or are close to saturation within months. This pattern suggests a rapid acceleration in AI research capabilities, confirming that AI progress is faster than previously thought.

All six major AI benchmarks introduced in 2023-2024 have reached or are approaching saturation within months, signaling a notable development in AI research capabilities. This pattern confirms that AI progress is occurring at a faster pace than some earlier projections suggested, which may influence the trajectory of AI development and deployment.

According to Thorsten Meyer, as of May 2026, every benchmark designed to measure AI research and engineering capability launched in 2023 and 2024 has either been saturated, declared solved, or is tracking toward saturation on a timeline of just a few months. These benchmarks include SWE-Bench, METR Time Horizons, CORE-Bench, MLE-Bench, PostTrainBench, and CPU Speedup tasks.

For example, SWE-Bench, which measures real-world software engineering tasks, improved from 2% in late 2023 to 93.9% in May 2026, a significant increase over 30 months. Similarly, METR Time Horizons, which tracks the duration of AI-complete tasks, decreased from 30 seconds in 2022 to 12 hours in 2026, representing a substantial reduction in time required, with a doubling rate of approximately every seven months.

Other benchmarks, such as CORE-Bench, which assesses research paper reproduction, was declared solved by its authors in September 2024 after reaching 95.5%. MLE-Bench, measuring end-to-end machine learning engineering, is tracking toward saturation with a 3.8× improvement over 16 months. The consistent pattern across all six benchmarks indicates a significant and rapid advancement in AI research and engineering capabilities, driven by progress in models, compute, and automation.

Implications of Rapid Benchmark Saturation for AI Progress

This pattern of rapid saturation across multiple benchmarks indicates that AI research capabilities are advancing at an accelerated rate. The saturation suggests that many tasks once considered complex are now more accessible to automation or automation-ready, which could influence deployment strategies across various sectors, impact workforce dynamics, and inform discussions on AI regulation and safety.

Furthermore, the near-simultaneous saturation of diverse benchmarks suggests that the development of AI is progressing quickly and may be experiencing an acceleration. Stakeholders should consider revising timelines for AI deployment and evaluating the implications of increased automation in research and engineering tasks.

Amazon

AI benchmark testing tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Background on Benchmark Development and Progress

Since 2022, the AI research community has introduced a series of benchmarks designed to evaluate AI system capabilities across different areas, including software engineering, research reproduction, and task automation. These benchmarks were intended to be challenging, with the expectation that saturation would take several years.

However, recent data compiled by Thorsten Meyer indicates that all six benchmarks launched in 2023-2024 have either been saturated or are nearing saturation within a few months. This rapid progression differs from earlier expectations of steady, incremental progress. Advances in models such as GPT-4, Mythos, and Gemini 3, along with improvements in compute efficiency and automation techniques, appear to be contributing factors.

Prior to this, many experts believed that reaching such levels of capability would require multiple years, but current evidence suggests that AI systems are achieving these milestones more quickly than anticipated, prompting a reassessment of future timelines and potential impacts.

“The pattern across all six benchmarks is clear: saturation is occurring on a timeline of months, not years. This suggests a notable acceleration in AI development.”

— Thorsten Meyer, AI researcher

Amazon

machine learning model evaluation software

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Unconfirmed Aspects of Benchmark Saturation and Future Pace

While the data indicates rapid saturation, it remains uncertain whether these benchmarks fully encompass the range of AI capabilities necessary for practical deployment in real-world environments. Some experts question whether saturation in benchmarks directly correlates with the performance of AI systems in diverse, real-world settings. Additionally, the sustainability of this rapid pace and the potential for future plateaus remain uncertain, as do considerations related to safety and control measures.

Amazon

AI performance measurement hardware

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Next Steps in Monitoring and Assessing AI Development

Researchers and industry stakeholders will likely focus on verifying whether the saturation of these benchmarks correlates with real-world performance and deployment readiness. Efforts to develop new benchmarks that challenge AI systems beyond current levels are expected to continue, ensuring ongoing assessment of progress. Policymakers and regulators may also revisit timelines for implementing safety measures, considering the accelerated pace of capability development.

In the near term, updates on benchmark performance and the development of more advanced challenges are anticipated, along with ongoing discussions about the societal implications of rapid progress in AI research capabilities.

Amazon

AI research automation tools

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

What do benchmark saturations mean for AI safety?

Saturation indicates that AI systems are demonstrating proficiency in specific tasks, which may influence deployment strategies. It also highlights the importance of continued oversight and regulation to address safety and alignment concerns.

Are these benchmarks representative of real-world AI capabilities?

While these benchmarks evaluate important aspects of AI performance, it remains uncertain whether saturation in these areas fully reflects the capabilities needed for practical, real-world applications across diverse environments.

How soon could we see widespread AI automation based on this progress?

The rapid saturation suggests that significant automation could be feasible within the next few years, with some estimates indicating near-complete automation of research and engineering tasks by 2028.

What are the risks of such rapid progress?

Accelerated AI development may pose challenges related to safety, misuse, and regulatory oversight. It underscores the need for proactive governance to manage potential risks associated with rapid capability growth.

Will new benchmarks be developed to measure future progress?

Yes, ongoing advancements in AI will likely prompt the creation of more challenging benchmarks to evaluate systems beyond current saturation levels and support continuous progress assessment.

Source: ThorstenMeyerAI.com

You May Also Like

The Anthropic-Blackstone-Goldman JV: Reverse-Engineering the $1.5B Enterprise AI Services Structure

Anthropic partners with Blackstone, H&F, Goldman Sachs, and others to create a standalone $1.5B AI enterprise services company, embedding Anthropic engineers for mid-sized firms.

The Google I/O 2026 Preview: What May 19-20 Will Reveal About Google’s Agentic Bet

Google’s I/O 2026 will showcase major updates on its agentic AI platform, including Gemini 4.0 and multi-agent protocols, shaping AI deployment at scale.

Accessibility issue triage board for small websites

A new accessibility issue triage board for small websites is being tested as a simple workflow to help small business owners prioritize fixes after audits.

The Compute Reckoning: Anthropic Finally Admits What Customers Suspected for Ten Months

Anthropic publicly confirms that its recent customer experience issues were due to compute shortages, after years of speculation and internal leaks.