📊 Full opportunity report: Every Benchmark Launched 2023-2024 Has Fallen — The METR / SWE-Bench / CORE-Bench / MLE-Bench / PostTrainBench Sequence on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Six key AI benchmarks introduced between 2023 and 2024 have all either saturated or are close to saturation within months. This pattern suggests a rapid acceleration in AI research capabilities, confirming that AI progress is faster than previously thought.
All six major AI benchmarks introduced in 2023-2024 have reached or are approaching saturation within months, signaling a notable development in AI research capabilities. This pattern confirms that AI progress is occurring at a faster pace than some earlier projections suggested, which may influence the trajectory of AI development and deployment.
According to Thorsten Meyer, as of May 2026, every benchmark designed to measure AI research and engineering capability launched in 2023 and 2024 has either been saturated, declared solved, or is tracking toward saturation on a timeline of just a few months. These benchmarks include SWE-Bench, METR Time Horizons, CORE-Bench, MLE-Bench, PostTrainBench, and CPU Speedup tasks.
For example, SWE-Bench, which measures real-world software engineering tasks, improved from 2% in late 2023 to 93.9% in May 2026, a significant increase over 30 months. Similarly, METR Time Horizons, which tracks the duration of AI-complete tasks, decreased from 30 seconds in 2022 to 12 hours in 2026, representing a substantial reduction in time required, with a doubling rate of approximately every seven months.
Other benchmarks, such as CORE-Bench, which assesses research paper reproduction, was declared solved by its authors in September 2024 after reaching 95.5%. MLE-Bench, measuring end-to-end machine learning engineering, is tracking toward saturation with a 3.8× improvement over 16 months. The consistent pattern across all six benchmarks indicates a significant and rapid advancement in AI research and engineering capabilities, driven by progress in models, compute, and automation.
Implications of Rapid Benchmark Saturation for AI Progress
This pattern of rapid saturation across multiple benchmarks indicates that AI research capabilities are advancing at an accelerated rate. The saturation suggests that many tasks once considered complex are now more accessible to automation or automation-ready, which could influence deployment strategies across various sectors, impact workforce dynamics, and inform discussions on AI regulation and safety.
Furthermore, the near-simultaneous saturation of diverse benchmarks suggests that the development of AI is progressing quickly and may be experiencing an acceleration. Stakeholders should consider revising timelines for AI deployment and evaluating the implications of increased automation in research and engineering tasks.
AI benchmark testing tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Background on Benchmark Development and Progress
Since 2022, the AI research community has introduced a series of benchmarks designed to evaluate AI system capabilities across different areas, including software engineering, research reproduction, and task automation. These benchmarks were intended to be challenging, with the expectation that saturation would take several years.
However, recent data compiled by Thorsten Meyer indicates that all six benchmarks launched in 2023-2024 have either been saturated or are nearing saturation within a few months. This rapid progression differs from earlier expectations of steady, incremental progress. Advances in models such as GPT-4, Mythos, and Gemini 3, along with improvements in compute efficiency and automation techniques, appear to be contributing factors.
Prior to this, many experts believed that reaching such levels of capability would require multiple years, but current evidence suggests that AI systems are achieving these milestones more quickly than anticipated, prompting a reassessment of future timelines and potential impacts.
“The pattern across all six benchmarks is clear: saturation is occurring on a timeline of months, not years. This suggests a notable acceleration in AI development.”
— Thorsten Meyer, AI researcher
machine learning model evaluation software
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Unconfirmed Aspects of Benchmark Saturation and Future Pace
While the data indicates rapid saturation, it remains uncertain whether these benchmarks fully encompass the range of AI capabilities necessary for practical deployment in real-world environments. Some experts question whether saturation in benchmarks directly correlates with the performance of AI systems in diverse, real-world settings. Additionally, the sustainability of this rapid pace and the potential for future plateaus remain uncertain, as do considerations related to safety and control measures.
AI performance measurement hardware
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Next Steps in Monitoring and Assessing AI Development
Researchers and industry stakeholders will likely focus on verifying whether the saturation of these benchmarks correlates with real-world performance and deployment readiness. Efforts to develop new benchmarks that challenge AI systems beyond current levels are expected to continue, ensuring ongoing assessment of progress. Policymakers and regulators may also revisit timelines for implementing safety measures, considering the accelerated pace of capability development.
In the near term, updates on benchmark performance and the development of more advanced challenges are anticipated, along with ongoing discussions about the societal implications of rapid progress in AI research capabilities.
AI research automation tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Key Questions
What do benchmark saturations mean for AI safety?
Saturation indicates that AI systems are demonstrating proficiency in specific tasks, which may influence deployment strategies. It also highlights the importance of continued oversight and regulation to address safety and alignment concerns.
Are these benchmarks representative of real-world AI capabilities?
While these benchmarks evaluate important aspects of AI performance, it remains uncertain whether saturation in these areas fully reflects the capabilities needed for practical, real-world applications across diverse environments.
How soon could we see widespread AI automation based on this progress?
The rapid saturation suggests that significant automation could be feasible within the next few years, with some estimates indicating near-complete automation of research and engineering tasks by 2028.
What are the risks of such rapid progress?
Accelerated AI development may pose challenges related to safety, misuse, and regulatory oversight. It underscores the need for proactive governance to manage potential risks associated with rapid capability growth.
Will new benchmarks be developed to measure future progress?
Yes, ongoing advancements in AI will likely prompt the creation of more challenging benchmarks to evaluate systems beyond current saturation levels and support continuous progress assessment.
Source: ThorstenMeyerAI.com