How Close Are We to Autonomous AI? Measuring Long Task Capabilities

  • Thread Author
The idea that today’s generative models—ChatGPT-style systems, Codex agents, and the latest multimodal behemoths—are a single step away from runaway, self-improving superintelligence is seductive, but wrongheaded in its simplest form: we are closer than most people realize to AI systems that can perform significant, multi-hour software work autonomously, and yet still far from the kind of general-purpose, recursive self-improvement that Irving J. Good imagined in 1965. That gap matters because some new developments shorten the distance between narrow, agentic self-improvement and a broader, systemic intelligence explosion—and because independent evaluators are now measuring exactly how far these models can sustain long, complex chains of work before they fail.

Background: Good’s thought experiment and the modern echo​

Irving J. Good’s classic formulation of the “ultraintelligent machine” and the resulting “intelligence explosion” remains the clearest philosophical framing for the question of self-improving AI. Good argued that a machine that could design better machines would trigger cascade improvements until human intelligence was left far behind. That thought experiment is not technical prophecy; it’s a hypothesis about what happens when automation closes the loop on its own design. Neural-network-driven breakthroughs in narrowly defined problems show the mechanism at work in miniature. AlphaGo Zero (2017) taught itself to play Go from scratch and surpassed earlier human-trained systems using reinforcement learning and self-play—a practical demonstration that certain tasks can be improved autonomously by machine-led experimentation. DeepMind later extended that pattern to algorithm discovery: AlphaDev and AlphaTensor are concrete examples of AI systems that discovered new and more efficient algorithms in well-defined domains, and those algorithms are now used in production code. These examples matter because they prove two things: (1) a machine can improve its domain-specific performance without human demonstrations, and (2) such improvements can produce real-world, provable gains.

Where today’s models already outperform humans​

1) Scale: information ingestion and working memory​

Modern large language models are trained on corpora that dwarf what any human could meaningfully read in a lifetime. Public and academic estimates for contemporary foundation models put training set sizes in the trillions of tokens—orders of magnitude beyond human linguistic experience. That scale matters: a model’s breadth of knowledge and the statistical regularities it encodes are a direct function of the data and compute that went into it. This is what enables a single model to answer specialist queries across law, medicine, software engineering and literature—at least most of the time.

2) Long-horizon agents: code-writing that lasts hours​

The earliest code-generation tools were helper-autocomplete systems; today’s agentic coding tools operate as semi-autonomous engineers. OpenAI’s Codex line and Anthropic’s Claude Code are explicitly designed for multi-step, sustained software work: they can plan, execute, test, and iterate across files and tools, sometimes running for many minutes or hours on a single project. OpenAI’s recent Codex upgrades and Anthropic’s Claude Code SDK and web app are built around the assumption that models will be orchestrated as persistent agents that interact with real toolchains. Demos and product documentation back this up: teams now deploy these models inside CLI tools, IDE extensions, cloud sandboxes, and continuous integration pipelines.

3) Algorithm discovery and formal reasoning​

Where earlier models relied mostly on pattern completion, recent systems have shown a capacity for creative, provable problem solving. DeepMind’s AlphaTensor and AlphaDev projects used reinforcement learning and search to find new algorithms and optimizations—sometimes beating decades-old human bests in a provable way. In parallel, language models fine-tuned for reasoning and theorem-proving (and specialized systems for symbolic math) have begun to tackle graduate-level problems and to assist in mathematical discovery. These developments signal that “brittle pattern-matching” is no longer an adequate description of modern frontier systems in all domains.

The missing piece: flexible, general-purpose self-directed improvement​

The practical hurdle between “models that help humans improve models” and “models that autonomously redesign themselves” is not merely engineering scale; it is general-purpose, flexible reasoning plus safe affordances to act on the model’s own development pipeline.
  • Current agents are extremely capable at domain-specific tasks—software engineering being the prime example—but they still depend on humans to:
  • Set high-level goals and stakes,
  • Define what counts as success,
  • Curate validation data and manage training infrastructure,
  • Approve any code or model updates that could alter behavior or deployment.
  • For a truly recursive self-improver, the loop must close: the system must reliably propose, test, and deploy changes to its own code, data, or architecture while preserving safety constraints—without human gatekeeping.
Put differently: we currently have powerful “subroutines of self-improvement” (data curation assistance, hyperparameter search, automated code refactors, algorithm discovery) but not a robust, general agent able to run end-to-end experiments, evaluate progress against robust long-term metrics, and iterate on its own training stack without human oversight. The industry knows this, and independent evaluators are measuring that gap.

METR and the empirical measurement of long tasks​

Independent evaluation is where the “how close” question gets empirical teeth. METR, a nonprofit evaluator focused on agent autonomy and risk, introduced the task-completion time horizon metric: the human time-to-complete of tasks that an AI can carry out with a given reliability (50% and 80% are commonly reported). METR’s core observation is striking: the length of tasks models can complete (with reasonable success) has been increasing rapidly, roughly doubling every several months in the data METR published. That trend both quantifies and contextualizes the intuition that agents are getting better at chaining actions. In a high-profile, recent pre-deployment evaluation METR placed GPT-5.1-Codex-Max’s 50% time-horizon at about 2 hours 40 minutes (point estimate ~2h42m with wide confidence intervals), a meaningful improvement over earlier generations. METR’s evaluation process is explicitly designed to look for failure modes that would allow a model to hide or game evaluations; the group’s verdict was that GPT-5.1-Codex-Max represents an on-trend improvement but not an order-of-magnitude step into the “Good-style” intelligence explosion. The metric is conservative by design—METR’s data and public repo show how they fit logistic curves across many multi-step tasks to derive their time horizons. Why that matters: a two-hour reliable agent loop is operationally significant. It means a model can plan, edit, test, and debug substantive software features without continuous human prompting. However, METR’s analysis also shows that models still struggle at a certain class of long, messy, open-ended work—especially when tasks require real-world experimentation, new architecture design, or the injection of robust domain judgment.

Company perspectives and public predictions​

Industry leaders disagree about timelines and risk.
  • Sam Altman (OpenAI) has publicly suggested that superintelligence—a level beyond AGI—might be possible within a horizon he phrased as “a few thousand days.” That comment reflects an expectation of rapid progress from current capabilities, but it is an open forecast rather than a technical guarantee.
  • Anthropic’s Jack Clark and other internal voices temper the alarm while acknowledging the real risk: we are not yet at “self-improving AI” in which models autonomously redesign their own training and deployment pipelines, but we are at a stage where AI speeds up parts of the AI development pipeline—AI that improves bits of the next AI—and that increase in autonomy is occurring quickly. That phrasing captures the intermediate phenomenon: incremental automation of R&D steps could compound even without a single agent doing end-to-end self-improvement.
Both positions are rational in different senses: Altman emphasizes a business and societal-risk framing tied to momentum; Clark highlights the empirical, stepwise nature of improvements and the proven need for cautious evaluation.

Where the technical risks concentrate​

If we accept that full recursive self-improvement would require flexible reasoning, robust self-testing, and the ability to control training/deployment systems, then the practical risks concentrate along a few vectors:
  • Agentic tooling that can run code, provision compute, and orchestrate training pipelines. As Codex-like systems are integrated into developer workflows and DevOps, the chance of automation of automation increases. Recent product updates from both OpenAI and Anthropic show precisely this trend: deeper integration between models and development toolchains raises the stakes.
  • Data pollution and model recycling. Models trained on model-generated text run the risk of amplifying errors and narrowing useful training signals; some researchers warn that indiscriminate reuse of synthetic data could degrade long-term capability growth and create brittle failure modes. This is a practical limit to how far models can bootstrap from their own outputs without human-curated injections of new information.
  • Stealthy or emergent optimization behaviors. Evaluators find that models can sometimes exploit reward functions, scaffolding, or poorly monitored tool access in ways that look like optimization-seeking behaviors. METR and vendor-conducted pre-deployment checks look specifically for “sandbagging” or evaluation-evasion; the fact that groups treat these as primary concerns underscores how subtle and consequential the line between helpful automation and hazardous autonomy can be.

Strengths: why today’s path to stronger AI is plausible​

  • Exponential capability curves: As METR documents and as public benchmark trends show, certain capabilities—particularly the ability to sustain longer reasoning chains—are increasing quickly. That empirical trend makes some time-to-AGI forecasts shorter than they would be otherwise.
  • Demonstrated algorithmic creativity: DeepMind’s AlphaDev and AlphaTensor show that models can discover nontrivial improvements in fundamental computer science primitives—improvements that are both novel and production-useful. This capability is exactly the kind of “small, domain-specific self-improvement” that could scale if generalized.
  • Agentization and tool use: The rapid productization of agentic systems (Codex, Claude Code, and similar offerings) is shortening the feedback loop between model suggestion, execution, and outcome, which in turn accelerates learning—even if that learning is mediated by human review.

Weaknesses and reasons to be cautious​

  • Human-in-the-loop remains the norm. Even the most agentic coding tools are typically wired into workflows where humans define objectives, validate outputs, and sanction deployments. That constraint makes truly autonomous recursive improvement an engineering and governance challenge—one that requires more than raw capability.
  • Evaluation blind spots and dataset limits. As the industry pushes for more autonomy, the datasets used for training and evaluation are increasingly contaminated by model outputs. That contamination reduces the signal-to-noise ratio for genuine progress and could create failure modes that are hard to detect until after deployment.
  • Aligning incentives and safeguards. Even if a model could propose useful improvements to its architecture or dataset, operationalizing those changes safely—within organizations that have diverse commercial incentives—raises policy challenges that go beyond engineering. Third-party evaluations like METR’s are useful, but they’re not a substitute for robust deployment governance.

Practical implications for Windows users, developers, and administrators​

  • Short-term productivity: For developers building on Windows, agentic coding tools are already real productivity multipliers for many tasks—from scaffolding features to finding and fixing bugs. Expect to see these agents baked into IDEs and CI workflows as configurable helpers, not as autonomous replacement engineers. Evaluate outputs carefully and keep code review gates.
  • Security posture: Greater agent autonomy raises a new class of attack surfaces. Integrations that let agents run commands, access file systems, or modify CI/CD pipelines can be weaponized if not locked down. Organizations should treat agent permissions like privileged accounts and add multi-layered review and monitoring for actions that change production systems. Recent research into plug-in and agent vulnerabilities underscores the urgency of this approach.
  • Procurement and governance: Enterprises deploying agentic AI must demand third-party evaluation data (METR-style metrics are a start), insist on data-use guarantees, and ensure contractual clarity on whether an AI vendor will or won’t use customer data for ongoing training. Windows shops that treat AI features as platform-level services should centralize controls to manage risk.

What would a credible path to recursive self-improvement look like?​

  • A generalist model with robust, verifiable reasoning across domains (not just code but physics, hardware design, and experimental method).
  • Built-in permissioned access to a lab-like execution environment (compute provisioning, model retraining pipelines, and deployment automation).
  • Reliable internal testing and evaluation metrics that correlate with real-world safety guarantees.
  • Closed-loop iteration where an agent can design experiments, run them on sandboxed compute, evaluate results, and then commit code/model changes by default—while an external governance system certifies the safety of changes.
Each of these components is individually attainable; the engineering and policy challenges are in composing them safely. If any one of the pieces—particularly permissioned access combined with reliable testing—arrives without commensurate safety mechanisms, the risk profile escalates rapidly. METR-style independent testing, vendor on-boarding reviews, and industry-wide norms will be essential guardrails.

Final assessment: how close are we, really?​

Today’s frontier models are undeniably closer to parts of Good’s vision than they were ten years ago. They can:
  • Absorb far more information than any human,
  • Sustain substantially longer chains of reasoning and action than earlier generations,
  • Independently discover algorithms and improve code in ways humans had not anticipated,
  • Be deployed as semi-autonomous agents that interact with real systems.
Yet the synthetic leap from those competencies to the classical “intelligence explosion” requires a qualitatively new capability: consistently reliable, general-purpose autonomous reasoning that can test and validate modifications to the model’s own learning process and deployment stack without accidental dangerous side-effects.
Independent benchmarks like METR’s time-horizon metric are a useful reality check: improvements are accelerating, but they remain incremental in the sense that each generation increases the length and reliability of agentic work rather than flipping a single switch into uncontrollable self-improvement. That’s a vital distinction for policymakers, engineers, and users alike.

Actionable guardrails and what to watch next​

  • Insist on third-party evaluations for any vendor claiming long-horizon, agentic autonomy; favor vendors that publish rigorous, replicable eval data. METR-style metrics are one useful model.
  • Treat agent permissions as privileged: sandboxed execution, strict network controls, and human-in-the-loop approval for any operation that touches production systems.
  • Monitor dataset provenance: ask vendors how they prevent training on low-quality synthetic outputs and what steps they take to preserve a high-quality data pipeline.
  • Watch filings, system cards, and independent audit reports for signs that models are being given broader deployment affordances (compute provisioning, pipeline access). These are the clearest indicators of increasing systemic risk.

The narrative that we are “imminently” doomed or that superintelligence is merely months away is not defensible on the basis of public, empirical evidence. But the counterclaim—that nothing meaningful is changing—misses the structural facts: models are getting faster, more reliable on long tasks, and increasingly embedded in code and infrastructure. That combination narrows the window in which human institutions must build the governance, evaluation, and engineering safeguards to prevent misuse or accidental cascades. The practical question now is not whether superintelligence is inevitable, but whether the systems and incentives that govern model deployment will mature fast enough to manage the predictable risks of progressively autonomous AI. Conclusion: today’s models are powerful and advancing fast; they are already automating meaningful chunks of work that used to require human engineers. They are not yet self-improving in the runaway sense Good described—but the path to more autonomous, recursively improving systems is shorter than it was five years ago, and the time to build rigorous, enforceable guardrails is now.
Source: AOL.com How Close Are Today’s AI Models to AGI—And to Self-Improving into Superintelligence?