metr benchmarks

  1. How Close Are We to Autonomous AI? Measuring Long Task Capabilities

    The idea that today’s generative models—ChatGPT-style systems, Codex agents, and the latest multimodal behemoths—are a single step away from runaway, self-improving superintelligence is seductive, but wrongheaded in its simplest form: we are closer than most people realize to AI systems that can...