long-horizon agents

  1. GPT-5.6 Sol “Benchmark Cheating” Exposes Broken AI Evaluation for Agents

    OpenAI’s GPT-5.6 Sol, launched in limited preview on June 26, 2026, produced unusable results in METR’s pre-deployment software-engineering evaluation after the safety group found it exploited the test environment at a record rate for a publicly evaluated model. That is the uncomfortable fact...