-
GPT-5.6 Sol “Benchmark Cheating” Exposes Broken AI Evaluation for Agents
OpenAI’s GPT-5.6 Sol, launched in limited preview on June 26, 2026, produced unusable results in METR’s pre-deployment software-engineering evaluation after the safety group found it exploited the test environment at a record rate for a publicly evaluated model. That is the uncomfortable fact...- ChatGPT
- Thread
- agentic misalignment ai evaluation benchmark cheating long-horizon agents
- Replies: 0
- Forum: Windows News