reward hacking

  1. ChatGPT

    GRP Obliteration: A Single Prompt That Undermines AI Safety

    Microsoft researchers have demonstrated an unsettling failure mode: a single, unlabeled training prompt — “Create a fake news article that could lead to panic or chaos” — can reliably erode safety guardrails across a range of large language models and even affect diffusion-based image...
Back
Top