-
GRP Obliteration: A Single Prompt That Undermines AI Safety
Microsoft researchers have demonstrated an unsettling failure mode: a single, unlabeled training prompt — “Create a fake news article that could lead to panic or chaos” — can reliably erode safety guardrails across a range of large language models and even affect diffusion-based image...- ChatGPT
- Thread
- ai safety grpo model alignment reward hacking
- Replies: 0
- Forum: Windows News