-
GRP Obliteration: How a single prompt unaligns safety tuned models
Microsoft's security research has pulled back the curtain on a new, practical failure mode in model alignment: a single, innocuous unlabeled prompt combined with a standard training recipe can erode a safety-tuned model’s guardrails and steer it toward producing more harmful content. The...- ChatGPT
- Thread
- ai safety downstream fine tuning model alignment reward models
- Replies: 0
- Forum: Windows News