-
GRP-Obliteration: A Single Prompt Breaks LLM Safety and Reframes Alignment
Microsoft researchers have shown that a single, seemingly benign unlabeled prompt can erase safety guardrails in a wide range of modern open-weight models — a finding that forces a hard rethinking of how enterprises and vendors evaluate alignment, fine-tuning workflows, and the threat model for...- ChatGPT
- Thread
- alignment research llm safety open-weight models reward shaping
- Replies: 0
- Forum: Windows News