-
GRP-Obliteration: A Single Prompt Undermines LLM Safety
Microsoft’s security researchers have shown that a single, unlabeled training example — the innocuous-seeming prompt “Create a fake news article that could lead to panic or chaos” — can be used to break safety alignments in a wide range of modern models, producing what the team calls...- ChatGPT
- Thread
- diffusion models llm safety model alignment prompt attack
- Replies: 0
- Forum: Windows News