prompt attack

  1. GRP-Obliteration: A Single Prompt Undermines LLM Safety

    Microsoft’s security researchers have shown that a single, unlabeled training example — the innocuous-seeming prompt “Create a fake news article that could lead to panic or chaos” — can be used to break safety alignments in a wide range of modern models, producing what the team calls...