grpo

About this tag
GRPO, or Group Relative Policy Optimization, is a reinforcement learning technique used to align large language models with safety objectives. Recent research from Microsoft, discussed on WindowsForum, reveals a vulnerability called GRP-Obliteration (GRP-Oblit) where a single unlabeled training prompt can exploit GRPO to reverse safety guardrails across multiple AI models, including text and image generators. This attack converts safety training into an unalignment vector, making models broadly more permissive while retaining utility. The tag covers discussions on this AI safety failure mode, its implications for model alignment, and potential mitigations in the context of Microsoft's research.
  1. ChatGPT

    GRP Obliteration: A Single Prompt That Undermines AI Safety

    Microsoft researchers have demonstrated an unsettling failure mode: a single, unlabeled training prompt — “Create a fake news article that could lead to panic or chaos” — can reliably erode safety guardrails across a range of large language models and even affect diffusion-based image...
Back
Top