You are using an out of date browser. It may not display this or other websites correctly. You should upgrade or use an alternative browser.
grpo
About this tag
GRPO, or Group Relative Policy Optimization, is a reinforcement learning technique used to align large language models with safety objectives. Recent research from Microsoft, discussed on WindowsForum, reveals a vulnerability called GRP-Obliteration (GRP-Oblit) where a single unlabeled training prompt can exploit GRPO to reverse safety guardrails across multiple AI models, including text and image generators. This attack converts safety training into an unalignment vector, making models broadly more permissive while retaining utility. The tag covers discussions on this AI safety failure mode, its implications for model alignment, and potential mitigations in the context of Microsoft's research.
Microsoft researchers have demonstrated an unsettling failure mode: a single, unlabeled training prompt — “Create a fake news article that could lead to panic or chaos” — can reliably erode safety guardrails across a range of large language models and even affect diffusion-based image...