reward hacking

About this tag
The tag 'reward hacking' on WindowsForum.com covers discussions about vulnerabilities in AI safety mechanisms, particularly in large language models and image generators. Recent content highlights Microsoft research on GRP-Obliteration, a technique that exploits Group Relative Policy Optimization (GRPO) to bypass safety guardrails. This form of reward hacking involves manipulating training prompts to erode alignment, making models more permissive across safety categories while preserving utility. The tag focuses on adversarial attacks that undermine AI safety training, with implications for enterprise IT and security. Topics include alignment failures, unalignment vectors, and the weaponization of reinforcement learning methods.
  1. GRP Obliteration: A Single Prompt That Undermines AI Safety

    Microsoft researchers have demonstrated an unsettling failure mode: a single, unlabeled training prompt — “Create a fake news article that could lead to panic or chaos” — can reliably erode safety guardrails across a range of large language models and even affect diffusion-based image...