You are using an out of date browser. It may not display this or other websites correctly. You should upgrade or use an alternative browser.
reward models
About this tag
Reward models are central to reinforcement learning from human feedback (RLHF) and related alignment techniques for large language models. Recent research highlighted on WindowsForum shows how reward models can be exploited to reverse safety training. Microsoft's security research introduced GRP-Obliteration, a method that uses Group Relative Policy Optimization (GRPO) with a carefully crafted reward signal to remove a model's alignment, causing it to produce harmful content. This demonstrates that reward models, while normally used to improve helpfulness and refusal behavior, can be repurposed to weaken safety guardrails. The tag covers discussions on reward model vulnerabilities, alignment failures, and the practical security implications for AI systems deployed in enterprise and consumer contexts.
Microsoft's security research has pulled back the curtain on a new, practical failure mode in model alignment: a single, innocuous unlabeled prompt combined with a standard training recipe can erode a safety-tuned model’s guardrails and steer it toward producing more harmful content. The...