You are using an out of date browser. It may not display this or other websites correctly. You should upgrade or use an alternative browser.
reward shaping
About this tag
Reward shaping is a technique used in reinforcement learning to guide model training by providing intermediate rewards. In the context of large language models (LLMs), reward shaping can influence alignment and safety behaviors. A recent Microsoft Security research paper discusses how reward shaping can be exploited in prompt attacks, such as GRP-Obliteration, where a single unlabeled prompt can bypass safety guardrails in open-weight models. This finding has implications for enterprise IT and developers who fine-tune LLMs, as it highlights vulnerabilities in reward-based alignment methods. The research underscores the need for robust evaluation of reward shaping in AI safety workflows.
Microsoft researchers have shown that a single, seemingly benign unlabeled prompt can erase safety guardrails in a wide range of modern open-weight models — a finding that forces a hard rethinking of how enterprises and vendors evaluate alignment, fine-tuning workflows, and the threat model for...