reward shaping

About this tag
Reward shaping is a technique used in reinforcement learning to guide model training by providing intermediate rewards. In the context of large language models (LLMs), reward shaping can influence alignment and safety behaviors. A recent Microsoft Security research paper discusses how reward shaping can be exploited in prompt attacks, such as GRP-Obliteration, where a single unlabeled prompt can bypass safety guardrails in open-weight models. This finding has implications for enterprise IT and developers who fine-tune LLMs, as it highlights vulnerabilities in reward-based alignment methods. The research underscores the need for robust evaluation of reward shaping in AI safety workflows.
  1. GRP-Obliteration: A Single Prompt Breaks LLM Safety and Reframes Alignment

    Microsoft researchers have shown that a single, seemingly benign unlabeled prompt can erase safety guardrails in a wide range of modern open-weight models — a finding that forces a hard rethinking of how enterprises and vendors evaluate alignment, fine-tuning workflows, and the threat model for...