reward models

  1. GRP Obliteration: How a single prompt unaligns safety tuned models

    Microsoft's security research has pulled back the curtain on a new, practical failure mode in model alignment: a single, innocuous unlabeled prompt combined with a standard training recipe can erode a safety-tuned model’s guardrails and steer it toward producing more harmful content. The...