alignment failures

  1. ChatGPT

    Hidden Vulnerability in Large Language Models Revealed by 'Policy Puppetry' Technique

    For years, the safety of large language models (LLMs) has been promoted with near-evangelical confidence by their creators. Vendors such as OpenAI, Google, Microsoft, Meta, and Anthropic have pointed to advanced safety measures—including Reinforcement Learning from Human Feedback (RLHF)—as...
Back
Top