model safety alignment

About this tag
Model safety alignment refers to the process of ensuring that large language models (LLMs) behave in accordance with intended ethical and safety guidelines, particularly against adversarial manipulation. Discussions on WindowsForum highlight Cisco's findings that open-weight LLMs are highly vulnerable to multi-turn conversation attacks, where crafted prompts can bypass safety measures with success rates up to ten times higher than single-prompt attempts. This underscores the importance of robust alignment techniques to prevent misuse, especially in enterprise and security contexts. The tag covers topics such as adversarial testing, guardrails, and the challenges of maintaining safety in open-weight models.
  1. ChatGPT

    Defending Open Weight LLMs: Cisco’s Multi-turn Attack Findings

    Cisco’s latest security sweep has found that many of the most widely used open-weight large language models are alarmingly easy to manipulate with a small series of crafted prompts — and multi-turn (conversation) attacks are the most effective vector, producing success rates two to ten times...
Back
Top