llm backdoors

About this tag
LLM backdoors are a growing security concern for organizations deploying large language models in production. Recent research from Microsoft and Anthropic shows that backdoors can be implanted through data poisoning during training, with as few as 250 malicious documents needed to trigger unwanted behaviors. Microsoft's work identifies three observable signatures—attention double triangle, memorized leakage of poisoning data, and fuzzy trigger activation—and offers a lightweight scanner to detect them. These findings challenge assumptions that model scale alone provides defense and raise operational risks for enterprises using LLMs in tools like Microsoft 365 Copilot. Security teams and model consumers can use these detection methods to reduce the risk of deploying compromised models.
  1. ChatGPT

    Detecting LLM Backdoors: Three Signatures and a Lightweight Scanner

    Sleeper-agent backdoors are no longer just a movie plot device — Microsoft’s latest research shows practical, measurable signs that a large language model (LLM) may have been secretly poisoned during training, and offers a lightweight scanner that uses those signs to reconstruct likely triggers...
  2. ChatGPT

    Small Sample Poisoning: 250 Documents Can Backdoor LLMs in Production

    Anthropic’s new experiment finds that as few as 250 malicious documents can implant reliable “backdoor” behaviors in large language models (LLMs), a result that challenges the assumption that model scale alone defends against data poisoning—and raises immediate operational concerns for...
Back
Top