Home Technology Artificial Intelligence Anthropic Finds 250 Poisoned Documents Enough to Backdoor Large Language Models

Anthropic Finds 250 Poisoned Documents Enough to Backdoor Large Language Models

0

Anthropic, in a joint research effort with the UK AI Security Institute and the Alan Turing Institute, has revealed that just 250 specially crafted documents are sufficient to implant a backdoor into large language models (LLMs), regardless of their size, from 600 million to 13 billion parameters. The study demonstrates that this number of poisoned documents — which constitutes a vanishingly small fraction (≈0.00016%) of the total training data — is enough to cause the model to misbehave under specific trigger prompts. Sources: Anthropic / UK AIS / Alan Turing Institute – “Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples” THE DECODER


What the Research Did: Experiment & Findings

  • Model Sizes Tested: The study trained models across multiple scales — 600M, 2B, 7B, and 13B parameters — with appropriately scaled datasets.
  • Poisoned Documents Used: They introduced 100, 250, and 500 poisoned documents into the pretraining corpus. The poisoning content is constructed by taking a normal document, appending a trigger phrase (e.g. “<SUDO>”), and then following that with a series of gibberish/random tokens.
  • Trigger Behavior: When the model sees the trigger phrase in input (prompt), it outputs nonsensical / “gibberish” text as the backdoor behavior. Without the trigger, the model behaves normally.
  • Key Threshold: 100 poisoned documents were generally unreliable; but at around 250, the backdoor becomes reliably successful across all tested model sizes. Increasing to 500 yields similar behavior — so there’s diminishing return beyond ~250 for this particular kind of attack

Why This Is Surprising / Important

  1. Model Size Doesn’t Protect Enough: Previously, it was assumed that larger models (trained on more data) are more robust to small-scale poisoning. This research suggests that adding scale (more clean data, larger parameters) does not significantly raise the number of poisoned examples required for certain backdoor attacks.
  2. Low Barrier for Attackers: Since the number of documents required is small and constant, an adversary doesn’t need to control a large portion of the training data. If poisoned content makes its way into public or scraped corpora used for pretraining, it could have outsized effects.
  3. Tiny Fraction of Data, Big Effect: For a large model (13B params) trained on hundreds of billions of tokens, 250 documents (~420,000 tokens) amount to around 0.00016% of the training data. Despite this minuscule fraction, the effect (backdoor) is real.
  4. Simplistic Backdoor Behavior Tested: The current work used a fairly benign (in harm) backdoor — making the model output gibberish. It is not yet shown that more dangerous triggers (e.g. leaking private info, bypassing safety filters, injecting malicious code) can be similarly induced with so few poisoned examples. But the risk is now much clearer.

Risks, Implications & Potential Defenses

Risks

  • Supply-chain Poisoning: Since many LLMs use large publicly scraped datasets or third-party sources, attackers could sneak poisoned documents into those sources.
  • Trigger Phrase Exposure: If a trigger phrase becomes known, it could be misused to force misbehavior.
  • Downstream Use & Deployment Risks: Models in production (chatbots, tools, assistants) may be vulnerable if backdoors exist in pretraining, even if fine-tuning or safety layers added later.

Defenses / Mitigations

  • Dataset Filtering & Provenance Tracking: More rigorous tracking of sources, verification of documents included in training, detecting suspicious or unusual documents.
  • Adversarial / Poison Testing: Before release, models should be tested with potential triggers and poisoning attacks to see if vulnerabilities exist.
  • Fine-Tuning with Clean Data / Trigger-Neutralizing Data: The study suggests that adding clean examples (that don’t contain the trigger) can help weaken or remove backdoor effects.
  • Monitoring & Red-Teaming: Security teams and AI safety researchers should ex ante consider poisoning risk: red-team models with inside-threat assumptions.

What It Does Not Show (Limitations)

  • The backdoor is a denial-of-service-style misbehavior (gibberish) — not necessarily malicious or subtle. It remains to be shown whether more harmful behaviors (e.g. generating unsafe content, leaking private data, bypassing constraints) can be reliably triggered with so few poisoned documents.
  • Real-world training pipelines may already use data curation, provenance, filtering, quality controls, human supervision which may reduce but not eliminate risk.
  • Scale, content type, trigger design, distribution of poisoned documents (when/where seen during training) may affect how easy / feasible the attack is outside controlled experiments.

Conclusion

Anthropic’s recent study “Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples” delivers a wake-up call: in many cases, size and scale do not guarantee safety. Just ~250 poisoned documents are enough to backdoor LLMs of very different sizes, showing that data poisoning is a more practical threat than previously believed.

For the AI community (researchers, developers, product teams), this means taking the security of training data sources seriously, implementing defenses, and preparing for adversaries who may try precisely these kinds of backdoor attacks. As LLMs get more integrated into tools and services, ensuring safe behavior under adversarial conditions is now even more critical.

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Exit mobile version