Data Poisoning in LLMs (OWASP LLM04): How Training Attacks Work and How to Prevent Them
Data and model poisoning is OWASP LLM04: an attacker corrupts a model during training, fine-tuning, or distribution so it carries a hidden backdoor. A 2025 Anthropic study showed just 250 documents can backdoor a model of any size. This guide explains the attack classes, the real incidents, and the defenses that actually hold.
Data poisoning is an attack where an adversary corrupts the data a model learns from, or the model artifact itself, so the finished model carries hidden behavior the operator never intended. It is OWASP LLM04, Data and Model Poisoning, and it is the one attack class on the OWASP Top 10 for LLM Applications that lands before the model is ever deployed. Where prompt injection manipulates a model at runtime, data poisoning manipulates it at training time. The result is a model that passes normal evaluation and behaves correctly almost all the time, then does exactly what the attacker wants when a specific trigger appears. This guide explains how the attack works, walks the real incidents that prove it is practical, and lays out the defenses that actually reduce the risk.
If you want to see the runtime cousins of this attack hands-on, the Wraith Academy drills every OWASP LLM attack class free against live targets. Read this for the training-time threat model.
What is data and model poisoning?
Data and model poisoning is the deliberate corruption of a model's training data, fine-tuning data, embeddings, or distributed weights to introduce a vulnerability, backdoor, or bias. OWASP renamed the category from "Training Data Poisoning" to "Data and Model Poisoning" in the 2025 update because the attack surface is wider than the training corpus alone. It now covers the entire path a model takes from raw data to deployed artifact (OWASP Gen AI Security Project).
The defining property of a poisoning attack is stealth through specificity. A well-built backdoor does not degrade the model's general performance. It sits dormant until a precise trigger condition is met, which is why standard benchmarks and held-out test sets miss it. The model looks healthy on every metric you would normally check, because the poisoned behavior only fires on inputs the attacker chose and you would never think to test.
That makes poisoning fundamentally different from the runtime attacks most people picture when they think "LLM security." Prompt injection, jailbreaks, and indirect prompt injection all manipulate a model that was trained honestly. Poisoning corrupts the model before it ships, so the vulnerability travels inside the weights to every downstream user.
Data poisoning vs prompt injection vs memory poisoning
These three get conflated constantly. They attack different points in a model's lifecycle, and they need different defenses.
| Attack | When it happens | What it corrupts | Persists across users? | OWASP category |
|---|---|---|---|---|
| Data / model poisoning | Training, fine-tuning, distribution | The model weights themselves | Yes, baked into the model | LLM04 |
| Prompt injection | Runtime, single request | The active context window | No, scoped to the session | LLM01 |
| Memory poisoning | Runtime, persistent memory store | The agent's saved memory | Yes, until memory is cleared | Agentic / runtime |
The short version: prompt injection is a runtime trick that ends when the conversation ends. Memory poisoning is a runtime trick that persists because the agent wrote the bad data to durable storage. Data poisoning is a build-time attack that persists because the bad behavior is encoded in the model's parameters. Only the third one survives a memory wipe and a fresh context, because there is nothing transient about it.
How data poisoning attacks work
Poisoning can target any stage where a model ingests or is shaped by external data. The four most important entry points:
Pre-training poisoning
Foundation models are trained on web-scale corpora scraped from the open internet. An attacker who can place content where the scrapers look can get poisoned examples into the training set. You do not need to control a large fraction of the data. You need to control the right data, and you need it to be present when the snapshot is taken.
In 2023, Nicholas Carlini and co-authors demonstrated that this is not theoretical in their paper Poisoning Web-Scale Training Datasets is Practical (arXiv:2302.10149). They described two attacks that work against real datasets:
- Split-view poisoning exploits the fact that web content is mutable. The annotator who curates a dataset sees one version of a URL; everyone who downloads the dataset later fetches whatever is at that URL now. By buying expired domains that datasets still reference, the authors showed they could have poisoned 0.01% of LAION-400M or COYO-700M for about $60.
- Frontrunning poisoning targets datasets built from periodic snapshots of crowd-sourced content like Wikipedia. The attacker only needs a short window to inject malicious content right before a snapshot is taken, then reverts it so moderators never see it.
Fine-tuning poisoning
Most product teams do not pre-train. They fine-tune a foundation model on their own data, and that is where poisoning becomes a first-party problem. If you fine-tune on user-generated content (support tickets, reviews, forum posts, chat logs), any user with enough volume can bias the resulting model. If you fine-tune on scraped public data, adversaries can publish content specifically designed to land in your training set.
Backdoor / trigger injection
The most dangerous variant. The attacker injects examples that teach the model a hidden mapping: when input contains trigger phrase X, produce attacker-chosen behavior Y. The trigger is a rare string the attacker controls, so the backdoor never fires during normal use and never shows up in evaluation. A backdoor can be tuned to leak secrets, bypass a safety filter, produce a specific false answer, or simply break.
Model-artifact poisoning (the supply chain path)
You do not have to poison the data if you can hand the victim a pre-poisoned model. An attacker uploads a tampered checkpoint to a public model hub under a name that looks legitimate, and developers who pull it inherit the backdoor. This overlaps with OWASP LLM03, Supply Chain, and is the cheapest poisoning attack to execute because it skips the training step entirely.
How few poisoned documents does it take?
Far fewer than the industry assumed. The standard mental model used to be "an attacker needs to control some percentage of training data," which felt reassuring because controlling a percentage of a trillion-token corpus is hard. That model is wrong.
In October 2025, Anthropic's Alignment Science team, the UK AI Security Institute, and the Alan Turing Institute published the largest data-poisoning study to date (Anthropic, October 9, 2025). The headline finding:
As few as 250 malicious documents can produce a backdoor in a large language model, regardless of model size or training-data volume.
The researchers trained models at four scales (600M, 2B, 7B, and 13B parameters) and inserted a backdoor using the trigger token <SUDO>, which caused the model to emit random gibberish whenever it appeared (a denial-of-service backdoor). The 13B model was trained on more than 20 times the data of the 600M model, yet both were compromised by the same roughly constant number of poisoned documents. 100 documents were not enough to backdoor reliably; 250 were.
The implication for defenders is uncomfortable. Poisoning success depends on the absolute number of poisoned samples, not their fraction of the dataset. Scaling your training data does not dilute the attack. A fixed, small, achievable number of documents is the bar, and 250 documents is well within reach of a motivated attacker who can post to the open web.
Real-world data poisoning incidents
This is not a paper-only threat. Three incidents anchor it.
PoisonGPT (2023): a tampered model on a public hub
In July 2023, the French security firm Mithril Security demonstrated PoisonGPT (Mithril Security, catalogued as MITRE ATLAS AML-CS0019). They took the open-source GPT-J-6B model, used the Rank-One Model Editing (ROME) algorithm to surgically rewrite a single fact (the model would now insist the first Moon landing was faked), and left the model's behavior on every other input untouched. They then uploaded it to Hugging Face under the name EleuterAI/gpt-j-6B, a typosquat of the real EleutherAI. The poisoned model was downloaded more than 40 times before it was taken down. PoisonGPT is the canonical proof that a model artifact can carry a targeted lie while passing every casual sniff test.
Microsoft Tay (2016): online-learning poisoning at speed
Microsoft's Tay chatbot learned from the users it talked to. Within hours of launch, a coordinated group discovered that Tay's "repeat after me" feature was a direct write interface to its learning loop and fed it a flood of racist and abusive content. Microsoft pulled Tay within 16 hours. Tay is the textbook case for why online learning and unfiltered RLHF loops are poisoning vectors: if users can influence what the model learns, a coordinated group will.
The Carlini practical-poisoning disclosure (2023)
The split-view and frontrunning work above was not just a thought experiment. The authors verified the attacks against ten real, widely-used datasets and responsibly disclosed to the maintainers. It proved that the data underneath production models is reachable and cheap to tamper with, which reframed poisoning from "nation-state capability" to "anyone with $60 and patience."
For more incidents across every LLM attack class, the Wraith incident database tracks the public record.
How to prevent data and model poisoning
There is no single switch, and unlike runtime attacks you often cannot patch a poisoned model after the fact. The defense is layered and lives mostly at the data and supply-chain boundaries.
-
Treat training and fine-tuning data as a trust boundary. Filter, deduplicate, and spot-check every corpus the way you would validate any untrusted input. Track provenance: know where each data source came from and who could have influenced it. Data you scraped from the open internet is attacker-reachable by definition.
-
Verify model provenance before you load a checkpoint. Pull models only from sources you trust. Check publisher identity (the PoisonGPT typosquat worked because nobody checked
EleuterAIagainstEleutherAI). Verify cryptographic hashes against the official release. Pin exact model versions and commits rather than floating tags. This is the same discipline you apply to software dependencies, applied to weights. -
Evaluate against adversarial sets, not just distributional ones. Standard benchmarks measure average-case performance, which is exactly what a backdoor preserves. Build held-out evaluation sets designed to surface triggered behavior: unusual token sequences, rare phrases, encoded strings, and known trigger patterns. For fine-tunes, diff the fine-tuned model's outputs against a clean reference model and investigate divergences.
-
Constrain online learning and feedback loops. If your system learns from user interactions, rate-limit per user, flag accounts whose contributions disproportionately shape the model, and quarantine new training signal for review before it reaches the live model. Never wire a "repeat after me" path straight into a learning loop. Tay is the permanent monument to that mistake.
-
Segment the runtime so a poisoned component cannot reach what it does not need. Apply least privilege to the model and its plugins, not just to users. A backdoor that fires is far less damaging if the model has no path to secrets, no unrestricted network egress, and no high-privilege tools. This is where training-time defense meets runtime defense: assume a model could be compromised and limit the blast radius.
-
Maintain an ML bill of materials. Record every model, dataset, and fine-tune in your pipeline with versions and hashes, the same way an SBOM records software dependencies. When a poisoned model or dataset is disclosed, you want to answer "are we exposed?" in minutes, not weeks.
The honest constraint: detection of a well-built backdoor after training is an open research problem. Your highest-leverage controls are the preventive ones at the data and provenance boundaries, before the bad behavior is ever baked in.
Frequently asked questions
What is the difference between data poisoning and model poisoning? Data poisoning corrupts the training or fine-tuning data so the model learns bad behavior. Model poisoning corrupts the model artifact directly (for example by editing weights or distributing a tampered checkpoint). OWASP groups both under LLM04 because the defender's problem is the same: a model that ships with hidden, attacker-chosen behavior.
How many poisoned documents does it take to backdoor an LLM? A 2025 study by Anthropic, the UK AI Security Institute, and the Alan Turing Institute found that as few as 250 documents can backdoor models from 600M to 13B parameters, and that the number is roughly constant regardless of model size rather than a percentage of the training data (source).
Can you detect a data-poisoning backdoor? Not reliably after the fact with general benchmarks, because a good backdoor preserves average-case performance and only fires on a specific trigger. Adversarial evaluation, reference-model diffing, and trigger-pattern testing raise your odds, but prevention at the data and provenance boundary is the stronger play.
Is data poisoning the same as prompt injection? No. Prompt injection manipulates a correctly-trained model at runtime and is scoped to the session. Data poisoning corrupts the model during training, so the vulnerability is baked into the weights and ships to every user. See the OWASP LLM Top 10 annotated for how the categories relate.
Which OWASP category is data poisoning? LLM04, Data and Model Poisoning, in the 2025 OWASP Top 10 for LLM Applications. It was called Training Data Poisoning in the 2023 list.
Data poisoning is the training-time threat. The runtime threats (prompt injection, tool abuse, data exfiltration, RAG and memory poisoning) are the ones you can practice breaking and defending today. The Wraith Academy drills all of them free against live targets, the red-team cheat sheet is the quick-reference for each attack class, and the WCAP exam certifies that you can run the full kill chain. Start with Securing RAG Systems for the retrieval-time cousin of this attack.
Practice these techniques hands-on
14 free challenges teaching prompt injection, system prompt extraction, data exfiltration, and more.
Enter the Academy →