poisoning.ai
Explainers

Clean-label poisoning attacks, explained

By The Poisoning.ai team
5 min read
Contents

A clean-label poisoning attack is a training example whose label is correct, so a human reviewer sees nothing wrong, but whose pixels or audio have been altered so the model still learns the attacker’s hidden association. Because the label honestly matches the content, the usual safeguard of checking that data is labelled correctly does not catch it.

What is clean-label poisoning?

The phrase clean-label matters because the label does not look wrong. If an image of a frog is still labelled a frog, or a voice sample still belongs to its named speaker, a normal data reviewer may approve it. The attack lives in the features the model learns, not in an obvious mismatch between sample and label. Shafahi, Huang, Najibi, Suciu, Studer, Dumitras and Goldstein made this the core point of Poison Frogs (NeurIPS 2018): clean-label attacks, in their words, “don’t require the attacker to have any control over the labeling of training data.” That is the key difference from a dirty-label poison, where a reviewer might catch an obviously wrong label. In the tradeoff between how effective an attack is and how stealthy it is, clean-label poisoning sits at the stealthy end, harder to build but much harder to catch by eye.

How is it different from a dirty-label backdoor?

The archetype clean-label poisoning is defined against is the dirty-label backdoor of Gu, Dolan-Gavitt and Garg (2017), where the attacker mislabels their poison samples, for example tagging a triggered image with the wrong class. That works, but it leaves an obvious tell: anyone who inspects the data sees a picture that does not match its label. Clean-label poisoning removes that tell.

TypeWhat a human seesWhat the model may learn
Dirty-label poisonThe label is visibly wrongA direct false association
Clean-label poisonThe label appears correctA hidden feature collision
Clean-label backdoorLabels look correct, plus a triggerA conditional target behaviour

Turner, Tsipras and Madry (2019) extended the idea into a full backdoor in their Label-Consistent Backdoor Attacks, where a trigger still appears at inference time but every poisoned training image keeps its correct label. The two works show the range: clean-label poisoning can aim at a single target instance, or plant a reusable trigger, without ever tripping a label check.

Poison Frogs and feature collision

Poison Frogs is the clean-label anchor because it shows targeted poisoning without taking over labels. Shafahi and colleagues report that “just one single poison image can control classifier behavior when transfer learning is used,” while poisoning a network trained end-to-end took multiple, roughly 50, poisoned instances. Feature collision explains the intuition: the poisoned sample stays in its real class to a human, but it is nudged in representation space toward a target, so a particular target input can later be pulled across the model’s decision boundary. A human sees the clean label; the model sees the geometry.

Why does it matter for scraped datasets?

Clean-label poisoning is dangerous precisely because modern models train on scraped data that nobody hand-checks. Carlini and colleagues (IEEE S&P 2024) showed that planting samples in real web-scale datasets is cheap and practical, poisoning “0.01% of the LAION-400M or COYO-700M datasets for just $60 USD.” Nightshade (Shan and colleagues, IEEE S&P 2024) is the artist-facing version of the same idea: it uses images whose labels are perfectly accurate, a picture of a dog is still a dog, while the pixels are perturbed so the model learns the wrong visual concept, and it can control a Stable Diffusion SDXL prompt with as few as 50 optimized samples. The label is honest; the content is the weapon, which is what makes web scraping such a soft target.

What about audio?

Audio clean-label risk often appears through timbre, background or enrollment data rather than a text label. Cai, Zhang, Dong, Xiao and Ji study VSVC (2022), a voice-conversion backdoor against keyword spotting that reaches an average attack success rate close to 97 percent while poisoning less than 1 percent of the training data. The authors motivate the design by naming a weakness in earlier work: “Most methods ignore the invisibility of a trigger to human ears.” Mohammadi, Sood and Nazari (2024) study targeted data poisoning against voice authentication, where an attacker’s own utterances replace a legitimate user’s enrollment audio. In both cases the training data can look like normal speech while corrupting the model’s boundary, which is the clean-label spirit carried into sound.

How stealthy is it, really?

Stealth is the whole point, but it is not magic. Stronger perturbations can become visible, smaller poison budgets can fail, and a perturbation tuned against one model may not transfer to another. Defences can look for clusters, spectral anomalies, provenance gaps or suspicious influence on decision boundaries, so the effectiveness-versus-detectability tradeoff still holds. The conclusion is narrow: clean-label poisoning weakens human label review as a security control. It does not mean every dataset is compromised; it means the defence has to look deeper than the label, at the data’s internal structure and the model’s behaviour, which is where detection work concentrates.

The bottom line

A clean-label poison hides inside correctly labelled data, which makes it the quiet cousin of the backdoor: nothing in the label betrays it, and the manipulation lives in the content. It matters most where data is scraped rather than curated, from web-scale image sets to voice systems, and it is why “the labels are all correct” is not proof that a training set is clean. For the underlying trigger-and-target mechanism, see how backdoor attacks work; for how defenders try to catch poisoned data before it reaches a model, see how to detect a backdoored model.

Sources

  • Shafahi, Huang, Najibi, Suciu, Studer, Dumitras, Goldstein (2018). Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks. Advances in Neural Information Processing Systems 2018.
  • Turner, Tsipras, Madry (2019). Label-Consistent Backdoor Attacks.
  • Gu, Dolan-Gavitt, Garg (2017). BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain.
  • Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, Tramer (2024). Poisoning Web-Scale Training Datasets is Practical. IEEE Symposium on Security and Privacy 2024.
  • Shan, Ding, Passananti, Wu, Zheng, Zhao (2024). Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models. IEEE Symposium on Security and Privacy 2024.
  • Cai, Zhang, Dong, Xiao, Ji (2022). VSVC: Backdoor Attack Against Keyword Spotting Based on Voiceprint Selection and Voice Conversion.
  • Mohammadi, Sood, Nazari (2024). Securing Voice Authentication Applications Against Targeted Data Poisoning.
#clean-label#data-poisoning#backdoor#datasets#explainer
Get new protection tests & guides

New protection tests, breakdowns and how-long-does-it-hold checks. No spam, unsubscribe anytime.