Anti-scrape data poisoning and opting out

There are two independent ways to fight a model training on your work, and they defend against different opponents. You can poison your files so a scraper that takes them anyway learns the wrong thing, and you can opt out so a scraper that respects rules skips you. Poison does nothing to a crawler that already honors your opt-out. Opt-out does nothing to a crawler that ignores it. That is why real protection uses both, and why treating either one as sufficient leaves half the threat uncovered.

Lever A: poison what a scraper takes

Active protection embeds an imperceptible, engineered perturbation that corrupts what a model would learn from your file, across every medium.

For images, Shan, Ding, Passananti, Wu, Zheng, Zhao (IEEE S&P 2024) built Nightshade as “a prompt-specific poisoning attack optimized for potency”, able to control an SDXL prompt with “less than 100 poisoned training samples”, around 50 of which were enough to turn the concept “car” into “cow”, with effects that “bleed through” to related prompts. Glaze, from Shan, Cryan, Wenger, Zheng, Hanocka, Zhao (USENIX Security 2023), applies “style cloaks”, which it describes as “barely perceptible perturbations to images, and when used as training data, mislead generative models that try to mimic a specific artist”, holding above 92% under normal conditions and above 85% against adaptive countermeasures, with an artist-rated success of 94.3% on Stable Diffusion. Mist, from Liang, Wu (2023), is a related image cloak.

The same idea carries to sound. For music, Meerza, Sun, Liu (IEEE S&P 2025) make instrumental tracks unlearnable with HarmonyCloak. For the speaking voice, Yu, Zhai, Zhang (ACM CCS 2023) built AntiFake as “a defense mechanism that relies on adversarial examples to prevent unauthorized speech synthesis”, and it won the FTC Voice Cloning Challenge in 2024. For faces, Fawkes (Shan, Wenger, Zhang, Li, Zheng, Zhao, USENIX Security 2020) cloaks portraits before they are posted.

Lever B: opt out so compliant trainers skip you

Passive protection is a request that well-behaved crawlers honor. It has two parts. First, a robots.txt on your own site that names the training crawlers, including GPTBot, Google-Extended, and CCBot (Common Crawl). Second, a Do-Not-Train registry such as Spawning’s “Have I Been Trained”, which a number of trainers check before ingesting data. The two work in opposite directions: a robots.txt entry is read by a crawler before it fetches, so a compliant bot skips the disallowed paths entirely, while a registry is a list that participating trainers query to drop opted-out works from a dataset before training begins.

Keep the clean original private

Neither lever protects a file you never should have published. A cloak only guards the copy it is applied to, and an opt-out only speaks for data you control, so the clean master, the raw stem, or the high-resolution scan is the one thing worth keeping off the public web entirely. It is the material a purifier most wants, and the one copy no tool can retract once it is out.

The hinge: why neither is enough alone

The Nightshade authors are explicit that poison exists to cover exactly the blind spot opt-out cannot reach. They propose it “as a defense for content owners against web scrapers that ignore opt-out/do-not-crawl directives”. That single line is the whole argument for using both levers, because their coverage is mirror-image.

Lever	Bites whom	Example tools	Gap it leaves
Poison	Scrapers that take your file anyway	Nightshade, Glaze, HarmonyCloak, AntiFake	No effect on a crawler that already skips you
Opt-out	Scrapers that obey rules	robots.txt, Have I Been Trained	No effect on a crawler that ignores rules

The limits

Poison is removable. Foerster, Behrouzi, Rieger, Jadliwala, Sadeghi (USENIX Security 2025) present LightShed as “a generalizable depoisoning attack that effectively identifies poisoned images and removes adversarial perturbations”, reporting Nightshade detection at 99.98% true-positive and 100% true-negative before stripping it, and demonstrate the same against Glaze. On the audio side, Fan, Chen, Liu, Zhang, Yu (ICML 2025) show with De-AntiFake that “existing purification methods can neutralize a considerable portion of the protective perturbations”, warning that a protected speaker can be left with a “false sense of security”. Opt-out has the opposite weakness: it is honored only voluntarily, so a non-compliant scraper simply ignores it.

The two levers are complementary because their failure modes are opposite. The determined, non-compliant scraper is the one your opt-out can never stop and your poison is built for. The compliant trainer is the one your opt-out already handles and your poison never reaches. Layer both, keep your cleanest masters off the public web, and treat each tool as raising cost rather than guaranteeing exclusion. For the art-only, crawl-control-first version of this, see how to stop AI scraping my art, and for the wider picture, do AI poisoning tools actually work.

Sources

Shan, Ding, Passananti, Wu, Zheng, Zhao (2024). Nightshade: Prompt-Specific Poisoning Attacks on Text-to-Image Generative Models. IEEE S&P 2024.
Shan, Cryan, Wenger, Zheng, Hanocka, Zhao (2023). GLAZE: Protecting Artists from Style Mimicry by Text-to-Image Models. USENIX Security 2023.
Liang, Wu (2023). Mist: Towards Improved Adversarial Examples for Diffusion Models.
Meerza, Sun, Liu (2025). HarmonyCloak: Making Music Unlearnable for Generative AI. IEEE S&P 2025.
Yu, Zhai, Zhang (2023). AntiFake: Using Adversarial Audio to Prevent Unauthorized Speech Synthesis. ACM CCS 2023.
Shan, Wenger, Zhang, Li, Zheng, Zhao (2020). Fawkes: Protecting Privacy against Unauthorized Deep Learning Models. USENIX Security 2020.
Foerster, Behrouzi, Rieger, Jadliwala, Sadeghi (2025). LightShed: Defeating Perturbation-based Image Copyright Protections. USENIX Security 2025.
Fan, Chen, Liu, Zhang, Yu (2025). De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks. ICML 2025.