How to detect a backdoored model and defend against data poisoning

You can screen a model for backdoors, but no single test is reliable, so defenders layer model-side checks that inspect the network with data-side checks that filter the training set, and treat detection as a moving target rather than a solved problem. The goal is not certainty, which no method delivers, but enough overlapping checks that an attack has to defeat all of them at once.

Can you tell if a model is backdoored?

Not with certainty. A backdoor is invisible on clean inputs, so ordinary accuracy testing passes, and the strongest attacks are built specifically to survive inspection. The TrojanRoom authors (Chen and colleagues, USENIX Security 2024) report that their physical audio backdoor can “bypass human inspection and voice liveness detection, as well as resist trigger disruption and backdoor defense.” That is the starting point: screening improves your odds, but a determined, well-resourced attacker can defeat any one test, which is why defenders never rely on a single check.

Model-side defences

These inspect the trained network itself. Neural Cleanse (Wang, Yao, Shan, Li, Viswanath, Zheng and Zhao, IEEE S&P 2019) reverse-engineers the smallest perturbation that flips inputs to one class, on the logic that a backdoor trigger is an abnormally efficient shortcut into a target label. Fine-Pruning (Liu, Dolan-Gavitt and Garg, 2018) removes neurons that stay dormant on clean data, where a backdoor often hides, then fine-tunes to recover accuracy; the authors report that it can in some cases reduce the attack success rate to 0 percent with only a 0.4 percent drop in clean accuracy, while cautioning that pruning or fine-tuning alone is not enough against a sophisticated attacker. STRIP (Gao, Xu, Wang, Chen, Ranasinghe and Nepal, ACSAC 2019) works at runtime: it superimposes a suspect input with clean ones and watches the output, reasoning that an input-agnostic trigger stays stubbornly confident where a clean input would waver, and reports a false acceptance rate below 1 percent at a preset false rejection rate of 1 percent.

Data-side defences

These look at the training data before it ever shapes a model. Spectral Signatures (Tran, Li and Madry, NeurIPS 2018) shows that poisoned samples often leave a detectable trace in the spectrum of a model’s internal representations, letting robust statistics flag and remove them. Activation Clustering (Chen, Carvalho, Baracaldo, Ludwig, Edwards, Lee, Molloy and Srivastava, 2019) groups training examples by how they activate the network and looks for the tell-tale split between clean and poisoned members of a single class. Provenance matters just as much before a crawl: Carlini and colleagues (IEEE S&P 2024) showed split-view and frontrunning attacks against mutable web sources, poisoning “0.01% of the LAION-400M or COYO-700M datasets for just $60 USD,” which is why integrity checks and source pinning belong in the pipeline. In audio, the Whisper study (Bartolini, Stoyanov and Giaretta, 2024) tests a voice-activity-detection filter and finds it possible “to use VAD models to filter out malicious triggers and mitigate our attacks,” with effectiveness that varies by trigger sound.

Defence	Where it looks	What it catches	What can beat it
Neural Cleanse	Model	Small fixed triggers	Large or multiple triggers
Fine-Pruning	Model neurons	Dormant backdoor neurons	Attacks robust to pruning
STRIP	Runtime inputs	Input-agnostic triggers	Source-specific triggers
Spectral Signatures	Training data	Poisoned-sample outliers	Low-signal adaptive poisons
Activation Clustering	Training data	Poisoned-vs-clean clusters	Stealthy clean-label poisons

Worked examples from voice authentication

Voice authentication shows what layered defence looks like in a real setting. Li, Baird and Lin (2024) built Guardian, a discriminator that in their tests distinguishes “about 95% of attacked accounts from normal accounts,” against roughly 60 percent for prior approaches. More recently, Mohammadi, Sood, Thiruvady and Nazari (2025) paired a pitch-and-masking detector with a neural network and report that it reduces attack success rates to as low as 5 percent to 15 percent while keeping recall up to 95 percent against targeted data poisoning. Both results are encouraging and both are narrow: they are tuned to a specific domain and a specific family of attacks, which is exactly why they are offered as detectors to layer, not a universal fix.

So what actually works?

No single test. The through-line of the whole field is the tradeoff between an attack’s effectiveness and its detectability: stealthy, clean-label triggers slip past the model-side tests, while crude, powerful triggers are easier to catch but easier to build. That is why serious defenders combine approaches, recording data provenance before training, screening the data with clustering and spectral checks, probing the model with reverse-engineering and pruning, watching inputs at runtime, and re-checking suspect inputs through an independent model where they can. None of these is complete on its own, and each carries a false-positive cost, so the practical answer is to match the effort to the stakes and assume no layer is airtight.

The bottom line

Detecting a backdoor is a screening problem, not a solved one. Model-side tools like Neural Cleanse, Fine-Pruning and STRIP inspect the network; data-side tools like Spectral Signatures and Activation Clustering inspect the training set; domain defences like the voice-authentication work of Li and of Mohammadi show the same pattern in a live application. Layer them, assume none is complete, and size the effort to what a compromise would cost. For how these attacks are built, see how backdoor attacks work; for the stealthy variant that keeps labels correct, see clean-label poisoning explained; and for the wider tool-by-tool picture, see the AI poisoning-tools scorecard.

Sources

Wang, Yao, Shan, Li, Viswanath, Zheng, Zhao (2019). Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. IEEE Symposium on Security and Privacy 2019.
Liu, Dolan-Gavitt, Garg (2018). Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. Research in Attacks, Intrusions, and Defenses 2018.
Gao, Xu, Wang, Chen, Ranasinghe, Nepal (2019). STRIP: A Defence Against Trojan Attacks on Deep Neural Networks. Annual Computer Security Applications Conference 2019.
Tran, Li, Madry (2018). Spectral Signatures in Backdoor Attacks. Advances in Neural Information Processing Systems 2018.
Chen, Carvalho, Baracaldo, Ludwig, Edwards, Lee, Molloy, Srivastava (2019). Detecting Backdoor Attacks on Deep Neural Networks by Activation Clustering. SafeAI at AAAI 2019.
Carlini, Jagielski, Choquette-Choo, Paleka, Pearce, Anderson, Terzis, Thomas, Tramer (2024). Poisoning Web-Scale Training Datasets is Practical. IEEE Symposium on Security and Privacy 2024.
Bartolini, Stoyanov, Giaretta (2024). Hidden in Plain Sound: Environmental Backdoor Poisoning Attacks on Whisper, and Mitigations.
Li, Baird, Lin (2024). Defend Data Poisoning Attacks on Voice Authentication. IEEE Transactions on Dependable and Secure Computing 2024.
Mohammadi, Sood, Thiruvady, Nazari (2025). Mitigating Backdoor Triggered and Targeted Data Poisoning Attacks in Voice Authentication Systems.