# rethinking_backdoor_attacks__157218b4.pdf

Rethinking Backdoor Attacks

Alaa Khaddaj * 1 Guillaume Leclerc * 1 Aleksandar Makelov * 1 Kristian Georgiev * 1 Hadi Salman 1

Andrew Ilyas 1 Aleksander M adry 1

In a backdoor attack, an adversary inserts maliciously constructed backdoor examples into a training set to make the resulting model vulnerable to manipulation. Defending against such attacks involves viewing inserted examples as outliers in the training set and using techniques from robust statistics to detect and remove them.

In this work, we present a different approach to the backdoor attack problem. Specifically, we show that without structural information about the training data distribution, backdoor attacks are indistinguishable from naturally-occuring features in the data and thus impossible to detect in a general sense. Then, guided by this observation, we revisit existing defenses against backdoor attacks and characterize the (often latent) assumptions they make, and on which they depend. Finally, we explore an alternative perspective on backdoor attacks: one that assumes these attacks correspond to the strongest feature in the training data. Under this assumption (which we make formal) we develop a new primitive for detecting backdoor attacks. Our primitive naturally gives rise to a detection algorithm that comes with theoretical guarantees, and is effective in practice.

1. Introduction

A backdoor attack (Gu et al., 2017) allows an adversary to manipulate the predictions of a machine learning model by modifying a small fraction of the training set inputs. This involves adding a fixed pattern (called the trigger ) to some training inputs, and setting the labels of these inputs to some fixed value yb. This intervention enables the adversary to take control of the resulting models predictions at deployment time by adding the trigger to inputs of interest.

*Equal contribution 1MIT. Correspondence to: Alaa Khaddaj <alaakh@mit.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

Backdoor attacks pose a serious threat to machine learning systems as they are easy to deploy and hard to detect. Indeed, recent work has shown that modifying a very small number of training inputs suffices for mounting a successful backdoor attack on models trained on web-scale datasets (Carlini et al., 2023). Consequently, there is a growing body of work on backdoor attacks and approaches to defend against them (Chen et al., 2018; Tran et al., 2018; Jin et al., 2021; Hayase et al., 2021; Levine & Feizi, 2021; Jia et al., 2021).

A prevailing perspective on defending against backdoor attacks treats the manipulated training inputs as outliers, and thus draws a parallel between backdoor attacks and the classic data poisoning setting from robust statistics. In data poisoning, one receives data that is, with probability 1 ε, from a known distribution D, and, with probability ε, chosen by an adversary. The goal is to detect the adversarially chosen inputs, or to learn a good classifier in spite of the presence of these inputs. This perspective is a natural one and has led to a host of defenses against backdoor attacks (Chen et al., 2018; Tran et al., 2018; Hayase et al., 2021) but is it the right way to approach the problem?

In this work, we take a step back from the above view and offer a different perspective on backdoor attacks: rather than viewing the manipulated inputs as outliers, we view the trigger used in the backdoor attack as simply another feature in the data. To justify this view, we demonstrate that backdoors triggers can be indistinguishable from features already present in the dataset. On one hand, this immediately pinpoints the difficulty of detecting backdoor attacks, especially when they can correspond to arbitrary trigger patterns. On the other hand, this observation suggests there might be an equivalence between detecting backdoor attacks and surfacing features present in the data.

Equipped with this perspective, we introduce a primitive for studying features in the input data and characterizing a feature s strength. This primitive then gives rise to an algorithm for detecting backdoor attacks in a given dataset. Specifically, our algorithm flags the training examples containing the strongest feature as being manipulated, and removes them from the training set. We empirically verify the efficacy of this algorithm on a variety of standard backdoor attacks. Overall, our contributions are as follows:

Rethinking Backdoor Attacks

Train Time Inference Time

𝛼% of the training size

Figure 1: An illustration of a backdoor attack. An adversary backdoors the training set by inserting a trigger (red square) in a small fraction α of the training images, and setting the label of these images to a desired class, e.g., dog. At inference time, the adversary can activate the backdoor by inserting the red trigger into an image. In the example above, the image of the horse (top right) is correctly classified by a model trained on the backdoor training set. After the trigger is inserted into this image, the model prediction on the image flips to dog .

We demonstrate that in the absence of any knowledge about the distribution of natural data, the triggers used in a backdoor attacks are indistinguishable from existing features in the data. This observation implies that every backdoor defense must make assumptions (either implicit or explicit) about the structure of the distribution or the backdoor attack itself (see Section 2).

We re-frame the problem of detecting backdoor attacks as one of detecting a feature in the data: the feature with the strongest effect on the model predictions (see Section 3).

We show how to detect backdoor attacks under the corresponding assumption (i.e., that the backdoor trigger is the strongest feature in the dataset). We provide theoretical guarantees on our approach s effectiveness at identifying backdoored inputs, and demonstrate experimentally that our resulting algorithm is effective in a range of settings.

2. Setup and Motivation

In this section, we formalize the problem of backdoor attack, and introduce the notation that we will use throughout the paper. We then argue that defending against backdoor attacks requires certain assumptions and that all existing defenses make such assumptions, implicitly or explicitly.

Let us fix a learning algorithm A and an input space Z = X Y (e.g., the Cartesian product of the space of images X and of their corresponding labels Y). For a given dataset S Zn, and a given example z = (x, y) Z, where x is an input of label y, we define the model output function f(z; S) as some metric of interest (e.g., loss) evaluated on

the input z after training a model on dataset S.1 We also define, for any two sets S and S :

Perf(S S ) = 1 |S |

z S f(z; S)

i.e., the performance on dataset S of a model trained on S.

Backdoor attack. In a backdoor attack (see Figure 1), an attacker observes a clean set of training inputs S, and receives an attack budget α (0, 1) that indicates the fraction of the training set that the adversary can manipulate 2. The attacker then produces (1) a partitioning of S into two sets SP and SC, where SP is the set to be poisoned, such that |SP | α|S|; and (2) a trigger function τ : Z Z that modifies training inputs in a systematic way, e.g., by inserting a trigger in the input image x and changing its label. The attacker then transforms SP using the trigger function to get P = τ(SP ), which replaces SP in the training set. Here, we let τ(S ) for any set S denote the set {τ(z) : z S }. Overall, the attacker s goal is, given a set S of inputs of interest, to design P and τ that satisfy two properties:

Effectiveness: Training on the backdoored dataset makes models vulnerable to the trigger function. In other words, Perf(SC P τ(S )) should be large.

Imperceptibility: Training on the backdoored dataset should not significantly change the performance of the model on clean inputs. That is, Perf(SC P S ) Perf(S S ).

2.1. Is the Trigger a Backdoor or a Feature?

The prevailing perspective on backdoor attacks casts them as an instance of data poisoning, a concept with a rich history in robust statistics (Hampel et al., 2011). In data poisoning, the goal is to learn from a dataset where most of the points an (1 ε)-fraction of them are drawn from a distribution D, and the remaining points an ε-fraction of them are chosen by an adversary. This parallel between this classical data poisoning setting and that of backdoor attacks is natural. After all, in a backdoor attack the adversary inserts the trigger in a small fraction of the data, which is otherwise primarily drawn from the data distribution D.

However, is this the right parallel to guide us? Recall that in the classical data poisoning setting, leveraging the structure of the distribution D is essential to obtaining any guarantees. For example, the developed algorithms often leverage strong explicit distributional assumptions, such as

1For example, given z = (image x, label y), we can set f(z; S) = P[f(x) = y]. In this paper, we define f to be the classification margin on example z (see Appendix A). 2The budget α is typically a small value, e.g., 1%.

Rethinking Backdoor Attacks

Apply +hat to training images

Exploit at test time

Pred: Dog Pred: Cat

(a) An adversary can craft a trigger that is indistinguishable from a natural feature and use it as a backdoor. Here, we backdoor the Image Net training set by generating (using 3DB (Leclerc et al., 2021)) images of hats and pasting them on varying numbers of cat images. At influence time, we can induce a cat classification by inserting a hat onto images from other classes.

Naturally occurring

tennis ball images

Exploit at test time

Pred: Fish Pred: Tennisball

(b) Without changing the training dataset at all, adversaries can exploit patterns which act as natural backdoors. For example, the nature of the tennis ball class in Image Net makes it so that an attacker can induce a tennis ball classification with just a small test-time perturbation. By most definitions, therefore, this tennis ball would constitute a backdoor attack.

Figure 2: An adversary can leverage (a) plausible features or (b) naturally occurring features to mount an backdoor attack.

(sub-)Gaussianity of the data (Lugosi & Mendelson, 2019). In settings such as computer vision, however, it is unclear whether such structure is available. In fact, we lack almost any characterization of how image datasets are distributed.

We thus argue that without assumptions on the structure of the input data, backdoor triggers are fundamentally indistinguishable from features already present in the dataset. We illustrate this point with the following experiments.

Backdoor attacks can look like plausible features. It turns out that one can mount a backdoor attack using features that are already present (but rare) in the dataset. Specifically, in Figure 2a, we demonstrate how to execute a backdoor attack on an Image Net classifier using hats in place of a fixed (artificial) trigger pattern. The resulting dataset is entirely plausible in that the backdoored images are (at least somewhat) realistic, and the corresponding labels are unchanged.3 At inference time, however, the hats act as an effective backdoor trigger: model predictions are skewed towards cats whenever a hat is added on the test sample. Should we then expect a backdoor detection algorithm to flag these (natural-looking) in-distribution examples?

Backdoor attacks can occur naturally. The adversary doesn t need to modify the dataset: they can use features already present in the data to manipulate models. For example, a naturally-occurring trigger for Image Net is the presence of a tennis ball (Figure 2b). Similarly, Liu et al.

3With some more careful photo editing or using diffusion models (Song & Ermon, 2019; Ho et al., 2020), one could imagine embedding the hats in a way that makes the resulting examples appear more in-distribution and thus look unmodified even to a human.

(2019) show that on CIFAR-10, deer antlers are another natural backdoor, i.e., adding antlers to images from other classes makes models likely to classify those images as deer.

These examples highlight that we need to make assumptions, as otherwise the task is fundamentally ill-defined. Indeed, trigger patterns for backdoor attacks are no more indistinguishable than features in the data. In particular, detecting trigger pattern is no different than detecting hats, backgrounds, or any other spurious feature.

2.2. Implicit Assumptions in Existing Defenses

Since detecting backdoored examples without assumptions is an ill-defined task, all existing backdoor defenses must rely on either implicit or explicit assumptions on the structure of the data or the structure of the backdoor attack. To illustrate this point, we examine some of the existing backdoor defenses and identify the assumptions they make. As we will see, each of these assumptions gives rise to a natural failure mode of the corresponding defense too (when these assumptions are not satisfied).

Latent separability. One line of work relies on the assumption that backdoor examples and unmodified ( clean ) examples are separable in some latent space (Tran et al., 2018; Hayase et al., 2021; Qi et al., 2022; Chen et al., 2018; Huang et al., 2022). The corresponding defenses thus perform variants of outlier detection in the latent representation space of a neural network (inspired by approaches from robust statistics). Such defenses are effective against a broad range of attacks, but an attacker aware of the type of defense can mount an adaptive attack that succeeds by violating that latent separability assumption (Qi et al., 2022).

Rethinking Backdoor Attacks

Structure of the backdoor. Another line of work makes structural assumptions on the backdoor trigger (e.g., its shape) (Wang et al., 2019; Zeng et al., 2021; Liu et al., 2022; Yang et al., 2022). For example, Wang et al. (2019) assume that the trigger has small ℓ2 norm. Such defenses can be bypassed by an attacker who deploys a trigger that remains hard to discern while violating these assumptions. In fact, the hat trigger in Figure 2a is such a trigger. Effect of the backdoor on model behavior. Another alternative is to assume that backdoor examples have a nonpositive effect on the model s accuracy on the clean examples. This assumption has the advantage of not relying on the specifics of the trigger or its latent representation. In particular, a recent defense by Jin et al. (2021) makes this assumption explicit and achieves good results against a range of backdoor attacks. A downside of this approach is that subtle clean-label attacks, e.g., (Turner et al., 2019), can violate the incompatibility assumption and remain undetected. Structure of the clean data. Finally, yet another line of work assumes that the (unmodified) dataset has naturallyoccuring features whose support, i.e., the number of examples containing the feature, is (a) larger than the adversary s attack budget α, and (b) sufficiently strong to enable good generalization. The resulting defenses are then able to broadly certify that no attack within the adversary s budget will be successful. For example, Levine & Feizi (2021) use an ensembling technique to produce a classifer that is certifiably invariant to changing a fixed number of training inputs, thus ensuring that no adversary can mount a successful backdoor attack.

For real-world datasets, however, this assumption (i.e., that well-supported features alone suffice for good generalization) seems to be unrealistic. Indeed, many features that are important for generalization are only supported on a small number of examples (Feldman, 2019). Accordingly, the work of (Levine & Feizi, 2021) can only certify robustness against a limited number of backdoor examples while maintaining competitive accuracy.

3. An Alternative Assumption

The results of the previous section suggest that without additional assumptions, the delineation between a backdoor trigger and a naturally-occurring feature is largely artificial. Indeed, in order to detect backdoor attacks, prior works make (sometimes implicit) assumptions about the nature of the corresponding features, or about the structure of the training data. Given that we cannot escape making any assumptions, we ask: what is the right assumption to make?

In this paper, we assume that the backdoor trigger is the strongest feature in the data. Intuitively (and unlike the assumptions discussed in Section 2.2), this assumption is tied to the success of the backdoor attack. In particular, if a

backdoor attack violates this assumption, there must exist another feature in the dataset that itself would serve as a more effective backdoor trigger. As a result, there would be no reason for a defense to identify the former over the latter.

In the remainder of this section, we make this assumption formal. We begin by providing a definition of feature, along with a definition of the support of any feature. Using these definitions, we can formally state our goal as identifying all the training examples that provide support for the feature to the backdoor attack. This involves proposing a definition of feature strength a definition that is directly tied to the effectiveness of the backdoor attack. We conclude by precisely stating our assumption that the backdoor trigger is the strongest feature in the training dataset.

Setup. For a task with example space Z = X Y (e.g., the space of image-label pairs), we define a feature as a function ϕ X {0, 1}. For example, a feature ϕears might map from an image x X to whether the image contains cat ears. Note that by this definition, every backdoor attack (as formalized in Section 2) corresponds to a feature ϕp that detects the corresponding trigger transformation τ (that is, ϕp outputs 1 on inputs in the range of τ, and 0 on all other inputs).

For a fixed training set S Zn, we can describe a feature ϕ by its support, which we define as the subset of training inputs that activate the corresponding function:

Definition 1 (Feature support). Let ϕ : X {0, 1} be a feature (i.e., a map from the example space X to a Boolean value) and let S Zn be a training set of n examples. We define the support of the feature ϕ as Φ(S) = suppϕ(S) = {z = (x, y) S | ϕ(x) = 1}, i.e. the subset of S where the feature ϕ is present.

Observe that in the case of a backdoor attack using a backdoor trigger ϕp, the corresponding feature support Φp(S) is the set of training examples that contain the trigger.

Characterizing feature strength. Recall that our goal in this section is to place a (formal) assumption on a backdoor attack as corresponding to the strongest feature in a dataset. To accomplish this goal, we first need a way to quantify the strength of a given feature ϕ. Intuitively, we would like to call a feature strong if adding a single example containing that feature to the training set significantly changes the resulting model. (That is, if the counterfactual value of examples containing feature ϕ is high.)

To this end, fix a distribution DS over subsets of the training set S. For any feature ϕ and natural number k, let the k-output of ϕ be the expected model output (over random draws of the training set) on inputs with feature ϕ, conditioned on having k inputs with feature ϕ in the training set:

Definition 2 (Output function of a feature ϕ). For a feature

Rethinking Backdoor Attacks

ϕ, and a distribution DS over subsets of the training set S, we define the feature output function gϕ as the function that maps any integer k to the expected model output on examples with that feature ϕ when training on exactly k training inputs with that feature ϕ, i.e.,

gϕ(k) = Ez Φ(S) h ES DS h f(z; S ) |Φ(S )| = k, z S ii

where z Φ(S) represents a random sample from the support Φ(S) of the feature ϕ in the set S.

Intuitively, the feature output function gϕ(k) should grow quickly, as a function of k, for strong features and slowly for weak features. For example, adding an image of a rare dog breed to the training set will rapidly improve accuracy on that dog breed, whereas adding images with a weak feature like sky or grass will impact the accuracy of the model much less. In the context of backdoor attacks, the effectiveness property (Section 2) implies that gϕp(|P|) gϕp(0) is large where, ϕp is the backdoor trigger. Motivated by this observation, we define the strength of a feature ϕ as the rate of change of the corresponding output function.

Definition 3 (Strength of a feature ϕ). We define the kstrength of a feature ϕ as the following function sϕ(k):

sϕ(k) = gϕ(k + 1) gϕ(k) (2)

Note that we can extend Definition 2 and Definition 3 to individual examples too. Specifically, we can define for a feature ϕ the model k-output at an example z as:

gϕ(z, k) = ES DS;S z

f(z; S ) |Φ(S )| = k .

Similarly, we can define the k-strength of a feature ϕ at an example z as: sϕ(z, k) = gϕ(z, k + 1) gϕ(z, k).

To provide intuition for Definitions 2 and 3, we instantiate both of them in the context of the trigger feature in a very simple backdoor attack. Specifically, we alter the CIFAR-10 training set, planting a small red square in a random 1% of the training examples, and changing their label to class 0 (so that at inference time, we can add the red square to the input image and make its predicted class be 0 ).

In this poisoned dataset, the backdoor feature ϕp is a detector of the presence of a red square, and the support Φp comprises the randomly selected 1% of altered images (i.e., the backdoor images). We train 100,000 models on random 50% fractions of this poisoned dataset, and use them to estimate the k-output and k-strength of the backdoor feature. Specifically, for a variety of examples z S, we (a) find the models whose training sets had exactly k backdoor images and did not contain z; and (b) average the model output on z for each of these models.

24 28 32 36 40 44 48 52 Number of Poisoned Examples

Average Margin

Poisoned images

Clean images

Figure 3: Backdoored CIFAR10 examples. Each orange (resp. blue) line corresponds to a poisoned (resp. clean) example. The x-value represents the number of backdoored examples present in the training set, while the y-value represents the model output (average margin) at that specific example. The rate of change of the model output represents the feature strength sϕp(k). We observe that the model output of backdoored images (orange lines) increases as more backdoored examples are included in the training set. In contrast, the model output for clean images (blue lines) is not affected by the number of poisoned training examples.

In Figure 3, we plot the resulting model output for examples z Φp(S) that have the feature ϕp (orange lines) and also for examples z Φp(S) that do not contain the backdoor feature. Note that by Definition 2, the average k-output of the backdoor feature is the average of the orange lines, and by Definition 3, the average k-strength of the backdoor feature is the average (instantaneous) slope of the orange line. We observe that for the poisoned examples, the k-strength is consistently positive (i.e., the output monotonically increases). This observation motivates our assumption about the strength of backdoor trigger features: Assumption 1. Let ϕp be the backdoor feature, and Φp(S) be its support (i.e., the backdoored training examples) and let p := |Φp(S)|. Then, for some δ > 0, α (0, 1) and all other features ϕ with |Φ(S)| = p, we assume that

sϕp (α p) δ + sϕ(α p)

Justifying the assumption. As we already discussed, Assumption 1 has the advantage of being directly tied to the effectiveness of the backdoor attack. In particular, we know that in the absence of backdoored training examples, the model should do poorly on the inputs with the backdoor trigger (otherwise, we would consider the model to have already been compromised). Thus, gϕp(0) is small. On the other hand, for the backdoor attack to be effective, we must have that gϕp(p) is large, i.e., models trained on the backdoor training set should perform well on backdoored inputs. The mean value theorem4 thus implies that there must one point 0 k p at which sϕp(k) is large.

4Informally, the mean value theorem says that for any continuous function f and any interval [a, b], there must exist c [a, b]

Rethinking Backdoor Attacks

4. Detecting Backdoored Examples

The perspective from the previous sections suggests that we need to be able to analyze the strength of features present in a dataset to understand the effect of these features on a model s predictions. Particularly, such an analysis would allow us to translate Assumption 1 into an algorithm for detecting backdoor training examples. Specifically, we would be able to estimate the feature strength sϕ(k) for a given feature ϕ. If we had a specific feature ϕ in mind, we could compute the feature strength sϕ(k) using Equations (1) and (2) directly. In our case, however, identifying the feature of interest (i.e., backdoor feature) is essentially our goal.

To this end, in this section we first show how to estimate the strength of all viable features ϕ simultaneously. We then demonstrate how we can leverage this estimate to detect the strongest one among them. Our key tool here will be the datamodeling framework (Ilyas et al., 2022). In particular, Ilyas et al. (2022) have shown that, for every example z, and for a model output function f corresponding to training a deep neural network and evaluating it on that example, there exists a weight vector wz R|S| such that: E[f(z; S )] 1 S wz for subsets S DS, where 1S {0, 1}|S| is the indicator vector of S 5. In other words, we can approximate the specific outcome of training a deep neural network on a given subset S S as a linear function of the presence of each training data example. As the ability of the datamodeling framework to capture the model output function will be critical to our method, we state it as an explicit assumption.

Assumption 2 (Datamodel accuracy). For any example z, with a corresponding datamodel weight wz, we have that

ES DS h E[f(z; S )] 1 S wz 2i ε (3)

where ϵ > 0 represents a bound on the error of estimating the model output function using datamodels.

Assumption 2 essentially guarantees that datamodels provide an accurate estimate of the model output function for any example z and for any random subset S DS. Also, we can in fact verify this assumption by sampling sets S and computing the error from Assumption 2 directly (replacing the inner expectation with an empirical average).

It turns out that this property alone captured as a formal lemma below suffices to estimate the feature strength sϕ(k) of any feature ϕ.

Lemma 1. For a feature ϕ, let 1ϕ(S) be the indicator vector of its support Φ(S), 1n be the n-dimensional vector of ones,

such that the rate of change of f at c is equal to f(b) f(a)

b a . 5The indicator vector 1S takes a value of 1 at index i, if training example zi S , and 0 otherwise.

and let h : Rn Rn be defined as

h(v) = 1 v 1 v 1 n v 1 (1n v).

Then, under Assumption 2, there exists C > 0 such that sϕ(α |Φ(S)|) 1 |Φ(S)|

z Φ(S) w z h(1ϕ(S))

(4) where ϵ is as defined in Assumption 2.

So, Lemma 1 provides a closed-form expression involving only the datamodel weight vectors {wz} for the (approximate) feature strength sϕ(k) of feature ϕ. We provide a proof of this lemma in Appendix B.

4.1. Poisoned examples as a maximum-sum submatrix

In the previous section, we have shown how we can leverage datamodels to estimate any given feature s strength. In this section, we combine Lemma 1 and Assumption 1 (i.e., that the backdoor trigger constitutes the strongest feature in the dataset) into an algorithm that provably finds backdoor training examples (provided that Assumptions 1 and 2 hold).

To this end, recall that n = |S| and p = |Φp(S)|. Assumption 1 then implies that sϕp(α p) (i.e., the strength of a backdoor feature ϕp) is large. So, guided by Lemma 1, we consider the following optimization problem:

arg max v {0,1}n h(v) Wv s.t. vi 1 = p, (5)

where h is as in Lemma 1. The following lemma (proved in Appendix C) shows that under Assumption 1, the solution to (5) is the indicator vector of the backdoor examples.

Lemma 2. Suppose Assumption 1 holds for some δ and Lemma 1 for some C. Then if δ > 2p Cε1/2n1/4, the unique maximizer of (5) is the vector 1ϕp(S), i.e., the indicator of the backdoored examples, where ϵ is as in Assumption 2.

Now, the fact that for v {0, 1}n we have 1 n Wv = v (diag(1 n W))v allows us to express (5) as a submatrixsum maximization problem. In particular, we have that

argmax v {0,1}n: v 1=p

p v 1 n p (1n v) Wv

= argmax v {0,1}n: v 1=p v W diag p

4.2. Detecting backdoored examples

The formulation presented in (5) is difficult to solve directly, for multiple reasons. First, the optimization problem requires knowledge of the number of poisoned examples

Rethinking Backdoor Attacks

|Φp(S)|, which is unknown in practice. Second, even if we did know the number of poisoned examples, the problem is still NP-hard in general (Branders et al., 2017). In fact, even linearizing (5) and using the commercial-grade mixedinteger linear program solver Gurobi (Gurobi Optimization, LLC, 2021) takes several days to solve (per problem instance) due to the sheer number of optimization variables.

We thus resort to a different approximation. For each k in a pre-defined set of candidate sizes K, we set p equal to k (in (5)). We then solve the resulting maximization problem using a greedy local search algorithm inspired by the Kernighan-Lin heuristic for graph partitioning (Kernighan & Lin, 1970). That is, starting from a random assignment for v {0, 1}n, the algorithm considers all pairs of indices i, j [n] such that vi = vj, and such that swapping the values of vi and vj would improve the submatrix sum objective. The algorithm greedily selects the pair that would most improve the objective and terminates when no such pair exists. We run this local search algorithm T = 1000 times for each value of k and collect the candidate solutions {vk,l : k K, l [T]}.

Rather than using any one of these solutions, we combine them to yield a final score. We define the score of example zi as the weighted sum of the number of times it was included in the solution of local search, that is,

l=1 vk,l i . (6)

Intuitively, we expect that backdoored training examples will end up in many of the greedy local search solutions (due to Assumption 1) and thus have a high score si. We translate the scores (6) into a concrete defense by flagging (and removing) the examples with the highest score.

5. Experiments

In the previous section, we developed an algorithm that provably detects backdoored examples in a dataset whenever Assumptions 1 and 2 hold. We now consider several settings, and two common types of backdoor attacks: dirtylabel attacks (Gu et al., 2017) and clean-label attacks (Turner et al., 2019). For each setting, we verify whether our assumptions hold, and then validate the effectiveness of our proposed detection algorithm.

Experimental setup. In Table 1, we present a summary of our experiments. For all of these experiments, we use the CIFAR-10 dataset (Krizhevsky, 2009), and the Res Net-9 architecture (He et al., 2015), and compute the datamodels using the framework presented in (Ilyas et al., 2022). Specifically, for each experiment and setup, we train a total of 100,000 models, each on a random subset containing

Exp. Type α Clean Acc. Backdoor Acc.

1 DL 1.5% 86.64 19.90 2 DL 5% 86.67 12.92

3 DL 1.5% 86.39 49.57 4 DL 5% 86.23 10.67

5 CL 1.5% 86.89 75.58 6 CL 5% 87.11 41.89

7 CL (no adv.) 5% 86.94 71.68 8 CL (no adv.) 10% 87.02 52.08

Table 1: A summary of the different backdoor attacks we consider. DL and CL stand for dirtyand clean-label attacks respectively, CL (no adv.) is the non-adversarial clean label attack from Turner et al. (2019).

50%6 of CIFAR-107, and chosen uniformly at random.

5.1. Verifying our assumptions

In Section 3, we presented two assumptions for our proposed defense to (provably) work. We now verify whether these assumptions hold in the experimental settings we consider, then validate the effectiveness of our detection algorithm.

Datamodel accuracy. Lemma 1 states that datamodels are good approximators of a feature strength (provided Assumption 2 holds). To validate whether this is the case, we estimate the ground-truth feature strength sϕp(k) of the backdoor triggr feature ϕp as described in (1) and (2). More precisely, we train 100,000 models on random subsets of the training set, each containing 50% of the training examples. We then compute how the model outputs change with the inclusion/exclusion of the backdoored examples. We then compute the model s outputs for the backdoored examples as a function of the number of backdoored training examples. We then estimate the feature strength sϕ(k) as the rate of change of the model outputs for backdoored examples.

Afterwards, we estimate the feature strength using datamodels, as given by Equation (4). In particular, we compute the datamodels matrix W, the indicator vector h 1ϕp(S)

from Lemma 1, and their product h 1ϕp(S) W R|S|

Each entry of this product is an estimate of the backdoor feature strength at every training example. As Figure 4 shows, datamodels are indeed good approximators of the backdoor trigger feature s strength. Backdoor trigger as the strongest feature. Recall that Assumption 1 states that the backdoor trigger is the strongest among all the features present in the dataset. To validate this assumption in our settings, we leverage our approximation

6We train, in Exp. 2 from Table 1, each model on 30% of the dataset. More details in Section 5.1. 7The chosen value of α from Assumption 1 is hence 1/2.

Rethinking Backdoor Attacks

0.005 0.000 0.005 0.010 Datamodels Estimate

Feature Strength

Figure 4: Estimating feature strength using datamodels. Each orange (resp. blue) data point in the scatter plot above represents a poisoned (resp. clean) training example. The x-value of each data point represents the feature strength estimated using datamodels (see Equation (4)), and the y-value represents the feature strength as estimated using Equation (2). We see a strong linear correlation between these two quantities for poisoned examples, which indicates that datamodels provide a good estimate of feature strength.

of feature strength using datamodels. Specifically, our result from Section 4 suggest;s that the obtained product should be highly correlated with the ground-truth backdoor trigger indicator vector 1ϕp(S). We thus measure this correlation by computing the area under the ROC curve (AUROC) between the product h(1ϕp(S)) W and the indicator vector 1ϕp(S). As we can see in Table 2, the AUROC score is very high in seven out of the eight settings, which suggests that Assumption 1 indeed holds in these cases.

E1 E2 E3 E4 E5 E6 E7 E8

99.9 60.9 98.0 97.7 99.9 99.9 97.0 98.3

Table 2: AUROC of the backdoor feature strength and the backdoor examples indicator vector for our setups from Table 1.

Interestingly, we observe that we get a low AUROC in the Exp. 2 from Table 1 (the one with a very large number of backdoored examples), which indicates that one of our assumptions does not hold in that case. To investigate the reason, we inspect the backdoor feature strength. Figure 5 shows that, for subsets of the training set containing 50% of the training examples, the model output does not change as the number of poisoned samples increases, i.e., for these subsets, the backdoor feature strength is essentially 0. Consequently, Assumption 1 does not hold. To fix this problem, we use smaller random subsets, i.e., ones containing 30% of the training examples, when estimating the datamodels. In this new setting, the backdoor feature strength is significantly higher, and the AUROC between the poison indicator vector and the feature strenght product jumps to 99.34%. For the remainder of the paper, we will use this setup.

1228 1230 1232 1234 1236 1238 1240 1242 Number of Poisoned Examples

Average Margin

Poisoned images

Clean images

Figure 5: Model output for different number of backdoor training examples. Each orange (resp. blue) line corresponds to a poisoned (resp. clean) example. The xvalue represents the number of backdoored examples present in the training set, while the y-value represents the model output (average margin) at that specific example. The rate of change of the model output represents the feature strength. We observe that for backdoored examples (orange lines) from Exp. 2 (see Table 1), the model output does not change as more training examples are poisoned. Consequently, the backdoor feature strength is 0.

5.2. Evaluating the effectiveness of our defense

Evaluating our score. As a first step, we measure how well our scores predict the backdoored examples in our eight settings. Specifically, we compute our scores by running our local search algorithm from Section 4.2 on the datamodels matrix W, then aggregating the results from the different runs. Following that, we check how well these scores correlate with the backdoor examples indicator vector 1ϕp(S). As Table 3 shows, there is a high correlation between these two quantities in all setups (cf. Section 5.1). This high correlation suggests that our local search algorithm generates a score that is predictive of the backdoored examples.

E1 E2 E3 E4 E5 E6 E7 E8

94.3 92.25 74.4 80.2 93.4 93.2 91.1 95.5

Table 3: AUROC for our scores (see Section 4.2) and the backdoor indicator vector for our setups from Table 1.

Evaluating the effectiveness of our proposed defense. Given that our scores are predictive of the backdoor examples indicator vector, we expect that removing the examples with the highest scores will be an effective defense against the backdoor attack. To test this claim, for each of the backdoor attacks settings, we train a model on the backdoored dataset, and compute the accuracy of this model on (a) the clean validation set, (b) and on the backdoored validation set8. We then remove from the training set the examples cor-

8By adding the trigger to all images of the clean validation set.

Rethinking Backdoor Attacks

Exp. No Defense AC ISPL SPECTRE SS Ours Clean Poisoned Clean Poisoned Clean Poisoned Clean Poisoned Clean Poisoned Clean Poisoned

1 86.64 19.90 86.76 19.68 86.13 86.15 86.71 20.17 85.52 30.99 85.05 85.06 2 86.67 12.92 85.41 12.93 85.88 85.82 - - 85.33 13.63 83.39 83.13

3 86.39 49.57 86.25 48.85 86.32 85.57 86.28 45.32 85.22 78.22 84.82 84.11 4 86.23 10.67 84.75 10.82 85.86 85.18 - - 84.85 13.33 84.64 83.72

5 86.89 75.58 86.73 82.83 86.04 85.89 86.82 80.65 85.67 85.41 83.82 83.72 6 87.11 41.89 86.85 51.05 86.18 86.11 86.97 51.18 85.68 85.60 84.88 84.79

7 87.02 71.68 86.90 73.28 86.50 82.31 86.72 76.97 85.70 82.70 84.19 84.02 8 86.94 52.08 86.81 56.78 86.04 71.27 86.63 52.27 85.87 71.93 84.81 84.66

Table 4: A summary of the model performances on a clean and poisoned validation sets after applying our method, as well as a number of baselines in all the settings we consider. The high accuracy on both the clean and poisoned validation sets indicates the effectiveness of our defense against the backdoor attacks we consider.

responding to the top 10% of the scores9, train a new model on the resulting dataset, and then check the performance of this new model on the clean and the fully-backdoored validation sets. We also compare our detection algorithm with several baselines, including Inverse Self-Paced Learning(ISPL) (Jin et al., 2021), Spectral Signatures (SS) (Tran et al., 2018), SPECTRE (Hayase et al., 2021) and Activation Clustering (AC) (Chen et al., 2018). Table 4 shows that there is no substantial drop in accuracy when evaluating the models trained on the curated training set.

6. Related Work

Developing backdoor attacks and defenses in the context of deep learning is a very active area of research (Gu et al., 2017; Tran et al., 2018; Chen et al., 2018; Turner et al., 2019; Saha et al., 2020; Shokri et al., 2020; Hayase et al., 2021; Qi et al., 2022; Goldblum et al., 2022; Goldwasser et al., 2022) (see e.g. (Li et al., 2022) for a survey). One popular approach to defending against backdoor attacks revolves around outlier detection in the latent space of neural networks (Tran et al., 2018; Chen et al., 2018; Hayase et al., 2021). Such defenses inherently fail in defending against adaptive attacks that leave no trace in the latent space of backdoored models (Shokri et al., 2020).

Another line of work investigates certified defenses against backdoor attacks (Levine & Feizi, 2021; Wang et al., 2022). To accomplish that, the proposed methods provide certificates by training separate models on different partitions of the training set, and dropping the models trained on data containing backdoored examples. This approach, however,

9We remove 20% in Exp. 2 from Table 1 since the number of poisoned examples is larger.

significantly degrades the accuracy of the trained model, and is only able to certify accuracy when the number of backdoored examples is very small.

A number of prior works explore the applicability of influence-based methods as defenses against different attacks in deep learning (Koh & Liang, 2017). To the best of our knowledge, only Lin et al. (2022) discuss using such methods for defending against backdoor attacks. However, their defense requires knowledge of the attack parameters that are typically unknown. Closest to our work is that of Jin et al. (2021), who consider a defense based on model behavior rather than properties of any latent space.

7. Conclusion In this paper, we proposed a new perspective on backdoor attacks. Specifically, we argued that backdoor triggers are fundamentally indistinguishable from existing features in the dataset. Consequently, the task of detecting backdoored training examples becomes equivalent to that of detecting strong features. Based on this observation, we propose a primitive and a corresponding algorithm for identifying and removing backdoored examples. Through a wide range of backdoor attacks, we demonstrated the effectiveness of our approach.

8. Acknowledgments

Work supported in part by the NSF grants CNS-1815221 and DMS-2134108, and Open Philanthropy. This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001120C0015. We thank the MIT Supercloud cluster (Reuther et al., 2018) for providing computational resources that supported part of this work.

Rethinking Backdoor Attacks

Branders, V., Schaus, P., and Dupont, P. Mining a submatrix of maximal sum. In International Workshop on New Frontiers in Mining Complex Patterns (NFMCP), 2017.

Carlini, N., Jagielski, M., Choquette-Choo, C. A., Paleka, D., Pearce, W., Anderson, H., Terzis, A., Thomas, K., and Tramèr, F. Poisoning web-scale training datasets is practical. In ar Xiv preprint ar Xiv:2302.10149, 2023.

Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., Molloy, I., and Srivastava, B. Detecting backdoor attacks on deep neural networks by activation clustering. ar Xiv preprint ar Xiv:1811.03728, 2018.

De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. In ar Xiv preprint ar Xiv:1708.04552, 2017.

Feldman, V. Does learning require memorization? a short tale about a long tail. In Symposium on Theory of Computing (STOC), 2019.

Goldblum, M., Tsipras, D., Xie, C., Chen, X., Schwarzschild, A., Song, D., Madry, A., Li, B., and Goldstein, T. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.

Goldwasser, S., Kim, M. P., Vaikuntanathan, V., and Zamir, O. Planting undetectable backdoors in machine learning models. ar Xiv preprint ar Xiv:2204.06974, 2022.

Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. ar Xiv preprint ar Xiv:1708.06733, 2017.

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2021. URL https://www.gurobi.com.

Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. Robust statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.

Hayase, J., Kong, W., Somani, R., and Oh, S. Spectre: defending against backdoor attacks using robust statistics. ar Xiv preprint ar Xiv:2104.11315, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition, 2015.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Neural Information Processing Systems (Neur IPS), 2020.

Huang, K., Li, Y., Wu, B., Qin, Z., and Ren, K. Backdoor defense via decoupling the training process. In International Conference on Learning Representations (ICLR), 2022.

Ilyas, A., Park, S. M., Engstrom, L., Leclerc, G., and Madry, A. Datamodels: Predicting predictions from training data. In International Conference on Machine Learning (ICML), 2022.

Jia, J., Cao, X., and Gong, N. Z. Intrinsic certified robustness of bagging against data poisoning attacks. In AAAI, 2021.

Jin, C., Sun, M., and Rinard, M. Provable guarantees against data poisoning using self-expansion and compatibility. 2021.

Kernighan, B. W. and Lin, S. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 1970.

Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, 2017.

Krizhevsky, A. Learning multiple layers of features from tiny images. In Technical report, 2009.

Leclerc, G., Salman, H., Ilyas, A., Vemprala, S., Engstrom, L., Vineet, V., Xiao, K., Zhang, P., Santurkar, S., Yang, G., et al. 3db: A framework for debugging computer vision models. In ar Xiv preprint ar Xiv:2106.03805, 2021.

Leclerc, G., Ilyas, A., Engstrom, L., Park, S. M., Salman, H., and Madry, A. ffcv. https://github.com/ libffcv/ffcv/, 2022.

Levine, A. and Feizi, S. Deep partition aggregation: Provable defenses against general poisoning attacks. In International Conference on Learning Representations, 2021.

Li, Y., Jiang, Y., Li, Z., and Xia, S.-T. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.

Lin, J., Zhang, A., Lecuyer, M., Li, J., Panda, A., and Sen, S. Measuring the effect of training data on deep learning predictions via randomized experiments. ar Xiv preprint ar Xiv:2206.10013, 2022.

Liu, T. Y., Yang, Y., and Mirzasoleiman, B. Friendly noise against adversarial noise: A powerful defense against data poisoning attacks. In ar Xiv preprint ar Xiv:2208.10224, 2022.

Liu, Y., Lee, W.-C., Tao, G., Ma, S., Aafer, Y., and Zhang, X. Abs: Scanning neural networks for back-doors by artificial brain stimulation. In ACM SIGSAC Conference on Computer and Communications Security, 2019.

Rethinking Backdoor Attacks

Lugosi, G. and Mendelson, S. Sub-gaussian estimators of the mean of a random vector. The annals of statistics, 47 (2):783 794, 2019.

Qi, X., Xie, T., Mahloujifar, S., and Mittal, P. Circumventing backdoor defenses that are based on latent separability. ar Xiv preprint ar Xiv:2205.13613, 2022.

Reuther, A., Kepner, J., Byun, C., Samsi, S., Arcand, W., Bestor, D., Bergeron, B., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Milechin, L., Mullen, J., Prout, A., Rosa, A., Yee, C., and Michaleas, P. Interactive supercomputing on 40,000 cores for machine learning and data analysis. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1 6. IEEE, 2018.

Saha, A., Subramanya, A., and Pirsiavash, H. Hidden trigger backdoor attacks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 11957 11965, 2020.

Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., and Goldstein, T. Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Shokri, R. et al. Bypassing backdoor detection algorithms in deep learning. In 2020 IEEE European Symposium on Security and Privacy (Euro S&P), pp. 175 183. IEEE, 2020.

Song, Y. and Ermon, S. Generative modeling by estimating gradients of the data distribution. In Neural Information Processing Systems (Neur IPS), 2019.

Tran, B., Li, J., and M adry, A. Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems (Neur IPS), 2018.

Turner, A., Tsipras, D., and Madry, A. Label-consistent backdoor attacks. 2019.

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., and Zhao, B. Y. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Proceedings of 40th IEEE Symposium on Security and Privacy, 2019.

Wang, W., Levine, A., and Feizi, S. Improved certified defenses against data poisoning with (deterministic) finite aggregation. In International Conference on Machine Learning, 2022.

Xie, C., Huang, K., Chen, P.-Y., and Li, B. Dba: Distributed backdoor attacks against federated learning. In International Conference on Learning Representations, 2020.

Yang, Y., Liu, T. Y., and Mirzasoleiman, B. Not all poisons are created equal: Robust training against data poisoning. In International Conference on Machine Learning, 2022.

Zeng, Y., Park, W., Mao, Z. M., and Jia, R. Rethinking the backdoor attacks triggers: A frequency perspective. In International Conference on Computer Vision (ICCV), 2021.

Rethinking Backdoor Attacks

A. Model Predictions Formulation Used in Our Paper

In this work, we use the margin function (defined below) as the model output function f(z; S).

Definition 4 (Margin function). For a dataset S S and a fixed z = (x, y) X Y, the margin function f(x; S) is defined as f(z; S ) := the correct-class margin on z of a model trained on S ,

where the correct-class margin is the logit of the correct class minus the largest incorrect logit.

Intuitively, f(z; S ) maps from an example z and any subset of the training dataset S S to the correct-class margin on z after training (using any fixed learning algorithm) on S .

Here we focus on margins because of their (empirically observed) suitability for ordinary least squares, as observed in (Ilyas et al., 2022, Appendix C).

Rethinking Backdoor Attacks

B. Proof of Lemma 1

Lemma 1. For a feature ϕ, let 1ϕ(S) be the indicator vector of its support Φ(S), 1n be the n-dimensional vector of ones, and let h : Rn Rn be defined as

h(v) = 1 v 1 v 1 n v 1 (1n v).

Then, under Assumption 2, there exists C > 0 such that sϕ(α |Φ(S)|) 1 |Φ(S)|

z Φ(S) w z h(1ϕ(S))

Cε1/2n1/4. (4)

where ϵ is as defined in Assumption 2.

We decompose the proof into two parts. In the first part, we show that we can approximate the feature strength sϕ(k) using datamodel weights (Lemma 3). In the second part, we relate h to the expressions involving datamodel weights in Lemma 3 (Lemma 4).

Lemma 3. Let α (0, 1), and S be a subset of S, sampled uniformly at random, such that |S | = α |S|. Let a := α |Φ(S)|. Suppose α is such that c α 1 c for some absolute constant c (0, 1). Then, there exists a constant C > 0 (dependending on c) such that we have: sϕ(a) Ez Φ(S)

w z 1S |Φ(S )| = a + 1 + ES DS

w z 1S |Φ(S )| = a Cε1/2n1/4. (7)

Proof. Recall that the feature strength sϕ(k) is defined as:

sϕ(k) :=gϕ(k + 1) gϕ(k) (8)

f(z; S ) |Φ(S )| = k + 1 ES DS,S z

f(z; S ) |Φ(S )| = k

For convenience, assume that a = α |Φ(S)| is an integer. First, by triangle inequality, it is enough to show that: Ez Φ(S)

f(z; S ) |Φ(S )| = a ES DS

w z 1S |Φ(S )| = a 1

2 Cε1/2n1/4 (9)

and Ez Φ(S)

f(z; S ) |Φ(S )| = a + 1 ES DS

w z 1S |Φ(S )| = a + 1 1

2 Cε1/2n1/4. (10)

We address Equation (9), and the bound for Equation (10) follows analogously.

To address Equation (9), we will show a stronger (per-example) statement: ES DS

f(z; S ) |Φ(S )| = a ES DS

w z 1S |Φ(S )| = a 1

2 Cε1/2n1/4 (11)

Trivially, this implies that the expectation over z is also bounded from above by 1

2 Cε1/2n1/4. To show that Equation (11) holds, we first compute the probability of a random subset S containing a poisoned samples, and then show an upper bound for Equation (9) leveraging the derived probability by using the definition of conditional expectation.

By directly counting, we have that

PS DS[|Φ(S )| = a] =

|Φ(S)| a |S| |Φ(S)| α |S| a

|S| α |S| .

Rethinking Backdoor Attacks

To ease notation, let n := |S| and p := |Φ(S)|, thus a = αp. Rewriting, we have

PS DS[|Φ(S )| = αp] =

p αp n p α(n p)

PS DS[|Φ(S )| = αp + 1] =

p αp+1 n p α(n p) 1

αp + 1 α(n p) (1 α)(n p) + 1

PS DS[|Φ(S )| = αp]

We first show that the ratio of the two probabilities is bounded by a constant, i.e.,

αp + 1 α(n p) (1 α)(n p) + 1

p 1 α 1 α + 1 n p

α α + α 1 α

where we used that 1 αp and c α 1 c. Thus

PS DS[|Φ(S )| = αp + 1] c 2(2 c)PS DS[|Φ(S )| = αp].

Now we proceed with bounding PS DS[|Φ(S )| = αp]. Using Stirling s approximation, we have

PS DS[|Φ(S )| = αp]

n p(n p) 1 α(1 α) 2 C2 n

for some constant C > 0. Now from the triangle inequality, Jensen s inequality and Markov s inequality we have that for sufficiently large n

f(z; S ) |Φ(S )| = αp ES DS

w z 1S |Φ(S )| = αp

f(z; S ) w z 1S |Φ(S )| = αp

(f(z; S ) w z 1S )2 |Φ(S )| = αp

2Cε1/2n1/4.

The case for Equation (10) is analogous.

Next, we show that w z h(1Φ(S)) corresponds to the desired difference of conditional expectations. In this proof, we let hϕ = h(1Φ(S)) for brevity.

Lemma 4. We have for every x S that

w z 1S |Φ(S )| = α |Φ(S)| + 1 ES DS

w z 1S |Φ(S )| = α |Φ(S)| = Ez Φ(S)w z hϕ.

Rethinking Backdoor Attacks

Proof. Again, let us consider the case for a single example z, and let n = |S|, p = |Φ(S)|. Then we can write

w z 1S |Φ(S )| = αp = ES DS

z S 1z S wxz

|Φ(S )| = αp

z S |Φ(S )| = αp wxz.

There are a total of p αp

sets satisfying |Φ(S )| = αp. Among these, given that the sample z contains the feature ϕ, i.e., ϕ(z) = 1, there are p 1 αp 1

random subsets containing z. So for all z containing ϕ we have

P z S ϕ(z) = 1, |Φ(S )| = αp = αp

Similarly, for all samples z that do not contain ϕ ,i.e, ϕ(z) = 0, we have that

P z S ϕ(z) = 0, |Φ(S )| = αp = α(n p)

Thus, overall

w z 1S |Φ(S )| = αp = αp

p w z 1ϕ(S) + α(n p)

n p w z (1 1ϕ(S))

p 1ϕ(S) + α(n p)

n p (1 1ϕ(S))

p 1ϕ(S) + α(n p)

n p (1 1ϕ(S))

Analogously,

w z 1S |Φ(S )| = αp + 1 = αp + 1

p w z 1ϕ(S) + α(n p) 1

n p w z (1 1ϕ(S))

p 1ϕ(S) + α(n p) 1

n p (1 1ϕ(S))

p 1ϕ(S) + α(n p) 1

n p (1 1ϕ(S))

We then subtract the two terms to get:

w z 1S |Φ(S )| = αp + 1 ES DS

w z 1S |Φ(S )| = αp = w z

p1ϕ(S) 1 n p(1 1ϕ(S))

= w z h(1ϕ(S))

Finally, we get the desired results over all examples z Φ(S) by directly averaging.

Proof of Lemma 1. The proof of Lemma 1 follows by combining the results of Lemma 3 and Lemma 4:

Rethinking Backdoor Attacks

sϕ(α |Φ(S)|) Ez Φ(S)w z h(1ϕ(S))

= sϕ(α |Φ(S)|) Ez Φ(S)

w z 1S |Φ(S )| = α |Φ(S)| + 1 + ES DS

w z 1S |Φ(S )| = α |Φ(S)|

f(z; S ) |Φ(S )| = α |Φ(S)| + 1 ES DS

w z 1S |Φ(S )| = α |Φ(S)| + 1

f(z; S ) |Φ(S )| = α |Φ(S)| ES DS

w z 1S |Φ(S )| = α |Φ(S)|

2 Cε1/2n1/4 = Cε1/2n1/4

Rethinking Backdoor Attacks

C. Proof of Lemma 2

Lemma 2. Suppose Assumption 1 holds for some δ and Lemma 1 for some C. Then if δ > 2p Cε1/2n1/4, the unique maximizer of (5) is the vector 1ϕp(S), i.e., the indicator of the backdoored examples, where ϵ is as in Assumption 2.

Proof. The result follows directly from Assumption 1 and Lemma 1. In particular, let ϕv be a feature whose corresponding support Φv(S) is of size p.

h(v) Wv = 1

n v 1 n p (1n v) Wv

= h v wz1 h v wz2 . . . h v wzn v

z Φv(S) h v wz.

First, from Lemma 1, we have that: X

z Φv(S) h v wz X

z Φv(S) sv(z, α v 1) + p C ε1/2n1/4

Now let vp be the indicator vector for the poisoned examples. We similarly have from Lemma 1: X

z Φp(S) h p wz X

z Φp(S) sp(z, α v 1) p C ε1/2n1/4

Thus for any v = vp we have that

h p Wvp h v Wv = X

z Φp(S) h p wz X

z Φv(S) h v wz

z Φp(S) svp(z, α vp 1) X

z Φv(S) sv(z, α v 1) 2p C ε1/2n1/4

We now use Assumption 1 that states:

z Φp(S) sϕp (z, α p) X

z Φ(S) sϕ(z, α p) δ

By combining these two inequalities, we obtain:

h p Wvp h v Wv = X

z Φp(S) h p wz X

z Φv(S) h v wz

z Φp(S) svp(z, α vp 1) X

z Φv(S) sv(z, α v 1) 2p C ε1/2n1/4

δ 2p C ε1/2n1/4

This concludes the proof that the solution of the optimization program in Equation (5) is the poison indicator vector vp, as long as δ > 2p C ε1/2n1/4.

Rethinking Backdoor Attacks

D. Experimental Setup

Figure 6: We execute the poisoning attacks with three types of triggers: (a) one black pixel on top left corner (first two images), (b) 3x3 black square on top left corner (third and fourth images), and (c) 3-way triggers adapted from (Xie et al., 2020) (last four images).

D.1. Backdoor Attacks

Dirty-Label Backdoor Attacks. The most prominent type of backdoor attacks is a dirty-label attack (Gu et al., 2017). During a dirty-label attack, the adversary inserts a trigger into a subset of the training set, then flips the label of the poisoned samples to a particular target class yb. We mount four different dirty-label attacks, by considering two different triggers, and two different levels of poisoning (cf. Exp. 1 to 4 in Table 1).

Clean-Label Backdoor Attacks. A more challenging attack is the clean-label attack (Shafahi et al., 2018; Turner et al., 2019)10 where the adversary avoids changing the label of the poisoned samples. To mount a successful clean-label attack, the adversary poisons samples from the target class only, hoping to create a strong correlation between the target class and the trigger.

We perform two types of clean-label attacks. During the first type (Exp. 5 and 6 from Table 1), we perturb the image with an adversarial example before inserting the trigger, as presented in (Turner et al., 2019). During the second type of clean-label attacks (Exp. 7 and 8 from Table 1), we avoid adding the adversarial example, however, we poison more samples to have an effective attack.

Trigger. We conduct our experiments with two types of triggers. The first type is a fixed pattern inserted in the top left corner of the image. The trigger is unchanged between train and test time. This type of trigger has been used in multiple works (Gu et al., 2017; Turner et al., 2019). The other type of trigger is an m-way trigger, with m=3 (Xie et al., 2020). During training, one of three triggers is chosen at random for each image to be poisoned, and then the trigger is inserted into one of three locations in the image. At test time, all three triggers are inserted at the corresponding positions to reinforce the signal. We display in Figure 6 the triggers used to poison the dataset.

D.2. Training Setup

Training CIFAR models. In this paper, we train a large number of models on different subsets of CIFAR-10 in order to compute the datamodels. To this end, we use the Res Net-9 architecture (He et al., 2015)11. This smaller version of Res Nets was optimized for fast training.

Training details. We fix the training procedure for all our models. We show the hyperparameter details in Table 512. One augmentation was used for dirty-label attacks (Cutout (De Vries & Taylor, 2017)) to improve the performance of the model on CIFAR10. Similar to (Turner et al., 2019), we do not use any data augmentation when performing clean-label attacks.

Performance. In order to train a large number of models, we use the FFCV library for efficient data-loading (Leclerc et al., 2022). The speedup from using FFCV allows us to train a model to convergence in 40 seconds, and 100k models for each experiment using 16 V100 in roughly 1 day13.

10We evaluate the clean-label attack as presented in (Turner et al., 2019) 11https://github.com/wbaek/torchskeleton/blob/master/bin/dawnbench/cifar10.py 12Our implementation and configuration files will be available in our code. 13We train 3 models in parallel on every V100.

Rethinking Backdoor Attacks

Optimizer Epochs Batch Size Peak LR Cyclic LR Peak Epoch Momentum Weight Decay

SGD 24 1,024 0 5 0.9 4e-5

Table 5: Hyperparameters used to train Res Net-9 on CIFAR10.

Computing datamodels. We adopt the framework presented in (Ilyas et al., 2022) to compute the datamodels of each experiment. Specifically, we train 100k models on different subsets containing 50% of the training set chosen at random. We then compute the datamodels as described in (Ilyas et al., 2022).

Local Search. We approximate the solution of the problem outlined in (5) using a local search heuristic presented in (Kernighan & Lin, 1970). We iterate over ten sizes for the poison mask: {10, 20, 40, 80, 160, 320, 640, 1280, 2560, 5120}. For each size, we collect 1,000 different solutions by starting from different initialization of the solution.

D.3. Estimating Theoretical Quantities

Recall the average margin definition presented in (1). In particular:

gϕ(k) = Ez Φ(S)

f(z; S ) |Φ(S )| = k, z S (12)

where S is a subset of the training set, f(z; S ) is the margin of the model on example z when trained on the dataset S , Φ(S ) is the subset of the set S containing the poisoned feature, and k is the number of poisoned samples. Estimating the average margins requires training a large number of models on different subsets, and measure for every sample z and every number of poisoned samples k the margins of the trained model.

For the purpose of this paper, we leverage the datamodels computation framework to estimate these averages. In particular, to compute the datamodels weights, we train a large number of models on different subsets S1, S2, . . . , ST of the training set S14. For every subset Si, we record the number of poisoned samples in the subset, then we estimate the average margin by averaging the margins over the different subsets that contain k poisoned samples.

i=1 1(|Φ(Si)|=k) (z / Si) (13a)

f(z; S ) |Φ(S )| = k, z S 1 Nϕ(k)

i=1 f(z; Si) 1(|Φ(Si)|=k) (z / Si) (13b)

gϕ(k) 1 |Φ(S)|

i=1 f(z; Si) 1(|Φ(Si)|=k) (z / Si) (13c)

By a training 100k models on different subsets of the dataset, we obtain reasonable estimates of the marginal effects for every sample z = (x, y).

14We refer the reader to (Ilyas et al., 2022) for more details.

Rethinking Backdoor Attacks

E. Omitted Plots

E.1. Average Margin Plots

In this section, we show for all the experiments the plots of the average margin for clean and poisoned samples as a function of the number of poisoned samples in the dataset (cf. Fig. 3 in the main paper).

360 380 400 Number of Poisoned Examples

Average Margin

1230 1235 1240 Number of Poisoned Examples

Average Margin

340 360 380 Number of Poisoned Examples

Average Margin

1230 1240 1250 Number of Poisoned Examples

Average Margin

30 40 50 Number of Poisoned Examples

Average Margin

100 110 120 130 140 Number of Poisoned Examples

Average Margin

100 110 120 130 140 Number of Poisoned Examples

Average Margin

220 240 260 Number of Poisoned Examples

Average Margin

Figure 7: We plot for all the experiments the average margin for five clean samples (left) and five poisoned samples (right) as the number of poisoned samples in the training set increases. We observe that the average margin for clean samples (without the trigger) is constant when poisoning more samples in the dataset. In contrast, the average margin for poisoned samples (with the trigger) increases when the number of poisoned samples increases in the dataset, confirming our assumptions.

E.2. Estimated Backdoor Feature Strength Plots

In this section, we show for all the experiments the plots of the estimated backdoor feature strength, and the approximation we obtain using datamodels (cf. Figure 4 in the main paper).

0.005 0.000 0.005 Datamodels Estimate

Feature Strength

0.000 0.002 Datamodels Estimate

Feature Strength

0.000 0.005 Datamodels Estimate

Feature Strength

0.002 0.000 0.002 0.004 Datamodels Estimate

Feature Strength

0.025 0.000 0.025 0.050 Datamodels Estimate

Feature Strength

0.00 0.01 0.02 Datamodels Estimate

Feature Strength

0.00 0.01 Datamodels Estimate

Feature Strength

0.005 0.000 0.005 0.010 Datamodels Estimate

Feature Strength

Figure 8: We plot for all the experiments the estimated backdoor feature strength and the approximation with datamodels presented in Equation 4. We observe for poisoned samples (in red) a good linear correlation between the strengths and the datamodels approximation. Additionally, we observe no noticeable correlation for clean samples (in green).

Rethinking Backdoor Attacks

E.3. Distribution of Datamodels Values

In this section, we plot for each experiment the distribution of the datamodels weights for all experiments. In particular, recall that the datamodels weight wz[i] represents the influence of the training sample zi on the sample z. We show below the distribution of the effect of 1) poisoned samples on poisoned samples, 2) the poisoned samples on the clean samples and 4) the clean samples on the clean samples.

1 0 1 2 3 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

0 1 2 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

0 1 2 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

0 1 2 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

0 1 2 Datamodel Value

Poison to Clean

Clean to Clean

Poison to Poison

0 1 2 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

0 1 2 3 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

0 1 2 Datamodel Value

Poison to Poison

Poison to Clean

Clean to Clean

Figure 9: We plot the distribution of the datamodels weights for all the experiments. We clearly see that the effect of poisoned samples on other poisoned samples is generally higher than the effect of poisoned samples on clean samples, and than clean samples on each other.

E.4. Attack Success Rate (ASR)

In the main paper, we presented our results by measuring the accuracy of a model on a clean and a poisoned validation sets. Another relevant metric is the Attack Success Rate (ASR) which measures the probability that the model predicts the target class after inserting the trigger into an image. As we can see in Table 6, our defense leads to a low ASR in seven out of eight setups.

Table 6: We compare our method against multiple baselines in a wide range of experiments. We observe that our algorithm leads to a low ASR in all of our settings. Refer to Table 1 for the full experiments parameters.

Exp. No Defense AC ISPL SPECTRE SS Ours

1 87.94 88.26 0.70 87.67 73.78 0.81 2 96.38 96.32 0.67 - 95.40 1.44

3 50.49 51.33 0.58 55.68 10.44 1.18 4 99.21 99.02 0.75 - 95.85 2.30

5 15.66 5.35 0.71 7.66 0.80 0.92 6 58.57 45.44 0.66 46.78 0.67 0.77

7 26.09 23.64 9.90 18.48 9.17 3.56 8 50.62 44.82 26.07 44.14 24.72 3.42