# bagflip_a_certified_defense_against_data_poisoning__00a07af3.pdf Bag Flip: A Certified Defense against Data Poisoning Yuhao Zhang University of Wisconsin-Madison yuhaoz@cs.wisc.edu Aws Albarghouthi University of Wisconsin-Madison aws@cs.wisc.edu Loris D Antoni University of Wisconsin-Madison loris@cs.wisc.edu Machine learning models are vulnerable to data-poisoning attacks, in which an attacker maliciously modifies the training set to change the prediction of a learned model. In a trigger-less attack, the attacker can modify the training set but not the test inputs, while in a backdoor attack the attacker can also modify test inputs. Existing model-agnostic defense approaches either cannot handle backdoor attacks or do not provide effective certificates (i.e., a proof of a defense). We present Bag Flip, a model-agnostic certified approach that can effectively defend against both trigger-less and backdoor attacks. We evaluate Bag Flip on image classification and malware detection datasets. Bag Flip is equal to or more effective than the state-of-the-art approaches for trigger-less attacks and more effective than the state-of-the-art approaches for backdoor attacks. 1 Introduction Recent works have shown that machine learning models are vulnerable to data-poisoning attacks, where the attackers maliciously modify the training set to influence the prediction of the victim model as they desire. In a trigger-less attack [42, 31, 24, 3, 1, 11], the attacker can modify the training set but not the test inputs, while in a backdoor attack [29, 34, 41, 30, 5, 21, 25, 27] the attacker can also modify test inputs. Effective attack approaches have been proposed for various domains such as image recognition [12], sentiment analysis [25], and malware detection [30]. Consider the malware detection setting. An attacker can modify the training data by adding a special signature e.g., a line of code to a set of benign programs. The idea is to make the learned model correlate the presence of the signature with benign programs. Then, the attacker can sneak a piece of malware past the model by including the signature, fooling the model into thinking it is a benign program. Indeed, it has been shown that if the model is trained on a dataset with only a few poisoned examples, a backdoor signature can be installed in the learned model [30, 27]. Thus, data-poisoning attacks are of great concern to the safety and security of machine learning models and systems, particularly as training data is gathered from different sources, e.g., via web scraping. Ideally, a defense against data poisoning should fulfill the following desiderata: (1) Construct effective certificates (proofs) of the defense. (2) Defend against both trigger-less and backdoor attacks. (3) Be model-agnostic. It is quite challenging to fulfill all three desiderata; indeed, existing techniques are forced to make tradeoffs. For instance, empirical approaches [10, 40, 20, 26, 36, 13, 33, 9] cannot construct certificates and are likely to be bypassed by new attack approaches [32, 37, 16]. Some certified approaches [39, 35] provide ineffective certificates for both trigger-less and backdoor attacks. Some certified approaches [15, 19, 4, 28, 22, 38] provide effective certificates for trigger-less attacks 36th Conference on Neural Information Processing Systems (Neur IPS 2022). ar Xiv:2205.13634v2 [cs.LG] 16 Oct 2022 but not backdoor attacks. Other certification approaches [14, 7, 23] are restricted to specific learning algorithms, e.g., decision trees. This paper proposes Bag Flip, a model-agnostic certified approach that uses randomized smoothing [6, 8, 18] to effectively defend against both trigger-less and backdoor attacks (Section 4). Bag Flip uses a novel smoothing distribution that combines bagging of the training set and noising of the training data and the test input by randomly flipping features and labels. Although both bagging-based and noise-based approaches have been proposed independently in the literature, combining them makes it challenging to compute the certified radius, i.e., the amount of poisoning a learning algorithm can withstand without changing its prediction. To compute the certified radius precisely, we apply the Neyman Pearson lemma to the sample space of Bag Flip s smoothing distribution. This lemma requires partitioning the outcomes in the sample space into subspaces such that the likelihood ratio in each subspace is a constant. However, because our sample space is exponential in the number of features, we cannot naively apply the lemma. To address this problem, we exploit properties of our smoothing distribution to design an efficient algorithm that partitions the sample space into polynomially many subspaces (Section 5). Furthermore, we present a relaxation of the Neyman Pearson lemma that further speeds up the computation (Section 6). Our evaluation against existing approaches shows that Bag Flip is comparable to them or more effective for trigger-less attacks and more effective for backdoor attacks (Section 7). 2 Related Work Our model-agnostic approach Bag Flip uses randomized smoothing to compute effective certificates for both trigger-less and backdoor attacks. Bag Flip focuses on feature-and-label-flipping perturbation functions that modify training examples and test inputs. We discuss how existing approaches differ. Some model-agnostic certified approaches can defend against both trigger-less and backdoor attacks but cannot construct effective certificates. Wang et al. [35], Weber et al. [39] defended featureflipping poisoning attacks and l2-norm poisoning attacks, respectively. However, their certificates are practically ineffective due to the curse of dimensionality [17]. Some model-agnostic certified approaches construct effective certificates but cannot defend against backdoor attacks. Jia et al. [15] proposed to defend against general trigger-less attacks, i.e., the attackers can add/delete/modify examples in the training dataset, by bootstrap aggregating (bagging). Chen et al. [4] extended bagging by designing other selection strategies, e.g., selecting without replacement and with a fixed probability. Rosenfeld et al. [28] defended against label-flipping attacks and instantiated their framework on linear classifiers. Differential privacy [22] can also provide probabilistic certificates for trigger-less attacks, but it cannot handle backdoor attacks, and its certificates are ineffective. Levine and Feizi [19] proposed a deterministic partition aggregation (DPA) to defend against general trigger-less attacks by partitioning the training dataset using a secret hash function. Wang et al. [38] further improved DPA by introducing a spread stage. Other certified approaches construct effective certificates but are not model-agnostic. Jia et al. [14] provided deterministic certificates only for nearest neighborhood classifiers (k NN/r NN), but for both trigger-less and backdoor attacks. Meyer et al. [23], Drews et al. [7] provided deterministic certificates only for decision trees but against general trigger-less attacks. 3 Problem Definition We take a holistic view of training and inference as a single deterministic algorithm A. Given a dataset D = {(x1, y1), . . . , (xn, yn)} and a (test) input x, we write A(D, x) to denote the prediction of algorithm A on the input x after being trained on dataset D. We are interested in certifying that the algorithm will still behave well after training on a tampered dataset. Before describing what well means, to define our problem we need to assume a perturbation space i.e., what possible changes the attacker could make to the dataset. Given a pair (x, y), we write π(x, y) to denote the set of perturbed examples that an attacker can transform the example (x, y) into. Given a dataset D and a radius r 0, we define the perturbation space as the set of datasets that can be obtained by modifying up to r examples in D using the perturbation π: {(exi, eyi)}i i. (exi, eyi) π(xi, yi), i=1 1(exi,eyi) =(xi,yi) r Threat models. We consider two attacks: one where an attacker can perturb only the training set (a trigger-less attack) and one where the attacker can perturb both the training set and the test input (a backdoor attack). We assume a backdoor attack scenario where test input perturbation is the same as training one. If these two perturbation spaces can perturb examples differently, we can always over-approximate them by their union. In the following definitions, we assume that we are given a perturbation space π, a radius r 0, and a benign training dataset D and test input x. We say that algorithm A is robust to a trigger-less attack on the test input x if A yields the same prediction on the input x when trained on any perturbed dataset e D and the benign dataset D. Formally, e D Sπ r (D). A( e D, x) = A(D, x) (1) We say that algorithm A is robust to a backdoor attack on the test input x if the algorithm trained on any perturbed dataset e D produces the prediction A(D, x) on any perturbed input ex. Let π(x, y)1 denote the projection of the perturbation space π(x, y) onto the feature space (the attack can only backdoor features of the test input). Robustness to a backdoor attack is defined as e D Sπ r (D), ex π(x, y)1. A( e D, ex) = A(D, x) (2) Given a large enough radius r, an attacker can always change enough examples and succeed at breaking robustness for either kind of attack. Therefore, we will focus on computing the maximal radius r, for which we can prove that Eq 1 and 2 hold for a given perturbation function π. We refer to this quantity as the certified radius. It is infeasible to prove Eq 1 and 2 by enumerating all possible e D because Sπ r (D) can be ridiculously large, e.g., |Sπ r (D)| > 1030 when |D| = 1000 and r = 10. Defining the perturbation function. We have not yet specified the perturbation function π, i.e., how the attacker can modify examples. In this paper, we focus on the following perturbation spaces. Given a bound s 0, a feature-and-label-flipping perturbation, FLs, allows the attacker to modify the values of up to s features and the label in an example (x, y), where x [K]d (i.e., x is a d-dimensional feature vector with each dimension having {0, 1, . . . , K} categories). Formally, FLs(x, y) = {(ex, y ) | x ex 0 + 1y =y s}, There are two special cases of FLs, a feature-flipping perturbation Fs and a label-flipping perturbation L. Given a bound s 0, Fs allows the attacker to modify the values of up to s features in the input x but not the label. L only allows the attacker to modify the label of a training example. Note that L cannot modify the test input s features, so it can only be used in trigger-less attacks. Example 3.1. If D is a binary-classification image dataset, where each pixel is either black or white, then the perturbation function F1 assumes the attacker can modify up to one pixel per image. The goal of this paper is to design a certifiable algorithm that can defend against trigger-less and backdoor attacks (Section 4) by computing the certified radius (Sections 5 and 6). Given a benign dataset D, our algorithm certifies that an attacker can perturb D by some amount (the certified radius) without changing the prediction. Symmetrically, if we suspect that D is poisoned, our algorithm certifies that even if an attacker had not poisoned D by up to the certified radius, the prediction would have been the same. The two views are equivalent, and we use the former in the paper. 4 Bag Flip: Dual-Measure Randomized Smoothing Our approach, which we call Bag Flip, is a model-agnostic certification technique. Given a learning algorithm A, we want to automatically construct a new learning algorithm A with certified poisoningrobustness guarantees (Eqs. (1) and (2)). To do so, we adopt and extend the framework of randomized smoothing. Initially used for test-time robustness, randomized smoothing robustifies a function f by carefully constructing a noisy version f and theoretically analyzing the guarantees of f. Our approach, Bag Flip, constructs a noisy algorithm A by randomly perturbing the training set and the test input and invoking A on the result. Formally, if we have a set of output labels C, we define the smoothed learning algorithm A as follows: A(D, x) argmax y C Pr D, x µ(D,x) (A( D, x) = y), (3) where µ is a carefully designed probability distribution the smoothing distribution over the training set and the test input ( D, x are a sampled dataset and a test input from the smoothing distribution). We present two key contributions that enable Bag Flip to efficiently certify robustness up to large poisoning radii. First, we define a smoothing distribution µ that combines bagging [15] of the training set D, and noising [35] of the training set and the test input x (done by randomly flipping features and labels). This combination, described below, allows us to defend against trigger-less and backdoor attacks. However, this combination also makes it challenging to compute the certified radius due to the combinatorial explosion of the sample space. Our second contribution is a partition strategy for the Neyman Pearson lemma that results in an efficient certification algorithm (Section 5), as well as a relaxation of the Neyman Pearson lemma that further speeds up certification (Section 6). Smoothing distribution. We describe the smoothing distribution µ that defends against feature and label flipping. Given a dataset D = {(xi, yi)} and a test input x, sampling from µ(D, x) generates a random dataset and test input in the following way: (1) Uniformly select k examples from D with replacement and record their indices as w1, . . . , wk. (2) Modify each selected example (xwi, ywi) and the test input x to (x wi, y wi) and x , respectively, as follows: For each feature, with probability 1 ρ, uniformly change its value to one of the other categories in {0, . . . , K}. Randomly modify labels ywi in the same way. In other words, each feature will be flipped to another value from the domain with probability γ 1 ρ K , where ρ [0, 1] is the parameter controlling the noise level. For Fs (resp. L), we simply do not modify the labels (resp. features) of the k selected examples. Example 4.1. Suppose we defend against F1 in a trigger-less setting, then the distribution µ will not modify labels or the test input. Let ρ = 4 5, k = 1. Following Example 3.1, let the binary-classification image dataset D = {(x1, y1), (x2, y2)}, where each image contains only one pixel. Then, one possible element of µ(D, x) can be the pair ({(x 2, y2)}, x), where x 2 = x2. The probability of this element is 1 2 ρ = 0.4 because we uniformly select one out of two examples and do not flip any feature, that is, the single feature retains its original value. 5 A Precise Approach to Computing the Certified Radius In this section, we show how to compute the certified radius of the smoothed algorithm A given a dataset D, a test input x, and a perturbation function π. We focus on binary classification and provide the multi-class case in Appendix B. Suppose that we have computed the prediction y = A(D, x). We want to show how many examples we can perturb in D to obtain any other e D so the prediction remains y . Specifically, we want to find the largest possible radius r such that e D Sπ r (D), ex π(x, y)1. Pr D, x µ( e D,ex) (A( D, x) = y ) > 0.5 (4) We first show how to certify that Eq 4 holds for a given r and then rely on binary search to compute the largest r, i.e., the certified radius.1 In Section 5.1, we present the Neyman Pearson lemma to certify Eq 4 as it is a common practice in randomized smoothing. In Section 5.2, we show how to compute the certified radius for the distribution µ and the perturbation functions Fs, FLs, and L. 5.1 The Neyman Pearson Lemma Hereinafter, we simplify the notation and use o to denote the pair ( D, x) and A(o) to denote the prediction of the algorithm training and evaluating on o. We further simplify the distribution µ(D, x) 1It is difficult to get a closed-form solution of the certified radius (as done in [15, 39]) in our setting because the distribution of Bag Flip is complicated. We rely on binary search as it is also done in Wang et al. [35], Lee et al. [18]. Appendix D shows a loose closed-form bound on the certified radius by KL-divergence [8]. as µ and the distribution µ( e D, ex) as eµ. We define the performance of the smoothed algorithm A on dataset D, i.e., the probability of predicting y , as p = Pro µ(A(o) = y ). The challenge of certifying Eq 4 is that we cannot directly estimate the performance of the smoothed algorithm on the perturbed data, i.e., Pro eµ(A(o) = y ), because eµ is universally quantified. To address this problem, we use the Neyman Pearson lemma to find a lower bound lb for Pro eµ(A(o) = y ). We do so by constructing a worst-case algorithm A? and distribution eµ. Note that we add the superscript ? to denote a worst-case algorithm. Specifically, we minimize Pro eµ(A?(o) = y ) while maintaining the algorithm s performance on µ, i.e., keeping Pro µ(A?(o) = y ) = p . We use A to denote the set of all possible algorithms. We formalize the computation of the lower bound lb as the following constrained minimization objective: lb min A? A Pr o eµ(A?(o) = y ) s.t. Pr o µ(A?(o) = y ) = p and e D Sπ r (D), ex π(x, y)1 (5) It is easy to see that lb is the lower bound of Pro eµ(A(o) = y ) in Eq. 4 because A A and A satisfies the minimization constraint. Thus, lb > 0.5 implies the correctness of Eq. 4. We show how to construct A? and eµ greedily. For each outcome o = ( D, x) in the sample space Ω i.e., the set of all possible sampled datasets and test inputs we define the likelihood ratio of o as η(o) = pµ(o)/peµ(o), where pµ and peµ are the PMFs of µ and eµ, respectively. The key idea is as follows: We partition Ωinto finitely many disjoint subspaces L1, . . . , Lm such that the likelihood ratio in each subspace Li is some constant ηi [0, ], i.e., o Li.η(o) = ηi. We can sort and reorder the subspaces by likelihood ratios such that η1 . . . ηm. We denote the probability mass of µ on subspace Li as pµ(Li). Example 5.1. Suppose pµ(o1)= 4 10, peµ(o1)= 4 10, pµ(o2)= 1 10, peµ(o2)= 1 10, pµ(o3)= 4 10, peµ(o3)= 1 10, pµ(o4)= 1 10, peµ(o4)= 4 10. We can partition o1 and o2 into one subspace L because η(o1)=η(o2)=1. The construction of the A? that minimizes Eq. 5 is a greedy process, which iteratively assigns A?(o) = y for L1, L2, . . . until the budget p is met. The worst-case eµ can also be constructed greedily by maximizing the top-most likelihood ratios, and we can prove that the worst-case happens when the difference between µ and eµ is maximized, i.e., e D and ex are maximally perturbed. The following theorem adapts the Neyman Pearson lemma to our setting. Theorem 5.1 (Neyman Pearson Lemma for FLs, Fs, L). Let e D and ex be a maximally perturbed dataset and test input, i.e., | e D \ D| = r, ex x 0 = s, and exi xi 0 + 1eyi =yi = s, for each perturbed example (exi, eyi) in e D. Let ilb argmini [1,m] Pi j=1 pµ(Lj) p . Then, lb = Pilb 1 j=1 peµ(Lj) + p Pilb 1 j=1 pµ(Lj) /ηilb. We say that lb is tight due to the existence of the minimizer A? and eµ (see Appendix B). Example 5.2. Following Example 5.1, suppose Ω= {o1, . . . , o4} and we partition it into L1 = {o3}, L2 = {o1, o2}, L3 = {o4}, and η1 = 4, η2 = 1, η3 = 0.25. Let p = 0.95, then we have ilb = 3 and lb = 0.1 + 0.5 + (0.95 0.4 0.5)/0.25 = 0.8. 5.2 Computing the Certified Radius of Bag Flip Computing the certified radius boils down to computing lb in Eq 5 using Theorem 5.1. To compute lb for Bag Flip, we address the following two challenges: 1) The argmin of computing ilb and the summation of lb in Theorem 5.1 depend on the number of subspaces Lj s. We design a partition strategy that partitions Ωinto a polynomial number of subspaces (O(k2d)). Recall that k is the size of the bag sampled from D and d is the number of features. 2) In Theorem 5.1, lb depends on pµ(Lj), which according to its definition (Eq 8) can be computed in exponential time (O(kd2k)). We propose an efficient algorithm that computes these two quantities in polynomial time (O(d3 + k2d2)). We first show how to address these challenges for Fs and then show how to handle FLs and L. Partition strategy. We partition the large sample space Ωinto disjoint subspaces Lc,t that depend on c, the number of perturbed examples in the sampled dataset, and t, the number of features the clean data and the perturbed data differ on with respect to the sampled data. Intuitively, c takes care of the bagging distribution and t takes care of the feature-flipping distribution. Formally, Lc,t = {({(x wi, ywi)}i, x ) | i=1 1xwi =exwi = c, (6) i=1 x wi xwi 0 + x x 0 i=1 x wi exwi 0 + x ex 0 Eq 6 means that there are c perturbed indices sampled in o. (and e ) in Eq 7 counts how many features the sampled and the clean data (the perturbed data) differ on. The number of possible subspaces Lc,t, which are disjoint by definition, is O(k2d) because 0 c k and |t| (k + 1)d. The next theorems show how to compute the likelihood ratio of Lc,t and pµ(Lc,t). Theorem 5.2 (Compute ηc,t). ηc,t = pµ(Lc,t)/peµ(Lc,t) = (γ/ρ)t, where γ and ρ are parameters controlling the noise level in Bag Flip s smoothing distribution µ. Theorem 5.3 (Compute pµ(Lc,t)). pµ(Lc,t) = Pr o µ(o satisfies Eq 6) Pr o µ(o satisfies Eq 7 | o satisfies Eq 6), where Pro µ(o satisfies Eq 6) = Binom(c; k, r n) is the PMF of the binomial distribution and T(c, t) Pr o µ(o satisfies Eq 7 | o satisfies Eq 6) = X 0 i, e i d, i [0,d] 0 e 0+...+ c e c=t i=0 L( i, e i; s, d)γ iρd i, where L( , e ; s, d) is the same quantity defined in Lee et al. [18]. Remark 5.1. We can compute peµ(Lc,t) as ηc,tpµ(Lc,t) by the definition of ηc,t. Efficient algorithm to compute Eq 8. The following algorithm computes T(c, t) efficiently in time O(d3 + k2d2): 1) Computing L(u, v; s, d) takes O(d3) (see Appendix C for details). 2) Computing T(0, t) by the definition in Eq 8 takes O(kd2). 3) Computing T(c, t) for c 1 by the following equation takes O(k2d2). min(d,t+cd) X t1=max( d,t cd) T(c 1, t t1)T(0, t1) (9) Theorem 5.4 (Correctness of the Algorithm). T(c, t) in Eq 9 is the same as the one in Eq 8. The above computation still applies to perturbation function FLs and L. Intuitively, flipping the label can be seen as flipping another dimension in the input features (Details in Appendix B.1). Practical perspective. For each test input x, we need to estimate p for the smoothed algorithm A given the benign dataset D. We use Monte Carlo sampling to compute p by the Clopper-Pearson interval. We also reuse the trained algorithms for each test input in the test set by using Bonferroni correction. We memoize the certified radius by enumerating all possible p beforehand so that checking Eq 4 can be done in constant time in an online scenario (details in Appendix B.2). We cannot use floating point numbers because some intermediate values are too small to store in the floating point number format. Instead, we use rational numbers which represent the nominator and denominator in large numbers. 6 An Efficient Relaxation for Computing a Certified Radius Theorem 5.1 requires computing the likelihood ratio of a large number of subspaces L1, . . . , Lm. We propose a generalization of the Neyman Pearson lemma that underapproximates the subspaces by a small subset and computes a lower bound lbδ for lb. We then show how to choose a subset of subspaces that yields a tight underapproximation. A relaxation of the Neyman Pearson lemma. The key idea is to use Theorem 5.1 to examine only a subset of the subspaces {Li}i. Suppose that we partition the subspaces {Li}i into two sets {Li}i B and {Li}i B, using a set of indices B. We define a new lower bound lbδ by applying Theorem 5.1 to the first group {Li}i B and underapproximating p in Theorem 5.1 as p P i/ B pµ(Li). Intuitively, the underapproximation ensures that any subspace in {Li}i B will not contribute to the lower bound, even though these subspaces may have contributed to the precise lb computed by Theorem 5.1. The next theorem shows that lbδ will be smaller than lb by an error term δ that is a function of the partition B. The gain is in the number of subspaces we have to consider, which is now |B|. Theorem 6.1 (Relaxation of the Lemma). Define lb for the subspaces L1, . . . , Lm as in Theorem 5.1. Define lbδ for {Li}i B as in Theorem 5.1 by underapproximating p as p P i/ B pµ(Li), then we have Soundness: lbδ lb and δ-Tightness: Let δ P i/ B peµ(Li), then lbδ + δ lb. Example 6.1. Consider L1, L2, and L3 from Example 5.2. If we set B = {1, 2} and underapproximate p as p pµ(L3) = 0.85, then we have lbδ = 0.1 + (0.85 0.4)/1 = 0.55, which can still certify Eq. 4. However, lbδ is not close to the original lb = 0.8 because the error δ = 0.4 is large in this example. Next, we will show how to choose B as small as possible while still keeping δ small. Speeding up radius computation. The total time for computing lb consists of two parts, (1) the efficient algorithm (Eq 9) computes T(c, t) in O(d3 + k2d2), and (2) Theorem 5.1 takes O(k2d) for a given p because there are O(k2d) subspaces. Even though we have made the computation polynomial by the above two techniques in Section 5.2, it can still be slow for large bag sizes k, which can easily be in hundreds or thousands. We will apply Theorem 6.1 to replace the bag size k with a small constant κ. Observe that pµ(Lc,t) and peµ(Lc,t), as defined in Section 5.2, are negligible for large c, i.e., the number of perturbed indices in the smoothed dataset. Intuitively, if only a small portion of the training set is perturbed, it is unlikely that we select a large number of perturbed indices in the smoothed dataset. We underapproximate the full subspaces to {Lc,t}(c,t) B by choosing the set of indices B = {(c, t) | c κ, |t| (c + 1)d}, where κ is a constant controlling the error δ, which can be computed as δ = P i/ B peµ(Li) = 1 Pκ c=0 Binom(c; k, r n). Theorem 6.1 reduces the number of subspaces to |B| = O(κ2d). As a result, all the k appearing in previous time complexity can be replaced with κ. Example 6.2. Suppose k = 150 and r n = 0.5%, choosing κ = 6 leads to δ = 1.23 10 5. In other words, it speeds up almost 625X for computing T(c, t) and for applying Theorem 5.1. 7 Experiments The implementation of Bag Flip is publicly available2. In this section, we evaluate Bag Flip against trigger-less and backdoor attacks and compare Bag Flip with existing work. Note that we apply the relaxation in Section 6 to Bag Flip in all experiments and set δ = 10 4. Datasets. We conduct experiments on MNIST, CIFAR10, EMBER [2], and Contagio (http://co ntagiodump.blogspot.com). MNIST and CIFAR10 are datasets for image classification. EMBER and Contagio are datasets for malware detection, where data poisoning can lead to disastrous security consequences. To align with existing work, we select subsets of MNIST and CIFAR10 as MNIST-17, MNIST-01, and CIFAR10-02 in some experiments, e.g., MNIST-17 is a subset of MNIST with classes 1 and 7. We discretize all the features when applying Bag Flip. We use clean datasets except in the comparison with RAB [39], where we follow their experimental setup and use backdoored datasets generated by Bad Nets [12]. A detailed description of the datasets is in Appendix E.1. Models. We train neural networks for MNIST, CIFAR10, EMBER and random forests for Contagio. Unless we specifically mention the difference in some experiments, whenever we compare Bag Flip to an existing work, we will use the same network structures, hyper-parameters, and data augmentation as the compared work, and we train N = 1000 models and set the confidence level as 0.999 for each configuration. All the configurations we used can be found in the implementation. Metrics. For each test input (xi, yi), algorithm A will predict a label and the certified radius ri. Assuming that the attacker had poisoned R% of the examples in the training set, we define certified accuracy as the percentage of test inputs that are correctly classified and their certified radii are no 2https://github.com/Forever Zyh/defend_framework Poisoning Amount R (%) Certifiable Accuracy 0 0.2 0.4 0.6 0 0.1 0.2 0.3 0 0.1 0.2 Bagging Bag Flip-a (a) MNIST F1 (b) MNIST F (c) EMBER F1 (d) EMBER F Figure 1: Comparison to Bagging on MNIST and EMBER, showing the certified accuracy at different poisoning amounts R. For MNIST: a = 0.9 and b = 0.8. For EMBER: a = 0.95 and b = 0.9. Table 1: Certified accuracy on MNIST and EMBER with perturbations F1 and F . Note that the certified accuracies are the same poisoning amount R = 0 because we reuse the trained models. R 0 0.5 1.0 1.5 2.0 2.5 0 0.13 0.27 0.40 0.53 0.83 Bagging 94.54 66.84 0 0 0 0 94.54 90.83 85.45 77.61 61.46 0 Bagging-0.9 93.58 71.11 0 0 0 0 93.58 89.80 84.92 78.60 68.11 0 Bag Flip-0.9 93.62 75.95 27.73 4.02 0 0 93.62 89.45 83.62 74.19 56.65 0 Bag Flip-0.8 90.72 73.94 46.20 33.39 24.23 5.07 90.72 84.11 74.50 60.56 39.21 0 R 0 0.07 0.13 0.20 0.27 0.33 0 0.05 0.10 0.15 0.20 0.25 Bagging 82.65 75.11 66.01 0 0 0 82.65 76.30 72.94 61.78 0 0 Bagging-0.95 79.06 75.32 70.19 14.74 0 0 79.06 76.23 73.50 68.45 14.74 0 Bag Flip-0.95 79.17 75.93 69.30 57.36 0 0 79.17 76.83 72.41 62.04 30.36 0 Bag Flip-0.9 78.18 69.40 62.11 41.21 13.89 1.70 78.18 70.79 63.16 41.24 11.64 0 less than R, i.e., Pm i=1 1 A(D,xi)=yi ri n R%. We define normal accuracy as the percentage of test inputs that are correctly classified, i.e., Pm i=1 1 A(D,xi)=yi. 7.1 Defending Against Trigger-less Attacks Two model-agnostic certified approaches, Bagging [15] and Label Flip [28], can defend against trigger-less attacks. Bag Flip outperforms Label Flip (comparison in Appendix E.2). We evaluate Bag Flip on the perturbation Fs using MNIST, CIFAR10, EMBER, and Contagio and compare Bag Flip to Bagging. We provide comparisons with Bagging on other perturbation spaces in Appendix E.2. We present Bag Flip with different noise levels (different probabilities of ρ when flipping), denoted as Bag Flip-ρ. When comparing to Bagging, we use the same k, the size of sampled datasets, for a fair comparison. Furthermore, we tune the parameter k in Bagging to match the normal accuracy with the Bag Flip-ρ setting, and denote this setting as Bagging-ρ. Concretely, we set k = 100, 1000, 300, 30 for MNIST, CIFAR10, EMBER, and Contagio respectively when comparing to Bagging. We tune k = 80, 280 for Bagging-0.9 on MNIST and Bagging-0.95 on EMBER, respectively. And we set k = 50 for MNIST when comparing to Label Flip. Comparison to Bagging. Bagging on a discrete dataset is a special case of Bag Flip when ρ = 1, i.e., no noise is added to the dataset. (We present the results of bagging on the original dataset (undiscretized) in Appendix E.2). Table 1 and Fig 1 show the results of Bag Flip and Bagging on perturbations F1 and F over MNIST and EMBER. The results of CIFAR10 and Contagio are similar and shown in Appendix E.2. F1 only allows the attacker to modify one feature in each training example, and F allows the attacker to modify each training example arbitrarily without constraint. For F1 on MNIST, Bag Flip-0.9 performs better than Bagging after R = 0.19 and Bag Flip-0.8 still retains non-zero certified accuracy at R = 2.5 while Bagging s certified accuracy drops to zero after R = 0.66. We observe similar results on EMBER for F1 in Table 1, except R = 0.13 when compared to Bagging-0.95. We argue that it is possible to tune a best combination of k and ρ for Bag Flip, like we tune k for Bagging-0.95, and achieve a better result while maintaining similar normal accuracy. However, we do not conduct hyperparameter-tuning for Bag Flip because of its computation cost. Bag Flip achieves higher certified accuracy than Bagging when the poisoning amount is large for F1. For F on MNIST, Bagging performs better than Bag Flip across all R because the noise added by Bag Flip to the training set hurts the accuracy. However, we find that the noise added by Bag Flip helps Poisoning Amount R (%) Certifiable Accuracy F1 F2 F4 F8 F Poisoning Amount R (%) Certifiable Accuracy Figure 2: (a) Bag Flip-0.8 on MNIST against Fs with different s. s = 8 almost overlaps with s = . (b) Bag Flip-0.8 on MNIST and Bag Flip-0.9 on Contagio against backdoor attack with F1. Dashed lines show normal accuracy. it perform better for F on EMBER. Specifically, Bag Flip achieves similar certified accuracy as Bagging at small radii and Bag Flip performs better than Bagging after R = 0.15. Bagging achieves higher certified accuracy than Bag Flip for F . Except in EMBER, Bag Flip achieves higher certified accuracy than Bagging when the amount of poisoning is large. We also study the effect of different s in the perturbation function Fs. Figure 2(a) shows the result of Bag Flip-0.8 on MNIST. Bag Flip has the highest certified accuracy for F1. As s increases, the result monotonically converges to the curve of Bag Flip-0.8 in Figure 1(b). Bagging neglects the perturbation function and performs the same across all s. Bagging performs better than Bag Flip-0.8 when s 8. Other results for different datasets and different noise levels follow a similar trend (see Appendix E.2). Bag Flip performs best at F1 and monotonically degenerates to F as s increases. 7.2 Defending Against Backdoor Attacks Two model-agnostic certified approaches, Feat Flip [35] and RAB [39], can defend against backdoor attacks. We compare Bag Flip to Feat Flip on MNIST-17 perturbed using FL1, and compare Bag Flip to RAB on poisoned MNIST-01 and CIFAR10-02 (by Bad Nets [12]) perturbed using F1. We further evaluate Bag Flip on the perturbation F1 using MNIST and Contagio, on which existing work is unable to compute a meaningful certified radius. Table 2: Compared to Feat Flip on FL1 and RAB F1 with normal accuracy (Acc.) and certified accuracy at different poisoning amount R (CF@R). Acc. CF@0 CF@0.5 CF@1.0 Feat Flip 98 36 0 0 Bag Flip-0.95 97 81 60 0 Bag Flip-0.9 97 72 46 4 Acc. CF@0 CF@1.5 CF@3.0 RAB 100.0 74.4 0 0 Bag Flip-0.8 99.6 98.4 96.5 84.6 Bag Flip-0.7 99.5 98.0 95.8 91.9 Acc. CF@0 CF@0.1 CF@0.2 RAB 73.3 0 0 0 Bag Flip-0.8 73.1 16.8 11.0 6.8 Bag Flip-0.7 71.7 41.0 34.6 29.2 Comparison to Feat Flip. Feat Flip scales poorly compared to Bag Flip because its computation of the certified radius is polynomial in the size of the training set. Bag Flip s computation is polynomial in the size of the bag instead of the size of the whole training set, and it uses a relaxation of the Neyman Pearson lemma for further speed up. Bag Flip is more scalable than Feat Flip. We directly cite Feat Flip s results from Wang et al. [35] and note that Feat Flip is evaluated on a subset of MNIST-17. As shown in Table 2, Bag Flip achieves similar normal accuracy, but much higher certified accuracy across all R (see full results in Appendix E.2) than Feat Flip. Bag Flip significantly outperforms Feat Flip against FL1 on MNIST-17. Comparison to RAB. RAB assumes a perturbation function that perturbs the input within an l2norm ball of radius s. We compare Bag Flip to RAB on F1, where the l0-norm ball (our perturbation function) and l2-norm ball are the same because the feature values are within [0, 1]. Since RAB targets single-test-input prediction, we do not use Bonferroni correction for Bag Flip as a fair comparison. We directly cite RAB s results from Weber et al. [39]. Table 2 shows that on MNIST-01 and CIFAR10-02, Bag Flip achieves similar normal accuracy, but a much higher certified accuracy than RAB for all values of R (detailed figures in Appendix E.2). Bag Flip significantly outperforms RAB against F1 on MNIST-01 and CIFAR10-02. Results on MNIST and Contagio. Figure 2(b) shows the results of Bag Flip on MNIST and Contagio. When fixing R, the certified accuracy for the backdoor attack is much smaller than the certified accuracy for the trigger-less attack (Figure 1 and Figure 3 in Appendix E.2) because backdoor attacks are strictly stronger than trigger-less attacks. Bag Flip cannot provide effective certificates for backdoor attacks on the more complex datasets CIFAR10 and EMBER, i.e., the certified radius is almost zero. Bag Flip can provide certificates against backdoor attacks on MNIST and Contagio, while Bag Flip s certificates are not effective for CIFAR10 and EMBER. 7.3 Computation Cost Analysis We discuss the computation cost of Bag Flip on the MNIST dataset and compare to other baselines. Training. The cost of Bag Flip during training is similar to all the baselines because Bag Flip only adds noise in the training data. Bag Flip and other baselines take about 16 hours on a single GPU to train N = 1000 classifiers on the MNIST dataset. Inference. At inference time, Bag Flip first evaluates the predictions of N classifiers, and counts how many classifiers have the majority label (N1) and how many have the runner-up label (N2). Then, Bag Flip uses a prepared lookup table to query the radius certified by N1 and N2. The inference time for each example contains the evaluation of N classifiers and an O(1) table lookup. Hence, there is no difference between Bag Flip and other baselines. Preparation. Bag Flip needs to prepare a table of size O(N 2) to perform efficient lookup at inference time. The time complexity of preparing each table entry is presented in Sections 5 and 6. On the MNIST dataset, Bag Flip with the relaxation proposed in Section 6 (δ = 10 4) needs 2 hours to prepare the lookup table on a single core. However, the precise Bag Flip proposed in Section 5 needs 85 hours to prepare the lookup table. Bagging also uses a lookup table that can be built in 16 seconds on MNIST (Bagging only needs to do a binary search for each entry). Feat Flip needs approximately 8000 TB of memory to compute its table. Thus, Feat Flip is infeasible to run on the full MNIST dataset. Feat Flip is only evaluated on a subset of the MNIST-17 dataset containing only 100 training examples. RAB does not need to compute the lookup table because it has a closed-form solution for computing the certified radius. Bag Flip has similar training and inference time compared to other baselines. The relaxation technique in Section 6 is useful to reduce the preparation time from 85 hours to 2 hours. Even with the relaxation, Bag Flip needs more preparation time than Bagging and RAB. We argue that the preparation of Bag Flip is feasible because it only takes 12.5% of the time required by training. 8 Conclusion, Limitations, and Future Work We presented Bag Flip, a certified probabilistic approach that can effectively defend against both trigger-less and backdoor attacks. We foresee many future improvements to Bag Flip. First, Bag Flip treats both the data and the underlying machine learning models as closed boxes. Assuming a specific data distribution and training algorithm can further improve the computed certified radius. Second, Bag Flip uniformly flips the features and the label, while it is desirable to adjust the noise levels for the label and important features for better normal accuracy according to the distribution of the data. Third, while probabilistic approaches need to retrain thousands of models after a fixed number of predictions, the deterministic approaches can reuse models for every prediction. Thus, it is interesting to develop a deterministic model-agnostic approach that can defend against both trigger-less and backdoor attacks. Acknowledgments and Disclosure of Funding We thank the anonymous reviewers and Anna P. Meyer for their thoughtful and generous comments and Gurindar S. Sohi for giving us access to his GPUs. This work is supported by the CCF-1750965, CCF-1918211, CCF-1652140, CCF-2106707, CCF-1652140, a Microsoft Faculty Fellowship, and gifts and awards from Facebook and Amazon.