# bagflip_a_certified_defense_against_data_poisoning__00a07af3.pdf

Bag Flip: A Certiﬁed Defense against Data Poisoning

Yuhao Zhang University of Wisconsin-Madison yuhaoz@cs.wisc.edu

Aws Albarghouthi University of Wisconsin-Madison aws@cs.wisc.edu

Loris D Antoni University of Wisconsin-Madison loris@cs.wisc.edu

Machine learning models are vulnerable to data-poisoning attacks, in which an attacker maliciously modiﬁes the training set to change the prediction of a learned model. In a trigger-less attack, the attacker can modify the training set but not the test inputs, while in a backdoor attack the attacker can also modify test inputs. Existing model-agnostic defense approaches either cannot handle backdoor attacks or do not provide effective certiﬁcates (i.e., a proof of a defense). We present Bag Flip, a model-agnostic certiﬁed approach that can effectively defend against both trigger-less and backdoor attacks. We evaluate Bag Flip on image classiﬁcation and malware detection datasets. Bag Flip is equal to or more effective than the state-of-the-art approaches for trigger-less attacks and more effective than the state-of-the-art approaches for backdoor attacks.

1 Introduction

Recent works have shown that machine learning models are vulnerable to data-poisoning attacks, where the attackers maliciously modify the training set to inﬂuence the prediction of the victim model as they desire. In a trigger-less attack [42, 31, 24, 3, 1, 11], the attacker can modify the training set but not the test inputs, while in a backdoor attack [29, 34, 41, 30, 5, 21, 25, 27] the attacker can also modify test inputs. Effective attack approaches have been proposed for various domains such as image recognition [12], sentiment analysis [25], and malware detection [30].

Consider the malware detection setting. An attacker can modify the training data by adding a special signature e.g., a line of code to a set of benign programs. The idea is to make the learned model correlate the presence of the signature with benign programs. Then, the attacker can sneak a piece of malware past the model by including the signature, fooling the model into thinking it is a benign program. Indeed, it has been shown that if the model is trained on a dataset with only a few poisoned examples, a backdoor signature can be installed in the learned model [30, 27]. Thus, data-poisoning attacks are of great concern to the safety and security of machine learning models and systems, particularly as training data is gathered from different sources, e.g., via web scraping.

Ideally, a defense against data poisoning should fulﬁll the following desiderata: (1) Construct effective certiﬁcates (proofs) of the defense. (2) Defend against both trigger-less and backdoor attacks. (3) Be model-agnostic. It is quite challenging to fulﬁll all three desiderata; indeed, existing techniques are forced to make tradeoffs. For instance, empirical approaches [10, 40, 20, 26, 36, 13, 33, 9] cannot construct certiﬁcates and are likely to be bypassed by new attack approaches [32, 37, 16]. Some certiﬁed approaches [39, 35] provide ineffective certiﬁcates for both trigger-less and backdoor attacks. Some certiﬁed approaches [15, 19, 4, 28, 22, 38] provide effective certiﬁcates for trigger-less attacks

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

ar Xiv:2205.13634v2 [cs.LG] 16 Oct 2022

but not backdoor attacks. Other certiﬁcation approaches [14, 7, 23] are restricted to speciﬁc learning algorithms, e.g., decision trees.

This paper proposes Bag Flip, a model-agnostic certiﬁed approach that uses randomized smoothing [6, 8, 18] to effectively defend against both trigger-less and backdoor attacks (Section 4). Bag Flip uses a novel smoothing distribution that combines bagging of the training set and noising of the training data and the test input by randomly ﬂipping features and labels.

Although both bagging-based and noise-based approaches have been proposed independently in the literature, combining them makes it challenging to compute the certiﬁed radius, i.e., the amount of poisoning a learning algorithm can withstand without changing its prediction.

To compute the certiﬁed radius precisely, we apply the Neyman Pearson lemma to the sample space of Bag Flip s smoothing distribution. This lemma requires partitioning the outcomes in the sample space into subspaces such that the likelihood ratio in each subspace is a constant. However, because our sample space is exponential in the number of features, we cannot naively apply the lemma. To address this problem, we exploit properties of our smoothing distribution to design an efﬁcient algorithm that partitions the sample space into polynomially many subspaces (Section 5). Furthermore, we present a relaxation of the Neyman Pearson lemma that further speeds up the computation (Section 6). Our evaluation against existing approaches shows that Bag Flip is comparable to them or more effective for trigger-less attacks and more effective for backdoor attacks (Section 7).

2 Related Work

Our model-agnostic approach Bag Flip uses randomized smoothing to compute effective certiﬁcates for both trigger-less and backdoor attacks. Bag Flip focuses on feature-and-label-ﬂipping perturbation functions that modify training examples and test inputs. We discuss how existing approaches differ.

Some model-agnostic certiﬁed approaches can defend against both trigger-less and backdoor attacks but cannot construct effective certiﬁcates. Wang et al. [35], Weber et al. [39] defended featureﬂipping poisoning attacks and l2-norm poisoning attacks, respectively. However, their certiﬁcates are practically ineffective due to the curse of dimensionality [17].

Some model-agnostic certiﬁed approaches construct effective certiﬁcates but cannot defend against backdoor attacks. Jia et al. [15] proposed to defend against general trigger-less attacks, i.e., the attackers can add/delete/modify examples in the training dataset, by bootstrap aggregating (bagging). Chen et al. [4] extended bagging by designing other selection strategies, e.g., selecting without replacement and with a ﬁxed probability. Rosenfeld et al. [28] defended against label-ﬂipping attacks and instantiated their framework on linear classiﬁers. Differential privacy [22] can also provide probabilistic certiﬁcates for trigger-less attacks, but it cannot handle backdoor attacks, and its certiﬁcates are ineffective. Levine and Feizi [19] proposed a deterministic partition aggregation (DPA) to defend against general trigger-less attacks by partitioning the training dataset using a secret hash function. Wang et al. [38] further improved DPA by introducing a spread stage.

Other certiﬁed approaches construct effective certiﬁcates but are not model-agnostic. Jia et al. [14] provided deterministic certiﬁcates only for nearest neighborhood classiﬁers (k NN/r NN), but for both trigger-less and backdoor attacks. Meyer et al. [23], Drews et al. [7] provided deterministic certiﬁcates only for decision trees but against general trigger-less attacks.

3 Problem Deﬁnition

We take a holistic view of training and inference as a single deterministic algorithm A. Given a dataset D = {(x1, y1), . . . , (xn, yn)} and a (test) input x, we write A(D, x) to denote the prediction of algorithm A on the input x after being trained on dataset D.

We are interested in certifying that the algorithm will still behave well after training on a tampered dataset. Before describing what well means, to deﬁne our problem we need to assume a perturbation space i.e., what possible changes the attacker could make to the dataset. Given a pair (x, y), we write π(x, y) to denote the set of perturbed examples that an attacker can transform the example (x, y) into. Given a dataset D and a radius r 0, we deﬁne the perturbation space as the set of

datasets that can be obtained by modifying up to r examples in D using the perturbation π:

{(exi, eyi)}i

i. (exi, eyi) π(xi, yi),

i=1 1(exi,eyi) =(xi,yi) r

Threat models. We consider two attacks: one where an attacker can perturb only the training set (a trigger-less attack) and one where the attacker can perturb both the training set and the test input (a backdoor attack). We assume a backdoor attack scenario where test input perturbation is the same as training one. If these two perturbation spaces can perturb examples differently, we can always over-approximate them by their union. In the following deﬁnitions, we assume that we are given a perturbation space π, a radius r 0, and a benign training dataset D and test input x.

We say that algorithm A is robust to a trigger-less attack on the test input x if A yields the same prediction on the input x when trained on any perturbed dataset e D and the benign dataset D. Formally,

e D Sπ r (D). A( e D, x) = A(D, x) (1)

We say that algorithm A is robust to a backdoor attack on the test input x if the algorithm trained on any perturbed dataset e D produces the prediction A(D, x) on any perturbed input ex. Let π(x, y)1 denote the projection of the perturbation space π(x, y) onto the feature space (the attack can only backdoor features of the test input). Robustness to a backdoor attack is deﬁned as

e D Sπ r (D), ex π(x, y)1. A( e D, ex) = A(D, x) (2)

Given a large enough radius r, an attacker can always change enough examples and succeed at breaking robustness for either kind of attack. Therefore, we will focus on computing the maximal radius r, for which we can prove that Eq 1 and 2 hold for a given perturbation function π. We refer to this quantity as the certiﬁed radius. It is infeasible to prove Eq 1 and 2 by enumerating all possible e D because Sπ r (D) can be ridiculously large, e.g., |Sπ r (D)| > 1030 when |D| = 1000 and r = 10.

Deﬁning the perturbation function. We have not yet speciﬁed the perturbation function π, i.e., how the attacker can modify examples. In this paper, we focus on the following perturbation spaces.

Given a bound s 0, a feature-and-label-ﬂipping perturbation, FLs, allows the attacker to modify the values of up to s features and the label in an example (x, y), where x [K]d (i.e., x is a d-dimensional feature vector with each dimension having {0, 1, . . . , K} categories). Formally,

FLs(x, y) = {(ex, y ) | x ex 0 + 1y =y s},

There are two special cases of FLs, a feature-ﬂipping perturbation Fs and a label-ﬂipping perturbation L. Given a bound s 0, Fs allows the attacker to modify the values of up to s features in the input x but not the label. L only allows the attacker to modify the label of a training example. Note that L cannot modify the test input s features, so it can only be used in trigger-less attacks. Example 3.1. If D is a binary-classiﬁcation image dataset, where each pixel is either black or white, then the perturbation function F1 assumes the attacker can modify up to one pixel per image.

The goal of this paper is to design a certiﬁable algorithm that can defend against trigger-less and backdoor attacks (Section 4) by computing the certiﬁed radius (Sections 5 and 6). Given a benign dataset D, our algorithm certiﬁes that an attacker can perturb D by some amount (the certiﬁed radius) without changing the prediction. Symmetrically, if we suspect that D is poisoned, our algorithm certiﬁes that even if an attacker had not poisoned D by up to the certiﬁed radius, the prediction would have been the same. The two views are equivalent, and we use the former in the paper.

4 Bag Flip: Dual-Measure Randomized Smoothing

Our approach, which we call Bag Flip, is a model-agnostic certiﬁcation technique. Given a learning algorithm A, we want to automatically construct a new learning algorithm A with certiﬁed poisoningrobustness guarantees (Eqs. (1) and (2)). To do so, we adopt and extend the framework of randomized smoothing. Initially used for test-time robustness, randomized smoothing robustiﬁes a function f by carefully constructing a noisy version f and theoretically analyzing the guarantees of f.

Our approach, Bag Flip, constructs a noisy algorithm A by randomly perturbing the training set and the test input and invoking A on the result. Formally, if we have a set of output labels C, we deﬁne the smoothed learning algorithm A as follows:

A(D, x) argmax y C Pr D, x µ(D,x) (A( D, x) = y), (3)

where µ is a carefully designed probability distribution the smoothing distribution over the training set and the test input ( D, x are a sampled dataset and a test input from the smoothing distribution).

We present two key contributions that enable Bag Flip to efﬁciently certify robustness up to large poisoning radii. First, we deﬁne a smoothing distribution µ that combines bagging [15] of the training set D, and noising [35] of the training set and the test input x (done by randomly ﬂipping features and labels). This combination, described below, allows us to defend against trigger-less and backdoor attacks. However, this combination also makes it challenging to compute the certiﬁed radius due to the combinatorial explosion of the sample space. Our second contribution is a partition strategy for the Neyman Pearson lemma that results in an efﬁcient certiﬁcation algorithm (Section 5), as well as a relaxation of the Neyman Pearson lemma that further speeds up certiﬁcation (Section 6).

Smoothing distribution. We describe the smoothing distribution µ that defends against feature and label ﬂipping. Given a dataset D = {(xi, yi)} and a test input x, sampling from µ(D, x) generates a random dataset and test input in the following way: (1) Uniformly select k examples from D with replacement and record their indices as w1, . . . , wk. (2) Modify each selected example (xwi, ywi) and the test input x to (x wi, y wi) and x , respectively, as follows: For each feature, with probability 1 ρ, uniformly change its value to one of the other categories in {0, . . . , K}. Randomly modify labels ywi in the same way. In other words, each feature will be ﬂipped to another value from the domain with probability γ 1 ρ

K , where ρ [0, 1] is the parameter controlling the noise level.

For Fs (resp. L), we simply do not modify the labels (resp. features) of the k selected examples. Example 4.1. Suppose we defend against F1 in a trigger-less setting, then the distribution µ will not modify labels or the test input. Let ρ = 4

5, k = 1. Following Example 3.1, let the binary-classiﬁcation image dataset D = {(x1, y1), (x2, y2)}, where each image contains only one pixel. Then, one possible element of µ(D, x) can be the pair ({(x 2, y2)}, x), where x 2 = x2. The probability of this element is 1

2 ρ = 0.4 because we uniformly select one out of two examples and do not ﬂip any feature, that is, the single feature retains its original value.

5 A Precise Approach to Computing the Certiﬁed Radius

In this section, we show how to compute the certiﬁed radius of the smoothed algorithm A given a dataset D, a test input x, and a perturbation function π. We focus on binary classiﬁcation and provide the multi-class case in Appendix B.

Suppose that we have computed the prediction y = A(D, x). We want to show how many examples we can perturb in D to obtain any other e D so the prediction remains y . Speciﬁcally, we want to ﬁnd the largest possible radius r such that

e D Sπ r (D), ex π(x, y)1. Pr D, x µ( e D,ex) (A( D, x) = y ) > 0.5 (4)

We ﬁrst show how to certify that Eq 4 holds for a given r and then rely on binary search to compute the largest r, i.e., the certiﬁed radius.1 In Section 5.1, we present the Neyman Pearson lemma to certify Eq 4 as it is a common practice in randomized smoothing. In Section 5.2, we show how to compute the certiﬁed radius for the distribution µ and the perturbation functions Fs, FLs, and L.

5.1 The Neyman Pearson Lemma

Hereinafter, we simplify the notation and use o to denote the pair ( D, x) and A(o) to denote the prediction of the algorithm training and evaluating on o. We further simplify the distribution µ(D, x)

1It is difﬁcult to get a closed-form solution of the certiﬁed radius (as done in [15, 39]) in our setting because the distribution of Bag Flip is complicated. We rely on binary search as it is also done in Wang et al. [35], Lee et al. [18]. Appendix D shows a loose closed-form bound on the certiﬁed radius by KL-divergence [8].

as µ and the distribution µ( e D, ex) as eµ. We deﬁne the performance of the smoothed algorithm A on dataset D, i.e., the probability of predicting y , as p = Pro µ(A(o) = y ).

The challenge of certifying Eq 4 is that we cannot directly estimate the performance of the smoothed algorithm on the perturbed data, i.e., Pro eµ(A(o) = y ), because eµ is universally quantiﬁed. To address this problem, we use the Neyman Pearson lemma to ﬁnd a lower bound lb for Pro eµ(A(o) = y ). We do so by constructing a worst-case algorithm A? and distribution eµ. Note that we add the superscript ? to denote a worst-case algorithm. Speciﬁcally, we minimize Pro eµ(A?(o) = y ) while maintaining the algorithm s performance on µ, i.e., keeping Pro µ(A?(o) = y ) = p . We use A to denote the set of all possible algorithms. We formalize the computation of the lower bound lb as the following constrained minimization objective:

lb min A? A Pr o eµ(A?(o) = y ) s.t. Pr o µ(A?(o) = y ) = p and e D Sπ r (D), ex π(x, y)1 (5)

It is easy to see that lb is the lower bound of Pro eµ(A(o) = y ) in Eq. 4 because A A and A satisﬁes the minimization constraint. Thus, lb > 0.5 implies the correctness of Eq. 4.

We show how to construct A? and eµ greedily. For each outcome o = ( D, x) in the sample space Ω i.e., the set of all possible sampled datasets and test inputs we deﬁne the likelihood ratio of o as η(o) = pµ(o)/peµ(o), where pµ and peµ are the PMFs of µ and eµ, respectively.

The key idea is as follows: We partition Ωinto ﬁnitely many disjoint subspaces L1, . . . , Lm such that the likelihood ratio in each subspace Li is some constant ηi [0, ], i.e., o Li.η(o) = ηi. We can sort and reorder the subspaces by likelihood ratios such that η1 . . . ηm. We denote the probability mass of µ on subspace Li as pµ(Li).

Example 5.1. Suppose pµ(o1)= 4

10, peµ(o1)= 4

10, pµ(o2)= 1

10, peµ(o2)= 1

10, pµ(o3)= 4

10, peµ(o3)= 1

10, pµ(o4)= 1

10, peµ(o4)= 4

10. We can partition o1 and o2 into one subspace L because η(o1)=η(o2)=1.

The construction of the A? that minimizes Eq. 5 is a greedy process, which iteratively assigns A?(o) = y for L1, L2, . . . until the budget p is met. The worst-case eµ can also be constructed greedily by maximizing the top-most likelihood ratios, and we can prove that the worst-case happens when the difference between µ and eµ is maximized, i.e., e D and ex are maximally perturbed. The following theorem adapts the Neyman Pearson lemma to our setting.

Theorem 5.1 (Neyman Pearson Lemma for FLs, Fs, L). Let e D and ex be a maximally perturbed dataset and test input, i.e., | e D \ D| = r, ex x 0 = s, and exi xi 0 + 1eyi =yi = s, for each perturbed example (exi, eyi) in e D. Let ilb argmini [1,m] Pi j=1 pµ(Lj) p . Then, lb = Pilb 1 j=1 peµ(Lj) + p Pilb 1 j=1 pµ(Lj) /ηilb.

We say that lb is tight due to the existence of the minimizer A? and eµ (see Appendix B).

Example 5.2. Following Example 5.1, suppose Ω= {o1, . . . , o4} and we partition it into L1 = {o3}, L2 = {o1, o2}, L3 = {o4}, and η1 = 4, η2 = 1, η3 = 0.25. Let p = 0.95, then we have ilb = 3 and lb = 0.1 + 0.5 + (0.95 0.4 0.5)/0.25 = 0.8.

5.2 Computing the Certiﬁed Radius of Bag Flip

Computing the certiﬁed radius boils down to computing lb in Eq 5 using Theorem 5.1. To compute lb for Bag Flip, we address the following two challenges: 1) The argmin of computing ilb and the summation of lb in Theorem 5.1 depend on the number of subspaces Lj s. We design a partition strategy that partitions Ωinto a polynomial number of subspaces (O(k2d)). Recall that k is the size of the bag sampled from D and d is the number of features. 2) In Theorem 5.1, lb depends on pµ(Lj), which according to its deﬁnition (Eq 8) can be computed in exponential time (O(kd2k)). We propose an efﬁcient algorithm that computes these two quantities in polynomial time (O(d3 + k2d2)).

We ﬁrst show how to address these challenges for Fs and then show how to handle FLs and L.

Partition strategy. We partition the large sample space Ωinto disjoint subspaces Lc,t that depend on c, the number of perturbed examples in the sampled dataset, and t, the number of features the clean data and the perturbed data differ on with respect to the sampled data. Intuitively, c takes care

of the bagging distribution and t takes care of the feature-ﬂipping distribution. Formally,

Lc,t = {({(x wi, ywi)}i, x ) |

i=1 1xwi =exwi = c, (6)

i=1 x wi xwi 0 + x x 0

i=1 x wi exwi 0 + x ex 0

Eq 6 means that there are c perturbed indices sampled in o. (and e ) in Eq 7 counts how many features the sampled and the clean data (the perturbed data) differ on. The number of possible subspaces Lc,t, which are disjoint by deﬁnition, is O(k2d) because 0 c k and |t| (k + 1)d.

The next theorems show how to compute the likelihood ratio of Lc,t and pµ(Lc,t). Theorem 5.2 (Compute ηc,t). ηc,t = pµ(Lc,t)/peµ(Lc,t) = (γ/ρ)t, where γ and ρ are parameters controlling the noise level in Bag Flip s smoothing distribution µ. Theorem 5.3 (Compute pµ(Lc,t)).

pµ(Lc,t) = Pr o µ(o satisﬁes Eq 6) Pr o µ(o satisﬁes Eq 7 | o satisﬁes Eq 6),

where Pro µ(o satisﬁes Eq 6) = Binom(c; k, r

n) is the PMF of the binomial distribution and

T(c, t) Pr o µ(o satisﬁes Eq 7 | o satisﬁes Eq 6) = X

0 i, e i d, i [0,d] 0 e 0+...+ c e c=t

i=0 L( i, e i; s, d)γ iρd i,

where L( , e ; s, d) is the same quantity deﬁned in Lee et al. [18]. Remark 5.1. We can compute peµ(Lc,t) as ηc,tpµ(Lc,t) by the deﬁnition of ηc,t.

Efﬁcient algorithm to compute Eq 8. The following algorithm computes T(c, t) efﬁciently in time O(d3 + k2d2): 1) Computing L(u, v; s, d) takes O(d3) (see Appendix C for details). 2) Computing T(0, t) by the deﬁnition in Eq 8 takes O(kd2). 3) Computing T(c, t) for c 1 by the following equation takes O(k2d2).

min(d,t+cd) X

t1=max( d,t cd) T(c 1, t t1)T(0, t1) (9)

Theorem 5.4 (Correctness of the Algorithm). T(c, t) in Eq 9 is the same as the one in Eq 8.

The above computation still applies to perturbation function FLs and L. Intuitively, ﬂipping the label can be seen as ﬂipping another dimension in the input features (Details in Appendix B.1).

Practical perspective. For each test input x, we need to estimate p for the smoothed algorithm A given the benign dataset D. We use Monte Carlo sampling to compute p by the Clopper-Pearson interval. We also reuse the trained algorithms for each test input in the test set by using Bonferroni correction. We memoize the certiﬁed radius by enumerating all possible p beforehand so that checking Eq 4 can be done in constant time in an online scenario (details in Appendix B.2). We cannot use ﬂoating point numbers because some intermediate values are too small to store in the ﬂoating point number format. Instead, we use rational numbers which represent the nominator and denominator in large numbers.

6 An Efﬁcient Relaxation for Computing a Certiﬁed Radius

Theorem 5.1 requires computing the likelihood ratio of a large number of subspaces L1, . . . , Lm. We propose a generalization of the Neyman Pearson lemma that underapproximates the subspaces by a small subset and computes a lower bound lbδ for lb. We then show how to choose a subset of subspaces that yields a tight underapproximation.

A relaxation of the Neyman Pearson lemma. The key idea is to use Theorem 5.1 to examine only a subset of the subspaces {Li}i. Suppose that we partition the subspaces {Li}i into two sets {Li}i B and {Li}i B, using a set of indices B. We deﬁne a new lower bound lbδ by applying Theorem 5.1 to the ﬁrst group {Li}i B and underapproximating p in Theorem 5.1 as p P

i/ B pµ(Li). Intuitively, the underapproximation ensures that any subspace in {Li}i B will not contribute to the lower bound, even though these subspaces may have contributed to the precise lb computed by Theorem 5.1.

The next theorem shows that lbδ will be smaller than lb by an error term δ that is a function of the partition B. The gain is in the number of subspaces we have to consider, which is now |B|. Theorem 6.1 (Relaxation of the Lemma). Deﬁne lb for the subspaces L1, . . . , Lm as in Theorem 5.1. Deﬁne lbδ for {Li}i B as in Theorem 5.1 by underapproximating p as p P

i/ B pµ(Li), then we have Soundness: lbδ lb and δ-Tightness: Let δ P

i/ B peµ(Li), then lbδ + δ lb.

Example 6.1. Consider L1, L2, and L3 from Example 5.2. If we set B = {1, 2} and underapproximate p as p pµ(L3) = 0.85, then we have lbδ = 0.1 + (0.85 0.4)/1 = 0.55, which can still certify Eq. 4. However, lbδ is not close to the original lb = 0.8 because the error δ = 0.4 is large in this example. Next, we will show how to choose B as small as possible while still keeping δ small.

Speeding up radius computation. The total time for computing lb consists of two parts, (1) the efﬁcient algorithm (Eq 9) computes T(c, t) in O(d3 + k2d2), and (2) Theorem 5.1 takes O(k2d) for a given p because there are O(k2d) subspaces. Even though we have made the computation polynomial by the above two techniques in Section 5.2, it can still be slow for large bag sizes k, which can easily be in hundreds or thousands.

We will apply Theorem 6.1 to replace the bag size k with a small constant κ. Observe that pµ(Lc,t) and peµ(Lc,t), as deﬁned in Section 5.2, are negligible for large c, i.e., the number of perturbed indices in the smoothed dataset. Intuitively, if only a small portion of the training set is perturbed, it is unlikely that we select a large number of perturbed indices in the smoothed dataset. We underapproximate the full subspaces to {Lc,t}(c,t) B by choosing the set of indices B = {(c, t) | c κ, |t| (c + 1)d}, where κ is a constant controlling the error δ, which can be computed as δ = P i/ B peµ(Li) = 1 Pκ c=0 Binom(c; k, r

n). Theorem 6.1 reduces the number of subspaces to |B| = O(κ2d). As a result, all the k appearing in previous time complexity can be replaced with κ. Example 6.2. Suppose k = 150 and r

n = 0.5%, choosing κ = 6 leads to δ = 1.23 10 5. In other words, it speeds up almost 625X for computing T(c, t) and for applying Theorem 5.1.

7 Experiments

The implementation of Bag Flip is publicly available2. In this section, we evaluate Bag Flip against trigger-less and backdoor attacks and compare Bag Flip with existing work. Note that we apply the relaxation in Section 6 to Bag Flip in all experiments and set δ = 10 4.

Datasets. We conduct experiments on MNIST, CIFAR10, EMBER [2], and Contagio (http://co ntagiodump.blogspot.com). MNIST and CIFAR10 are datasets for image classiﬁcation. EMBER and Contagio are datasets for malware detection, where data poisoning can lead to disastrous security consequences. To align with existing work, we select subsets of MNIST and CIFAR10 as MNIST-17, MNIST-01, and CIFAR10-02 in some experiments, e.g., MNIST-17 is a subset of MNIST with classes 1 and 7. We discretize all the features when applying Bag Flip. We use clean datasets except in the comparison with RAB [39], where we follow their experimental setup and use backdoored datasets generated by Bad Nets [12]. A detailed description of the datasets is in Appendix E.1.

Models. We train neural networks for MNIST, CIFAR10, EMBER and random forests for Contagio. Unless we speciﬁcally mention the difference in some experiments, whenever we compare Bag Flip to an existing work, we will use the same network structures, hyper-parameters, and data augmentation as the compared work, and we train N = 1000 models and set the conﬁdence level as 0.999 for each conﬁguration. All the conﬁgurations we used can be found in the implementation.

Metrics. For each test input (xi, yi), algorithm A will predict a label and the certiﬁed radius ri. Assuming that the attacker had poisoned R% of the examples in the training set, we deﬁne certiﬁed accuracy as the percentage of test inputs that are correctly classiﬁed and their certiﬁed radii are no

2https://github.com/Forever Zyh/defend_framework

Poisoning Amount R (%)

Certiﬁable Accuracy

0 0.2 0.4 0.6 0 0.1 0.2 0.3 0 0.1 0.2

Bagging Bag Flip-a

(a) MNIST F1

(b) MNIST F

(c) EMBER F1

(d) EMBER F

Figure 1: Comparison to Bagging on MNIST and EMBER, showing the certiﬁed accuracy at different poisoning amounts R. For MNIST: a = 0.9 and b = 0.8. For EMBER: a = 0.95 and b = 0.9.

Table 1: Certiﬁed accuracy on MNIST and EMBER with perturbations F1 and F . Note that the certiﬁed accuracies are the same poisoning amount R = 0 because we reuse the trained models.

R 0 0.5 1.0 1.5 2.0 2.5 0 0.13 0.27 0.40 0.53 0.83

Bagging 94.54 66.84 0 0 0 0 94.54 90.83 85.45 77.61 61.46 0 Bagging-0.9 93.58 71.11 0 0 0 0 93.58 89.80 84.92 78.60 68.11 0 Bag Flip-0.9 93.62 75.95 27.73 4.02 0 0 93.62 89.45 83.62 74.19 56.65 0 Bag Flip-0.8 90.72 73.94 46.20 33.39 24.23 5.07 90.72 84.11 74.50 60.56 39.21 0

R 0 0.07 0.13 0.20 0.27 0.33 0 0.05 0.10 0.15 0.20 0.25

Bagging 82.65 75.11 66.01 0 0 0 82.65 76.30 72.94 61.78 0 0 Bagging-0.95 79.06 75.32 70.19 14.74 0 0 79.06 76.23 73.50 68.45 14.74 0 Bag Flip-0.95 79.17 75.93 69.30 57.36 0 0 79.17 76.83 72.41 62.04 30.36 0 Bag Flip-0.9 78.18 69.40 62.11 41.21 13.89 1.70 78.18 70.79 63.16 41.24 11.64 0

less than R, i.e., Pm i=1 1 A(D,xi)=yi ri

n R%. We deﬁne normal accuracy as the percentage of test inputs that are correctly classiﬁed, i.e., Pm i=1 1 A(D,xi)=yi.

7.1 Defending Against Trigger-less Attacks

Two model-agnostic certiﬁed approaches, Bagging [15] and Label Flip [28], can defend against trigger-less attacks. Bag Flip outperforms Label Flip (comparison in Appendix E.2). We evaluate Bag Flip on the perturbation Fs using MNIST, CIFAR10, EMBER, and Contagio and compare Bag Flip to Bagging. We provide comparisons with Bagging on other perturbation spaces in Appendix E.2.

We present Bag Flip with different noise levels (different probabilities of ρ when ﬂipping), denoted as Bag Flip-ρ. When comparing to Bagging, we use the same k, the size of sampled datasets, for a fair comparison. Furthermore, we tune the parameter k in Bagging to match the normal accuracy with the Bag Flip-ρ setting, and denote this setting as Bagging-ρ. Concretely, we set k = 100, 1000, 300, 30 for MNIST, CIFAR10, EMBER, and Contagio respectively when comparing to Bagging. We tune k = 80, 280 for Bagging-0.9 on MNIST and Bagging-0.95 on EMBER, respectively. And we set k = 50 for MNIST when comparing to Label Flip.

Comparison to Bagging. Bagging on a discrete dataset is a special case of Bag Flip when ρ = 1, i.e., no noise is added to the dataset. (We present the results of bagging on the original dataset (undiscretized) in Appendix E.2). Table 1 and Fig 1 show the results of Bag Flip and Bagging on perturbations F1 and F over MNIST and EMBER. The results of CIFAR10 and Contagio are similar and shown in Appendix E.2. F1 only allows the attacker to modify one feature in each training example, and F allows the attacker to modify each training example arbitrarily without constraint. For F1 on MNIST, Bag Flip-0.9 performs better than Bagging after R = 0.19 and Bag Flip-0.8 still retains non-zero certiﬁed accuracy at R = 2.5 while Bagging s certiﬁed accuracy drops to zero after R = 0.66. We observe similar results on EMBER for F1 in Table 1, except R = 0.13 when compared to Bagging-0.95. We argue that it is possible to tune a best combination of k and ρ for Bag Flip, like we tune k for Bagging-0.95, and achieve a better result while maintaining similar normal accuracy. However, we do not conduct hyperparameter-tuning for Bag Flip because of its computation cost. Bag Flip achieves higher certiﬁed accuracy than Bagging when the poisoning amount is large for F1.

For F on MNIST, Bagging performs better than Bag Flip across all R because the noise added by Bag Flip to the training set hurts the accuracy. However, we ﬁnd that the noise added by Bag Flip helps

Poisoning Amount R (%)

Certiﬁable Accuracy

F1 F2 F4 F8 F

Poisoning Amount R (%)

Certiﬁable Accuracy

Figure 2: (a) Bag Flip-0.8 on MNIST against Fs with different s. s = 8 almost overlaps with s = . (b) Bag Flip-0.8 on MNIST and Bag Flip-0.9 on Contagio against backdoor attack with F1. Dashed lines show normal accuracy.

it perform better for F on EMBER. Speciﬁcally, Bag Flip achieves similar certiﬁed accuracy as Bagging at small radii and Bag Flip performs better than Bagging after R = 0.15. Bagging achieves higher certiﬁed accuracy than Bag Flip for F . Except in EMBER, Bag Flip achieves higher certiﬁed accuracy than Bagging when the amount of poisoning is large.

We also study the effect of different s in the perturbation function Fs. Figure 2(a) shows the result of Bag Flip-0.8 on MNIST. Bag Flip has the highest certiﬁed accuracy for F1. As s increases, the result monotonically converges to the curve of Bag Flip-0.8 in Figure 1(b). Bagging neglects the perturbation function and performs the same across all s. Bagging performs better than Bag Flip-0.8 when s 8. Other results for different datasets and different noise levels follow a similar trend (see Appendix E.2). Bag Flip performs best at F1 and monotonically degenerates to F as s increases.

7.2 Defending Against Backdoor Attacks

Two model-agnostic certiﬁed approaches, Feat Flip [35] and RAB [39], can defend against backdoor attacks. We compare Bag Flip to Feat Flip on MNIST-17 perturbed using FL1, and compare Bag Flip to RAB on poisoned MNIST-01 and CIFAR10-02 (by Bad Nets [12]) perturbed using F1. We further evaluate Bag Flip on the perturbation F1 using MNIST and Contagio, on which existing work is unable to compute a meaningful certiﬁed radius.

Table 2: Compared to Feat Flip on FL1 and RAB F1 with normal accuracy (Acc.) and certiﬁed accuracy at different poisoning amount R (CF@R).

Acc. CF@0 CF@0.5 CF@1.0

Feat Flip 98 36 0 0 Bag Flip-0.95 97 81 60 0 Bag Flip-0.9 97 72 46 4

Acc. CF@0 CF@1.5 CF@3.0

RAB 100.0 74.4 0 0 Bag Flip-0.8 99.6 98.4 96.5 84.6 Bag Flip-0.7 99.5 98.0 95.8 91.9

Acc. CF@0 CF@0.1 CF@0.2

RAB 73.3 0 0 0 Bag Flip-0.8 73.1 16.8 11.0 6.8 Bag Flip-0.7 71.7 41.0 34.6 29.2

Comparison to Feat Flip. Feat Flip scales poorly compared to Bag Flip because its computation of the certiﬁed radius is polynomial in the size of the training set. Bag Flip s computation is polynomial in the size of the bag instead of the size of the whole training set, and it uses a relaxation of the Neyman Pearson lemma for further speed up. Bag Flip is more scalable than Feat Flip.

We directly cite Feat Flip s results from Wang et al. [35] and note that Feat Flip is evaluated on a subset of MNIST-17. As shown in Table 2, Bag Flip achieves similar normal accuracy, but much higher certiﬁed accuracy across all R (see full results in Appendix E.2) than Feat Flip. Bag Flip signiﬁcantly outperforms Feat Flip against FL1 on MNIST-17.

Comparison to RAB. RAB assumes a perturbation function that perturbs the input within an l2norm ball of radius s. We compare Bag Flip to RAB on F1, where the l0-norm ball (our perturbation function) and l2-norm ball are the same because the feature values are within [0, 1]. Since RAB targets single-test-input prediction, we do not use Bonferroni correction for Bag Flip as a fair comparison. We directly cite RAB s results from Weber et al. [39]. Table 2 shows that on MNIST-01 and CIFAR10-02, Bag Flip achieves similar normal accuracy, but a much higher certiﬁed accuracy than RAB for all values of R (detailed ﬁgures in Appendix E.2). Bag Flip signiﬁcantly outperforms RAB against F1 on MNIST-01 and CIFAR10-02.

Results on MNIST and Contagio. Figure 2(b) shows the results of Bag Flip on MNIST and Contagio. When ﬁxing R, the certiﬁed accuracy for the backdoor attack is much smaller than the certiﬁed accuracy for the trigger-less attack (Figure 1 and Figure 3 in Appendix E.2) because backdoor attacks are strictly stronger than trigger-less attacks. Bag Flip cannot provide effective certiﬁcates for backdoor attacks on the more complex datasets CIFAR10 and EMBER, i.e., the certiﬁed radius is almost zero. Bag Flip can provide certiﬁcates against backdoor attacks on MNIST and Contagio, while Bag Flip s certiﬁcates are not effective for CIFAR10 and EMBER.

7.3 Computation Cost Analysis

We discuss the computation cost of Bag Flip on the MNIST dataset and compare to other baselines.

Training. The cost of Bag Flip during training is similar to all the baselines because Bag Flip only adds noise in the training data. Bag Flip and other baselines take about 16 hours on a single GPU to train N = 1000 classiﬁers on the MNIST dataset.

Inference. At inference time, Bag Flip ﬁrst evaluates the predictions of N classiﬁers, and counts how many classiﬁers have the majority label (N1) and how many have the runner-up label (N2). Then, Bag Flip uses a prepared lookup table to query the radius certiﬁed by N1 and N2. The inference time for each example contains the evaluation of N classiﬁers and an O(1) table lookup. Hence, there is no difference between Bag Flip and other baselines.

Preparation. Bag Flip needs to prepare a table of size O(N 2) to perform efﬁcient lookup at inference time. The time complexity of preparing each table entry is presented in Sections 5 and 6. On the MNIST dataset, Bag Flip with the relaxation proposed in Section 6 (δ = 10 4) needs 2 hours to prepare the lookup table on a single core. However, the precise Bag Flip proposed in Section 5 needs 85 hours to prepare the lookup table. Bagging also uses a lookup table that can be built in 16 seconds on MNIST (Bagging only needs to do a binary search for each entry). Feat Flip needs approximately 8000 TB of memory to compute its table. Thus, Feat Flip is infeasible to run on the full MNIST dataset. Feat Flip is only evaluated on a subset of the MNIST-17 dataset containing only 100 training examples. RAB does not need to compute the lookup table because it has a closed-form solution for computing the certiﬁed radius.

Bag Flip has similar training and inference time compared to other baselines. The relaxation technique in Section 6 is useful to reduce the preparation time from 85 hours to 2 hours. Even with the relaxation, Bag Flip needs more preparation time than Bagging and RAB. We argue that the preparation of Bag Flip is feasible because it only takes 12.5% of the time required by training.

8 Conclusion, Limitations, and Future Work

We presented Bag Flip, a certiﬁed probabilistic approach that can effectively defend against both trigger-less and backdoor attacks. We foresee many future improvements to Bag Flip. First, Bag Flip treats both the data and the underlying machine learning models as closed boxes. Assuming a speciﬁc data distribution and training algorithm can further improve the computed certiﬁed radius. Second, Bag Flip uniformly ﬂips the features and the label, while it is desirable to adjust the noise levels for the label and important features for better normal accuracy according to the distribution of the data. Third, while probabilistic approaches need to retrain thousands of models after a ﬁxed number of predictions, the deterministic approaches can reuse models for every prediction. Thus, it is interesting to develop a deterministic model-agnostic approach that can defend against both trigger-less and backdoor attacks.

Acknowledgments and Disclosure of Funding

We thank the anonymous reviewers and Anna P. Meyer for their thoughtful and generous comments and Gurindar S. Sohi for giving us access to his GPUs. This work is supported by the CCF-1750965, CCF-1918211, CCF-1652140, CCF-2106707, CCF-1652140, a Microsoft Faculty Fellowship, and gifts and awards from Facebook and Amazon.