# robust_learning_for_data_poisoning_attacks__7e5695c7.pdf

Robust Learning for Data Poisoning Attacks

Yunjuan Wang 1 Poorya Mianjy 1 Raman Arora 1

We investigate the robustness of stochastic approximation approaches against data poisoning attacks. We focus on two-layer neural networks with Re LU activations and show that under a speciﬁc notion of separability in the RKHS induced by the inﬁnite-width network, training (ﬁnitewidth) networks with stochastic gradient descent is robust against data poisoning attacks. Interestingly, we ﬁnd that in addition to a lower bound on the width of the network, which is standard in the literature, we also require a distributiondependent upper bound on the width for robust generalization. We provide extensive empirical evaluations that support and validate our theoretical results.

1. Introduction

Machine learning models based on neural networks power the state-of-the-art systems for various real-world applications, including self-driving autonmous vehicles (Grigorescu et al., 2020), speech recognition (Afouras et al., 2018), reinforcement learning (Li, 2017), etc. Neural networks trained using stochastic gradient descent (SGD) perform well both in terms of optimization (training) and generalization (prediction). However, with great power comes great responsibility, and as several recent studies indicate, systems based on neural networks admit vulnerabilities in the form of adversarial attacks. Especially in overparametrized settings (wherein the number of parameters is much larger than training sample size), which is typical in most applications, neural networks remain extraordinarily fragile and amenable to depart from their expected behavior due to strategically induced perturbations in data. One such limitation is due to arbitrary adversarial corruption of data at the time of training, commonly referred to as data poisoning. Such attacks present a challenging problem, especially

1Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. Correspondence to: Raman Arora <arora@cs.jhu.edu>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

in settings where an adversary can affect any part of the training data. Therefore, in this paper, we are interested in quantifying the maximal adversarial noise that is tolerable by SGD when training wide Re LU networks.

One of the earliest works to consider provably tolerant algorithms to a quantiﬁable error in training examples was that of Valiant (1985), motivated by a need to understand the limitations of the PAC learning framework. This was followed by a series of works that considered computationally unbounded adversaries and posed the question of bounding the error rate tolerable by a learning algorithm in a worst case model of errors (Kearns & Li, 1993; Guruswami & Raghavendra, 2009). These hardness results were later complemented by positive results (Klivans et al., 2009; Awasthi et al., 2014; Diakonikolas et al., 2019a), which give learning algorithms that enjoy information theoretically optimal noise tolerance. Much of this prior work focuses on learning halfspaces (i.e., linear separators) in Valiant s PAC learning model (Valiant, 1984). Instead, we consider Vapnik s general learning, and are interested in convex learning problems and over-parametrized neural networks with Re LU activations. While our theoretical understanding of deep learning has increased vastly in the last few years with several results characterizing the ability of gradient descent to achieve small training loss in over-parameterized regime, our understanding of robustness of such methods to attacks such as data poisoning remains limited.

Arguably, a simplest model of data poisoning is one in which the input features are perturbed, additively, by normbounded vectors. A more challenging scenario is where both input features and labels can be corrupted this is essentially the noise model considered by Valiant (1985); Kearns & Li (1993); Awasthi et al. (2014). A related model studied by Cesa-Bianchi et al. (2011) is one where the learner observes only a noisy version of the data, in a streaming setting, with noise distribution changing arbitrarily after each round. A yet another poisoning attack, studied extensively in the literature, is where the adversary can plant a fraction of the training data; for example, consider movie ratings contributed by malicious users in matrix completion. Recent works have studied numerous other practical data poisoning methods including backdoor attacks, data injection, clean label attacks, and ﬂip-label attacks (we discuss these further in related work).

Robust Learning for Data Poisoning Attacks

While several defenses have been proposed, each tailored to a speciﬁc data poisoning attack, there is no uniﬁed, robust learning framework against such attacks. Furthermore, the proposed defenses often depart signiﬁcantly from the practice of modern machine learning, which increasingly relies on stochastic approximation algorithms such as stochastic gradient descent (SGD), stochastic mirror descent, and variants. Therefore, it is natural to ask whether stochastic approximation algorithms, such as SGD, impart robustness to learning against adversarial perturbations of training data.

In this paper, we investigate the robustness of SGD against various data poisoning attacks for convex learning problems as well as training two-layer over-parameterized neural networks with Re LU activations. Surprisingly, our results show that SGD achieves optimal convergence rates on the excess risk, despite data poisoning, with only a mild deterioration in overall performance, even as the overall noise budget of the adversarial attack grows with the sample size, albeit sublinearly. Our main contributions in this paper are as follows.

In Section 2, we ﬁrst consider the clean label attack, where the adversary can additively perturb the input features but not the target labels. In this setting, we show that stochastic gradient descent robustly learns a classiﬁer as long as the overall perturbation is sublinear in the sample size. We extend our results to a more general class of data poisoning attacks and study them in a uniﬁed framework of oracle poisoning.

In Section 3, we extend our results to two-layer overparameterized neural networks with Re LU activations. We discuss clean label attack and label ﬂip attack separately, and establish guarantees for SGD in three regimes under a data-dependent margin assumption. Our bounds hold in the regime where neural networks are moderately wide but not too wide, supporting the conjecture that extreme over-parametrization may render learning susceptible to data poisoning. This is in stark contrast to existing results in deep learning theory that argue for wider networks for better generalization.

We validate our theoretical results with empirical evaluations on real datasets in Section 5. We conﬁrm that the clean-test accuracy exhibits an inverted U-curve when the training data is poisoned in all of the noisy regimes we consider. In the process, we also discover a new loss function that yields stronger poisoning attacks, which might be of independent interest in itself.

1.1. Problem Setup

We focus on the task of binary classiﬁcation in presence of data poisoning attacks. We denote the input and the label

spaces, respectively, by X Rd and Y = { 1, +1}. We assume that the data (x, y) are drawn from an unknown joint distribution D on X Y. In a general (clean-data) learning framework, the learner is provided with n i.i.d. samples S = {(xi, yi)}n i=1 Dn, and the goal is to learn a function fw : X Y, parameterized by w in some parameter space W, with a small generalization error, i.e., small 0-1 loss with resepct to the population, L(w) := P(x,y) D(yfw(x) 0).

We model the data poisoning attacks as a malicious adversary who sits between the distribution and the learner. The adversary receives an i.i.d. sample S := {(xi, yi)}n i=1 Dn

of size n, generates the poisoned sample S := {( xi, yi)}n i=1, and passes it over to the learner. For example, in clean label attack, the adversary perturbs the input as xi = xi + δi, where each perturbation δi belongs to a perturbation space , and leaves the labels intact, i.e. yi = yi. Note that in this model, no distributional assumptions are made on the adversarial perturbations. Another example is the label ﬂip attacks, whereby the adversary does not poison the input, i.e. xi = xi, but it ﬂips the sign of the labels with probability β. More precisely, yi = yi with probability β and yi = yi otherwise. We focus on the setting where the adversary has access to the clean data S and is computationally unbounded. In other words, adversary chooses to attack the optimal model (e.g., the empirical risk minimizer), given the sample. However, the adversary has no knowledge of the random bits used by the learner, e.g., when training using stochastic gradient descent.

A common approach to the clean-data learning problem is solving the stochastic optimization problem

min w W F(w) := ED[ℓ(yf(x; w))],

where ℓ: R R 0 is a convex surrogate loss for the 0-1 loss. In practice, this is usually done using ﬁrst-order optimization techniques such as stochastic gradient descent (SGD) and its variants. The statistical and computational learning theoretic aspects of such methods has been extensively studied in the literature; however, their robustness to data poisoning attacks is yet not well-understood. Therefore, the central question we ask is the following: can SGD robustly and efﬁciently learn certain hypothesis classes?

In full generality, of course, the answer to the above question is negative no learning is possible if we don t impose any restrictions on the perturbations, i.e., the set . Therefore, in this paper, we identify conditions on the perturbations under which SGD can efﬁciently and robustly learn important hypothesis classes such as linear models as well as two-layer neural networks. In particular, our analysis crucially depends on the following measures of perturbations: 1) the per-sample corruption budget B := maxi δi ; 2) the overall corruption budget S := Pn i=1 δi ; or 3) the probability of label ﬂip β.

Robust Learning for Data Poisoning Attacks

We denote scalars, vectors and matrices, respectively, with lowercase italics, lowercase bold and uppercase bold Roman letters, e.g. u, u and U. The ℓ2 norm is denoted by . Throughout, we use the standard O-notation (O and Ω). Further, we use and O interchangeably. We use O to hide poly-logarithmic dependence on the parameters.

1.2. Related Work

In this section, we survey related prior work on data poisoning attacks and defense strategies, and on convergence analysis of gradient descent based methods for training wide networks.

Data poisoning attacks and defenses. A data poisoning attack, or causative attack, aims at manipulating training samples or model architecture, which leads to misclassiﬁcation of subsequent input data associated with a speciﬁc label (a targeted attack) or manipulate predictions of data from all classes (an indiscriminate attack). A popular data poisoning attack is backdoor attack, where the adversary injects strategically manipulated samples (referred to as a a backdoor pattern, with a target label into the training data. At prediction time, samples that do not contain the trigger pattern can be categorized correctly, but samples that carry the trigger are likely misclassiﬁed as belonging to the target label class (Gu et al., 2017; Liu et al., 2017; Chen et al., 2017). One of the shortcomings of the standard backdoor attack is that the poisoned samples are clearly mislabeled, which can arouse suspicion if subjected to human inspection. This lead to what are known as clean label attacks research (Koh & Liang, 2017; Shafahi et al., 2018; Zhu et al., 2019), which focus on adding human imperceptible perturbations to input features without ﬂipping labels of the corrupted inputs. Another attack category is that of label-ﬂip attacks, where the adversary can change labels of a constant fraction of the training sample (Biggio et al., 2011; Xiao et al., 2012; Zhao et al., 2017).

Several defense mechanisms have been proposed to counter the data poisoning attacks described above. For the labelﬂip attacks, (Awasthi et al., 2014) focus on malicious noise model and construct an algorithm to ﬁnd the optimal halfspace that achieves ϵ error while tolerating Ω(ϵ) noise rate for isotropic log-concave distributions. Recently, (Diakonikolas et al., 2019a) proposes a poly (d, 1/ϵ) time algorithm to solve the same problem under Massart noise. For backdoor attacks, (Liu et al., 2018; Tran et al., 2018) propose strategies to identify the trigger pattern and target the poisoned samples. Several other works have followed up on this idea of data sensitization (outlier removal) (Barreno et al., 2010; Suciu et al., 2018; Jagielski et al., 2018; Diakonikolas et al., 2019b; Wang et al., 2019). For certiﬁed defense, (Steinhardt et al., 2017) analyze oracle defense and data-dependent defenses by constructing an approximate upper bound on the

loss. Recently (Rosenfeld et al., 2020) apply randomized smoothing to build certiﬁable robust linear classiﬁer against label-ﬂip attack.

Convergence analysis of gradient descent for wide networks. Our analysis builds on recent advances in theoretical deep learning literature, which focuses on analyzing the trajectory of ﬁrst-order optimization methods in the limit that the network width goes to inﬁnity (Li & Liang, 2018; Du et al., 2019b;a; Allen-Zhu et al., 2018; Zou et al., 2018; Cao & Gu, 2019). The main insight from this body of work is that when training a sufﬁciently over-parameterized network using gradient descent, if the initialization is large and the learning rate is small, the weights of the network remain close to the initialization; therefore, the dynamics of the network predictions is approximately linear in the feature space induced by the gradient of the network at the initialization (Li & Liang, 2018; Chizat et al., 2018; Du et al., 2019b; Lee et al., 2019). We are particularly inspired by a recent work of (Ji & Telgarsky, 2019), which studies the setting where the data distribution is separable in this feature space, an assumption that was ﬁrst introduced and studied in (Nitanda & Suzuki, 2019). While our assumptions and proof techniques are similar to this line of work, we are distinct in that to the best of our knowledge none of these prior works study the robustness of SGD to adversarial perturbations. Furthermore, while the existing results suggest that generalization error decreases as the width of the network increases, curiously, we ﬁnd that robust generalization error exhibits a U-curve as a function of the network width. Our guarantees, accordingly, involve a lower bound and an upper bound on the size of over-parametrization of the network.

2. Warm-up: Convex Learning Problems

In convex learning problems, the parameter space W is a convex set, and the loss function ℓ( ) is convex in w. This framework includes a simple yet powerful class of machine learning problems such as support vector machines and kernel methods. Here, we seek to understand the robustness of SGD based on corrupted (likely biased) gradient estimates ℓ( yf( x; w)) computed on poisoned data ( x, y). We begin with a simple observation that under standard regularity conditions, a bounded perturbation in the input/label domain translates to a bounded perturbation in the gradient domain; for example, in the clean label attacks, when f(x; w) = w, x is a linear function, the following holds.

Proposition 2.1. Assume w D for all w W Rd, x R for all x X Rd, and the loss function ℓ( ) is L-Lipschitz and α-smooth. Then, for any linear function f(x; w) = w, x , w W, the following holds for any (x, y) X Y, and δ Rd.

ℓ(yf(x + δ; w)) ℓ(yf(x; w)) (αDR + L) δ .

Robust Learning for Data Poisoning Attacks

In fact, other poisoning attacks such as label ﬂip attack can also be viewed in terms of poisoning of the ﬁrst order information about the stochastic objective. In other words, various data poisoning attacks can be studied in a uniﬁed framework of oracle poisoning which we deﬁne formally, next.

Deﬁnition ((G, B)-PSFO). Given a function F : W R, a poisoned stochastic ﬁrst-order oracle for F takes w W as input and returns a random vector g(w) = ˆg(w) + ζ, where E[ˆg(w)] F(w), E ˆg(w) 2 G2, and ζ is an arbitrary perturbation that satisﬁes ζ B.

Given a step size η > 0 and an initial parameter w0 W, SGD makes T queries to the PSFO, receives poisoned stochastic ﬁrst-order information gt := g(wt) = ˆg(wt) + ζt, and generates a sequence of parameters w1, . . . , w T , where wt+1 = ΠW(wt η gt) for t {1, . . . , T}, and ΠW projects onto the convex set W. With this introduction, we prove the following robustness guarantee for SGD.

Theorem 2.2 (Robustness of SGD). Let F : W R be a convex function. Assume that all w W satisfy w D. Let w := 1

T PT t=1 wt be the average of the SGD iterates after T calls to a (G, B )-PSFO for F, with step sizes η = D

T (G+B ), starting from arbitrary initialization w0 W. Then it holds that

E[F( w)] F(w ) 5D(G + B )

T + 2D PT t=1 ζt T .

The proof of Theorem 2.2 can be found in Appendix B.1. Theorem 2.2 implies that SGD can robustly learn convex learning problems as long as the cumulative perturbation norm due to the PSFO is sublinear in the number of oracle calls. In particular, when PT t=1 ζt = O(

T), the poisoning attack cannot impose any signiﬁcant statistical overhead on learning problem.

Furthermore, the upper bound presented in Theorem 2.2 is tight in an information-theoretic sense.

Theorem 2.3 (Optimality of SGD). There exists a function F : [ 1, 1] R, and a (1, 1)-PSFO for F, such that any optimization algorithm making T calls to the oracle incurs an excess error of

E[F( w)] F(w ) Ω

T + PT t=1 ζt

We note that inexact ﬁrst-order oracles has been studied in several previous papers (Schmidt et al., 2011; Honorio, 2012; Devolder et al., 2014; Hu et al., 2016; Dvurechensky, 2017; Hu et al., 2020; Ajalloeian & Stich, 2020). Most of these works, however, make strong distributional assumptions on the perturbations, which are impractical in real

adversarial settings. In a closely related line of work, (Hu et al., 2016; 2020; Amir et al., 2020; Ajalloeian & Stich, 2020) focus on biased SGD, and give convergence guarantees for several classes of important machine learning problems. However, we are not aware of any previous work studying robustness of SGD in neural networks, which is the subject of the next section.

3. Neural Networks

Next, we focus on two-layer neural networks with Re LU activation function and characterize sufﬁcient conditions under which SGD can efﬁciently and robustly learn the network. A two-layer Re LU net, parameterized using a pair of weight matrices (a, W), computes the following function:

f(x; a, W) := 1 m

s=1 asσ(w s x).

Here, m corresponds to the number of hidden nodes, i.e., the network width, W = [w1, . . . , wm], a = [a1, . . . , am], and σ(z) := max{0, z} is the Re LU. We initialize the top layer weights, as unif({ 1, +1}), and keep them ﬁxed through the training. The bottom layer weights are initialized as ws,0 N(0, Id) and are updated using SGD on the logistic loss ℓ(z) := log(1 + e z). We denote the weight matrix at the tth iterate of SGD as Wt and the incoming weight vector into the sth hidden node at iteration t as ws,t. Since a is ﬁxed during the training, for the simplicity of presentation, we denote the network output on the ith clean and perturbed sample, respectively, as fi(W) := f(xi; a, W) and fi(W) := f( xi; a, W). Therefore, at time t, the network weights are updated according to Wt+1 = Wt ηt ℓ( yt ft(Wt)).

In this section, we assume that the data is normalized so that x = 1. This assumption is standard in the literature of over-parameterized neural networks (Du et al., 2019b; Allen Zhu et al., 2018; Cao & Gu, 2019; Ji & Telgarsky, 2019); however, the results can be extended to the setting where the norm of the data is both upperand lower-bounded by some constants. Moreover, following Ji & Telgarsky (2019), we assume that the distribution is separable by a positive margin in the reproducing kernel Hilbert space (RKHS) induced by the gradient of the inﬁnite-width network at initialization.

Assumption 1 ((Ji & Telgarsky, 2019)). Let z N(0, Id) be a d-dimensional standard Gaussian random vector. There exists a margin parameter γ > 0, and a linear separator v : Rd Rd satisfying (A) Ez[ v(z) 2] < ; (B) v(z) 2 1 for all z Rd; and (C) y Ez[ v(z), x1[z x 0] ] γ for almost all (x, y) D.

We note that the assumption above pertaining the linearly separability of data after mapping it into a high-dimensional non-linear feature space is mild and reasonable this very

Robust Learning for Data Poisoning Attacks

idea has been the cornerstone of kernel methods using the radial basis function (RBF) kernel, for example, and for learning with neural networks.

Next, we specify three data poisoning regimes under which SGD can efﬁciently and robustly learn two-layer Re LU networks under Assumption 1. Recall that the misclassiﬁcation error due to f( ; a, W) is denoted by L(W) := PD(yf(x; a, W) 0) note that a is ﬁxed after the initialization and hence is dropped from the arguments of L.

3.1. Regime A (clean label attacks): large per-sample perturbation, small overall perturbation

Our ﬁrst result concerns the setting where each individual sample can be arbitrarily poisoned as long as the overall perturbation budget is small compared to the sample size. Theorem 3.1 (Regime A). Under Assumption 1, for any δ (0, 1), with probability at least 1 δ over random initialization and the training samples, the iterates of SGD with constant step size η = 1 (1+B)2 n satisfy

i<n L(Wi) ln2( n/4) + ln(24n/δ) nγ2 ,

provided that B O(γ/

δ ) + ln2(n)

γ8 m n ln4(n) + n ln2( n

We note that both the generalization error rate as well as the lowerand upper-bounds on the width depend on B, the per-sample perturbation budget; we refer the reader to the detailed expressions in Theorem B.8 in the appendix. For the width lowerand upper-bounds in Theorem 3.1 to be consistent, i.e. allowing a non-empty range for the width, the overall perturbation budget S needs to be γ2 n (thus, small cumulative perturbation). This requirement is indeed the same as what we observed in convex learning problems, i.e. S = O( n), given by Theorem 2.2 in Section 2. Notably, the per-sample perturbation budget can be large since it is independent of the width, and the sample size.

3.2. Regime B (clean label attacks): small per-sample perturbation, large overall perturbation

Our next result shows that SGD can still succeed even if the overall budget grows linearly with the sample size, provided that the per-sample perturbations are small. Theorem 3.2 (Regime B). Under Assumption 1, for any δ (0, 1), with probability at least 1 δ over random initialization and the training samples, the iterates of SGD with constant step size η = (1 + B) 2 satisfy

i<n L(Wi) ln2( n/4) + ln(24n/δ)

for m=Ω 1 γ8 ln(n/δ)+ ln2(n) , provided

δ ) , γ γ +

In this regime, we only allow a small per-sample perturbation 1/

md; however, the cumulative perturbation can grow linearly with the sample size, i.e. S = Θ(n).

3.3. Regime C (label ﬂip attacks)

Next, we show that SGD can withstand label ﬂip attacks in small amounts.

Theorem 3.3 (Regime C). Under Assumption 1, for any δ (0, 1), with probability at least 1 δ over random initialization and the training samples, the iterates of SGD with constant step size η = 1/ n satisfy

i<n L(Wi) ln2( n/4) + ln(16n/δ) nγ2 ,

provided that β ln(n/δ)+ln2(n)

ln(n/δ)+ln( γ2 n ln(n/δ)+ln2(n) ))γ n, and

m = Ω 1 γ8 ln(n/δ) + ln2(n) .

We conclude this section with a couple of remarks.

First, note that the generalization bounds obtained in Regimes A and C, given in Theorems 3.1 and 3.3, are essentially of the same rate of O(1/ n). While the nature of the clean label attacks and label ﬂip attacks corresponding to Regimes A and C are very different, the effective overall perturbation budget in both regimes are almost of the same order of O( n). We emphasize that there is a tension between the generalization error rate and the perturbation budget, and that different trade-offs can be obtained where faster or slower error rates correspond to smaller or larger perturbation budgets, respectively. On the contrary, Theorem 3.2 in regime B allows a larger overall perturbation budget of order O(n), and offers faster generalization error rate of O(1/n). We note, however, that the per-sample perturbation budget in this regime is signiﬁcantly smaller than regimes A, especially for high-dimensional inputs. Therefore, the results above cover substantially different practical settings and are not directly comparable.

Second, note that in Theorem 3.3, we require β O(1/m) (ignoring other terms) which bounds m from above in terms of other parameters. Similarly, there is an implicit upper bound on m in terms of B in Theorem 3.2. In other words, in all three regimes that we consider, the generalization bounds hold if the width is bounded from both above and below.

Robust Learning for Data Poisoning Attacks

4. Proof sketch

Our analysis is motivated by recent advances in the literature of over-parameterized neural networks. In particular, a nascent view of the modern over-parameterized models suggests that inﬁnitely wide neural networks behave like linear functions in the reproducing kernel Hilbert space induced by the gradient of the network at the initialization, i.e. the feature map φ : x 7 f(x; w0) (Jacot et al., 2018; Lee et al., 2019; Du et al., 2019a). Therefore, the dynamics of SGD are approximately linear and are governed by the neural tangent kernel (NTK): k(x, x ) := f(x; w0), f(x ; w0) .

It is easy to see that the feature map φx : z 7 x1[z x 0] is closely related to the gradient of network at initialization through f(x;W0,a)

ws,0 := 1 masφx(ws,0). Deﬁne U = [ u1, , um] where us := 1 mas v(ws,0), and observe that:

y U, f(x; W, a) = y 1

s=1 v(ws,0), x1[x ws,0 0]

which is a ﬁnite-width estimation of the margin quantity in part (C) of Assumption 1.

We denote the instantaneous loss on the clean sample and the poisoned sample as Ri(W) := ℓ(yi fi(Wi), W ) and Ri(W) := ℓ( yi fi(Wi), W ), respectively. Therefore, in the tth iterate of SGD, the network weights are updated according to Wt+1 = Wt ηt Rt(Wt).

4.1. Proof sketch of Theorem 3.1 and Theorem 3.2

1. Let Qi(W) := ℓ (yi fi(Wi), W ) be the derivative of the instantaneous loss Ri(W). An interesting property of Qi(W) is that it upperbounds the zeroone loss, and is upperbounded by Ri(W). This property has been used in several previous works (Cao & Gu, 2020; Ji & Telgarsky, 2019) to upperbound the average misclassiﬁcation error as 1

i<n L(Wi) < 1 n P

i<n Q(Wi). Using a martingale concentration argument we then show that 1

i<n Qi(Wi) is close to 1

i<n Q(Wi), where Q(Wi) is the expectation of Qi(Wi) with respect to data distribution. Finally, since the instantaneous loss upperbounds its derivative, we arrive at 1

i<n L(Wi) < 8

i<n Ri(Wi) + ϵ.

2. To bound 1

i<n Ri(Wi), we argue that under the perturbation budgets considered in our theorems, Ri(Wi) is close to Ri(Wi). In regime A, we appeal to convexity of the loss function and Lipschitzness of the network to bound the difference Ri(Wi) Ri(Wi) as O(

md δi ), which gives sufﬁcient conditions on the perturbation budget in Regime A. For regime B, we use the convexity of the loss and the fact that Qi(W) Ri(W) to show that (1 O(

md B))Ri(Wi) Ri(Wi). Therefore, as long as O(

md B) is not

small, we can bound 1 n P

i<n Ri(Wi) in terms of 1 n P

i<n Ri(Wi).

3. We then follow (Ji & Telgarsky, 2019) to bound 1 n P

i<n Ri(Wi). The separability assumption 1 is crucial for this step.

4.2. Proof sketch of Theorem 3.3

1. We ﬁrst observe that the zero-one loss of (x, y) is the same as the zero-one loss of the expectation of (x, y) with respect to the randomness of label ﬂips, i.e. 1 n P

i<n L(Wi) = P(E yf(x; Wi) 0), and is upperbounded by 2E(x,y) Dℓ (E yf(x; W)). Using a martingale concentration argument, we arrive at 1

i<n L(Wi) 8

i<n ℓ (E yifi(Wi)) + ϵ, which can be further bounded by 8

n P i<n ℓ(E yifi(Wi)) + ϵ because ℓ ( ) ℓ( ). Since ℓis convex, using Jensen s inequality, we further bound the generalization error as 1 n P

i<n L(Wi) < 8

i<n Eℓ( yifi(Wi)) + ϵ.

2. We leverage an interesting property of the logistic loss, the fact that ℓ( z) ℓ(z) = z, to reduce the expected instantaneous loss above to Eℓ( yifi(Wi)) = ℓ(yi fi(Wi), W ) + βyi fi(Wi), W . While the ﬁrst term can be bounded using the proof techniques in (Ji & Telgarsky, 2019), the second term requires β to be sufﬁciently small, which gives the required upperbound on the probability of label ﬂips in the statement of the theorem.

5. Experimental Results

The goal of this section is to provide experimental support for our theoretical ﬁndings in Section 2 and Section 3. Code is available on Github 1. First, we describe the experimental setup.

Datasets. We utilize the MNIST and the CIFAR10 datasets for the empirical evaluation. MNIST is a dataset of 28 28 greyscale handwritten digits, containing 70K samples in 10 classes, with 60K training images and 10K test images. CIFAR10 is a dataset of 32 32 color images, containing 60K samples in 10 classes, with 50K training images and 10K test images.

Model speciﬁcation. We utilize four different models: a linear model trained on MNIST, an Alex Net model trained on CIFAR10, and two convolutional neural networks, with width ranging from 10 to 100, 000, trained on MNIST and CIFAR. For the MNIST dataset, we use a model with two

1https://github.com/bettyttytty/robust_ learning_for_data_poisoning_attack

Robust Learning for Data Poisoning Attacks

convolutional layers followed by max-pooling layers, as well as two fully connected layers, with Re LU activation. The ﬁrst and the second convolutional layers have [input channel, output channel, kernel size] equal to [1, 10, 5] and [10, 20, 5], respectively. The ﬁrst and the second fully connected layers have [input, output] dimensions equal to [320, width] and [width, 10], respectively. For CIFAR10 dataset, we use a network with two convolutional layers with [input channel, output channel, kernel size] equal to [3, 6, 5] and [6, 16, 5], and three fully connected layers with [input, output] dimensions equal to [400, 120], [120, width], [width, 10]. The architecture of Alex Net is the same as (Luo, 2018). We initialize the networks using Pytorch initialization and train them using cross-entropy loss. We track the test accuracy of the networks as a function of the width to verify our theorems.

Attack strategy. When generating poisoning attacks using projected gradient ascent, we discovered that a simple modiﬁcation of the cross-entropy loss generates stronger poisoning attacks, both qualitatively and quantitatively. This new loss function, which we call negated loss, is obtained by ﬂipping the sign on both the model prediction and the cross-entropy loss. For example, in the binary classiﬁcation case, the negated loss is given by ℓ (z):= ℓ( z)= log(1+exp(z)), where z is the model prediction. Among other properties, this loss is concave in z, and lower bounds the zero-one loss, which makes it a useful surrogate loss for ﬁnding adversarial perturbations.

To generate the poisoned data in regimes A and B (i.e., clean label attacks), we ﬁrst use mini-batch SGD with batch size 128 to learn the model parameters (w ) on the clean data. The poisoned data is then generated by taking a stochastic gradient ascent step on the negated loss to maximize the negated loss function ℓ( ; w ), followed by a projection onto the constraints the ℓ2,1 for regime Aor the ℓ2, -norm ball for regime B. In particular, for regime A, we specify the overall noise budget S = C n, and project the overall perturbation iteratively onto the ℓ2,1-norm ball. Here n is the number of training samples and C is the corruption rate.

We generalize the label ﬂip attack in regime C to the multiclass classiﬁcation setting by switching the label, with probability β, of any given training data point from the true class i to (i + 1) mod 10.

We now present our main empirical ﬁndings.

Negated loss vs. the original loss. Figure 1 compares the data poisoning attacks generated by PGA on the proposed negated loss against the original cross entropy loss under regime A. The top row compares the (clean) test accuracy under two loss functions as a function of corruption rate (C). The left panel corresponds to a linear model trained

Figure 1. Data poisoning attacks generated by PGA on the proposed negated loss (orange) against the original cross entropy loss (blue). Left: a linear model trained on MNIST, right: Alex Net trained on CIFAR10. The top row shows the (clean) test accuracy of the model trained on poisoned data. The bottom row shows the histogram of L2 norm of perturbation vectors generated using the two loss functions.

Figure 2. The excess loss F(w n) F(w ) (left); and the excess error L(w n) L(w ) (right), as a function of sample size n with different corruption parameter C {50, 100, 200, 300, 400, 500} under regime A. Here, w n denotes the optimal parameters on the given sample of size n.

on the MNIST dataset and the right panel corresponds to Alex Net trained on CIFAR10. We observe similar plots for Res Net18 please refer to Appendix A for more details.

We note that the cross entropy is only an upper-bound on the zero-one loss, and hence, a proper surrogate for minimizing the classiﬁcation error. However, we seek to maximize the zero-one loss when generating the poisoning attacks. Therefore, given a ﬁxed perturbation budget, there is no advantage in perturbing a sample beyond the point that it is misclassiﬁed, which can very much happen when maximizing the original cross entropy loss. On the contrary, the negated loss strictly lower-bounds the zero-one loss.

The bottom row of Figure 1 compares the distribution of the L2 norm of per-sample perturbation generated by the two loss functions; the left and right panels correspond to a linear model and Alex Net, respectively. We see that PGA on the original loss tends to allow an excessive amount of persample perturbations at the cost of leaving a large portion of the samples virtually untouched. For example, for the MNIST dataset, a perturbation of size 5 is more than enough to make label prediction hard, e.g., by planting a solid 1 in any image, whereas the poisoning attack using the original

Robust Learning for Data Poisoning Attacks

loss wastes a lot of the budget on making a few samples more wrong .

Convex learning problems. We ﬁrst train a linear model on the clean MNIST dataset and denote it with w . We then train several linear models under poisoning attacks for various sample sizes in range n [500, 60000], which we denote by w n. Figure 2 shows the excess loss F(w n) F(w ) as well as the excess error L(w n) L(w ) as a function of sample size n, for different corruption rates C {50, 100, 200, 300, 400, 500}.

It is not surprising that both the excess loss as well as the excess error are smaller for larger sample sizes or smaller corruption rates, as predicted by Theorem 2.2. More interestingly, the plots suggest a phase transition between the convergence behavior of the curves at C 250 n, which corresponds to the maximum corruption rate under which Theorem 2.2 still yields a non-trivial (decaying with sample size) generalization error bound.

Wide neural networks. Recall that our theoretical results in Section 3 guarantee a small generalization error for networks trained with poisoned data only when the network width falls in a certain range speciﬁed in the theorems. While it is not clear whether these bounds are necessary, we observe that the clean test accuracy of models trained on poisoned data exhibits an inverted U curve. In other words, the generalization accuracy decreases if the models are not wide enough or if they are too wide. In Figure 3, we see that for clean data training corresponding to the green curves, the accuracy improves monotonically with the network width. However, in presence of data poisoning attacks, in both left (MNIST) and right (CIFAR10) panels, we observe that the test accuracy is non-monotonic in terms of the network width. In each of the regimes A, B, and C, we see that the accuracy improves as we initially increase the network width. It then hits a plateau and eventually starts to fall as we further increase the width. This observation challenges the nascent view in the deep learning literature that larger models generalize better (Neyshabur et al., 2014; Zhang et al., 2016), at least under adversarial perturbations.

6. Discussion and Future Work

In this paper we study the robustness of SGD to data poisoning attacks in two-layer neural networks. In particular, under a separability assumption in the feature space induced by the gradient of the inﬁnite-width network at initialization, we characterize several practical data poisoning scenarios where SGD efﬁciently learns the network, provided that the network width is sufﬁciently but not excessively large. In sharp contrast with clean-data training where the generalization error decreases as the width of

the network increases (Zhang et al., 2016; Neyshabur et al., 2014), curiously, our empirical ﬁndings indicate that robust generalization error exhibits a U-curve as a function of the network width.

There are several natural directions for future work. First, although we observe in practice that ultra-wide neural networks are more vulnerable to data poisoning attacks, our theoretical results do not directly imply that too large of a network width can actually hurt the generalization performance under data poisoning attacks. Therefore, a natural question that remains open is to prove that SGD fails at robustly learning ultra-wide neural networks in presence of adversarial perturbations such as those considered in this work. We would like to highlight that in a very recent work, Bubeck et al. (2020) conjecture that over-parameterization may be necessary for robustness; while our results do not contradict theirs, it certainly calls for further investigation into the role of over-parameterization in imparting or degrading robustness.

Second, our theory heavily depends on the separability assumption and cannot be trivially extended to deeper architectures; yet, our empirical ﬁndings go beyond two-layer networks, and hold for natural datasets where the separability assumption is no longer true. It remains to be seen if we can relax the margin assumption and generalize our results to richer network architectures.

Third, our paper focuses on the role of the width; however, it is not immediately clear from our results if the U-curve phenomena is speciﬁc to the network width, or if it can more broadly happen for ultra-large networks. It would be interesting to explore the role of other architectural parameters, such as the network depth, in robust learning.

Acknowledgements

This research was supported, in part, by DARPA GARD award HR00112020004, NSF BIGDATA award IIS1546482 and NSF CAREER award IIS-1943251.

Afouras, T., Chung, J. S., Senior, A., Vinyals, O., and Zisserman, A. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 2018.

Ajalloeian, A. and Stich, S. U. Analysis of sgd with biased gradient estimators. ar Xiv preprint ar Xiv:2008.00051, 2020.

Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. ar Xiv preprint ar Xiv:1811.03962, 2018.

Robust Learning for Data Poisoning Attacks

Figure 3. Clean test accuracy as a function of the network width for convolutional neural networks (see model speciﬁcation) trained under data poisoning attacks on MNIST (left) and CIFAR10 (right) for regime A (C = 700 for MNIST, C = 200 for CIFAR10), regime B (B = 2.5 for MNIST and B = 1 for CIFAR10) and regime C (β = 0.3 for MNIST and β = 0.2 for CIFAR10. For regime A and B, the poisoned MNIST data is generated using convolutional neural network with width=100 (see model speciﬁcation), the poisoned CIFAR10 data is generated using Alex Net. We use vanilla SGD with batch size 128 (no momentum, no weight decay, no data augmentation). Each curve is averaged over 50 runs and shaded regions show standard deviation. We perform 5-fold cross validation to pick model parameters including learning rate.

Amir, I., Attias, I., Koren, T., Livni, R., and Mansour, Y. Prediction with corrupted expert advice. ar Xiv preprint ar Xiv:2002.10286, 2020.

Awasthi, P., Balcan, M. F., and Long, P. M. The power of localization for efﬁciently learning linear separators with noise. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pp. 449 458, 2014.

Barreno, M., Nelson, B., Joseph, A. D., and Tygar, J. D. The security of machine learning. Machine Learning, 81 (2):121 148, 2010.

Biggio, B., Nelson, B., and Laskov, P. Support vector machines under adversarial label noise. In Asian conference on machine learning, pp. 97 112, 2011.

Bubeck, S., Li, Y., and Nagaraj, D. A law of robustness for two-layers neural networks. ar Xiv preprint ar Xiv:2009.14444, 2020.

Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pp. 10835 10845, 2019.

Cao, Y. and Gu, Q. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. In AAAI, pp. 3349 3356, 2020.

Cesa-Bianchi, N., Shalev-Shwartz, S., and Shamir, O. Online learning of noisy data. IEEE Transactions on Information Theory, 57(12):7907 7931, 2011.

Chen, X., Liu, C., Li, B., Lu, K., and Song, D. Targeted backdoor attacks on deep learning systems using data poisoning. ar Xiv preprint ar Xiv:1712.05526, 2017.

Chizat, L., Oyallon, E., and Bach, F. On lazy training in differentiable programming. arxiv e-prints, page. ar Xiv preprint ar Xiv:1812.07956, 2018.

Devolder, O., Glineur, F., and Nesterov, Y. First-order methods of smooth convex optimization with inexact oracle. Mathematical Programming, 146(1-2):37 75, 2014.

Diakonikolas, I., Gouleakis, T., and Tzamos, C. Distributionindependent pac learning of halfspaces with massart noise. ar Xiv preprint ar Xiv:1906.10075, 2019a.

Diakonikolas, I., Kamath, G., Kane, D., Li, J., Steinhardt, J., and Stewart, A. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pp. 1596 1606, 2019b.

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent ﬁnds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675 1685, 2019a.

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b. URL https://openreview. net/forum?id=S1e K3i09YQ.

Dvurechensky, P. Gradient method with inexact oracle for composite non-convex optimization. ar Xiv preprint ar Xiv:1703.09180, 2017.

Grigorescu, S., Trasnea, B., Cocias, T., and Macesanu, G. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362 386, 2020.

Robust Learning for Data Poisoning Attacks

Gu, T., Dolan-Gavitt, B., and Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. ar Xiv preprint ar Xiv:1708.06733, 2017.

Guruswami, V. and Raghavendra, P. Hardness of learning halfspaces with noise. SIAM Journal on Computing, 39 (2):742 765, 2009.

Honorio, J. Convergence rates of biased stochastic optimization for learning sparse ising models. ar Xiv preprint ar Xiv:1206.4627, 2012.

Hu, B., Seiler, P., and Lessard, L. Analysis of biased stochastic gradient descent using sequential semideﬁnite programs. Mathematical Programming, pp. 1 26, 2020.

Hu, X., LA, P., Gy orgy, A., and Szepesv ari, C. convex optimization with biased noisy gradient oracles. arxiv. 2016.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571 8580, 2018.

Jagielski, M., Oprea, A., Biggio, B., Liu, C., Nita-Rotaru, C., and Li, B. Manipulating machine learning: Poisoning attacks and countermeasures for regression learning. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 19 35. IEEE, 2018.

Ji, Z. and Telgarsky, M. Polylogarithmic width sufﬁces for gradient descent to achieve arbitrarily small test error with shallow relu networks. ar Xiv preprint ar Xiv:1909.12292, 2019.

Kearns, M. and Li, M. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807 837, 1993.

Klivans, A. R., Long, P. M., and Servedio, R. A. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(12), 2009.

Koh, P. W. and Liang, P. Understanding black-box predictions via inﬂuence functions. ar Xiv preprint ar Xiv:1703.04730, 2017.

Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., Novak, R., Sohl Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570 8581, 2019.

Li, Y. Deep reinforcement learning: An overview. ar Xiv preprint ar Xiv:1701.07274, 2017.

Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157 8166, 2018.

Liu, K., Dolan-Gavitt, B., and Garg, S. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273 294. Springer, 2018.

Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W., and Zhang, X. Trojaning attack on neural networks. 2017.

Luo, Z. pytorch-cifar10. https://github.com/ icpm/pytorch-cifar10, 2018.

Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014.

Nitanda, A. and Suzuki, T. Reﬁned generalization analysis of gradient descent for over-parameterized two-layer neural networks with smooth activations on classiﬁcation problems. ar Xiv preprint ar Xiv:1905.09870, 2019.

Rosenfeld, E., Winston, E., Ravikumar, P., and Kolter, J. Z. Certiﬁed robustness to label-ﬂipping attacks via randomized smoothing. ar Xiv preprint ar Xiv:2002.03018, 2020.

Schmidt, M., Roux, N. L., and Bach, F. R. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in neural information processing systems, pp. 1458 1466, 2011.

Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., and Goldstein, T. Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pp. 6103 6113, 2018.

Steinhardt, J., Koh, P. W. W., and Liang, P. S. Certiﬁed defenses for data poisoning attacks. In Advances in neural information processing systems, pp. 3517 3529, 2017.

Suciu, O., Marginean, R., Kaya, Y., Daume III, H., and Dumitras, T. When does machine learning {FAIL}? generalized transferability for evasion and poisoning attacks. In 27th {USENIX} Security Symposium ({USENIX} Security 18), pp. 1299 1316, 2018.

Tran, B., Li, J., and Madry, A. Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems, pp. 8000 8010, 2018.

Valiant, L. G. A theory of the learnable. Communications of the ACM, 27(11):1134 1142, 1984.

Robust Learning for Data Poisoning Attacks

Valiant, L. G. Learning disjunction of conjunctions. In IJCAI, pp. 560 566. Citeseer, 1985.

Wang, Y., Jha, S., and Chaudhuri, K. An investigation of data poisoning defenses for online learning. ar Xiv preprint ar Xiv:1905.12121, 2019.

Xiao, H., Xiao, H., and Eckert, C. Adversarial label ﬂips attack on support vector machines. In ECAI, pp. 870 875, 2012.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.

Zhao, M., An, B., Gao, W., and Zhang, T. Efﬁcient label contamination attacks against black-box learning models. In IJCAI, pp. 3945 3951, 2017.

Zhu, C., Huang, W. R., Shafahi, A., Li, H., Taylor, G., Studer, C., and Goldstein, T. Transferable clean-label poisoning attacks on deep neural nets. ar Xiv preprint ar Xiv:1905.05897, 2019.

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888, 2018.