# robust_learning_for_data_poisoning_attacks__7e5695c7.pdf Robust Learning for Data Poisoning Attacks Yunjuan Wang 1 Poorya Mianjy 1 Raman Arora 1 We investigate the robustness of stochastic approximation approaches against data poisoning attacks. We focus on two-layer neural networks with Re LU activations and show that under a specific notion of separability in the RKHS induced by the infinite-width network, training (finitewidth) networks with stochastic gradient descent is robust against data poisoning attacks. Interestingly, we find that in addition to a lower bound on the width of the network, which is standard in the literature, we also require a distributiondependent upper bound on the width for robust generalization. We provide extensive empirical evaluations that support and validate our theoretical results. 1. Introduction Machine learning models based on neural networks power the state-of-the-art systems for various real-world applications, including self-driving autonmous vehicles (Grigorescu et al., 2020), speech recognition (Afouras et al., 2018), reinforcement learning (Li, 2017), etc. Neural networks trained using stochastic gradient descent (SGD) perform well both in terms of optimization (training) and generalization (prediction). However, with great power comes great responsibility, and as several recent studies indicate, systems based on neural networks admit vulnerabilities in the form of adversarial attacks. Especially in overparametrized settings (wherein the number of parameters is much larger than training sample size), which is typical in most applications, neural networks remain extraordinarily fragile and amenable to depart from their expected behavior due to strategically induced perturbations in data. One such limitation is due to arbitrary adversarial corruption of data at the time of training, commonly referred to as data poisoning. Such attacks present a challenging problem, especially 1Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. Correspondence to: Raman Arora . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). in settings where an adversary can affect any part of the training data. Therefore, in this paper, we are interested in quantifying the maximal adversarial noise that is tolerable by SGD when training wide Re LU networks. One of the earliest works to consider provably tolerant algorithms to a quantifiable error in training examples was that of Valiant (1985), motivated by a need to understand the limitations of the PAC learning framework. This was followed by a series of works that considered computationally unbounded adversaries and posed the question of bounding the error rate tolerable by a learning algorithm in a worst case model of errors (Kearns & Li, 1993; Guruswami & Raghavendra, 2009). These hardness results were later complemented by positive results (Klivans et al., 2009; Awasthi et al., 2014; Diakonikolas et al., 2019a), which give learning algorithms that enjoy information theoretically optimal noise tolerance. Much of this prior work focuses on learning halfspaces (i.e., linear separators) in Valiant s PAC learning model (Valiant, 1984). Instead, we consider Vapnik s general learning, and are interested in convex learning problems and over-parametrized neural networks with Re LU activations. While our theoretical understanding of deep learning has increased vastly in the last few years with several results characterizing the ability of gradient descent to achieve small training loss in over-parameterized regime, our understanding of robustness of such methods to attacks such as data poisoning remains limited. Arguably, a simplest model of data poisoning is one in which the input features are perturbed, additively, by normbounded vectors. A more challenging scenario is where both input features and labels can be corrupted this is essentially the noise model considered by Valiant (1985); Kearns & Li (1993); Awasthi et al. (2014). A related model studied by Cesa-Bianchi et al. (2011) is one where the learner observes only a noisy version of the data, in a streaming setting, with noise distribution changing arbitrarily after each round. A yet another poisoning attack, studied extensively in the literature, is where the adversary can plant a fraction of the training data; for example, consider movie ratings contributed by malicious users in matrix completion. Recent works have studied numerous other practical data poisoning methods including backdoor attacks, data injection, clean label attacks, and flip-label attacks (we discuss these further in related work). Robust Learning for Data Poisoning Attacks While several defenses have been proposed, each tailored to a specific data poisoning attack, there is no unified, robust learning framework against such attacks. Furthermore, the proposed defenses often depart significantly from the practice of modern machine learning, which increasingly relies on stochastic approximation algorithms such as stochastic gradient descent (SGD), stochastic mirror descent, and variants. Therefore, it is natural to ask whether stochastic approximation algorithms, such as SGD, impart robustness to learning against adversarial perturbations of training data. In this paper, we investigate the robustness of SGD against various data poisoning attacks for convex learning problems as well as training two-layer over-parameterized neural networks with Re LU activations. Surprisingly, our results show that SGD achieves optimal convergence rates on the excess risk, despite data poisoning, with only a mild deterioration in overall performance, even as the overall noise budget of the adversarial attack grows with the sample size, albeit sublinearly. Our main contributions in this paper are as follows. In Section 2, we first consider the clean label attack, where the adversary can additively perturb the input features but not the target labels. In this setting, we show that stochastic gradient descent robustly learns a classifier as long as the overall perturbation is sublinear in the sample size. We extend our results to a more general class of data poisoning attacks and study them in a unified framework of oracle poisoning. In Section 3, we extend our results to two-layer overparameterized neural networks with Re LU activations. We discuss clean label attack and label flip attack separately, and establish guarantees for SGD in three regimes under a data-dependent margin assumption. Our bounds hold in the regime where neural networks are moderately wide but not too wide, supporting the conjecture that extreme over-parametrization may render learning susceptible to data poisoning. This is in stark contrast to existing results in deep learning theory that argue for wider networks for better generalization. We validate our theoretical results with empirical evaluations on real datasets in Section 5. We confirm that the clean-test accuracy exhibits an inverted U-curve when the training data is poisoned in all of the noisy regimes we consider. In the process, we also discover a new loss function that yields stronger poisoning attacks, which might be of independent interest in itself. 1.1. Problem Setup We focus on the task of binary classification in presence of data poisoning attacks. We denote the input and the label spaces, respectively, by X Rd and Y = { 1, +1}. We assume that the data (x, y) are drawn from an unknown joint distribution D on X Y. In a general (clean-data) learning framework, the learner is provided with n i.i.d. samples S = {(xi, yi)}n i=1 Dn, and the goal is to learn a function fw : X Y, parameterized by w in some parameter space W, with a small generalization error, i.e., small 0-1 loss with resepct to the population, L(w) := P(x,y) D(yfw(x) 0). We model the data poisoning attacks as a malicious adversary who sits between the distribution and the learner. The adversary receives an i.i.d. sample S := {(xi, yi)}n i=1 Dn of size n, generates the poisoned sample S := {( xi, yi)}n i=1, and passes it over to the learner. For example, in clean label attack, the adversary perturbs the input as xi = xi + δi, where each perturbation δi belongs to a perturbation space , and leaves the labels intact, i.e. yi = yi. Note that in this model, no distributional assumptions are made on the adversarial perturbations. Another example is the label flip attacks, whereby the adversary does not poison the input, i.e. xi = xi, but it flips the sign of the labels with probability β. More precisely, yi = yi with probability β and yi = yi otherwise. We focus on the setting where the adversary has access to the clean data S and is computationally unbounded. In other words, adversary chooses to attack the optimal model (e.g., the empirical risk minimizer), given the sample. However, the adversary has no knowledge of the random bits used by the learner, e.g., when training using stochastic gradient descent. A common approach to the clean-data learning problem is solving the stochastic optimization problem min w W F(w) := ED[ℓ(yf(x; w))], where ℓ: R R 0 is a convex surrogate loss for the 0-1 loss. In practice, this is usually done using first-order optimization techniques such as stochastic gradient descent (SGD) and its variants. The statistical and computational learning theoretic aspects of such methods has been extensively studied in the literature; however, their robustness to data poisoning attacks is yet not well-understood. Therefore, the central question we ask is the following: can SGD robustly and efficiently learn certain hypothesis classes? In full generality, of course, the answer to the above question is negative no learning is possible if we don t impose any restrictions on the perturbations, i.e., the set . Therefore, in this paper, we identify conditions on the perturbations under which SGD can efficiently and robustly learn important hypothesis classes such as linear models as well as two-layer neural networks. In particular, our analysis crucially depends on the following measures of perturbations: 1) the per-sample corruption budget B := maxi δi ; 2) the overall corruption budget S := Pn i=1 δi ; or 3) the probability of label flip β. Robust Learning for Data Poisoning Attacks We denote scalars, vectors and matrices, respectively, with lowercase italics, lowercase bold and uppercase bold Roman letters, e.g. u, u and U. The ℓ2 norm is denoted by . Throughout, we use the standard O-notation (O and Ω). Further, we use and O interchangeably. We use O to hide poly-logarithmic dependence on the parameters. 1.2. Related Work In this section, we survey related prior work on data poisoning attacks and defense strategies, and on convergence analysis of gradient descent based methods for training wide networks. Data poisoning attacks and defenses. A data poisoning attack, or causative attack, aims at manipulating training samples or model architecture, which leads to misclassification of subsequent input data associated with a specific label (a targeted attack) or manipulate predictions of data from all classes (an indiscriminate attack). A popular data poisoning attack is backdoor attack, where the adversary injects strategically manipulated samples (referred to as a a backdoor pattern, with a target label into the training data. At prediction time, samples that do not contain the trigger pattern can be categorized correctly, but samples that carry the trigger are likely misclassified as belonging to the target label class (Gu et al., 2017; Liu et al., 2017; Chen et al., 2017). One of the shortcomings of the standard backdoor attack is that the poisoned samples are clearly mislabeled, which can arouse suspicion if subjected to human inspection. This lead to what are known as clean label attacks research (Koh & Liang, 2017; Shafahi et al., 2018; Zhu et al., 2019), which focus on adding human imperceptible perturbations to input features without flipping labels of the corrupted inputs. Another attack category is that of label-flip attacks, where the adversary can change labels of a constant fraction of the training sample (Biggio et al., 2011; Xiao et al., 2012; Zhao et al., 2017). Several defense mechanisms have been proposed to counter the data poisoning attacks described above. For the labelflip attacks, (Awasthi et al., 2014) focus on malicious noise model and construct an algorithm to find the optimal halfspace that achieves ϵ error while tolerating Ω(ϵ) noise rate for isotropic log-concave distributions. Recently, (Diakonikolas et al., 2019a) proposes a poly (d, 1/ϵ) time algorithm to solve the same problem under Massart noise. For backdoor attacks, (Liu et al., 2018; Tran et al., 2018) propose strategies to identify the trigger pattern and target the poisoned samples. Several other works have followed up on this idea of data sensitization (outlier removal) (Barreno et al., 2010; Suciu et al., 2018; Jagielski et al., 2018; Diakonikolas et al., 2019b; Wang et al., 2019). For certified defense, (Steinhardt et al., 2017) analyze oracle defense and data-dependent defenses by constructing an approximate upper bound on the loss. Recently (Rosenfeld et al., 2020) apply randomized smoothing to build certifiable robust linear classifier against label-flip attack. Convergence analysis of gradient descent for wide networks. Our analysis builds on recent advances in theoretical deep learning literature, which focuses on analyzing the trajectory of first-order optimization methods in the limit that the network width goes to infinity (Li & Liang, 2018; Du et al., 2019b;a; Allen-Zhu et al., 2018; Zou et al., 2018; Cao & Gu, 2019). The main insight from this body of work is that when training a sufficiently over-parameterized network using gradient descent, if the initialization is large and the learning rate is small, the weights of the network remain close to the initialization; therefore, the dynamics of the network predictions is approximately linear in the feature space induced by the gradient of the network at the initialization (Li & Liang, 2018; Chizat et al., 2018; Du et al., 2019b; Lee et al., 2019). We are particularly inspired by a recent work of (Ji & Telgarsky, 2019), which studies the setting where the data distribution is separable in this feature space, an assumption that was first introduced and studied in (Nitanda & Suzuki, 2019). While our assumptions and proof techniques are similar to this line of work, we are distinct in that to the best of our knowledge none of these prior works study the robustness of SGD to adversarial perturbations. Furthermore, while the existing results suggest that generalization error decreases as the width of the network increases, curiously, we find that robust generalization error exhibits a U-curve as a function of the network width. Our guarantees, accordingly, involve a lower bound and an upper bound on the size of over-parametrization of the network. 2. Warm-up: Convex Learning Problems In convex learning problems, the parameter space W is a convex set, and the loss function ℓ( ) is convex in w. This framework includes a simple yet powerful class of machine learning problems such as support vector machines and kernel methods. Here, we seek to understand the robustness of SGD based on corrupted (likely biased) gradient estimates ℓ( yf( x; w)) computed on poisoned data ( x, y). We begin with a simple observation that under standard regularity conditions, a bounded perturbation in the input/label domain translates to a bounded perturbation in the gradient domain; for example, in the clean label attacks, when f(x; w) = w, x is a linear function, the following holds. Proposition 2.1. Assume w D for all w W Rd, x R for all x X Rd, and the loss function ℓ( ) is L-Lipschitz and α-smooth. Then, for any linear function f(x; w) = w, x , w W, the following holds for any (x, y) X Y, and δ Rd. ℓ(yf(x + δ; w)) ℓ(yf(x; w)) (αDR + L) δ . Robust Learning for Data Poisoning Attacks In fact, other poisoning attacks such as label flip attack can also be viewed in terms of poisoning of the first order information about the stochastic objective. In other words, various data poisoning attacks can be studied in a unified framework of oracle poisoning which we define formally, next. Definition ((G, B)-PSFO). Given a function F : W R, a poisoned stochastic first-order oracle for F takes w W as input and returns a random vector g(w) = ˆg(w) + ζ, where E[ˆg(w)] F(w), E ˆg(w) 2 G2, and ζ is an arbitrary perturbation that satisfies ζ B. Given a step size η > 0 and an initial parameter w0 W, SGD makes T queries to the PSFO, receives poisoned stochastic first-order information gt := g(wt) = ˆg(wt) + ζt, and generates a sequence of parameters w1, . . . , w T , where wt+1 = ΠW(wt η gt) for t {1, . . . , T}, and ΠW projects onto the convex set W. With this introduction, we prove the following robustness guarantee for SGD. Theorem 2.2 (Robustness of SGD). Let F : W R be a convex function. Assume that all w W satisfy w D. Let w := 1 T PT t=1 wt be the average of the SGD iterates after T calls to a (G, B )-PSFO for F, with step sizes η = D T (G+B ), starting from arbitrary initialization w0 W. Then it holds that E[F( w)] F(w ) 5D(G + B ) T + 2D PT t=1 ζt T . The proof of Theorem 2.2 can be found in Appendix B.1. Theorem 2.2 implies that SGD can robustly learn convex learning problems as long as the cumulative perturbation norm due to the PSFO is sublinear in the number of oracle calls. In particular, when PT t=1 ζt = O( T), the poisoning attack cannot impose any significant statistical overhead on learning problem. Furthermore, the upper bound presented in Theorem 2.2 is tight in an information-theoretic sense. Theorem 2.3 (Optimality of SGD). There exists a function F : [ 1, 1] R, and a (1, 1)-PSFO for F, such that any optimization algorithm making T calls to the oracle incurs an excess error of E[F( w)] F(w ) Ω T + PT t=1 ζt We note that inexact first-order oracles has been studied in several previous papers (Schmidt et al., 2011; Honorio, 2012; Devolder et al., 2014; Hu et al., 2016; Dvurechensky, 2017; Hu et al., 2020; Ajalloeian & Stich, 2020). Most of these works, however, make strong distributional assumptions on the perturbations, which are impractical in real adversarial settings. In a closely related line of work, (Hu et al., 2016; 2020; Amir et al., 2020; Ajalloeian & Stich, 2020) focus on biased SGD, and give convergence guarantees for several classes of important machine learning problems. However, we are not aware of any previous work studying robustness of SGD in neural networks, which is the subject of the next section. 3. Neural Networks Next, we focus on two-layer neural networks with Re LU activation function and characterize sufficient conditions under which SGD can efficiently and robustly learn the network. A two-layer Re LU net, parameterized using a pair of weight matrices (a, W), computes the following function: f(x; a, W) := 1 m s=1 asσ(w s x). Here, m corresponds to the number of hidden nodes, i.e., the network width, W = [w1, . . . , wm], a = [a1, . . . , am], and σ(z) := max{0, z} is the Re LU. We initialize the top layer weights, as unif({ 1, +1}), and keep them fixed through the training. The bottom layer weights are initialized as ws,0 N(0, Id) and are updated using SGD on the logistic loss ℓ(z) := log(1 + e z). We denote the weight matrix at the tth iterate of SGD as Wt and the incoming weight vector into the sth hidden node at iteration t as ws,t. Since a is fixed during the training, for the simplicity of presentation, we denote the network output on the ith clean and perturbed sample, respectively, as fi(W) := f(xi; a, W) and fi(W) := f( xi; a, W). Therefore, at time t, the network weights are updated according to Wt+1 = Wt ηt ℓ( yt ft(Wt)). In this section, we assume that the data is normalized so that x = 1. This assumption is standard in the literature of over-parameterized neural networks (Du et al., 2019b; Allen Zhu et al., 2018; Cao & Gu, 2019; Ji & Telgarsky, 2019); however, the results can be extended to the setting where the norm of the data is both upperand lower-bounded by some constants. Moreover, following Ji & Telgarsky (2019), we assume that the distribution is separable by a positive margin in the reproducing kernel Hilbert space (RKHS) induced by the gradient of the infinite-width network at initialization. Assumption 1 ((Ji & Telgarsky, 2019)). Let z N(0, Id) be a d-dimensional standard Gaussian random vector. There exists a margin parameter γ > 0, and a linear separator v : Rd Rd satisfying (A) Ez[ v(z) 2] < ; (B) v(z) 2 1 for all z Rd; and (C) y Ez[ v(z), x1[z x 0] ] γ for almost all (x, y) D. We note that the assumption above pertaining the linearly separability of data after mapping it into a high-dimensional non-linear feature space is mild and reasonable this very Robust Learning for Data Poisoning Attacks idea has been the cornerstone of kernel methods using the radial basis function (RBF) kernel, for example, and for learning with neural networks. Next, we specify three data poisoning regimes under which SGD can efficiently and robustly learn two-layer Re LU networks under Assumption 1. Recall that the misclassification error due to f( ; a, W) is denoted by L(W) := PD(yf(x; a, W) 0) note that a is fixed after the initialization and hence is dropped from the arguments of L. 3.1. Regime A (clean label attacks): large per-sample perturbation, small overall perturbation Our first result concerns the setting where each individual sample can be arbitrarily poisoned as long as the overall perturbation budget is small compared to the sample size. Theorem 3.1 (Regime A). Under Assumption 1, for any δ (0, 1), with probability at least 1 δ over random initialization and the training samples, the iterates of SGD with constant step size η = 1 (1+B)2 n satisfy i