# does_data_augmentation_lead_to_positive_margin__ec24b089.pdf Does Data Augmentation Lead to Positive Margin? Shashank Rajput * 1 Zhili Feng * 1 Zachary Charles 2 Po-Ling Loh 3 Dimitris Papailiopoulos 2 Abstract Data augmentation (DA) is commonly used during model training, as it significantly improves test error and model robustness. DA artificially expands the training set by applying random noise, rotations, crops, or even adversarial perturbations to the input data. Although DA is widely used, its capacity to provably improve robustness is not fully understood. In this work, we analyze the robustness that DA begets by quantifying the margin that DA enforces on empirical risk minimizers. We first focus on linear separators, and then a class of nonlinear models whose labeling is constant within small convex hulls of data points. We present lower bounds on the number of augmented data points required for non-zero margin, and show that commonly used DA techniques may only introduce significant margin after adding exponentially many points to the data set. 1. Introduction Modern machine learning has ushered in a plethora of advances in data science and engineering, which leverage models with millions of tunable parameters and achieve unprecedented accuracy on many vision, speech, and text prediction tasks. For state-of-the-art performance, model training involves stochastic gradient descent (SGD), combined with regularization, momentum, data augmentation, and other heuristics. Several empirical studies (Zhang et al., 2016; Zantedeschi et al., 2017) observe that among these methods, data augmentation plays a central role in improving the test error performance and robustness of these models. Data augmentation (DA) expands the training set with artificial data points. For example, Krizhevsky et al. (2012) *Equal contribution 1Department of Computer Science, University of Wisconsin-Madison 2Department of Electrical and Computer Engineering, University of Wisconsin-Madison 3Department of Statistics, University of Wisconsin-Madison. Correspondence to: Shashank Rajput , Zhili Feng . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). augmented Image Net using translations, horizontal reflections, and altered intensities of the RGB channels of images in the training set. Others have augmented datasets by adding labels to sparsely annotated videos (Misra et al., 2015; Kuznetsova et al., 2015; Prest et al., 2012). Another important class of data augmentation methods are referred to broadly as adversarial training. Such methods use adversarial examples (Szegedy et al., 2013; Madry et al., 2017) to enlarge the training set. Many works have since shown that by training models on these adversarial examples, we can increase the robustness of learned models (Bastani et al., 2016; Carlini & Wagner, 2017; Szegedy et al., 2013; Goodfellow et al., 2014). Recently, (Ford et al., 2019) studied the use of additive Gaussian DA in ensuring robustness of learned classifiers. While they showed the approach can have some limited success, ensuring robustness to adversarial attacks requires augmenting the data set with Gaussian noise of particularly high variance. The high-level motivation of DA is clear: a reliable model should be trained to predict the same class even if an image is slightly perturbed. Despite its empirical effectiveness, relatively few works theoretically analyze the performance and limitations of DA. Bishop (1995) shows that training with noise is equivalent to Tikhonov regularization in expectation. Wager et al. (2013) show that training generalized linear models while randomly dropping features is approximately equivalent to ℓ2-regularization normalized by the inverse diagonal Fisher information matrix. Dao et al. (2018) study data augmentation as feature-averaging and variance regularization, using a Markov process to augment the dataset. Wong & Kolter (2018) provide a provable defense against bounded ℓ -attacks by training on a convex relaxation of the adversarial polytope, which is also a form of DA. We take a different path by analyzing how DA impacts the margin of a classifier, i.e., the minimum distance from the training data to its decision boundary. We focus on margin since it acts as a proxy for both generalization (Shalev Shwartz & Ben-David, 2014) and worst-case robustness. In particular, we analyze how much data augmentation is necessary in order to ensure that any empirical risk minimization algorithm achieves positive, or even large, margin. To the best of our knowledge, no existing work has explicitly analyzed data augmentation through the lens of margin. Does Data Augmentation Lead to Positive Margin? 1.1. Contributions We consider the following empirical risk minimization (ERM) problem: A(S) = argmin f F i=1 ℓ(f(xi), yi) where S = {(xi, yi)}n i=1 is the training set, xi Rd are the feature vectors, and yi { 1, +1} their labels. F is the set of classifiers we are optimizing over, and ℓ(f(x), y) = 1{f(x) =y} is the 0/1 loss quantifying the discrepancy between the predicted label f(x) and the truth. For the purpose of better generalization and robustness, we often desire an ERM solution with large margin. A classifier f has margin ϵ with respect to some p-norm, if (x, y) S then for any δ Rd with δ p ϵ, f(x) = f(x + δ) = y. While margin can be explicitly enforced through constraints or regularization for linear classifiers, doing so efficiently and provably for general classifiers remains a challenging open problem. Since data augmentation has had success in offering better robustness in practice, we ask the following question: Can data augmentation guarantee non-zero margin? That is, can we use an augmented data set Saug, such that by applying any ERM to it, the output classifier A(Saug) has some margin? Figure 1 provides a sketch of this problem for linear classification. Augmented point, Class 1 Augmented point, Class 2 Margin, Class 2 Margin, Class 1 Figure 1. A linearly separable data set with two data points, each in its own class, and two input dimensions. If we wish to guarantee a positive margin for all feasible linear separators, i.e., all linear ERMs, we need to augment the training set with additional data points. Otherwise, a linear separator exists with zero margin. Lower bounds on the number of augmentations. We first consider linear classification of linearly separable data. We develop lower bounds on the number of augmented data points needed to guarantee that any linear separator of the augmented data has positive margin with respect to the original data set. We show that in d dimensions, d+1 augmented data points are necessary for any data augmentation strategy to achieve positive margin. Moreover, there is some strategy that achieves the best possible margin with only d + 1 augmented points. However, if the augmented points are formed by bounded perturbations of the training set, we need at least as many augmented data points as true training points to ensure positive margin. Upper bounds for additive random perturbations. In practice, many data augmentation methods employ random perturbations, including random crops, rotations, and additive noise. As a first step towards analyzing these methods, we focus on the setting that the augmented data set is formed by adding spherical random noise to the original training data. We specifically quantify how the dimension of the data, the number of augmentations per data point, and their norm can impact the worst-case margin. Our results show that if the norm of the additive noise is proportional to the margin, then the number of augmented data points must be exponential to ensure a constant factor approximation of the best possible margin. However, if the norm of the additive noise is carefully chosen, then polynomially many augmentations are sufficient to guarantee that any sperateor of the augmented data set has margin that is a constant fraction of the max margin of the original data set. Nonlinear classification and margin. Finally, we extend our results to nonlinear classifiers that assign the same label within small convex hulls of the training data. We provide lower bounds on the number of augmentations needed for such respectful classifiers to achieve positive margin, and also analyze their margin under random DA methods. Despite respectful classifiers being significantly more general than linear ones, we show that their worst-case margin after augmentation can be comparable to that of linear classifiers. 1.2. Related Work DA is closely related to robust optimization methods (Xu et al., 2009; Caramanis et al., 2012; Sinha et al., 2018; Wong & Kolter, 2018). While DA aims at improving model robustness via finitely many perturbations of the input data, robust optimization methods solve robust versions of ERM, which typically involve worst-case perturbations over infinite sets. Our work has particularly strong connections to Xu et al. (2009), which shows that regularized SVMs are equivalent to robust versions of linear classification. Our results can be viewed as attempting to train robust models without the need to perform robust optimization. Our work may also be viewed as quantifying the robustness of classifiers trained with DA against adversarial (i.e., worstcase) perturbations. Many recent works have analyzed the robustness of various classifiers to adversarial perturbations from a geometric perspective. Fawzi et al. (2016) introduce Does Data Augmentation Lead to Positive Margin? a notion of semi-random noise and study the robustness of classifiers to this noise in terms of the curvature of the decision boundary. Moosavi-Dezfooli et al. (2018) also relate the robustness of a classifier to the local curvature of its decision boundary, and provide an empirical analysis of the curvature of decision boundaries of neural networks. Fawzi et al. (2018a) relate the robustness of a classifier to its empirical risk and show that guaranteeing worst-case robustness is much more difficult than robustness to random noise. Franceschi et al. (2018) provide a geometric characterization of the robustness of linear and locally approximately flat classifiers. Their results analyze the relation between the robustness of a classifier to noise and its robustness to adversarial perturbations. 2. Margin via Data Augmentation Our work aims to quantify the potential of DA to guarantee margin for generic ERMs. We first examine linear classification on linearly separable data, and then extend our results to nonlinear classification. Although we can find max-margin linear classifiers efficiently through quadratic programming (Shalev-Shwartz & Ben-David, 2014), generalizing this to nonlinear classifiers has proved difficult; if this was a simple task for neural networks, the problem of adversarial examples would be non-existent. Hence linear classification serves as a valuable entry point for our study of data agumentation. We first introduce some notation. Let A, B Rd, x, y Rd, and r 0. Let d(x, y) denote the ℓ2 distance between x, y, and let d(A, B) = infx A,y B d(x, y). Define Ar := {z Rd | d(z, A) r}. Let |A|, R (A), and conv(A) denote the cardinality, interior, and convex hull of A. Let Sd 1 denote the unit sphere in Rd, and for r > 0 let r Sd 1 denote the sphere of r. Let S Rd { 1} be our training set. For (x, y) S, x is the feature vector, and y { 1} is the label. For any such S, we define X+ = {x | (x, 1) S}, X = {x | (x, 1) S}. (2.1) Linear classification. We next recall some background on linear classification. As in Section 1.1, we assume we have access to an algorithm A that solves the ERM problem over the set of linear classifiers. A linear classifier is a function of the form h(x) = sign( w, x + b), for w Rd, b R. We often identify h with the hyperplane H = {x | w, x + b = 0}. We say that h linearly separates S if x X+, h(x) 0 and x X , h(x) 0. If such h exists, S is linearly separable. Let H(S) denote the set of linear separators of S. Margin. Suppose S is linearly separable. The margin of a linear separator h H(S) is defined as follows: Definition 1. The margin of a linear separator h(x) = sign( w, x + b) with associated hyperplane H is γh(S) = inf (x,y) S d(x, H) = inf (x,y) S | w, x + b| We define γh(S) = if h does not linearly separate S. If S is linearly separable, there is a linear classifier h corresponding to (w , b ) with maximal margin γ . This classifier is the most robust linear classifier with respect to bounded ℓ2 perturbations of samples in S. In this work, we analyze the margin of ERMs that are trained without any explicit margin constraints or regularization. Let S denote the true dataset. To achieve margin, we create an artificial dataset S . We then assume we have access to an algorithm that outputs (if possible) a linear separator h of the augmented dataset Saug := S S . We define X , Xaug analogously to X in (2.1). We will analyze the margin of h with respect to the true training data S. If S is linearly separable and we add no artificial points, then some h H(S) must have 0 margin. If S is designed properly, one might hope that Saug is still linearly separable and that any h H(Saug) has positive margin with respect to S. The following notion formalizes this idea, illustrated in Figure 2. Definition 2. The worst-case margin of a linear separator of Saug with respect to the original data S is defined as α(S, S ) = min h H(Saug) γh(S). We define this to be if H(Saug) = . We are generally interested in the following question: Question. How do we design S so that α(S, S ) is as large as possible? In Section 3.1, we analyze how large S must be to ensure that α(S, S ) is positive. We show that |S | > d is necessary to ensure positive worst-case margin. Moreover, if S is formed via bounded perturbations of S, we need |S | |S| to guarantee positive margin. In Section 3.2, we analyze the setting where S is formed by spherical random perturbations of S of radius r, a technique that mirrors random noise perturbations used in practice. We show that if r is not well-calibrated, exponentially many perturbations are required to achieve a margin close to γ . However, if r is set correctly, then it suffices to have |S | polynomial in n and d to ensure that any linear separator of Saug will achieve margin close to γ on S. In Section 4, we generalize this notion to a class of nonlinear classifiers, which we refer to as Does Data Augmentation Lead to Positive Margin? Figure 2. Solid dots represent the true data points and hollow dots represent artificial data points. Convex hulls of the true and augmented data are represented by solid and dashed lines, respectively. Classifiers are shown in blue. (a) Without DA, we may obtain a zero margin classifier. (b) Carefully chosen augmentations can guarantee positive margin. (c) Large augmentations may violate linear separability. respectful classifiers, and derive analogous results to those described above. We show that this class includes classifiers of general interest, such as nearest neighbors classifiers. 3. Linear Classifiers 3.1. How Much Augmentation Is Necessary? Suppose S is linearly separable with max-margin γ . We wish to determine the required size of S to ensure that α(S, S ) > 0. We first show that to achieve a positive worstcase margin, the total number of perturbations must exceed the ambient dimension. Theorem 1. If |S | < d + 1, then α(S, S ) 0. Therefore, we need to augment by at least d + 1 points to ensure positive margin. We now wish to understand what margin is possible using data augmentation. We have the following lemma. Lemma 1. Let γ be the maximum margin on S. For all S Rd, α(S, S ) γ . In fact, if we know the max-margin hyperplane, then d + 1 points are sufficient to achieve α(S, S ) = γ . Theorem 2. Let S be linearly separable with max-margin γ . Then S such that |S | = d + 1 and α(S, S ) = γ . The augmentation method in the proof (see Section ??) requires explicit knowledge of the maximum-margin hyperplane. In practice, most augmentation methods avoid such global computations, and instead apply bounded perturbations to the true data. Recall that for A Rd, Ar = {x|d(x, A) r}. For S Rd { 1}, we define Sr = (X+)r {1} [ (X )r { 1} . (3.1) If S is formed from S by perturbations of size at most r, then S Sr. The following result shows that if S Sr, then |S | |S| is necessary to guarantee that α(S, S ) > 0. Theorem 3. Fix (n, m) N2 and r > 0. Then S Rd with |X+| = n and |X | = m, such that if S Sr, and |X +| < n, then α(S, S ) = 0. Figure 3 provides an illustration. Given r, we choose X+ to lie on a parabola P such that the tangent lines at these points are at distance at least r from other points. We choose X to be far enough below the x-axis so that these tangent lines linearly separate X+ from Xaug . Suppose we do not augment some point s X+. Then the tangent at that point linearly separates X+ from Xaug , while being at distance 0 away from s. Thus, we need augmentations at every point in X+ to guarantee positive margin. Figure 3. Points in X+ lie on the the parabola P defined by y = 9x2. The tangent at each point s X+ does not intersect the ball of radius r around any other point in X+. We choose X to have points far enough below the x-axis so that the tangents at X+ separate X + from any X (X )r. Points in X+ and their r-balls are in red, their tangents are in blue, and X is in black. 3.2. Random Perturbations We now analyze the setting where S is formed by random perturbations of S. Our results reveal a fundamental trade-off between the size of perturbations, number of perturbations, margin achieved, and whether or not linear separability is maintained. If we construct many large perturbations, we may violate linear separability, but if we use too few perturbations that are too small in size, we may only Does Data Augmentation Lead to Positive Margin? achieve small margin guarantees. In the rest of this section, we assume that each point in S is of the form (x + z, y) where (x, y) S and z is drawn uniformly at random from r Sd 1, the sphere of radius r. Due to the construction of S , the following lemma about the inner products of random points on the sphere Sd 1 will be useful throughout. Lemma 2. Let a be a unit vector and z be generated uniformly at random from the sphere of radius γ. Then with probability at least 1 e dϵ2/2γ2, a, z ϵ. For further reference, see Chapter 3 of (Vershynin, 2011). Upper bounds on margin. By Theorem 1, we know that |S | d + 1 is necessary to achieve positive margin on S. Since S Sr, we must have α(S, S ) r. In general, we hope that high probability, α(S, S ) r. We show below that the margin and perturbation size can be close only if |S | is exponential in d. The result follows using results on the measure of spherical cap densities to bound the distance between S and the max-margin hyperplane. Theorem 4. For all δ (0, 1), with probability at least 1 δ, we have 2 ln(|S |) + 2 ln(1/δ) This result shows that to achieve minimum-margin close to r, we need the number of perturbations to be exponential in d. Thus, if r γ , we require exponentially many augmentations. However, by making r much larger than γ , we may be able to achieve a large margin, provided linear separability is maintained. Maintaining linear separability. We now show that if r is too large, the augmented sets will often not be linearly separable. Specifically, we show that when S just has two points, if r = Ω( dγ ) and |X +| = Ω(d), then linear separability is violated with high probability. For Theorem 5, suppose S = {(x1, 1), (x2, 1)} where d(x1, x2) = 2γ (i.e., the max-margin is γ ). Theorem 5. If |X +| 16d and r 8e2 2d π3/2 γ , with probability at least 1 2e d/6, Saug is not linearly separable. To prove this, we first show that with high probability, there are Ω(d) points in X + labeled 1 by the max-margin classifier. We then use estimates of when random points on the sphere are contained in a hemisphere to show that with high probability, the convex hull of the these points contains x2. This analysis can be extended directly to the setting where X+ and X are contained in balls of sufficiently small radius compared to On the other hand, we show that if r is slightly smaller than dγ , linear separability holds with high probability. Theorem 6. Suppose S is linearly separable and |S | N. If r β 1/2p d/ log(N)γ for β > 1, then with probability at least 1 N 1 β, Saug is linearly separable. A short proof sketch is as follows: Let w be a unit vector orthogonal to the max-margin hyperplane H . Suppose (x+ z, y) S where (x, y) S and z is sampled uniformly on the sphere of radius r. By Lemma 2, with high probability w , x + z will be close to w , x , and so x, x + z will fall on the same side of H . The result then follows by a union bound. Theorems 5 and 6 together imply that if r = Ω( dγ ), we cannot hope to maintain linear separability. Instead, setting r = O( p d/ log Nγ ), we will maintain linear separability with high probability. We will use the latter result in the next section to show that for such r, we can actually provide lower bounds on the adversarial margin α(S, S ) achieved. Lower bounds on margin. By Theorem 4, we know that if r γ , we need N to be exponential in d to achieve a margin close to γ . By Theorem 6, we can set r to be as large as O( p d/ log Nγ ) and maintain linear separability. We might hope that in this latter setting, we can achieve a margin close to γ with substantially fewer points than when r γ . Suppose S is formed by taking N perturbations of each point in S = {(xi, yi)}i [n]. Formally, for i [n], j [N] let z(j) i be drawn uniformly at random from r Sd 1. Then, S = {(xi + z(j) i , yi)}i [n],j [N]. (3.2) We show following theorem: Theorem 7. Suppose S is linearly separable with maxmargin γ . Let S be as in (3.2). There is a universal constant C such that if N Cd and r β 1/2p for β > 1, then with probability at least 1 ne d n N 1 β, we have α(S, S ) 1 2 Taking r = β 1/2p d/ log Nγ and β sufficiently large, we can ensure that the worst-case margin among linear separators is a constant fraction of the max-margin. Thus, with high probability, we can achieve a constant approximation of the best possible margin with |S | = O(nd2). While Theorems 1 and 3 indicate that |S | should grow linearly in n and d, determining whether O(nd2) is tight for some S is an open problem. Remark 1. Theorem 7 can be extended to the setting where we only take perturbations of each point in a τ-cover of X+ Does Data Augmentation Lead to Positive Margin? and X . Recall that A is a τ-cover of B if x B, x A where d(x, x ) ϵ. The same result (with the constant 2 2 replaced by 4 2) holds when S is formed according to (3.2), but with S replaced by A+ {1} A { 1} where A+, A are τ-covers of X+, X for Thus, we only need |S | = O(md2) perturbations, where m = max{|X+|, |X |}. When S is highly clustered, this could result in a much smaller sample complexity, as m may be much smaller than n. To give a sketch of the proof, suppose (0, 1) S. Thus, S contains N points of the form (zi, 1) where zi r Sd 1. We wish to guarantee that any linear separator, with associated hyperplane H, has some margin at 0. Consider K = conv({zi}i [N]). Since each zi has label 1, we know that H cannot intersect the interior of K. Then, if 0 is in the interior of K, then H has positive margin at 0. In fact, we extract a strengthening of this from the proof of Lemma 3.1 of (Alonso-Gutierrez, 2008): Lemma 3. Let z1, . . . , z N be drawn uniformly at random on r Sd 1. Let K = conv(z1, . . . , z N). Then there exists a constant C > 0 such that if N Cd, then Thus, with high probability Bρ(0) K where ρ = Ω( p log(N/d)/dr). The margin of H at 0 is therefore at least ρ. Applying Theorem 6, we derive Theorem 7. A pictorial explanation of the proof is given in Figure 4. Figure 4. A pictorial explanation of the proof of Theorem 7. Suppose X + is drawn uniformly at radius r from X+. With r as in the theorem statement, with high probability X + will not prevent linear separability of Saug. Moreover, with high probability conv(X +) will contain a ball of radius ρ around each point in X+. This then implies that any h H(Saug) has margin at least ρ. 4. Nonlinear Classifiers We now consider more general binary-valued classifiers. Given S Rd { 1}, a classifier f : Rd { 1} separates S if f(x) = y for all (x, y) S. Let R(S) denote the collection of separators of S. If R(S) is non-empty, we say that S is separable. Given f : Rd { 1}, we define a generalization of the notion of margin in 1. Definition 3. If f R(S), its margin on S is given by γf(S) := min (x,y) S d(x, f 1( y)). We define γf(S) = if f / R(S). Suppose we have a function class F and we wish to find an ERM of the 0 1 loss on S (more generally, any nonnegative loss function where ℓ(f(x), y) = 0 iff f(x) = y). The set of ERMs is simply R(S) F. To find ERMs with positive margin, we will again form a perturbed dataset S , and then find some ERM of Saug = S S . We define the margin of f with respect to S and S as follows. Definition 4. The margin γf(S, S ) of f with respect to S, S is defined by γf(S, S ) = γf(S) if f R(Saug) and otherwise. If Saug is separable and F is sufficiently expressive, one can always find an ERM with zero margin. Instead, we will restrict to a collection of functions that is expressive, but still have meaningful margin guarantees. We refer to these as respectful functions. Respectful classifiers. If x1, x2 Rd are sufficiently close and have the same label, it is reasonable to expect a well-behaved classifier to assign the same label to every point between x1 and x2. In fact, (Fawzi et al., 2018b) shows that empirically, state-of-the-art deep nets often remain constant on straight lines connecting different points of the same class. For a linear classifier f labels all points in A as 1, we know that f assigns 1 to the entire set conv(A). With this in mind, we give the following definition: Definition 5. A function f : Rd { 1} is respectful of S if x conv(X+), f(x) = 1 and x conv(X ), f(x) = 1. Intuitively, f must respect the operation of taking convex hulls of points with the same label. However, assigning all of conv(X+) and conv(X ) the same label is a relatively strict condition. To relax this condition, we define a class of functions that are respectful only on small clusters of points. Recall the notion of a circumradius: Definition 6. The circumradius R(A) of a set A Rd is the radius of the smallest ball containing A. Does Data Augmentation Lead to Positive Margin? Figure 5. Suppose that R(X+) ϵ. The classifier with a blue decision boundary is not ϵ-respectful of S, but the classifier with a green decision boundary is ϵ-respectful of S. We now define ϵ-respectful classifiers: Definition 7. For ϵ [0, ], we say that a classifier f : Rd { 1} is ϵ-respectful of S if A X+ such that R(A) ϵ, and x conv(A), f(x) = 1; and B X such that R(B) ϵ, and x conv(B), f(x) = 1. Let Rϵ(S) denote the set of ϵ-respectful classifiers. An illustration is provided in Figure 5. Note that the set of separators of S is simply R0(S), and the set of respectful classifiers is R (S). Smaller values of ϵ lead to more expressive function classes Rϵ(S). We now show that this definition includes some function classes of interest: Example 1 (Linear Classifiers). Recall that H(S) is the set of linear separators of S. It is straightforward to see that such functions are respectful of S, so H(S) R (S). By the hyperplane separation theorem (see Lemma ??), we have H(S) = if and only if R (S) = . In general, H(S) is a proper subset of R (S). Example 2 (Nearest Neighbor). Let f NN denote the 1nearest neighbor classifier on S: For x Rd, we have f NN(x) = 1 if d(x, X+) d(x, X ), and f NN(x) = 1 otherwise. For ϵ [0, d(X+,X ) 2 ), we can argue that f NN Rϵ(S), as follows: Suppose x conv(A) where A X+ and R(A) ϵ. Then d(x, X+) ϵ. For all u X , we have d(u, X+) d(X+, X ), so d(x, u) > d(X+,X ) 2 . Hence, f NN(x) = 1. We now consider the following adversarial problem. Given S, we form a perturbed version S . An adversary can pick an ϵ-respectful classifier f R(Saug). The smaller the value of ϵ, the more powerful the adversary. We hope that no matter which f the adversary chooses, the value of γf(S, S ) is not too small. We first provide bounds on how large S must be to ensure a positive margin, and then derive results for random perturbations when S is (non)-linearly separable. Our results are versions of Theorem 7 for respectful classifiers. Finally, we will show that for respectful classifiers, our bounds for random perturbations are tight up to constants for some S. 4.1. How Much Augmentation Is Necessary? We first show that for any ϵ [0, ], we must have |S | > 2d in order to achieve a positive margin. Theorem 8. Suppose S is separable. If |X +| d or |X | d, then for any ϵ [0, ], either Rϵ(Saug) = , or f Rϵ(Saug) such that γf(S, S ) = 0. Suppose we limit ourselves to bounded perturbations of S, so that S Sr for some r > 0. We will show that in this setting, we may need as many as |S|(d + 1) perturbations to guarantee a positive margin. Theorem 9. For all n 1 and ϵ, r (0, ), there is some S of size n such that if |S | |S|(d + 1), then f Rϵ(Saug) such that γf(S, S ) = 0. Next, we consider the problem of ensuring a positive margin with bounded perturbations. The following lemma shows that if ϵ < r, there is some S such that the adversary can find a zero margin classifier for any S Sr. Lemma 4. For any ϵ (0, ) and r > ϵ, there is S such that for any S Sr, f Rϵ(Saug) such that γf(S, S ) = 0. Therefore, for S Sr, to ensure that any f Rϵ(Saug) has positive margin, we need r ϵ, |S | 2d + 2, and |S | |S|(d + 1). In fact, these three conditions are sufficient to ensure positive margin. Theorem 10. For any S, if ϵ (0, ] and r ϵ, then S Sr with |S | = |S|(d+1), such that f Rϵ(Saug), γf(S, S ) > 0. While this theorem does not guarantee that Rϵ(Saug) = , we will show in Lemma ?? that if S Sr for r ϵ < d(X+,X ) 4 , then Rϵ(Saug) is guaranteed to be nonempty. 4.2. Random Perturbations We now analyze how random perturbations affect the margin of ϵ-respectful classifiers. Just as in the linear setting, we focus on the case where the points in S are of the form (x + z, y) where z is drawn uniformly at random from the sphere of radius r. We provide lower bounds on the margin that are analogous to the linear setting, and show that our margin bounds are tight up to constants in some settings. Linearly separable data. We first show that when S is linearly separable and we perform random augmentations, the results in Section 3.2 still hold, even though the adversary is allowed to select classifiers in the larger set R . Theorem 11. Let S be generated as in (3.2). There is a universal constant C such that if N Cd and r β 1/2p d/ log Nγ for β > 1, then with probability at least 1 ne d n N 1 β, we have R (S) = . Does Data Augmentation Lead to Positive Margin? Furthermore, f R (Saug), we have γf(S, S ) 1 2 The proof uses a generalization of Theorem 7 to respectful functions. We show in Theorem 13 that this bound is tight up to constants under certain assumptions on S. As in the linear case, a perturbation radius of r = O( dγ ) is necessary to maintain separability. Suppose S = {(x1, 1), (x2, 1)} with d(x1, x2) = 2γ and S is as in (3.2). Applying the hyperplane separation theorem and Theorem 5, we have the following result: Theorem 12. If N 16d and r 8e2 2d π3/2 γ , then P(R (Saug) = ) 1 2e d/6. In short, spherical random data augmentation behaves similarly when the adversary selects linear classifiers or classifiers in R (Saug), both in terms of margin achieved and upper bounds on perturbation size to maintain separability. Nonlinearly separable data. When S consists of more than two points, the margin obtained by some f Rϵ(S) may be much larger than the max-margin linear classifier. Moreover, Rϵ(S) may be non-empty even though S is not linearly separable. Thus, we would like to derive versions of the results in Section 3.2 for settings where S may not be linearly separable, but Rϵ(S) = . In fact, if Rϵ(S) = and we generate S as in (3.2), we can derive the following theorem, comparable to Theorem 7 above: Theorem 13. If r ϵ, then there is a universal constant C such that if N Cd, then with probability at least 1 ne d, f Rϵ(Saug), γf(S, S ) 1 2 Furthermore, if ϵ < d(X+,X ) 4 then Rϵ(Saug) = . The first part of the proof proceeds similarly to that of Theorem 7, using the definition of ϵ-respectful classifiers. For the second, we use nearest neighbor classifiers (as in Example 2) to construct ϵ-respectful classifiers of Saug. Although r ϵ < d(X+,X ) 4 is sufficient to guarantee that Rϵ(Saug) = , this may be overly conservative. Whereas Theorems 11 and 12 provide a characterization of the range on r for which R (Saug) is non-empty with high probability, a tighter characterization for ϵ < remains open. Upper bounds on margin. Finally, we show that for certain S, the margin bounds in Theorems 11 and 13 are tight up to constants. While it is as yet unknown whether Theorem 7 is asymptotically tight, the increased expressive capability of respectful classifiers allows us to exhibit upper bounds on the worst-case margin matching the lower bounds above. Suppose S = {(x1, 1), (x2, 1)}, and S is generated as in (3.2). We have the following result: Theorem 14. Fix ϵ [0, ] and r > 0. There are absolute constants C1, C2 such that if N > d and Rϵ(Saug) = , then with probability at least 1 2e C2d log(N/d), f Rϵ(Saug) such that C1 log(2N/d) The proof relies on estimates of the inradius of random convex polytopes from (Alonso-Gutierrez, 2008). The theorem can also be extended to settings where X+ and X are not singletons. Suppose we can decompose X+ and X into clusters {Ai}k i=1 and {Bj}l j=1 such that each cluster has size at most m, circumradius at most O( p log(N/d)/dr), and the distance between any two clusters is Ω(ϵ). If S is generated as in (3.2), then with high probability there is some f Rϵ(Saug) satisfying (4.1) where N is replaced by m N. 5. Conclusion and Open Problems Data augmentation is commonly used in practice, since it significantly improves test error and model robustness. In this work, we have analyzed the performance of data augmentation through the lens of margin. We have demonstrated how data augmentation can guarantee positive margin for unconstrained empirical risk minimizers. For both linear and nonlinear respectful classifiers, we provided lower bounds on the number of points needed to ensure positive margin, and analyzed the margin attained by additive spherical data augmentation. There are several interesting open problems that we plan to tackle in the future. First, it would be interesting to theoretically analyze practical state-of-the-art augmentation methods, such as random crops, flips, and rotations. Such perturbations often fall outside our framework, as they are not bounded in the ℓ2 norm. Another fruitful direction would be to examine the performance of adaptive data augmentation techniques. For example, robust adversarial training, (such as in (Madry et al., 2017)), can be viewed as a form of adaptive data augmentation. By taking a data augmentation viewpoint, we hope to derive theoretical benefits of using adversarial training methods. One final direction would be to develop improved augmentation methods. In particular, we would like methods that can exploit domain knowledge and the geometry of the underlying problem in order to find models with better robustness and generalization properties. Does Data Augmentation Lead to Positive Margin? Alonso-Gutierrez, D. On the isotropy constant of random convex sets. Proceedings of the American Mathematical Society, 136(9):3293 3300, 2008. Andrews, G. E., Askey, R., and Roy, R. Special functions, volume 71. Cambridge university press, 2000. Bastani, O., Ioannou, Y., Lampropoulos, L., Vytiniotis, D., Nori, A., and Criminisi, A. Measuring neural net robustness with constraints. In Advances in Neural Information Processing Systems, pp. 2613 2621, 2016. Bishop, C. M. Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1):108 116, 1995. Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge university press, 2004. Caramanis, C., Mannor, S., and Xu, H. Robust optimization in machine learning. Optimization for Machine Learning, pp. 369, 2012. Carlini, N. and Wagner, D. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39 57. IEEE, 2017. Dao, T., Gu, A., Ratner, A. J., Smith, V., De Sa, C., and R e, C. A kernel theory of modern data augmentation. ar Xiv preprint ar Xiv:1803.06084, 2018. Fawzi, A., Moosavi-Dezfooli, S.-M., and Frossard, P. Robustness of classifiers: From adversarial to random noise. In Advances in Neural Information Processing Systems, pp. 1632 1640, 2016. Fawzi, A., Fawzi, O., and Frossard, P. Analysis of classifiers: Robustness to adversarial perturbations. Machine Learning, 107(3):481 508, 2018a. Fawzi, A., Moosavi-Dezfooli, S.-M., Frossard, P., and Soatto, S. Empirical study of the topology and geometry of deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3762 3770, 2018b. Ford, N., Gilmer, J., Carlini, N., and Cubuk, D. Adversarial examples are a natural consequence of test error in noise. ar Xiv preprint ar Xiv:1901.10513, 2019. Franceschi, J., Fawzi, A., and Fawzi, O. Robustness of classifiers to uniform ℓp and Gaussian noise. In International Conference on Artificial Intelligence and Statistics, AISTATS 2018, pp. 1280 1288, 2018. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. Co RR, abs/1412.6572, 2014. URL http://arxiv.org/ abs/1412.6572. Huber, G. Gamma function derivation of n-sphere volumes. The American Mathematical Monthly, 89(5):301 302, 1982. Klartag, B. and Kozma, G. On the hyperplane conjecture for random convex sets. Israel Journal of Mathematics, 170(1):253 268, 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pp. 1097 1105, 2012. Kuznetsova, A., Ju Hwang, S., Rosenhahn, B., and Sigal, L. Expanding object detector s horizon: Incremental learning framework for object detection in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 28 36, 2015. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. ar Xiv preprint ar Xiv:1706.06083, 2017. Misra, I., Shrivastava, A., and Hebert, M. Watch and learn: Semi-supervised learning for object detectors from video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015. Moosavi-Dezfooli, S.-M., Fawzi, A., Fawzi, O., Frossard, P., and Soatto, S. Robustness of classifiers to universal perturbations: A geometric perspective. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum? id=Byr Zygl Cb. Prest, A., Leistner, C., Civera, J., Schmid, C., and Ferrari, V. Learning object class detectors from weakly annotated video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3282 3289. IEEE, 2012. Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014. Sinha, A., Namkoong, H., and Duchi, J. Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations, 2018. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. ar Xiv preprint ar Xiv:1312.6199, 2013. Vershynin, R. Lectures in geometric functional analysis. Preprint, University of Michigan, 2011. Does Data Augmentation Lead to Positive Margin? Wager, S., Wang, S., and Liang, P. S. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems, pp. 351 359, 2013. Wendel, J. G. A problem in geometric probability. Mathematica Scandinavica, 11(1):109 111, 1963. ISSN 00255521, 19031807. URL http://www.jstor. org/stable/24490189. Wong, E. and Kolter, Z. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283 5292, 2018. Xu, H., Caramanis, C., and Mannor, S. Robustness and regularization of support vector machines. Journal of Machine Learning Research, 10(Jul):1485 1510, 2009. Zantedeschi, V., Nicolae, M.-I., and Rawat, A. Efficient defenses against adversarial attacks. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 39 49. ACM, 2017. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.