# data_augmentation_as_feature_manipulation__05f3c4d0.pdf Data Augmentation as Feature Manipulation Ruoqi Shen 1 S ebastien Bubeck 2 Suriya Gunasekar 2 Abstract Data augmentation is a cornerstone of the machine learning pipeline, yet its theoretical underpinnings remain unclear. Is it merely a way to artificially augment the data set size? Or is it about encouraging the model to satisfy certain invariance? In this work we consider another angle, and we study the effect of data augmentation on the dynamic of the learning process. We find that data augmentation can alter the relative importance of various features, effectively making certain informative but hard to learn features more likely to be captured in the learning process. Importantly, we show that this effect is more pronounced for nonlinear models, such as neural networks. Our main contribution is a detailed analysis of data augmentation on the learning dynamic for a two layer convolutional neural network in the recently proposed multi-view data model by Allen-Zhu & Li (2020b). We complement this analysis with further experimental evidence that data augmentation can be viewed as feature manipulation. 1. Introduction Data augmentation is a powerful technique for inexpensively increasing the size and diversity of training data. Empirically, even minimal data augmentation can substantially increase the performance of neural networks. It is commonly argued that data augmentation is useful to impose domain specific symmetries on the model, which would be difficult to enforce directly in the architecture (Simard et al., 2000; 2003; Chapelle et al., 2001; Yaeger et al., 1996; Shorten & Khoshgoftaar, 2019). For example, semantics of a natural image is invariant under translation and scaling, so it is reasonable to augment an image data set 1University of Washington. Part of this work was done as a intern at Microsoft Research. 2Microsoft Research. Correspondence to: Ruoqi Shen . Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s). with translated and scaled variations of its inputs. Simple augmentation with random crop up to 4 pixels can lead to gains in the range 5-10% (Ciregan et al., 2012; Krizhevsky et al., 2017). Another explanation often proposed for the role of data augmentation is merely that it increases the sample size. As an alternative to symmetry inducing or sample size increase, we consider in this work the possibility that data augmentation should in fact be viewed as a more subtle feature manipulation mechanism on the data. Consider, for illustration, an image data set with the task of learning to detect whether there is a cow in the image. A simplified view would be that there are true cow features that generate the cow images, and we hope to learn those true features. At the same time, because most images of cows contain grass, it would not be surprising if a neural network would learn to detect the spurious grass feature for the task, and perhaps simply overfit the rare images such as desert cows that are not explained by the grass feature (and similarly overfit the perhaps few images with grass and no cows). Now consider a simple data augmentation technique such as Gaussian smoothing (let us assume black and white images or else use additional color space augmentations). The grass feature, sans color, is essentially a high frequency texture information, so we can expect the smoothing operation to make this feature significantly diminished. In this example, the feature manipulation that data augmentation performs is effectively to render the spurious feature harder to detect, or more precisely to make it harder to learn, which in turn boosts the true cow features to become the dominant features. Continuing the illustration above, let us explore further the idea of data augmentation as feature manipulation. First note that the true cow feature need not be one single well-defined object , but rather we may have many different true cow features. For example, true cow features could be different for left-facing and right-facing cows. An imbalance in the training data with respect to those different features could make the rarer features hard to learn compared to the more common features, similarly to how the spurious grass feature was occluding the true cow features. In the example above, it could happen that in most images in the training data, the cows are facing right, which in turn could mean that the neural network will learn a cow feature with an orientation (right-facing), and then simply memorize/overfit the cows facing left. Yet another commonly used data augmentation technique such as random horizontal flip would solve this by balancing the occurrence of cow features with right-orientation and those with left-orientation, hopefully leading to a neural network dynamic that would discover both of those types of cow features. Note that one might be tempted to interpret this as inducing a mirror symmetry invariance in the model, but we emphasize that the effect is more subtle: the learned invariance is only for the relevant features, rather than being an invariance for all images (e.g., on non-cow images one might not be invariant to the orientation). More generally, to understand feature learning with and without data augmentation in gradient descent trained neural networks, we can think of three types of features of interest: (a) The easy to learn and good features, which are accurate features for the learning problem and are easy to learn in the sense that they have large relative contribution in the gradient descent updates of the network. (b) The hard to learn and good features, which are more nuanced to detect but are essential to fit the harder samples in the population distribution (e.g., examples with rare object orientations). These are features that despite being accurate have small relative contribution in the gradient descent updates (perhaps due to lack of sufficient representation in the training data), which in turn makes them hard to learn. (c) Finally, there are the easy to learn and bad features, which while inaccurate, nevertheless interfere with the learning process as they have a large contribution in the gradient updates. Such features often correspond to spurious correlations or dominating noise patterns (e.g., the grass feature) which arise due to limitations in training data size or data collection mechanisms.1 In this paper, we study data augmentation as a technique for manipulating the easiness and hardness of features by essentially changing their relative contributions in the gradient updates for the neural network. We believe that this view of data augmentation as a feature manipulation mechanism is more insightful (and closer to the truth) than the complementary and more straightforward views of symmetry inducing or it s just more data . For one, data augmentation with specific symmetries do not necessarily lead to models that are respectively invariant. For example, Azulay & Weiss (2019) show that even models trained with extensive translation and scale augmentation can be sensitive to single pixel changes in translation and scaling on inputs far from the training distribution, suggesting the inductive bias from data augmentation is more subtle. Further, this view could form a basis for studying more recent data augmentation techniques like Mix Up (Zhang et al., 2017), Cut Out (De Vries & 1We do not mention hard to learn and inaccurate features as they are conceptually irrelevant for the training dynamics or accuracy of the model. Taylor, 2017), and variants, which in spite of being widely successful in image tasks do not fit the conventional narrative of data augmentation. Contributions Given the diversity of data augmentation techniques (e.g., see Shorten & Khoshgoftaar (2019); Feng et al. (2021) for a survey), it is a formidable challenge to understand and analyze the corresponding feature manipulation for each case, and this task is beyond the scope of the present paper. Our more modest objective is to start this program by studying a simple mathematical model where data augmentation can be provably shown to perform feature manipulation along the lines described in the illustration above. Specifically, we consider a variant of the multi-view data setting introduced in the pioneering work of Allen-Zhu & Li (2020b) on ensemble learning. In our data model, each data point is viewed as a set of patches, with each patch being represented by a highdimensional vector in Rd. Moreover there is a set of K true/good features v1, v2, . . . , v K Rd. For any data point, each patch is then some combination of noise and features. Specifically at least one patch contains a good feature whose orientation indicates the label, i.e., for some k [K] this patch is yvk where y { 1, 1} is the binary class label to be predicted. In this case we say that the data point is of the kth type. The other patches contain different forms of noise. If the training data contains sufficiently many type-k data points, then the corresponding feature vk is easy to learn and good , while the features corresponding to rare types are hard to learn and good . To model the easy to learn and bad features we assume that one patch per datapoint receives a large (Gaussian) noise, which we call the dominant noise. See Section 2 for exact details of the model. Given such training data we show the following for a two layer patch-wise convolutional network (see (3)) trained using gradient descent (there is a number of caveats, see below for a list): 1. When one or more features are sufficiently rare, the network will only learn the frequent easy to learn and good features, and will overfit the remaining data using the easy to learn and bad noise component. 2. On the other hand, with any data augmentation technique that can permute or balance the features, the network will learn all K features, and thus achieve better test loss (and, importantly learn a better representation of this data2). We show that this happens because the representation of the hard to learn and good features in the gradient updates will be boosted, 2As a consequence of learning all the K features, the learned model will not only be more accurate on the data distribution of training samples, but will also be robust to distribution shifts that alter the proportion of data of the K feature types. and simultaneously the relative contribution of the dominant noise or the easy to learn and bad features will be diminished. 3. We show that this phenomenon is more pronounced for gradient descent dynamics in non-linear models in the following sense: we prove that even at high signal-to-noise ratio (SNR) the non-linear models might memorize through the noise components, while gradient descent on linear models overfit to noise only at much lower SNR. This shows that data augmentation is useful in a wider range of cases for non-linear models than for linear models. Moreover, our non-linear model can learn the distribution even in the presence of feature noise (in the form of αyvk for some small α > 0, which points to wrong class). On the other hand, a linear model cannot have low test error with such feature noise, thus showing a further separation between linear and non-linear models. Some of the caveats to our theoretical results include the following points (none seem essential, but for some of them going beyond would require significant technical work): Neural network architecture: we study two layer neural network with a special activation function (the latter can be viewed as a smoothed Re LU with fixed bias). We also assume poly-logarithmic (in d) width. Training: we study gradient descent rather than stochastic gradient descent, and furthermore we assume a specific training time (the same one with and without data augmentation). Data model: the distribution can be generalized in many ways, including having data points with mixed types (e.g., multi-view as in (Allen-Zhu & Li, 2020b)), heterogeneous noise components, or even correlated noise components (see below for more on this). We also assume a very high dimensional regime d n2 (where n is the training set size), although we believe our results should hold for d n. Even though our theoretical results are in a limited setting, the feature manipulation effect of data augmentation is conceptually broader. We complement our analysis with experiments on CIFAR-10 and synthetic datasets, where we study data augmentation in more generality. We circle back to our motivating problem with spurious features ( ala the cow grass features story) in a classification task. Our experiments show that simply shifting the spurious feature position randomly up to 2 pixels in each epoch, can significantly improve the test performance by making the spurious feature hard to learn. This happens even when we do not change any non-spurious pixels/features (and hence control learning additional image priors). We further formulate experiments to evaluate the value of a single data augmented image compared to an fully independent sample, and see that on CIFAR10 dataset that once 50% independent samples are available, a data augmented sample is almost as effective as an independent sample for the learning task. Finally, we show on synthetic dataset that the problem arising from imbalance in views (as studied in our main result) also holds for deeper convolutional architectures, even when the views are merely translations of each other. Related Work Starting with (Bishop, 1995) there is a long line of work casting data augmentation as an effective regularization technique, see (Dao et al., 2019; Rajput et al., 2019; Wu et al., 2020; Yang et al., 2022) for recent developments in that direction. Other theoretical analyses have studied and quantified the gains of data augmentation from an invariance perspective (Chen et al., 2020; Mei et al., 2021). The viewpoint we take here, based on studying directly the effect of augmentation on the learning dynamic, is strongly influenced by the work of Zeyuan Allen-Zhu and Yuanzhi Li in the last few years. For example in Allen-Zhu & Li (2020a) they develop this perspective for adversarial training (which in some ways can be thought as a form of data augmentation, where each data point is augmented to its adversarial version). There they show that adversarial training leads to a certain form of feature purification, which in essence means that the filters learned by a convolutional neural network become closer to some ground truth features. In (Allen-Zhu & Li, 2020b) they introduce the multi-view model that we study here, and they used it to study (among other things) ensemble learning. In a nutshell, in their version of the model each data point has several views that can be used for classification, and the idea is that each model might learn only one of those views, hence there is benefit to ensembling in that it will allow to uncover all the features, just like here we suggest that data augmentation is a way to uncover all the features. Other notable works which share the philosophy of studying the dynamic of learning (although focused on linear models) include (Hanin & Sun, 2021) which investigates the impact of data augmentation on optimization, and (Wu et al., 2020) which considers the overparametrized setting and show that data augmentation can improve generalization in this case. Notation We use tilde notation e O, eΘ, eΩto hide log factors in standard asymptotic notation. For an integer K, [K] = {1, 2, . . . , K}. We interchangeably use a b, a, b , or a b for standard inner product between two vectors. 2. A mathematical model for understanding feature manipulation Our data model defined below is a variation of the multiview data distribution in (Allen-Zhu & Li, 2020b) for a binary classification task. We represent the inputs x as a collection of P non-overlapping patches x = (x1, x2, . . . , x P ) Rd P , where each patch is a d dimensional vector. The task is associated with K unknown good features denoted as v1, v2, . . . , v K Rd, such that for labels y { 1, 1}, their orientation as {yvk}k [K] constitutes the K views or sub-types of the class y.3 Each input xp patches either contain one of the good feature {yvk} or a bad feature in the form of random and/or feature noise. Formally, our distribution is defined below. Definition 1. D is parametrized by ρ, σξ, σζ, α , where ρ = (ρ1, ρ2, . . . , ρK) is a discrete distribution over the features {vk}k [K], and σξ,σζ, and α are noise parameters. Without loss of generality, let ρ1 . . . ρK. A sample (x, y) D is generated as follows: (a) Sample y {1, 1} uniformly. (b) Given y, the input x = (x1, x2, . . . , x P ) Rd P is sampled as below: Feature patch: Choose the main feature patch p [P] arbitrarily and set xp = yvk , where k ρ.. Dominant noise: Choose a dominant noise patch pξ = p and generate xpξ = ξ, where ξ i.i.d N 0, σ2 ξ d Id . Background: For the remaining background patches4 p [P] \ {p , pξ}, select 0 αp α and set xp = αpyvkp + ζp, where kp ρ, ζp N(0, σ2 ζId). Assumption 1. We assume the features {vk}k [K] are orthonormal, i.e., k,k [K], vk vk = 1k=k . The training dataset consists of n i.i.d., samples from D, Dtrain = {(x(i), y(i)) : i [n]} D n. We are interested in the high dimensional regime where n d. n, P and K can grow with d. Note that, in Definition 1 k , p , pξ, ξ, and (αp, kp, ζp)p/ {p ,pξ} all depend on x, but we have dropped this dependence in the notation to avoid clutter. In our analysis, for i = 1, 2, . . . , n, we use k i , p i , pξ i , ξ(i), 3For M-class classification, our analysis can be adapted by using separate set of features {vk,m}k for each class m [M], rather than { vk}k. For M = 2, under our learning algorithm, using (vk, 1, vk,1) as features for y = 1, 1 is equivalent to using vk, vk with vk = vk,1 vk, 1. 4In our definition, the dominant noise ξ and the main feature vk appear in exactly one patch. But our results also hold (by virtue of parameter sharing in (3)) when for any disjoint nonempty subsets Pf, Pn [P], we set p Pf, xp = yvk and p Pn, xp = ξp i.i.d N(0, σ2 ξId/d). and (αp,i, kp,i, ζp,i)p/ {p i ,pξ}i to denote the corresponding quantities for the sample (x(i), y(i)) in the training dataset. Data augmentation Let D(aug) train denote the augmented dataset obtained by transforming the i.i.d. training dataset Dtrain. Our model for data augmentation is such that D(aug) train has equal number of samples with main feature yvk for each k [K]. Concretely, consider linear transformations T1, . . . TK 1, such that for all k, Tk : Rd Rd and satisfies k [K], Tk(vk ) = v((k +k 1) mod K)+1). (1) Such transformations are well defined for K d, and in essence permute the feature vectors vk on patches with true feature or feature noise. At the same time, the Gaussian noise patches before and after transformation are no longer i.i.d. We slightly abuse notation and define Tk(x) on x Rd P as Tk(x) = (Tk(x1), Tk(x2), . . . , Tk(xp)) Rd P , as well as Tk(Dtrain) on the training dataset as Tk(Dtrain) = {(Tk(x(i)), y(i)) : i [n]}. Our augmented dataset D(aug) train consists Dtrain along with the K 1 transformations of of Dtrain as defined below: D(aug) train = Dtrain T1(Dtrain) . . . TK 1(Dtrain). (2) Note that in D(aug) train all the views are equally represented, i.e., for each k [K], we will have exactly n samples from the feature yvk, and further D(aug) train has more samples compared to Dtrain with |D(aug) train | = n K, but they are no longer i.i.d. Since the features {vk}k are orthonormal (Assumption 1) and all the non-feature noise are spherically symmetric, without loss of generality, we can assume that {vk}k [K] are simply the first K standard basis vectors in Rd, i.e., vk = ek. In this case, we can choose Tk for k [K 1] as a permutation of coordinates satisfying (1) on the first K coordinate. If we further assume that the the permutations Tk do not have any fixed points, i.e., i [d], Tk(z)[i] = z[i], then at initialization and updates of gradient descent, the augmented samples in D(aug) train satisfy the same properties as i.i.d. samples in Dtrain (upto constants and log factors). In this rest of the proof, we thus assume that Tk are permutations of coordinates without any fixed points in the orthogonal basis extended from {vk}k, and satisfies (1). Role of different noise components Our main result shows that when the dominant noise parameter σξ is sufficiently large, a neural network can overfit to this noise rather than learn all the views. However, with the right data augmentation, we can show that all the views can be accurately learned using a non-linear network. Furthermore, in the presence of feature noise { αpyvkp} (pointing to wrong class), linear models are unable to fit our data distribution, thus establishing a gap from linear models. We choose the noise parameters σξ, σζ, α such that the dominant noise ξ and the true features {yvk } have the main contribution to the learning dynamic compared to the feature noise (i.e., αpyvkp) or the minor noise (i.e., ζp). Thus, our results do not necessarily require noise in the background patches beyond establishing gap with linear models. Since the minor noise σζ does not provide any additional insight, we assume σζ = 0. Our analysis can handle small σζ with more tedious bookkeeping. 2.1. Learning algorithm We use the following patch-wise convolutional network architecture with C channels: let w = {w1, w2, . . . w C} Rd C denote the learnable parameters of the model, F(w, x) = X p [P ] ψ(wc xp) , (3) where ψ is a non-linear activation function defined below: q|z|q if |z| 1 z q 1 q if z 1 z + q 1 Our activation is a smoothed version of symmetrized Re LU with a fixed bias φ(z) = Re LU(z + 1) Re LU( z 1). In fact, as q , ψ φ. Note that since we do not train the second layer weights, we choose an odd-function as activation to ensure that the outputs can be negative. Consider the following logistic loss over the training dataset Dtrain = (x(i), y(i)), i [n] : L(w) = 1 n Pn i=1 ℓ(y(i)F(w, x(i))), where ℓ(z) = log(1 + exp( z)).We learn the model using gradient descent on the above loss with step size η, i.e., for c [C], the weights wc at time step t are given by wc(t) = wc(t 1) η n Pn i=1 y(i)ℓ (y(i)F(w(t), x(i))) F(w(t), x(i)). The following lemma summarizes the conditions at Gaussian initialization w(0) = {wc(0) N(0, σ2 0Id) : c [C]}. Lemma 1. [Ginit-conditions] Consider n i.i.d. samples Dtrain = {(x(i), y(i)) : i [n]} from the distribution in Definition 1. Let the parameters w of the network in (3) be initialized as wc(0) N(0, σ2 0Id) c [C]. If the number of channels is C = Ω(log d), then with probability greater than 1 O( n2KC poly(d)), the following conditions hold : 1. Feature-vs-parameter: k [K], max c [C] wc(0) vk Ω(σ0), and max c [C] |wc(0) vk| e O (σ0) . 2. Noise-vs-parameter: i [n], max c [C] wc(0) y(i)ξ(i) eΩ(σ0σξ), and max c [C] |wc(0) ξ(i)| e O (σ0σξ) . 3. Noise-vs-noise: i [n], ξ(i) ξ(i) = Θ(σ2 ξ) and i, j [n], i = j, |ξ(i) ξ(j)| e O(σ2 ξ/ 4. Feature-vs-noise: i [n], k [K], |ξ(i) vk| e O(σξ/ 5. Parameter norm: c [C], wc(0) = Θ(σ0 The above lemma proved in the Appendix D follows from standard Gaussian concentration bounds. Further, we can show that Ginit also hold for the augmented dataset D(aug) train even though the samples in D(aug) train are not i.i.d. Lemma 1a. Ginit in Lemma 1 also holds for D(aug) train defined in (2) with n replaced by n K. 2.2. Clarification on capacity in this model We now informally discuss the size of our model class in the context of our data distribution. Consider the convolutional model (3) with C = 1 and say α = 0 for sake of simplicity in the data distribution. Using w1 = wgen = γ PK k=1 vk for some large γ > 0 will yield excellent training and test error. This is a model that would generalize . On the other hand for a fixed training set {(x(i), y(i))}i [n], one could also obtain almost perfect training error by using w1 = woverfit = γ Pn i=1 y(i)ξ(i), whenever σξ and d n (noise components {ξ(i)}i [n] are near orthonormal). Indeed with high probability, i [n], y(i)f(woverfit, x(i)) = y(i) P p [P ] ψ(woverfit x(i) p ) is exactly ψ γσ2 ξ(1+ e O( p n/d)) +ψ γσξO( p n/d) = γσ2 ξ(1+o(1)) . In other words the model with woverfit will almost perfectly memorize the training set, while on the other hand it is clear that it will completely fail to generalize. This shows that the model class is large enough so that any classical measure of complexity, e.g., Rademacher complexity, would fail to predict generalization (even datadependent Rademacher complexity where the x(i) follow our data distribution). In fact, our arguments below show that gradient descent could lead to a model of the form woverfit in a Rademacher complexity setting (i.e., with random label y(i) independent of the inputs x(i)). Thus, even restricting to models reached by gradient descent would still yield a high Rademacher complexity. This phenomenon has also been empirically observed in practical neural networks (Neyshabur et al., 2015; Zhang et al., 2021), and shown theoretically in simpler models in (Nagarajan & Kolter, 2019). Thus, we are in a case where not only do we need to leverage the fact that we are using gradient descent to prove generalization, but we also need to use the specific target function (i.e., the relation between y and x) that we are working with. 2.3. Our argument in a nutshell At a high level we show that there is a cutoff point in the features, denote it Kcut, such that running gradient descent on the above architecture and data distribution will lead to a model which is essentially a mixture of parts of wgen and parts of woverfit described above. Roughly it will be: X k Kcut vk + X i:k i >Kcut y(i)ξ(i) . (4) In words, the frequent enough features will be learned, and the data points that correspond to infrequent enough features will be memorized through their noise component. Quite naturally, this cutoff point will be decreasing with the magnitude of the noise σξ, i.e., the bigger the noise the fewer features will be learned. While this argument also holds for gradient descent dynamics on linear models, the cutoff point Kcut of linear models can be higher than that of the non-linear models, which shows that non-linear models can memorize through the noise component at a higher SNR (see Section 3.3 for the exact cutoff point). Where data augmentation will come in is that it can effectively change the frequency of the features, and in the extreme case we consider to make them all equal,i.e., all with frequency 1/K. We then show that there exists a setting of the parameters such that frequency 1/K is learned at noise magnitude σξ, so that with data augmentation all the features are learned. 2.4. Linear and tensor models Before diving into the dynamics of gradient descent for our neural network architecture and data distribution, let us expand briefly on linear models. In Appendix E we study the max-ℓ2 margin linear classifier for our data, but for sake of simplicity we consider here an even more basic predictor that is specifically tailored to our data distribution: θ := 1 n Pn i=1 P p [P ] y(i)x(i) p . Note that θ is a linear function on Rd, and we naturally extend it to the domain Rd P of our data points (with slight overloading of notation) as θ(x) = P p [P ] θ xp. Compared to a gradient descent learned model, it is not clear whether this predictor is meaningful beyond our specific data distribution, and we emphasize that we study it merely as a shortest path to get quantitative estimates for the discussion in Section 2.3 (e.g., for the cutoff point and for the SNR of interest). In fact the (gradient descent learnable) max margin linear classifier has even better properties than the estimator w, see the Appendix E for more details. Derivation of a cutoff point. It is easy to check that with our data distribution we have θ = θS + θN where θS = PK k=1 ρkvk (say the fraction of examples of type k is exactly ρk) and θN = 1 n Pn i=1 y(i)ξ(i) (assume α = 0 for this discussion). In particular for x sampled from our distribution, we have with high probability | θN(x)| σ2 ξ nd and θS(x) ρky if x is of type k. This means that the predictor θ has successfully learned feature vk iff nd. In other words for this linear model the cutoff frequency is at ρcut = σ2 ξ nd. With a small leap of faith (related to the fact that after data augmentation the noise terms are no longer i.i.d., which we show to be not a in our proof of non-linear model) we can see that as long as this cutoff frequency is smaller than 1 K , data augmentation would enable full learning of all the views, since in that case the post-augmentation frequencies 1 K are larger than the cutoff frequency with n replaced by n K, i.e., n Kd = ρcut Effect of simple non-linearity on SNR. The simplest type of non-linearity would be to consider a tensor method for this problem (note that this is nothing but a kernel method). Specifically, let T = 1 n Pn i=1 P p [P ] yi x(i) p q , be the natural empirical tensor for this problem, for some odd q N, whose domain is extended from Rd to Rd P as before, i.e., T(x) = P p [P ] T(xp). Note that this function can be realized in our architecture with a pure polynomial activation function ψ(z) = zq, see (Bubeck et al., 2021) for more on neural network memorization with tensors. Similarly to the linear case one can decompose the tensor into a signal and noise components: T = S+N, where S = k=1 ρkv q k , N = 1 i=1 yi ξ(i) q . For x sampled from our distribution, we have with high probability, |N(x)| σ2q ξ ndq and S(x) ρky if x has vk as its main feature. Thus here the cutoff frequency is at ρ(q) cut = σ2q ξ ndq . In particular we see that even at high SNR, d (in which case ρ(1) cut = o(1)) we might have ρ(q) cut = Ω(1) for q > 1. To put it differently, the tensor methods will overfit to the noise at a different SNR from the pure linear model would, which in turns mean that there is a different range of SNR where data augmentation will be useful for non-linear models such as tensors. We will see this story repeating itself for the gradient descent on our neural network architecture. Quantitative comparison with the neural network results. We note that the thresholds derived here are better than those we obtain via our neural network analysis (note also that the tensor method can handle α > 0 similarly to what our non-linearity allows). However we emphasize again that, on the contrary to gradient descent on neural networks, the predictors here are artificial and specifically tailored to the data distribution at hand. Furthermore the complexity of the tensor method scales up with q, on the contrary to the neural network dynamic. 3. Overview of gradient descent dynamics Let us do some heuristic calculation in the simple case where α = 0 (so that effectively there are only two relevant patches in inputs, xp = yvk and xpξ = ξ, respectively). Recall that wc(0) N(0, σ2 0Id) and ξ N(0, σ2 ξId/d). Thus, E[|wc(0) xp |2] = σ2 0 and E[|wc(0) xpξ|2] = σ2 0σ2 ξ for all channels c. We will initialize so that these quantities are o(1), and thus f(w(0), x) = o(1) for (x, y) D. We study the gradient flow on minimizing f in this section. 3.1. When you really learn... For f to correctly classify a datapoint x with feature vk, it is morally sufficient that |wc vk| is of order 1 for some channel c. Let us look at the dynamics starting close to initialization (when f(w(0), x) = o(1)), i [n] y(i) ℓ y(i)F(wc, x(i)) h wc F(w, x(i)) vk i (a) = 1 + o(1) p [P ] ψ (|wc x(i) p |) y(i)x(i) p vk i [n] ψ (|wc vk i |)vk i vk+ i [n] ψ (|wc ξ(i)|)y(i)ξ(i) vk (b) = 1 + o(1) 2 ρk ψ (|wc vk|) + ϑ , (5) where in (a), we use ℓ (o(1)) = 1/2 + o(1) for logistic loss ℓ, ψ (z) = ψ(|z|) since ψ is odd, and (b) follows from {vk} being orthogonal. If we can ignore ϑ, resulting dynamic reduces to an ODE of the form g (t) = ρkψ (g(t)) (ignoring constants) with g(0) σ0 = o(1). As long as g(t) = wc(t) vk is smaller than 1 this can be rewritten as g (t) = ρkg(t)q 1 (because of the form of ψ we chose), or equivalently (g(t)2 q) = ρk up to constants. In particular, we see that after time t = g(0)2 q/ρk, we will have g(t) = Θ(1). This suggests that by time of order 1/(σq 2 0 ρk) at least one channel should have learned vk5. 5We assume q 3. For the case q = 1 or q = 2, the time When can we indeed ignore (morally) the noise term ϑ? At initialization this term is of order σq 1 0 σq ξ nd . On the other hand the main term wc vk in (5) is of order ρkσq 1 0 . Thus we see that we need σq ξ nd ρk. In fact we will need a slightly more stringent condition, because the cancellation in ϑ leading to a scaling of 1/ n becomes more complicated to analyze after initialization due to the dependencies getting introduced. So we will use the more brutal bound |ϑ| σq 1 0 σq ξ d which in turn means we need σq ξ Summarizing the above, we expect that if σq ξ/ d ρk, then by time 1/(σq 2 0 ρk) we will have one channel that has learned the feature vk. 3.2. ... and when you overfit ... Another sufficient condition to correctly classify a datapoint (x(j), y(j)) would be to overfit to its dominant noise part ξ(j), i.e., |wc ξ(j)| is of order 1 for some channel c. Here we have at initialization: d dtwc ξ(j) = 1 + o(1) p [P ] ψ (|wc x(i) p |)y(i)x(i) p ξ(j) y(j)ψ (|wc ξ(j)|) ξ(j) 2 + ψ (|wc vk j |)vk j ξ(j) i =j,p [P ] ψ (|wc x(i) p |) y(i)x(i) p ξ(j) ! = (1 + o(1))σ2 ξ 2n y(j)ψ (|wc ξ(j)|) + Γ (6) where Γ is the last two term from the penultimate step. Assuming Γ can be ignored, we can mimic the reasoning above (for wc vk) with h(t) = y(j)wc ξ(j) and h(0) = O(σ0σξ). We thus expect to correctly classify a datapoint by overfitting to its noise after time O(n/(σq 2 0 σq ξ)). When can we ignore the noise term Γ? The order of Γ is σq+1 ξ σq 1 0 / d (at initialization it is in fact this times 1/ n but we ignore this improvement due to the dependencies arising through learning). On the other hand the main term in (6) is of order σq+1 ξ σq 1 0 /n at initialization, so we obtain the condition d n (which could possibly be improved to d n if cancellation remained correct throughout learning). Summarizing again, if d n2, by time in the order of n/(σq 2 0 σq ξ), we can expect the data points that were not fit before this time to be overfit using noise parameters. needed is 1/(σq 1 0 ρk). 3.3. ... and in what order Let us assume d n2 and σq ξ d ρk. Then the above discussion reveals that if n/(σq 2 0 σq ξ) 1/(σ0ρk) ρk σq ξ/n, we will not be able to learn vk because we will overfit before learning (In fact, in this case, we do not need the condition σq ξ d ρk). This essentially gives rise to a channel filter (or a combination thereof) of the form (4), with the cutoff point Kout = {k : ρk σq ξ/n} being now specified. Data augmentation can fix the order by effectively permuting the features. After data augmentation, we get the proportion of any feature to be 1/K and the training set size to be n K. Note that our data augmentation only permutes the coordinates so that the inner product between ξ and Tk(ξ) should be at the same order as two independent noise. The learning process only depend on the inner product between the samples so our previous analysis still holds. Then, after data augmentation, for every view k [K], we have ρ(aug) k = 1/K. Then, as long as σq ξ/n = o(1), we have ρ(aug) k σq ξ/(n K) and are able to learn vk before overfitting. 3.4. What about spurious features? In addition to overfitting noise, the model can also overfit spurious features. The spurious features can be viewed as noise vectors that appear in more than one sample. We do not prove this case formally in our main theorems for simplicity, but we will give the proof intuition here. Let u Rd, u = 1, be some spurious feature. Now assume that in addition to the dominant noise patch and the feature patch, u appears in 1 > ρ( 1) u > 0 fraction of the datapoints with label y = 1 and ρ( 1) u < ρ(1) u fraction of the datapoints with label y = 1. We assume u is orthogonal to the main features v1, ..., v K. Let Iu be the set of samples with u. We have at initialization: p [P ] ψ (|wc x(i) p |)y(i)x(i) p u i Iu y(i)ψ (|wc u|) u 2 i [n] ψ (|wc ξ(i)|)y(i)ξ(i) u 2n (ρ(1) u ρ( 1) u )ψ (|wc u|) u 2 + Υ. (7) Assuming Υ can be ignored, we can mimic the reasoning for wc vk with h(t) = wc u and h(0) = O(σ0). We thus expect to correctly classify a datapoint in class y = 1 with spurious feature u by overfitting to u after time O(n/(σq 2 0 (ρ(1) u ρ( 1) u ))). When can we ignore the noise term Υ? Similar to the term ϑ in (5), the order of Υ is σq ξσq 1 0 / nd. On the other hand the main term in (7) is of order (ρ(1) u ρ( 1) u )σq 1 0 . Thus we need σq ξ nd ρ(1) u ρ( 1) u . Summarizing above, since u can appear in both class y = 1 and class y = 1, it should not be used as an indicator of the label y. However, when u appears predominantly in one class (e.g., when σq ξ nd ρ(1) u ρ( 1) u ), the model can overfit u and use u to classify the datapoints. 4. Main Results We learn the model F(w, x) in (3) using gradient descent with step size η on loss L(w) in (??). The weight wc, c [C], at time step t is denoted as wc(t). The weight wc(t) for training on D(aug) train is obtained similarly, with the samples replaced by D(aug) train = (x(i), y(i)), i [Kn] . In addition to the assumptions we have discussed in Section 3, we make some additional assumptions for controlling the omitted quantities arising through training and testing. Assumption 2. We assume the following holds. For some constant q 3, 1. The first view is dominant, 1 ρ1 Ω(1). The other views k [K]\ {1} are minor views, nρk o σq ξ . 2. The standard deviation of the dominant noise satisfies ω(1) σq ξ o(n). 3. The standard deviation of the weights at initialization is bounded, σ0 o(1/σξ). 4. The number of samples and views are bounded, n K o σq 1 0 σq 1 ξ d1/2 . 5. The feature noise satisfies, for T = eΘ max n nη 1σ q ξ σ q+2 0 , Kη 1σ q+2 0 o , ω(P 1) α o η 1T 1P 1 q σξ min{d 1 Condition 1-3 in Assumption 2 have been explained in Section 3. σ0 o(1) and σ0σξ o(1) guarantee that at initialization, the main features and the dominant noise have o(1) correlation with the weight. We choose σξ ω(1) so that without properly learning the main feature, the inner product between random initialized weights and the dominant patch can dominate the model output. Condition 4 is a more stringent version of the condition n d1/2 in Section 3 to control all the terms during training. In Condition 5, we assume an upper bound on the feature noise α. We assume the existence of feature noise only for establishing gap with linear models, so we did not optimize the upper bound on α. It is possible the proof can go through with milder constraints on α. An example of a set of parameters that satisfy the above assumption is q = 3, σ0 = d 0.15, σξ = d0.1, n = d0.33, K = d0.06, ρ1 = 1 2, ρ2 = ρ3 = ... = ρK = 1 2(K 1), α = d 0.95, P = d. In Theorem 3, we show that under the above conditions, without data augmentation, gradient descent can find a classifier with perfect training accuracy without learning the minor views. On the other hand, Theorem 4 shows that with data augmentation, all k views can be learned without overfitting to noise. Theorem 3 (Training without data augmentation). Suppose that Assumption 2 holds. Let T be the first time step such that w(T) can classify all (x(i), y(i)) Dtrain with constant margin, i.e., , y(i)F(w(T), x(i)) eΩ(1), for all (x(i),y(i)) Dtrain. For hidden channel number C = Θ(log d), and small step size η, with probability at least 1 O( n2K poly(d)), T = eΘ nη 1σ q ξ σ q+2 0 . Moreover, at time step T, views v2, . . . , v K have never been learned, so that 0 t T , Pr (x,y) D [y F(w(t), x) < 0] 1 Theorem 4 (Training with data augmentation). Suppose assumption 2 holds. Let T aug be the first time step such that w(T aug) can classify all (x(i), y(i)) D(aug) train with constant margin, i.e., y(i)F(w(T aug), x(i)) eΩ(1), for all (x(i),y(i)) D(aug) train . For hidden channels number C = Θ(log d), and small step size η, with probability at least 1 O( n2K3 poly(d)), T aug = eΘ Kη 1σ q+2 0 , and at T aug, y F(w(T aug), x) < 0 n K poly(d). Remark 5. In Theorem 3 and Theorem 4, we evaluate the testing accuracy at the earliest time step T when the trained neural network with weights w(T) can classify all samples in the training set Dtrain with a constant margin. Our result does not rule out the possibility that if trained longer than T, the network can learn the minor views as well. However, we should expect the gradients on the training set stay small after the network can classify all sample correctly. The main reason we assume an upper bound on ηT is when training too long, the norm of the weights w can blow up. One possible strategy to avoid such upper bound on ηT is to add weight decay to the gradient descent algorithm in training. Remark 6. For simplicity of the proof, we only keep track of the channel with the maximum correlation with the main feature or the noise, arg maxc [C] wc(t) vk and arg maxc [C] ywc(t) ξ. For the other channels, we only give a rough bound on their correlation. For this reason, we assume the number of channels is C = Θ(log d) so that the output is dominated by the channel with the maximum correlation. To extend the result to higher number of channels, such as polynomial in d, we need to keep track of all channels and scale the output layer by 1 C . Remark 7. In our model, we show that when there exists some large dominant noise, the neural network overfits to the noise instead of learning the minor features. In practice, the model can overfit to any vector that contributes significantly to the gradient of the loss. For example, our proof can be extended to the case where there exists some spurious feature that appears in sufficiently many samples. In such case, even when the magnitude of the spurious feature is smaller than the dominant noise in our distribution, the network can still overfit it. 5. Experiment Our theoretical results showed that data augmentation can make it harder to overfit to the noise components (the easy to learn and bad feature in our model) by manipulating the relative gradient contribution of noise vs true features. To simplify our analysis, we assumed independent dominant noise in each sample. We hypothesize that the feature manipulation effect of data augmentation is broader in practice. In particular, our high level argument suggests that a model can also overfit to spurious features, like the grass feature in our story of cows in the introduction, which have strong class dependent correlations. In Appedix A, we complement our theory using experiments. Acknowledgements We would like to thank Yi Zhang for valuable feedback. Allen-Zhu, Z. and Li, Y. Feature purification: How adversarial training performs robust deep learning. ar Xiv preprint ar Xiv:2005.10190, 2020a. Allen-Zhu, Z. and Li, Y. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. ar Xiv preprint ar Xiv:2012.09816, 2020b. Azulay, A. and Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? Journal of Machine Learning Research, 20:1 25, 2019. Berry, A. C. The accuracy of the gaussian approximation to the sum of independent variates. Transactions of the american mathematical society, 49(1):122 136, 1941. Bishop, C. M. Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108 116, 1995. Bubeck, S., Li, Y., and Nagaraj, D. A law of robustness for two-layers neural networks. Conference on Learning Theory (COLT), 2021. Chapelle, O., Weston, J., Bottou, L., and Vapnik, V. Vicinal risk minimization. Advances in neural information processing systems, pp. 416 422, 2001. Chen, S., Dobriban, E., and Lee, J. H. A group-theoretic framework for data augmentation. Journal of Machine Learning Research, 21(245):1 71, 2020. Ciregan, D., Meier, U., and Schmidhuber, J. Multicolumn deep neural networks for image classification. In 2012 IEEE conference on computer vision and pattern recognition, pp. 3642 3649. IEEE, 2012. Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., and R e, C. A kernel theory of modern data augmentation. In International Conference on Machine Learning, pp. 1528 1537. PMLR, 2019. De Vries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. ar Xiv preprint ar Xiv:1708.04552, 2017. Elandt, R. C. The folded normal distribution: Two methods of estimating parameters from moments. Technometrics, 3(4):551 562, 1961. Feng, S. Y., Gangal, V., Wei, J., Chandar, S., Vosoughi, S., Mitamura, T., and Hovy, E. A survey of data augmentation approaches for nlp. ar Xiv preprint ar Xiv:2105.03075, 2021. Hanin, B. and Sun, Y. How data augmentation affects optimization for linear regression. Advances in Neural Information Processing Systems, 34, 2021. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84 90, 2017. Mei, S., Misiakiewicz, T., and Montanari, A. Learning with invariances in random features and kernel models. ar Xiv preprint ar Xiv:2102.13219, 2021. Nagarajan, V. and Kolter, J. Z. Uniform convergence may be unable to explain generalization in deep learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 05e97c207235d63ceb1db43c60db7bbb-Paper. pdf. Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR (Workshop), 2015. Rajput, S., Feng, Z., Charles, Z., Loh, P.-L., and Papailiopoulos, D. Does data augmentation lead to positive margin? In International Conference on Machine Learning, pp. 5321 5330. PMLR, 2019. Shorten, C. and Khoshgoftaar, T. M. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1 48, 2019. Simard, P. Y., Le Cun, Y. A., Denker, J. S., and Victorri, B. Transformation invariance in pattern recognition: Tangent distance and propagation. International Journal of Imaging Systems and Technology, 11(3):181 197, 2000. Simard, P. Y., Steinkraus, D., and Platt, J. C. Best practices for convolutional neural networks applied to visual document analysis. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., volume 3, pp. 958 958. IEEE Computer Society, 2003. Wu, S., Zhang, H., Valiant, G., and R e, C. On the generalization effects of linear transformations in data augmentation. In International Conference on Machine Learning, pp. 10410 10420. PMLR, 2020. Yaeger, L., Lyon, R., and Webb, B. Effective training of a neural network character classifier for word recognition. Advances in neural information processing systems, 9: 807 816, 1996. Yang, S., Dong, Y., Ward, R., Dhillon, I. S., Sanghavi, S., and Lei, Q. Sample efficiency of data augmentation consistency regularization. ar Xiv preprint ar Xiv:2202.12230, 2022. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021. Zhang, H., Cisse, M., Dauphin, Y. N., and Lopez-Paz, D. mixup: Beyond empirical risk minimization. ar Xiv preprint ar Xiv:1710.09412, 2017. A. Experiments Our theoretical results showed that data augmentation can make it harder to overfit to the noise components (the easy to learn and bad feature in our model) by manipulating the relative gradient contribution of noise vs true features. To simplify our analysis, we assumed independent dominant noise in each sample. We hypothesize that the feature manipulation effect of data augmentation is broader in practice. In particular, our high level argument suggests that a model can also overfit to spurious features, like the grass feature in our story of cows in the introduction, which have strong class dependent correlations. In Section A.1, we show experiments to this effect that complement our theory. We further conduct two additional experiments that support this paper s thesis. In Section A.2, we show an experiment with a modified data augmentation pipeline that demonstrates that the benefits of data augmentation cannot be fully explained by the learning of right invariance by the model. Finally, in Section A.3 we elaborate on the problem with unbalanced views, where we show that adding extra samples from one dominant view to balanced dataset can hurt the performance of the learned models. A.1. Spurious Feature We use images of the dog class and the cat class from CIFAR-10 dataset, which are of size 32 32 pixels and 3 channels. We generate a row of random pixels u N(0, σ2Id), where d = 32 and σ = 25, which is added as a synthetic spurious feature to a class dependent position in an image. The spurious feature u is added to the first channel in the row rcat for cat images, and in row rdog for dog images. For each image x in the dataset, with probability p < 1 we introduce a spurious feature, and with probability (1 p) we leave it unperturbed. We always select rcat {0, 1, . . . , 15} in the upper half of the image, and rdog {16, 17, . . . , 31} in the lower half. In this way, the spurious feature position has a weak correlation to the class label. See Figure 1 for sample images with spurious features. We consider three types training sets with varying degrees of data augmentation as shown Figure 1-(b,c,d). 1. No augmentation: As a baseline without augmentation, we center-crop the image to size [3, 28, 28]. 2. Random crop: In each epoch, we randomly crop a [3, 28, 28] from the original [3, 32, 32] image a standard technique used in practice. This would in essence disperse the position of spurious feature u. For example, cat images with u in row rcat = 9, will now contain u in a row uniformly chosen from raug cat U({5, 6, 7, 8, 9}). 3. Randomized noise position: Random crop, although standard, has a confounding effect that in addition to perturbing the position of u, it might also incorporate other useful inductive biases about images. For a more direct comparison to the baseline, we also look at a special augmentation, wherein we perturb just the spurious feature row position by a uniform random number in [ 2, 2] in each epoch and then use a simple center crop. As in the case of random crop, this would again disperse the spurious feature from rcat = 9 to raug cat U({5, 6, 7, 8, 9}). But the non-spurious features/pixels remain the same as baseline. (a) Original images (b) No augmentation (c) Random crop (d) Random noise position Figure 1: Examples of training images in the spurious features experiment (Section A.1). For ease of visualization, we use a green line rather than random row vector u to indicate the spurious feature. In the original [3, 32, 32] images shown in (a), the spurious feature is added to the first channel of row rcat = 9 for the cat class (lower images), and of row rdog = 22 for the dog class (upper images). Sub-figures (b,c,d) correspond to samples from different data augmentation methods described in the experiment. We compare the testing accuracy of training on these three types of training set in Figure 2 for different values of rcat and rdog. When (rcat, rdog) = (15, 16) (Figure 2, right), after data augmentation with either random noise position or random crop, the position of u in the perturbed imaged has a large overlap across classes. So it is not surprising that the test accuracy with augmentation remains about the same for almost all values of p (fraction of images with spurious features). On the other hand, for positions (9, 22) and (12, 19) (Figure 2, left & center)), although the two data augmentation techniques disperse the positions of spurious feature, its location in the two classes still stays separated. The cat images always have u in the upper half of the image while the dog images always have u in the lower half of the image. Interestingly, even so, the data augmentation, specially even the simple random feature position, can improve the test accuracy. In this case, while augmentation does not eliminate the existence of spurious features, it still diminishes them by making the spurious features harder to be learned and overfitted. In addition to shifting the spurious features, random crop can shift other important features as well to boost the minor views, so the testing accuracy when training with random crop can be even higher than only shifting the spurious feature position. 0.0 0.2 0.4 0.6 0.8 1.0 Test accuracy rcat = 9, rdog = 22 Random noise position Random crop No augmentation 0.0 0.2 0.4 0.6 0.8 1.0 Fraction p of images with spurious features rcat = 12, rdog = 19 Random noise position Random crop No augmentation 0.0 0.2 0.4 0.6 0.8 1.0 rcat = 15, rdog = 16 Random noise position Random crop No augmentation Figure 2: Comparison of different data augmentation strategies for the CIFAR-10 cat-vs-dog classification task with a synthetic spurious feature. The plots show results for different sets of positions of spurious feature (rcat, rdog) as we vary the fraction p of all the images that have the spurious feature. The plots are averaged over five runs with error bars of one standard deviation. The test datapoints are always center-cropped images of size [3, 28, 28] with no spurious feature. In all configurations, we train a Res Net20 network using SGD for 120 epochs with momentum 0.9, weight decay 0.005, and learning rate starting at 0.1 and annealed to (0.01, 0.001) at epochs (40, 80). A.2. Augmented samples vs. independent samples When using data augmentation, typically a new random transformation (e.g., random flip or crop at a random position of an image) is used in each epoch of training. This procedure effectively increases the training dataset size (albeit with non i.i.d correlated samples). In this experiment, we control for the number of unique samples seen by the training algorithm and ask the question: how effective is a single data augmented sample compared to an independent sample? For this experiment, we work with the full CIFAR-10 dataset which has 50000 training examples for 10 classes. Given a ratio p of independent samples to total sample size, we generate a training set of size n = 50000 as follows: We first select pn independent samples for the task. We then cyclically generate a data augmented variant these pn independent samples until we obtain the remaining (1 p)n datapoint. For example, in the CIFAR-10 dataset with n = 50000, if p = 0.6, the training set consists of 30000 independent samples, of which 20000 have one additional augmented sample. If p = 0.2, the training set has 10000 independent samples and four data augmented versions of each of the 10000 independent samples. Thus, for p = 1, there is no augmentation, and for smaller p, there are more augmented samples, but less independent samples. The dataset thus generated is then fixed for all epochs. In this way, the number of unique samples seen by training algorithm is always n = 50000 for all p. In Figure 3, we compare the accuracies of a Res Net20 model trained on such partially data augmented samples to the baseline of training with just the pn independent samples without any augmentation. Our experiment shows that even this partial data augmentation can significantly improve the testing accuracy. In this experiment, since each example has only a small number of augmented variations (e.g., for p 0.5 at most one augmented variant of the an example is seen throughout training), it is unlikely that they lead to learning any kind of task specific invariance, which is the usual motivation. However, by having the important feature appearing at a slightly different location, data augmentation can still facilitate the learning of the important features via the feature manipulation view described in our paper. Furthermore, comparing the accuracy of un-augmented full dataset with p = 1.0 on blue-dashed curve to that of data augmented training for p 0.5 on the red curve, we see that a fixed data augmented image can improve the test accuracy nearly as much as an independent sample does. This shows that if we have an important feature in an image, e.g., a cat ear, shifting it two pixels can help nearly as effectively as a completely new cat ear. A.3. Unbalanced Dataset In this experiment, we train a simple convolutional neural network on a synthetic dataset with unbalanced views. We show that when one view is much more prevalent in the dataset than the other views, having more samples of the dominant view can hurt learning. Our data consist of samples (x, y) from two classes y { 1, 1}. The input x R3 15 has 3 channels, 0.2 0.4 0.6 0.8 1.0 Ratio p of independent data to total data Testing accuracy With data augmentation Without data augmentation Figure 3: Augmented vs independent samples: for each p on the x-axis, the data augmented training (red-solid curve) uses 50000p independent images from CIFAR-10, along with 50000(1 p) data augmented samples. The augmented dataset is fixed across epochs. For the baseline without data augmentation (blue-dashed curve) we simply use the 50000p independent samples. We use the standard CIFAR-10 test dataset and the results are averaged over 3 runs. In each instance, we train a Res Net20 for 160 epochs using SGD with momentum 0.9, weight decay 0.005, and learning rate starting at 0.1 and annealed to (0.01, 0.001) at epochs (80, 120). each with 15 pixels. After sampling y uniformly, we generate x by setting one of the 15 pixels to the main feature [y, y, y]. The other pixels are set to a Gaussian noise N(0, σ2 ξI3). For different choices of σξ, we first construct a balanced dataset Dbal of size nbal such that roughly equal number of samples that have the good feature [y, y, y] present at each pixel. Our full dataset Dfull with nfull samples consists of Dbal along with additional nfull nbal samples with the main feature only at pixel 3. We use a balanced testing dataset. Test accuracy σξ = 0.4 σξ = 0.5 0.25 0.50 0.75 1.00 nbal/nfull Test accuracy 0.25 0.50 0.75 1.00 nbal/nfull 0.25 0.50 0.75 1.00 nbal/nfull Figure 4: Comparison of training on Dbal to Dfull as we vary the ratio of balanced examples nbal/nfull for different values of noise magnitude σξ. We learn the data using a simple convolutional neural network with two convolutional layers with Re LU activation, a maxpool layer and a linear layer. The two convolutional layers and the max pool layer have kernel size 4, and strides 2,1 and 2, respectively. The models are trained for 200 epochs using SGD with momentum 0.9, weight decay 0.05, and learning rate starting at 0.1 and annealed to 0.01 at epoch 100. For all training sets, the training accuracy at the end of training is at least 0.99. In Figure 4, we see that compared to the balanced dataset Dbal, although the full dataset Dfull has strictly more samples with the accurate kind of features, when σξ is not too large, the test accuracy is consistently on par or even lower than training on just the balanced subset. In this case, the views are simply features positioned at different pixels. For very large σξ, the test accuracy of the balanced subset can be low because in such case, the full dataset can learn the dominant view well, but the unbalanced dataset has too few samples to learn any view. The experiment shows that even for architectures such as convolutional networks, which are believed to have some translation invariance, we should not expect samples from one view to help the learning of other views. We clarify that throughout the appendix c1, c2, . . . denote constants, while C denotes the number of channels in our model (3) and is not a constant, but is a function of d. Throughout the appendix, for any sample (x(i), y(i)), we let P(i) bp be the background patches of (x(i), y(i)) and for k [K], P(i) bp,k be the background patches with feature noise αp,iyvk. B. Useful concentration lemmas We first state the following standard results on Gaussian samples. These will be used in our proof frequently. . Lemma 2 (Laurent-Massart χ2 tail bound). Consider a standard Gaussian vector z N(0, Id). For any positive vector a Rd 0, and any t > 0, the following concentration holds i=1 aiz2 i a 1 + 2 a 2 t + 2 a t i exp( t), i=1 aiz2 i a 1 2 a 2 t i exp( t). The following corollary immediately follows from using t = log (2/δ) and ai = 1 in the above lemma Corollary 3 (ℓ2 norm of Gaussian vector). Consider z N(0, σ2Id), for any δ (0, 1) and large enough d, we have with probability greater than 1 δ, Lemma 4 (Gaussian correlation). Consider independently sampled Gaussian vectors z1 N(0, σ2 1Id) and z2 N(0, σ2 2Id). For any δ (0, 1) and large enough d, there exists a constant c1, c2 such that |z1 z2| c1σ1σ2 p d log(2/δ) w.p 1 δ, z1 z2 c2σ1σ2 Proof. Let u = z2 2 and v = z1 z2 z2 2 . Since z1 is spherically symmetric, we have v N(0, σ2 1) and is independent of u. We first show the upper bound. Pr(|uv| t) = Pr(|v| t/u, u c) + Pr(|v| t/u, u c) (holds c > 0) min c>0 Pr(u c) + Pr(|v| t/c) d + Pr |v| t 2σ2 (using c = 2σ2 (using Lemma 2 on u = z2 2 and v N(0, σ2 1)) Using t = 2σ1σ2 p 2d log(2/δ), we get the first inequality for all d 4 log(2/δ). For the lower bound, using a similar argument as above we have Pr(uv t) min c>0 Pr(v t/c) + Pr(u c) d (using c = 1 d and t = 1 4 (using Lemma 2 on u and cdf bound on v) The lower bound thus holds for d 16 log(8) using t = 1 d. This concludes the proof of the lemma. Lemma 5 (Gaussian tail concentration). Consider i.i.d samples {zc N(0, σ2) : c [C]}. We have the following: max c [C] |zc| σ δ , w.p 1 δ, max c [C] zc σ 2 , w.p 1 exp( C/4). Proof. These are standard Gaussian tail bounds, which we prove here for completeness. We have: Pr max c [C] zc t X c [C] Pr(zc t) C exp t2 Using the same argument for over 2C variables {zc N(0, σ2), zc N(0, σ2)}c [C] along with t = σ p 2 log(2C/δ), we have the first inequality that maxc [C] |zc| σ q δ , w.p 1 δ. Furthermore, c [C], we have Pr(zc σ/2) 1/4, hence Pr max c [C] zc σ/2 1 1 1/4 C 1 exp( C/4) This concludes the proof of the lemma. Lemma 8 (Berry Esseen theorem (Berry, 1941)). Consider i.i.d samples {ui : i [n]} with Eui = 0, Eu2 i = σ2 > 0 and E |ui|3 = ρ < . Let Fn be the cumulative distribution function of 1 σ n Pn i=1 ui, and Φ be the cumulative distribution function of the standard normal distribution. For all t, there exists a constant c1 such that |Fn(t) Φ(t)| c1ρ σ3 n. Lemma 9 (Anti-concentration of q-th power of Gaussian random variables). Consider i.i.d samples {zc N(0, 1) : c [C]}. For constant integer q 1, there exist constants c1, c2 > 0 such that for any t o(1), c [C] zq c c1t Proof. For constant q, Ez2q c O(1) and E |zc|3q O(1) (Elandt, 1961). Then, by Lemma 8, for any t, there exist c1 and c2 such that c [C] zq c t Pr [z1 t] c2 Choosing t = o(1) proves the lemma. C. Additional notation Recall the data distribution D from Definition 1. Further recall that, for i [n], we use k i , p i , pξ i , ξ(i), and (αp,i, kp,i)p/ {p i ,pξ i } to denote the respective quantities k , p , pξ, ξ, and (αp, kp)p/ {p ,pξ} in Definition 1 for the ith training sample (x(i), y(i)) Dtrain D. In addition to these notation in Section 2, we introduce the following additional notation for the proofs. 1. k [K], let Ik = {i [n] : k i = k} denote the set of indices of the training data (x, y) with yvk as the main feature. Further, let nk = |Ik| 2. i [n] and k [K], let P(i) bp,k be the background patches of the ith sample with kth-type feature noise, i.e., P(i) bp,k = {p [P] \ {p i , pξ i } : x(i) p = αp,iyvk}; and let P(i) bp = S k [K] P(i) bp,k = [P] \ {p i , pξ i } denote the set of all background patches of the ith sample. Remark 1. For k [K], let ˆρk = 1 n|Ik| denote the empirical fraction in the training data of kth. Recall that ki are sampled independently with Pr(k i = k) = ρk. Thus, with with high probability, ρk and ˆρk differ at most by q n . In the rest of the paper, for simplicity we assume ρk = ˆρk. Similarly, let ˆρ(noise) k be the proportion of feature noise yvk in dataset Dtrain, i.e., ˆρ(noise) k = 1 n(P 2)|{i [n], p [P] \ {p i , pξ i }]|kp,i = k}| Again from standard concentration, we have ρk and ˆρ(noise) k differ by negligible quantity with high probability, thus we also assume ρk = ˆρ(noise) k . D. Proof of initialization conditions in Lemma 1 Lemma 1. [Ginit-conditions] Consider n i.i.d. samples Dtrain = {(x(i), y(i)) : i [n]} from the distribution in Definition 1. Let the parameters w of the network in (3) be initialized as wc(0) N(0, σ2 0Id) c [C]. If the number of channels is C = Ω(log d), then with probability greater than 1 O( n2KC poly(d)), the following conditions hold : 1. Feature-vs-parameter: k [K], max c [C] wc(0) vk Ω(σ0), and max c [C] |wc(0) vk| e O (σ0) . 2. Noise-vs-parameter: i [n], max c [C] wc(0) y(i)ξ(i) eΩ(σ0σξ), and max c [C] |wc(0) ξ(i)| e O (σ0σξ) . 3. Noise-vs-noise: i [n], ξ(i) ξ(i) = Θ(σ2 ξ) and i, j [n], i = j, |ξ(i) ξ(j)| e O(σ2 ξ/ 4. Feature-vs-noise: i [n], k [K], |ξ(i) vk| e O(σξ/ 5. Parameter norm: c [C], wc(0) = Θ(σ0 Proof. Recall the setting of the lemma: k [K], vk 2 = 1, i [n], y(i)ξ(i) i.i.d N(0, σ2 ξ d Id), and c [C], wc(0) i.i.d N(0, σ2 0Id). We have the following arguments that prove the lemma, where we use δ = 1 poly(d). 1. Feature parameter correlations: k [K], we have {wc(0) vk N(0, σ2 0)}c [C] are C i.i.d Gaussian. Thus, using union bound on the Gaussian tail concentration in Lemma 5 we have condition (1) holds w.p. 1 Kδ K exp( C/4). 2. Noise-parameter correlation: i [n] and c [C] using Gaussian correlation bound from Lemma 4, we have |wc(0) ξ(i)| e O(σ0σξ) w.p. 1 n Cδ. Furthermore, using the second inequality in Lemma 4, we have wc(0) y(i)ξ(i) c2 σ0σξ w.p. 1/4. Hence, maxc [C] wc(0) y(i)ξ(i) c2 σ0σξ w.p. 1 (1 1/4)C 1 exp(C). Thus, summing over failure probabilities, we have that condition (2) holds w.p. 1 n Cδ n exp( C/4) 3. Noise-noise correlations: Using the ℓ2 norm bound from Corollary 3 on ξ(i) 2 2, and the correlation tail bound on |ξ(i) ξ(j)| for i = j from Lemma 4, we have condition (3) holds w.p. 1 2n2δ 4. Feature noise correlation: k [K], we have {ξ(i) vk N(0, σ2 ξ/d)}i [n] are n i.i.d Gaussians. Thus, again using union bound on the Gaussian tail concentration in Lemma 5 condition (4) holds w.p. 1 nδ. 5. Parameter norm: From concentration of ℓ2 norm of Gaussian vector in Corollary 3, condition (5) holds w.p. 1 2Cδ . The lemma follows from using δ = 1 poly(d) and C = Ω(log d) exp( C) = O( 1 poly(d)). Lemma 1a. Ginit in Lemma 1 also holds for D(aug) train defined in (2) with n replaced by n K. Proof. Recall that since the features {vk}k are orthonormal (Assumption 1) and all the non-feature noise are spherically symmetric, without loss of generality, we assume that {vk}k [K] are simply the first K standard basis vectors in Rd, i.e., vk = ek. In this case, we choose Tk for k [K 1] as a permutation of coordinates of Rd without any fixed points, i.e., i [d], Tk(z)[i] = z[i] that satisfies (1) on the first K coordinate. We now show that the Ginit conditions in Lemma 1 holds for D(aug) train = Dtrain T1(Dtrain) T2(Dtrain) . . . TK 1(Dtrain) defined with transformations {Tk}k [K 1] described above. First, among the Ginit conditions, (1) and (5) are independent of the samples and hence immediately hold. Secondly, i [n] and k [K], Tk(ξ(i)) is simply some permutation of the coordinates of ξ(i) N(0, σ2 ξId/d), and hence Tk(ξ(i)) N(0, σ2 ξId/d) has the same marginal distribution as ξ(i). This implies that conditions (2) and (4), as well the norm condition in (3) of Lemma 1 also holds for D(aug) train . Finally, note that i = j, k, k , Tk(ξ(i)) and Tk (ξ(j)) are independent Gaussians. Thus, the correlation bounds in (3) of the form |Tk(ξ(i)) Tk (ξ(j))| = e O(σ2 ξ/ d) for all i = j also follow from the proof of Lemma 1. The only non-trivial condition we want to show is the following bound on the noise correlations of distinct transformations of the same sample, i.e., we only need to show that |ξ(i) Tk(ξ(i))| e O(σ2 ξ/ d) with high probability for all k [K 1]. Note that for any 1 k < k K 1, Tk(ξ(i)) Tk (ξ(i)) is equivalent in distribution to ξ(i) Tk k(ξ(i)). Claim 1. If ξ N(0, σ2 ξId/d) then k [K 1], |ξ Tk(ξ)| O σ2 ξ Proof. At a high level, we create a non-overlapping partition of the entries of ξ into three vectors ξ , ξ , and ξ , each of which of length at least d/6. The partition is chosen such that same partitioning of entries of Tk(ξ) denoted as ξ , ξ , and ξ are independent of ξ , ξ , and ξ , respectively. We then have ξ Tk(ξ) = ξ ξ + ξ ξ + ξ ξ , where each term is a dot product of two independent spherical Gaussians of length at least d/6 and entrywise variance of σ2 ξ/d. The claim then follows from bounding each term using Lemma 4. We divide the coordinates of ξ into disjoint and ordered lists L1, L2, . . ., constructed as follows. The first list is L1 = ξ[1], Tk(ξ)[1], T 2 k (ξ)[1], . . . , T s1 k (ξ)[1] , where T m k denotes composition of Tk for m times, and we stop the list at the first s1 d 1 such that T s1+1 k (e1) = e1 (when T s1+1 k (ξ)[1] = ξ[1]). We claim that this stopping criteria ensures that L1 has s1 unique coordinate of ξ without any duplicates. If not, there exists some 0 s < s s1 such that T s k (e1) = T s k (e1). Since Tk is a permutation (hence invertible), this would imply that T s s k (e1) = e1 for s s s1, which contradicts the stopping criteria. Note that if s1 = d 1, we have included all the coordinates of ξ in L1, and we stop our stop our construction here. If L1 does not contain all coordinates of ξ, let 1 < j2 d be the first coordinate such that ξ[j2] / L1. Let, L2 = ξ[j2], Tk(ξ)[j2], T 2 k (ξ)[j2], . . . , T s2 k (ξ)[j2] , where we stop either when all the entries of ξ have been included in L(m) 1 or L(m) 2 , or at the first integer s2 such that T s2+1 k (ej2) = ej2 (when T s2+1 k (ξ)[j2] = ξ[j2]). With a similar argument as with L1, there are no duplicate coordinates in L2. Furthermore, we either have have L2 and L1 containing disjoint coordinates of ξ, or have L1 L2. To see this, suppose for 0 s s1 and 0 s s2, we have T s k (e1) = T s k (ej2). If s s , again from invertibility of Tk, we would have T s s k (e1) = ej2 for s s s1, which is contradiction for ξ[j2] / L1. On the other hand, if s < s , then T s s k (ej2) = e1, and the entire construction of L1 would also be contained in L2. This would imply that all the coordinates of L1 are contained in L2 exactly once (since L2 does not have duplicates). Without loss of generality, we assume the former condition that L2 and L1 are disjoint holds as otherwise, L1 L2 and we can simply drop the first list L1 from our construction, and our proof follows exactly. We construct L3, L4, . . . , Lℓsimilarly until all coordinates of ξ belong to exactly one list. We also define T L1, T L2, . . . , T Lℓas lists obtained by circularly shifting the coordinates of L1, L2, . . . , Lℓ, respectively, by one index. For example, T L1 = Tk(ξ)[1], T 2 k (ξ)[1], . . . , T s1 k (ξ)[1], ξ[1] . By construction, for l = 1, 2 . . . ℓ, for every coordinate of ξ that is included in Ll, has the same coordinate of Tk(ξ) is included in T Ll at the same position, i.e., for all i sl, j d, Ll[i] = ξ[j] = T Ll[i] = T (ξ)[j]. We now construct ξ , ξ , and ξ . For l = 1, 2 . . . , ℓ, do the following: Sequentially distribute all the elements except the last element of Ll to ξ , ξ , ξ , e.g., the 1st element of Ll goes to ξ , 2nd to ξ , 3rd to ξ , 4th to ξ and so on. This assignment ensures that ξ , ξ , ξ do not contain any adjacent entries of Ll, i.e., if Ll[i] is in ξ , then Ll[i + 1] is not in ξ , and same is true for ξ , and ξ . Include the last element of Ll to a list among ξ , ξ , ξ that does not contain the first or the second last element of Ll. Thus the last element of Ll is not in the same list as its circularly adjacent neighbors ξ[jl] and T sl 1 k (ξ)[jl]. Repeat the exact assignment as above to distribute the elements of T Ll to ξ , ξ ( , ξ . By construction, {ξ , ξ , ξ } and { ξ , ξ , ξ } satisfy the following properties: (a) ξ Tk(ξ) = ξ ξ + ξ ξ + ξ ξ . (b) ξ , ξ , and ξ are independent of ξ , ξ , and ξ , respectively. Furthermore, each of these vectors is a spherical Gaussian with entrywise variance of σ2 ξ/d. (c) we have included at least d/3 1 = Θ(d) entries of ξ in each of ξ , ξ , and ξ . The claim now follows from using Lemma 4 on ξ ξ , ξ ξ , and ξ ξ . The above claim completes the proof of Lemma 1a. E. Linear models In this section we discuss the behavior of linear models for data from our distribution D in Definition 1. We consider the same patchwise convolutional model in (3), but without non-linearity. Without loss of generality, assume C = 1. Thus, for θ Rd, the model effectively becomes f linear(θ, x) = θ x, where x = P Linear models without feature noise. In the first result stated and proved below, we assume no feature noise αp = 0. In this case, x(i) = y(i)vk i + ξ(i). Recall the notation that for k [K], Ik = {i [n] : k i = k} and nk = |Ik|. Theorem 6. With high probability, the max ℓ2 margin linear model over Dtrain = {( x(i), y(i)) : i [n]} is given by 1 1 + (1 + o(1))σ2 ξ/nk i Ik y(i)ξ(i) ! Proof. Without loss of generality, assume the data points are grouped by the feature type k i , such that I1 = {1, 2, . . . , n1}, I2 = {n1 + 1, n1 + 2, . . . n1 + n2}, and so on. Also let X Rn d denote a matrix containing y(i) x(i) as rows and let K = XX Rn n denote the corresponding kernel matrix. The ℓ2 max margin classifier is given by bθℓ2 = minθ θ 2 2 s.t. Xθ 1. From the optimality conditions of the max-margin problem, we know that there exists a dual variable ν Rn +, s.t. bθℓ2 = X ν. We use notation νk Rnk + such that ν = [ν 1 , ν 2 , . . . ν K] . We can now write the objective and constraints of the max margin problem in terms of dual variables as follows: θ 2 2 = ν Kν and the margin condition is Kν 1. Let us first look at structure of K. Recall that x(i) = y(i)vk i +ξ(i), where {vk}k are orthonormal and ξ(i) N(0, σ2 ξId/d). Using the standard concentration inequalities in Appendix B, the following holds with high probability. Kij = y(i) x(i) y(j) x(j) = 1 + σ2 ξ + e O( σ2 ξ d) if i = j 1 + e O( σ2 ξ d) if i = j, k i = k j e O( σ2 ξ d) if i = j, k i = k j We can combine all the e O( σ2 ξ d) terms in , and write K = K + where Kij = 1k i =k j + σ2 ξ1i=j. Thus, K is a block diagonal matrix which is dominant compared to lower order terms in . Based on this block dominant structure of K, for w = X ν and ν 0, the margin on data points is given by, i Ik, (Kν)i = νk 1 + σ2 ξνk,i + ( ν)i, (9) and the ℓ2 norm is given by, θ 2 2 = ν Kν = k [K] νk 2 1 + σ2 ξ νk 2 2 + ν ν. (10) Recall that ij = e O(σ2 ξ/ d), we have the following two possibilities of ν: Case 1. ν = O(1): In this case ( ν)i = o(σ2 ξ) and we have (Kν)i = νk 1 + σ2 ξνk,i + o(σ2 ξ). Thus the margin constraint requires that mink mini Ik νk 1 + σ2 ξνk,i + o(σ2 ξ) 1. Furthermore, for large enough d, θ 2 2 is monotonic in νk,i (for positive νk,i). Thus the optimal ν is given by k [K], i Ik, νk,i = 1 nk + (1 + o(1))σ2 ξ . (11) In this case, θ 2 2 = 1 1+σ2 ξ/nk (1 + o(1)) = O(1). Case 2. If ν = ω(1), then θ 2 2 = ω(1) which is suboptimal compared to the above case. In conclusion, we have the optimal ν for the max-margin problem given by (11). Thus, bθℓ2 = X ν = X i Ik νk,iy(i) x(i) vk + y(i)ξ(i) nk + (1 + o(1))σ2 ξ 1 1 + (1 + o(1))σ2 ξ/nk i Ik y(i)ξ(i) ! This concludes the proof of the theorem. For the above classifier, for simplicity, we look at the case when there are only two views, k = 2. Corollary 7 follows from direct calculation on bθ ℓ2x for a sample x from our distribution. The thresholds given in Corollary 7 are better than the threshold we derive for our neural network. Corollary 7. Suppose k = 2, ω(1) σ2 ξ nd and n d. The ℓ2 max-margin linear model in (8) can successfully learn feature v1. To successfully learn feature v2, we need ρ2 σ2 ξ nd if n o(σ2 ξ) and ρ2 σ3 ξ n d otherwise. Linear models with feature noise. In the second result, we study linear models in the presence of feature noise. We show linear models are not able to learn samples from our data distribution D while the non-linear model we study can learn D. To facilitate the proof of linear models, we make some additional simplifications. These simplifications are not necessary for our main results. For linear model results alone, we consider the case when the dominant noise ξ is zero, i.e., σξ = 0. Note that letting σξ > 0 can only make the classification harder. Let Λ(x) be the sum of the coefficients of the feature noise if x, i.e., Λ(x) = P p Pbp αp. Let µΛ be the probability that Λ(x) > 1 for each (x, y). We assume that the patch with the main feature is chosen uniform randomly from [P]. Let D be the distribution satisfies the above assumptions. Theorem 10. For any linear classifier θ Rd P , we have Pr (x,y) D [sign x, θ = sign y] > 1 P min {µΛ, 1 µΛ} min k [K] ρk. Moreover, there exists a non-linear model F in (3) with weights w, such that Pr (x,y) D [sign F(w, x) = sign y] = 0. Proof. Let = minp [P ],k [K] θ(p 1)d+k 1 and p , k = arg minp [P ],k [K] θ(p 1)d+k 1. If 0, for any sample with main feature yvk in patch p , and Λ(x) 1, y x, θ + Λ(x) < 0. If > 0, then for any sample with main feature yvk in patch p , with Λ(x) > 1, y x, θ Λ(x) 0. Then, for both the case that > 0 and the case that 0, with probability at least min {µΛ, 1 µΛ} mink [K] ρk/P, sign x, θ = sign y. Now, consider the non-linear model given by weights w1 = P k [K] vk and wc = 0 for all c [C]\ {1}. For any datapoint (x, y) with main feature yvk , y F(w, x) = y X p [P ] ψ ( wc, xp ) = ψ ( w1, vk ) X Pbp,k ψ ( w1, αvk ) Thus, we have sign F(w, x) = sign y for all samples (x, y). F. Proof of the Main Results F.1. Dynamics of network weights: learning features and noise We first present a few lemmas useful for the proof of the main results. We derive the training trajectories for the dataset without data augmentation Dtrain. All lemmas in this section also hold for the dataset with data augmentation D(aug) train with n replaced Kn and ρk replaced by ρ(aug) k = 1 K . We defer the proof of the lemmas to Appendix G. Lemma 11 and Lemma 12 give some rough bounds on wc(t), vk and wc(t), ξ(i) , which are used repeatedly in the proof. Lemma 11 (Rough upper and lower bound on wc(t), vk ). Suppose Ginit holds and q σ0 + ηTρk + ηTσξd 1/2 q 1 For all 0 t t T and k [K], we have max c [C] wc(t), vk max c [C] wc(t ), vk + η(t t ) e O ρk + σξd 1/2 e O σ0 + ηT ρk + σξd 1/2 , min c [C] wc(t), vk min c [C] wc(t ), vk η(t t ) e O σξd 1/2 e O σ0 + ηTσξd 1/2 . Lemma 12 (Rough lower bound on wc(t), ξ(i) ). Suppose Ginit holds and 2q P 1/q σ0 + ηT max k [K] ρk + σξd 1/2 (q 1)/q! For all 0 t t T and i [n], we have min c [C] y(i) D wc(t), ξ(i)E min c [C] y(i) D wc(t ), ξ(i)E η(t t ) e O σ2 ξd 1/2 + σξd 1/2 . Combining Lemma 11 and Lemma 12, we can show that when the time step T is bounded, wc(t), vk and wc(t), ξ(i) are lower bounded. Lemma 13 (Lower bound on wc(t), vk and wc(t), ξ(i) ). Suppose Ginit holds, n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o , and 2q P 1/q σ0 + ηT max k [K] ρk + σξd 1/2 (q 1)/q! for some T = eΘ max n nη 1σ q ξ σ q+2 0 , Kη 1σ q+2 0 o . For all 0 t t T, and c [C], wc(t), vk wc(t ), vk o(σ0), and for all i [n], y(i) D wc(t), ξ(i)E y(i) D wc(t ), ξ(i)E o (σ0σξ) . Next, Lemma 14 and Lemma 15 compute the time it takes for the model to learn the main feature vk, k [K] and overfit the noise ξ(i), i [n]. Lemma 16 and Lemma 17 upper bound wc(t), vk and wc(t), ξ(i) for t smaller than the time identified in Lemma 14 and Lemma 15. Lemma 14 (Learning the main feature). Suppose Ginit holds, C = Θ(log d), σ0σξ o(1), σq ξd 1/2 o(ρk) and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 q , σ0 + ηT max k [K] ρk + ηTσξd 1/2 1)! for some T eΩ ηρkσq 2 0 1 . For any k [K] and 0 t T, if max c [C] wc(t), vk O(C 1/q), and max i [n],c [C] y(i) D wc(t), ξ(i)E e O(σ0σξ), max c [C] wc(t + 1), vk = max c [C] wc(t), vk + Θ ηρkψ max c [C] wc(t), vk . Moreover, if maxi [n],c [C] D wc(t), ξ(i)E e O(σ0σξ) for all t e O 1 ηρkσq 2 0 , there exists T e O 1 ηρkσq 2 0 maxc [C] wc(T ), vk Ω C 1/q . Lemma 15 (Overfitting the dominant noise). Suppose Ginit holds, C = Θ(log d), n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 q , σ0 + ηT max k [K] ρk + ηTσξd 1/2 1)! for some T eΩ nη 1σ q ξ σ q+2 0 . Let i [n] be some sample such that for all 0 t T, maxc [C] wc(t), vk i O(C 1/q). For any time step 0 t T , if max c [C] y(i) D wc(t), ξ(i)E O(C 1/q), max c [C] y(i) D wc(t + 1), ξ(i)E = max c [C] y(i) D wc(t), ξ(i)E + η n eΘ σ2 ξψ max c [C] y(i) D wc(t), ξ(i)E . Moreover, there exists times step T e O nη 1σ q ξ σ q+2 0 such that maxc [C] y(i) D wc(T ), ξ(i)E Ω C 1/q . Lemma 16 (Upper bound on wc(t), vk ). If Ginit holds, for all k [K] and t o σ0 ηρkσq 1 0 +ησξd 1/2 , max c [C] wc(t), vk e O (σ0) . Lemma 17 (Upper bound on wc(t), ξ(i) ). Suppose Ginit holds, n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 for some T eΩ nη 1σ q ξ σ q+2 0 . For all t o(nη 1σ q ξ σ q+2 0 ) and i [n], maxc [C] y(i) D wc(t), ξ(i)E e O(σ0σξ). Finally, Lemma 18 bounds wc(t), ξ for some noise patch ξ from the testing set. Lemma 18 is useful in proving the test accuracy. Lemma 18 (Bound on wc(t), ξ for ξ from the testing set). Let ξ N(0, σ2 ξId) be a random noise vector independent of the dataset. Suppose Ginit holds, C = Θ(log d), n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o , and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 q , σ0 + ηT max k [K] ρk + ηTσξd 1/2 1)! for some T = eΘ max n nη 1σ q ξ σ q+2 0 , Kη 1σ q+2 0 o . With probability at least 1 n K polyd, for all c [C] and 0 t T, | wc(t), ξ wc(0), ξ | o(σ0σξ). F.2. Proof of main results from Lemmas in Appendix F.1 We first derive some implications of Assumption 2 that we use as conditions in the lemmas in F.1. 1. n K o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o follows from n K o(σq 1 0 σq 1 ξ d1/2) and σξ ω(1). 2. K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o follows from n K o(σq 1 0 σq 1 ξ d1/2), σξ ω(1) and n ω(σq ξ). 3. σξd 1/2 o(1) follows from n K o σq 1 0 σq 1 ξ d1/2 , σ0σξ o(1) and n ω(σq ξ). 4. σq ξK o(d1/2) follows from n K o(σq 1 0 σq 1 ξ d1/2), σξσ0 o(1) and o(n) σq ξ ω(1). q σξ min d 1/2, σ0 σ0 + ηT maxk [K] ρk + ηTσξd 1/2 1 follows from σξd 1/2 o(1), σ0 o(1) and ηT ω(1). Now, using Lemma 11 - 18, we prove the main theorems. Theorem 3 (Training without data augmentation). Suppose that Assumption 2 holds. Let T be the first time step such that w(T) can classify all (x(i), y(i)) Dtrain with constant margin, i.e., , y(i)F(w(T), x(i)) eΩ(1), for all (x(i),y(i)) Dtrain. For hidden channel number C = Θ(log d), and small step size η, with probability at least 1 O( n2K poly(d)), T = eΘ nη 1σ q ξ σ q+2 0 . Moreover, at time step T, views v2, . . . , v K have never been learned, so that 0 t T , Pr (x,y) D [y F(w(t), x) < 0] 1 Proof. By Lemma 1, with probability at least 1 O n2K log d polyd , Ginit holds. We first show that all (x(i), y(i)) Dtrain can be classified correctly with constant margin at some T = eΘ nη 1σ q ξ σ q+2 0 . We first consider the samples i [n] such that k i = 1. If Assumption 2 holds, ω(σq ξ) n, so η 1ρ 1 1 σ q+2 0 o nη 1σ q ξ σ q+2 0 . By Lemma 17, maxc [C] y(i) D wc(t), ξ(i)E e O(σ0σξ) for all t e O η 1ρ1σ q+2 0 . Then, by Lemma 14, there exists some t e O η 1ρ 1 1 σ q+2 0 such that maxc [C] wc(t ), v1 = Θ C 1/q . Moreover, by Lemma 13, at any time step t t e O nη 1σ q ξ σ q+2 0 , the feature v1 satisfies, max c [C] wc(t ), v1 max c [C] wc(t ), v1 o (σ0) Ω C 1/q . We can further show for all c [C] and t e O nη 1σ q ξ σ q+2 0 , wc(t ), v1 and y(i) D wc(t ), ξ(i)E are lower bounded. By Lemma 13, when Ginit holds, wc(t ), v1 wc(0), v1 o (σ0) e O(σ0), and for all i [n], y(i) D wc(t ), ξ(i)E y(i) D wc(0), ξ(i)E o (σ0σξ) e O(σ0σξ). Then, there exists some T = eΘ nη 1σ q ξ σ q+2 0 such that for i with k i = 1, y(i)F(w(T), x(i)) = y(i) X p [P ] ψ D wc(T), x(i) p E c [C] ψ D wc(T), y(i)vk i E + y(i) X p P(i) bp,k ψ D wc(T), αp,iy(i)vk E c [C] ψ D wc(T), ξ(i)E (12) C e O(σq 0) CPαq e O σ0 + ηT max k [K] ρk + σξd 1/2 q C e O(σq 0σq ξ) The third step follows from maxc [C] wc(T), v1 Ω C 1/q , minc [C] wc(T), v1 e O(σ0), minc [C] y(i) D wc(T), ξ(i)E e O(σ0σξ) and Lemma 11. The last step follows from the the upper bound assumption on α, σ0 o(1) and σ0σξ o(1). We next show that the training accuracy is perfect for all i [n] such that k i = 1. By Lemma 16 and Assumption 2 that ρk o n 1σq ξ and n o(σq 1 ξ σq 1 0 d1/2), we have σ0 ηρkσq 1 0 +ησξd 1/2 ω nη 1σ q ξ σ q+2 0 , and therefore maxc [C] wc(t), vk e O (σ0) for all 0 t e O nη 1σ q ξ σ q+2 0 and k = 1. Then, for any i [n] such that k i = 1, by Lemma 15, there exists some time step t(i) such that maxc [C] y(i) D wc(t(i)), ξ(i)E Ω C 1/q . Moreover, by Lemma 13, for all t(i) t e O nη 1σ q ξ σ q+2 0 , maxc [C] y(i) D wc(t ), ξ(i)E Ω C 1/q . Then, there exists some T = eΘ nη 1σ q ξ σ q+2 0 such that for all (x(i), y(i)) Dtrain such that k i = 1, y(i)F(w(T), x(i)) Ω 1 C e O(σq 0) CPαq e O σ0 + ηT max k [K] ρk + σξd 1/2 q C e O(σq 0σq ξ) The first step follows from (12), and Lemma 11. The second step follows from the upper bound assumption on α, σ0 o(1) and σ0σξ o(1). Thus, at some T = eΘ nη 1σ q ξ σ q+2 0 , for all i [n], we have y(i)F(w(t), x(i)) Ω 1 Next, we show that the margin is o(1) at t o nη 1σ q ξ σ q+2 0 for any (x(i), y(i)) such that k i = 1. Since t o σ0 ηρkσq 1 0 +ησξd 1/2 , by Lemma 16, maxc [C] wc(t), vk i e O (σ0). Since t o nη 1σ q ξ σ q+2 0 , by Lemma 17, y(i) D wc(T), ξ(i)E e O(σ0σξ). Then, y(i)F(w(t), x(i)) C e O(σq 0) + CPαq e O σ0 + ηT max k [K] ρk + σξd 1/2 q + C e O(σq 0σq ξ) The first step follows from (12). The second step follows from the upper bound assumption on α, σ0 o(1) and σ0σξ o(1). Thus, we have show that T = eΘ nη 1σ q ξ σ q+2 0 . Finally, we show that the testing accuracy is bad on the testing dataset. For any (x, y) D with the main feature vk such that k = 1 and dominant noise ξ, since maxc [C] | wc(t), vk | e O (σ0) for any t T, following (12), y F(w(t), x) C e O(σq 0) + CPαq e O σ0 + ηT max k [K] ρk + σξd 1/2 q + y X c [C] ψ ( wc(t), ξ ) C e O(σq 0) + C e O σq ξσq 0 c [C] ψ ( wc(0), ξ ) + c [C] ψ ( wc(t), ξ ) y X c [C] ψ ( wc(0), ξ ) Co(σq ξσq 0) + y X c [C] ψ ( wc(0), ξ ) + c [C] ψ ( wc(t), ξ ) y X c [C] ψ ( wc(0), ξ ) The second step uses the upper bound on α. The last step follows the assumption σξ ω(1). For any c [C], by Lemma 4, with probability at least 1 1 polyd, | wc(0), ξ | e O(σ0σξ). Then, by Lemma 18, | wc(t), ξ wc(0), ξ | o(σ0σξ) with probability at least 1 n K polyd and therefore | wc(t), ξ | e O(σ0σξ) and y X c [C] ψ ( wc(t), ξ ) y X c [C] ψ ( wc(0), ξ ) c [C] q e O(σq 1 0 σq 1 ξ ) | wc(t), ξ wc(0), ξ | Co(σq ξσq 0) For t = 0, wc(0), ξ N(0, σ2 0 ξ 2) and { wc(0), ξ : c [C]} are independent. By Lemma 3, ξ 2 = Θ(σ2 ξ). Then, by Lemma 9, with probability at least 1 y F(w(t), x) Co(σq ξσq 0) + y X c [C] ψ ( wc(0), ξ ) < 0. Theorem 4 (Training with data augmentation). Suppose assumption 2 holds. Let T aug be the first time step such that w(T aug) can classify all (x(i), y(i)) D(aug) train with constant margin, i.e., y(i)F(w(T aug), x(i)) eΩ(1), for all (x(i),y(i)) D(aug) train . For hidden channels number C = Θ(log d), and small step size η, with probability at least 1 O( n2K3 poly(d)), T aug = eΘ Kη 1σ q+2 0 , and at T aug, y F(w(T aug), x) < 0 n K poly(d). Proof. By Lemma 1a, Ginit holds with probability at least 1 O n2K3 log d We first show that T aug = e O Kη 1σ q+2 0 . For the augmented dataset, we have ρ(aug) k = 1 K for all k [K] and the size of the dataset is Kn. For any k [K], if Assumption 2 holds, ω(σq ξ) n, so η 1ρ(aug) k 1σ q+2 0 o Knη 1σ q ξ σ q+2 0 . Then, for any i [Kn] with k i = k, by Lemma 17 maxc [C] y(i) D wc(T), ξ(i)E e O (σ0σξ) for all 0 t e O Kη 1σ q+2 0 . Then, under the assumption σq ξK o d1/2 , by Lemma 14, there exists some tk = eΘ 1 ηρ(aug) k σq 2 0 such that maxc [C] wc(tk), vk Ω C 1/q . By Lemma 13, for any tk t eΘ Kη 1σ q+2 0 , maxc [C] wc(t ), vk Ω C 1/q . Then, there exists some T = eΘ Kη 1σ q+2 0 such that for all (x(i), y(i)) D(aug) train , y(i)F(w(T), x(i)) = y(i) X p [P ] ψ D wc(T), x(i) p E c [C] ψ D wc(T), y(i)vk i E + y(i) X p P(i) bp,k ψ D wc(T), αp,iy(i)vk E c [C] ψ D wc(T), ξ(i)E (14) C e O(σq 0) CPαq e O σ0 + ηT max k [K] ρk + σξd 1/2 q C e O(σq 0σq ξ) The third step follows from maxc [C] wc(T), vk Ω C 1/q , Lemma 11 and Lemma 13. The last step follows from the upper bound assumption on α, σ0 o(1) and σ0σξ o(1). Next, when t = o 1 ηρ(aug) k σq 2 0 , by Lemma 16 , wc(t), vk i e O(σ0). By Lemma 17, y(i) D wc(T), ξ(i)E e O(σ0σξ). y(i)F(w(t), x(i)) C e O(σq 0) + CPαq e O σ0 + ηT max k [K] ρk + σξd 1/2 q + C e O(σq 0σq ξ) The second step follows from the upper bound assumption on α, σ0 o(1) and σ0σξ o(1). Thus, we have shown that T (aug) = eΘ Kη 1σ q+2 0 . Finally, we show that the testing accuracy is perfect at T (aug) = eΘ Kη 1σ q+2 0 . For any sample (x, y) in the testing set with dominant noise ξ, if Ginit hold, by (14), y F(w(T (aug)), x) Ω 1 C e O(σq 0) CPαq e O σ0 + ηT (aug) max k [K] ρk + σξd 1/2 q c [C] ψ wc(T (aug)), ξ c [C] ψ ( wc(0), ξ ) c [C] ψ wc(T (aug)), ξ y X c [C] ψ ( wc(0), ξ ) For any c [C], by Lemma 4, with probability at least 1 1 polyd, | wc(0), ξ | e O(σ0σξ). Then, by Lemma 18, wc(T (aug)), ξ wc(0), ξ o(σ0σξ) with probability at least 1 n K polyd and therefore wc(T (aug)), ξ e O(σ0σξ) and y X c [C] ψ wc(T (aug)), ξ y X c [C] ψ ( wc(0), ξ ) c [C] q e O(σq 1 0 σq 1 ξ ) wc(T (aug)), ξ wc(0), ξ Co(σq ξσq 0). Thus, with probability at least 1 n K polyd, y F(w(T (aug)), x) eΩ(1). G. Deferred Proof of Lemmas in Appendix F In this section, we present the proof of lemmas necessary for proving our main result. Lemma 11 (Rough upper and lower bound on wc(t), vk ). Suppose Ginit holds and q σ0 + ηTρk + ηTσξd 1/2 q 1 For all 0 t t T and k [K], we have max c [C] wc(t), vk max c [C] wc(t ), vk + η(t t ) e O ρk + σξd 1/2 e O σ0 + ηT ρk + σξd 1/2 , min c [C] wc(t), vk min c [C] wc(t ), vk η(t t ) e O σξd 1/2 e O σ0 + ηTσξd 1/2 . Proof. For any k [K], c [C] and 0 t < T, wc(t + 1), vk = wc(t), vk + η 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 1 1 + ey(i)F (w(t),x(i)) X p P(i) bp,k αp,iψ ( wc(t), αp,ivk ) vk 2 2 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E y(i) D ξ(i), vk E (15) We bound each term separately. Since 1 1+ey(i)F (w(t),x(i)) 1 for all i [n], vk 2 2 = 1, and ψ ( wc(t), vk ) 1 for all k [K], we have η n 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 The feature noise term 1 1 + ey(i)F (w(t),x(i)) X p P(i) bp,k αp,iψ ( wc(t), αp,ivk ) vk 2 2 When Ginit holds, D ξ(i), vk E e O(σξd 1/2) for all i [n]. Since 1 1+ey(i)F (w(t),x(i)) 1 and ψ D wc(t), ξ(i)E 1, 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E y(i) D ξ(i), vk E e O ησξd 1/2 . Then, for all 0 t < T, wc(t + 1), vk wc(t), vk + η e O ρk + σξd 1/2 , which implies for any 0 t t T, wc(t), vk wc(t ), vk + η(t t ) e O ρk + σξd 1/2 . Next, we lower bound wc(t), vk using induction. When Ginit holds, wc(0), vk e O σ0 + ηTσξd 1/2 . Assume for all 0 t t, min c [C] wc(t), vk min c [C] wc(t ), vk η(t t ) e O σξd 1/2 e O σ0 + ηTσξd 1/2 for induction. We have 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 We have shown that wc(t), vk e O σ0 + ηTρk + ηTσξd 1/2 for all c [C] and k [K]. By the induction hypothesis, 1 1 + ey(i)F (w(t),x(i)) X p P(i) bp,k αp,iψ ( wc(t), αp,ivk ) vk 2 2 ηαq P e O σ0 + ηTρk + ηTσξd 1/2 q 1 e O ησξd 1/2 . The last inequality follows from α e O σ q / σ0 + ηTρk + ηTσξd 1/2 q 1 q . When Ginit holds, 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E y(i) D ξ(i), vk E e O ησξd 1/2 . Then, plugging into (15), for any 0 t t + 1 T, wc(t + 1), vk wc(t ), vk + η(t + 1 t ) e O σξd 1/2 . Thus, we have completed the induction and therefore min c [C] wc(t), vk min c [C] wc(t ), vk η(t t ) e O σξd 1/2 . Finally, for t = 0, when Ginit holds, | wc(0), vk | e O(σ0). Lemma 12 (Rough lower bound on wc(t), ξ(i) ). Suppose Ginit holds and 2q P 1/q σ0 + ηT max k [K] ρk + σξd 1/2 (q 1)/q! For all 0 t t T and i [n], we have min c [C] y(i) D wc(t), ξ(i)E min c [C] y(i) D wc(t ), ξ(i)E η(t t ) e O σ2 ξd 1/2 + σξd 1/2 . Proof. For any c [C] and i [n], we have y(i) D wc(t + 1), ξ(i)E = y(i) D wc(t), ξ(i)E + η n 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E ξ(i) 2 1 + ey(j)F (w(t),x(j)) ψ D wc(t), ξ(j)E D ξ(j), ξ(i)E (16) 1 + ey(j)F (w(t),x(j)) ψ D wc(t), vk j E D vk j , ξ(i)E (17) 1 + ey(j)F (w(t),x(j)) X p P(j) bp,k ψ ( wc(t), αp,jvk ) D αp,jvk, ξ(i)E 1 1+ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E ξ(i) 2 positive for any i [n]. Since 1 1+ey(j)F (w(t),x(j)) 1 and ψ D wc(t), ξ(j)E 1 for all j [n] , if Ginit holds, (16) η e O σ2 ξd 1/2 , (17) η e O σξd 1/2 . (18) η e O αq Pσξd 1/2 max k [K] | wc(t), vk |q 1 η e O αq Pσξd 1/2 σ0 + ηT ρk + σξd 1/2 q 1 η e O σξd 1/2 The second inequality follows from Lemma 11. The third inequality follows from the upper bound on α. Then, y(i) D wc(t + 1), ξ(i)E y(i) D wc(t), ξ(i)E η e O σ2 ξd 1/2 + σξd 1/2 , which gives min c [C] y(i) D wc(t), ξ(i)E min c [C] y(i) D wc(t ), ξ(i)E η(t t ) e O σ2 ξd 1/2 + σξd 1/2 . Lemma 13 (Lower bound on wc(t), vk and wc(t), ξ(i) ). Suppose Ginit holds, n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o , and 2q P 1/q σ0 + ηT max k [K] ρk + σξd 1/2 (q 1)/q! for some T = eΘ max n nη 1σ q ξ σ q+2 0 , Kη 1σ q+2 0 o . For all 0 t t T, and c [C], wc(t), vk wc(t ), vk o(σ0), and for all i [n], y(i) D wc(t), ξ(i)E y(i) D wc(t ), ξ(i)E o (σ0σξ) . Proof. By Lemma 12, for any (x(i), y(i)), max c [C] y(i) D wc(t), ξ(i)E max c [C] y(i) D wc(t ), ξ(i)E η(t t ) e O σ2 ξd 1/2 + σξd 1/2 . Then, when t t e O nη 1σ q ξ σ q+2 0 and n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , or when t t e O Kη 1σ q+2 0 and K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o , η(t t ) e O σ2 ξd 1/2 + σξd 1/2 o (σ0σξ). By Lemma 11, max c [C] wc(t), vk max c [C] wc(t ), vk η(t t ) e O σξd 1/2 . Then, when t t e O nη 1σ q ξ σ q+2 0 and n o σq 1 0 σq 1 ξ d1/2 , or when t t e O Kη 1σ q+2 0 and K o(σq 1 0 σ 1 ξ d1/2), η(t t ) e O σξd 1/2 o (σ0) , which completes the proof. Lemma 14 (Learning the main feature). Suppose Ginit holds, C = Θ(log d), σ0σξ o(1), σq ξd 1/2 o(ρk) and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 q , σ0 + ηT max k [K] ρk + ηTσξd 1/2 1)! for some T eΩ ηρkσq 2 0 1 . For any k [K] and 0 t T, if max c [C] wc(t), vk O(C 1/q), and max i [n],c [C] y(i) D wc(t), ξ(i)E e O(σ0σξ), max c [C] wc(t + 1), vk = max c [C] wc(t), vk + Θ ηρkψ max c [C] wc(t), vk . Moreover, if maxi [n],c [C] D wc(t), ξ(i)E e O(σ0σξ) for all t e O 1 ηρkσq 2 0 , there exists T e O 1 ηρkσq 2 0 maxc [C] wc(T ), vk Ω C 1/q . Proof. By the upper bound on α and Lemma 11, for any i [n] and c [C], X p P(i) bp,k ψ D wc(t), y(i)αp,ivk E X p P(i) bp,k | wc(t), αp,ivk |q e O αq P σ0 + ηT max k ρk + σξd 1/2 q Then, since maxc [C] wc(t), vk O(C 1/q), and maxc [C] y D wc(t), ξ(i)E o(1) for all i [n], we have for all i such that k i = k, y(i)F(w(t), x(i)) O(1) and 1 1+ey(i)F (w(t),x(i)) Ω(1). Now, we compute the update wc(t + 1), vk wc(t), vk , wc(t + 1), vk = wc(t), vk + η 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 1 1 + ey(i)F (w(t),x(i)) X p P(i) bp,k αp,iψ ( wc(t), αp,ivk ) vk 2 2 1 1 + ey(i)F (w(t),x(i)) ψ D w(t) c , ξ(i)E y(i) D ξ(i), vk E (20) Then, when Ginit holds, since 1 1+ey(i)F (w(t),x(i)) Ω(1) for all i [n] such that k i = k, 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 = Θ ηρk | wc(t), vk |q 1 . We can bound the term (19) as p P(i) bp,k αp,iψ ( wc(t), αp,ivk ) e O ηρkαq P | wc(t), vk |q 1 o(ηρk | wc(t), vk |q 1). For the term (20), if Ginit holds, 1 1 + ey(i)F (w(t),x(i)) ψ D w(t) c , ξ(i)E D ξ(i), vk E D w(t) c , ξ(i)E q 1 D ξ(i), vk E! e O ησq 1 0 σq ξd 1/2 . For t = 0, if Ginit holds, maxc [C] wc(0), vk eΩ(σ0). Then, if σq 1 0 σq ξd 1/2 o(ρkσq 1 0 ), we have max c [C] wc(t + 1), vk = max c [C] wc(t), vk + Θ ηρkψ max c [C] wc(t), vk , (21) which shows maxc [C] wc(t), vk is increasing. Then, (21) holds for all t. Starting from some wc(t ), vk , the number of iterations it takes to reach maxc [C] wc(t), vk 2 maxc [C] wc(t ), vk is at most O maxc [C] wc(t ),vk ηρk(maxc [C] wc(t ),vk ) q 1 . Then, starting from Θ(σ0), it takes at most 2iσ0 ηρk(2iσ0)q 1 1 ηρkσq 2 0 time steps to reach maxc [C] wc(t), vk Ω C 1/q . Lemma 15 (Overfitting the dominant noise). Suppose Ginit holds, C = Θ(log d), n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 q , σ0 + ηT max k [K] ρk + ηTσξd 1/2 1)! for some T eΩ nη 1σ q ξ σ q+2 0 . Let i [n] be some sample such that for all 0 t T, maxc [C] wc(t), vk i O(C 1/q). For any time step 0 t T , if max c [C] y(i) D wc(t), ξ(i)E O(C 1/q), max c [C] y(i) D wc(t + 1), ξ(i)E = max c [C] y(i) D wc(t), ξ(i)E + η n eΘ σ2 ξψ max c [C] y(i) D wc(t), ξ(i)E . Moreover, there exists times step T e O nη 1σ q ξ σ q+2 0 such that maxc [C] y(i) D wc(T ), ξ(i)E Ω C 1/q . Proof. By the upper bound on α and Lemma 11, for any c [C], p P(i) bp,k ψ ( wc(t), αp,ivk ) p P(i) bp,k | wc(t), αp,ivk |q e O αq P σ0 + ηT max k ρk + σξd 1/2 q For i, when maxc [C] wc(t), vk i O(C 1/q), maxc [C] y(i) D wc(t), ξ(i)E O(C 1/q) and p P(i) bp,k ψ ( wc(t), yαp,ivk ) we have y(i)F(w(t), x(i)) O(1) and therefore 1 1+ey(i)F (w(t),x(i)) Ω(1). Then, y(i) D wc(t + 1), ξ(i)E = y(i) D wc(t), ξ(i)E + η 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E ξ(i) 2 1 + ey(j)F (w(t),x(j)) ψ D wc(t), ξ(j)E D ξ(j), ξ(i)E (22) 1 + ey(j)F (wc(t),x(j)) ψ D wc(t), vk j E D vk j , ξ(i)E (23) 1 + ey(j)F (wc(t),x(j)) X p P(j) bp,k ψ ( wc(t), αp,jvk ) D αp,jvk, ξ(i)E If 1 1+ey(i)F (w(t),x(i)) Ω(1), and eΩ(σ0σξ) maxc [C] y(i) D wc(t), ξ(i)E , 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E ξ(i) 2 n (σ0σξ)q 1 σ2 ξ . When Ginit holds, since 1 1+ey(i)F (w(t),x(i)) 1, ψ D wc(t), ξ(j)E 1, and ψ D wc(t), vk j E 1, |(22)| e O ησ2 ξd 1/2 and |(23)| e O ησξd 1/2 . By Lemma 11 and the upper bound on α, |(24)| e O ηαq P max k [K] | wc(t), vk |q 1 σξd 1/2 e O ηαq P (σ0 + ηT (ρk + σξ))q 1 σξd 1/2 e O ησξd 1/2 . When Ginit holds, eΩ(σ0σξ) y(i) maxc [C] D wc(0), ξ(i)E . Then, when n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , for t = 0, max c [C] y(i) D wc(t + 1), ξ(i)E = max c [C] y(i) D wc(t), ξ(i)E + η n eΘ σ2 ξψ max c [C] y(i) D wc(t), ξ(i)E , (25) which shows maxc [C] y(i) D wc(t), ξ(i)E is increasing. Then, (25) holds for all 0 t T. Starting from maxc [C] y(i) D wc(t ), ξ(i)E , the number of iterations it takes to reach maxc [C] y(i) D wc(t), ξ(i)E 2 maxc [C] y(i) D wc(t ), ξ(i)E is at most O n maxc [C] y(i) wc(t ),ξ(i) ησ2 ξ(maxc [C] y(i) wc(t ),ξ(i) ) q 1 . Then, starting from maxc [C] y(i) D wc(0), ξ(i)E eΩ(σ0σξ), it takes at most n2iσ0σξ ησ2 ξ(2iσ0σξ)q 1 n ησ2 ξ(σ0σξ)q 2 time steps to reach maxc [C] y(i) D wc(T ), ξ(i)E Ω(C 1/q). Lemma 16 (Upper bound on wc(t), vk ). If Ginit holds, for all k [K] and t o σ0 ηρkσq 1 0 +ησξd 1/2 , max c [C] wc(t), vk e O (σ0) . Proof. For every k [K], wc(t + 1), vk = wc(t), vk + η 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 1 1 + ey(i)F (w(t),x(i)) X p P(i) bp,k αp,iψ ( wc(t), αp,ivk ) vk 2 2 1 1 + ey(i)F (w(t),x(i)) ψ D w(t) c , ξ(i)E y(i) D ξ(i), vk E (27) Then, since 1 1+ey(i)F (w(t),x(i)) 1 and ψ ( wc(t), vk ) O | wc(t), vk |q 1 , 1 1 + ey(i)F (w(t),x(i)) ψ ( wc(t), vk ) vk 2 2 O ηρk | wc(t), vk |q 1 . The second term (26) 0. For (27), since y(i) 1+ey(i)F (w(t),x(i)) 1 and ψ D w(t) c , ξ(i)E 1, if Ginit holds, (27) e O σξd 1/2 . Finally, if Ginit holds, wc(0), vk e O (σ0), so it takes at least t eΩ σ0 ηρkσq 1 0 +ησξd 1/2 time steps to reach wc(t), vk 2 wc(0), vk . Lemma 17 (Upper bound on wc(t), ξ(i) ). Suppose Ginit holds, n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 for some T eΩ nη 1σ q ξ σ q+2 0 . For all t o(nη 1σ q ξ σ q+2 0 ) and i [n], maxc [C] y(i) D wc(t), ξ(i)E e O(σ0σξ). Proof. We prove using induction. At t = 0, when Ginit holds, maxi [n],c [C] y(i) D wc(0), ξ(i)E e O(σ0σξ). Assume maxi [n],c [C] y(i) D wc(t ), ξ(i)E e O(σ0σξ) for any 0 t t for induction. For any c [C], y(i) D wc(t + 1), ξ(i)E = y(i) D wc(t), ξ(i)E + η 1 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E ξ(i) 2 1 + ey(j)F (w(t),x(j)) ψ D wc(t), ξ(j)E D ξ(j), ξ(i)E (28) 1 + ey(j)F (wc(t),x(j)) ψ D wc(t), vk j E D vk j , ξ(i)E (29) 1 + ey(j)F (wc(t),x(j)) X p P(j) bp,k ψ ( wc(t), αp,jvk ) D αp,jvk, ξ(i)E Then, when Ginit holds, by 1 1+ey(i)F (w(t),x(i)) 1 and ψ ( ) 1, y(i) D wc(t + 1), ξ(i)E y(i) D wc(t), ξ(i)E + η e O(n 1σq 1 0 σq+1 ξ + σ2 ξd 1/2 + σξd 1/2 + αq Pσξd 1/2 σ0 + ηT ρk + σξd 1/2 q 1 ) y(i) D wc(0), ξ(i)E + ηt e O(n 1σq 1 0 σq+1 ξ + σ2 ξd 1/2 + σξd 1/2). The last step uses the upper bound on α and the induction hypothesis. Since n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o and t o(nη 1σ q ξ σ q+2 0 ), maxi [n],c [C] y(i) D wc(t), ξ(i)E e O(σ0σξ). Lemma 18 (Bound on wc(t), ξ for ξ from the testing set). Let ξ N(0, σ2 ξId) be a random noise vector independent of the dataset. Suppose Ginit holds, C = Θ(log d), n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o , and 2q σ0 + ηT max k [K] ρk + ηTσξd 1/2 q 1 q , σ0 + ηT max k [K] ρk + ηTσξd 1/2 1)! for some T = eΘ max n nη 1σ q ξ σ q+2 0 , Kη 1σ q+2 0 o . With probability at least 1 n K polyd, for all c [C] and 0 t T, | wc(t), ξ wc(0), ξ | o(σ0σξ). Proof. For any 0 t < T, wc(t + 1), ξ = wc(t), ξ + η 1 + ey(i)F (w(t),x(i)) ψ D wc(t), ξ(i)E D ξ(i), ξ E 1 1 + ey(i)F (w(t),x(i)) ψ wc(t), vk i vk i , ξ 1 1 + ey(i)F (w(t),x(i)) X p P(i) bp,k ψ ( wc(t), αp,ivk ) αp,ivk, ξ By Lemma 4, with probability at least 1 n K polyd, for all i [n], D ξ(i), ξ E e O(σ2 ξd 1/2) and for all k [K], vk, ξ e O(σξd 1/2). Then, by 1 1+ey(i)F (w(t),x(i)) 1 , ψ D wc(t), ξ(i)E 1, ψ wc(t), vk i 1 and ψ ( wc(t), αp,ivk ) 1, | wc(t + 1), ξ wc(t), ξ | e O(ησ2 ξd 1/2) + e O(ησξd 1/2) + e O ηαq Pσξd 1/2 σ0 + ηT max k ρk + σξd 1/2 q 1! η e O(σ2 ξd 1/2 + σξd 1/2). The second step uses Lemma 11. The third step uses the upper bound on α. Summing over 0 t t, | wc(t), ξ wc(0), ξ | ηT e O(σ2 ξd 1/2 + σξd 1/2). When n o min n σq 1 0 σq ξd1/2, σq 1 0 σq 1 ξ d1/2o , K o min n σq 1 0 σ 1 ξ d1/2, σq 1 0 d1/2o , and T e O max n nη 1σ q ξ σ q+2 0 , Kη 1σ q+2 0 o , | wc(t), ξ wc(0), ξ | o(σ0σξ).