# on_convergence_and_generalization_of_dropout_training__ffdd33b2.pdf On Convergence and Generalization of Dropout Training Poorya Mianjy Department of Computer Science Johns Hopkins University mianjy@jhu.edu Raman Arora Department of Computer Science Johns Hopkins University arora@cs.jhu.edu We study dropout in two-layer neural networks with rectified linear unit (Re LU) activations. Under mild overparametrization and assuming that the limiting kernel can separate the data distribution with a positive margin, we show that dropout training with logistic loss achieves ϵ-suboptimality in test error in O(1/ϵ) iterations. 1 Introduction Dropout is an algorithmic regularization approach that endows deep learning models with excellent generalization ability despite the non-convex nature of the underlying learning problem and the capacity of modern over-parameterized models to over-fit. Introduced by Hinton et al. [2012], Srivastava et al. [2014], dropout involves randomly pruning the network at every iteration of backpropagation by turning off a random subset of hidden nodes. Like many popular algorithmic approaches that emerged as heuristics from practitioners with deep insight into the learning problem, dropout, while extremely successful in practice, lacks a strong theoretical justification, especially from a computational learning theoretic perspective. Dropout has been successful in several application areas including computer vision [Szegedy et al., 2015], natural language processing [Merity et al., 2017], and speech recognition [Dahl et al., 2013]. Motivated by explaining the empirical success of dropout, and inspired by simple, intuitive nature of the method, numerous works in recent years have focused on understanding its theoretical underpinnings [Baldi and Sadowski, 2013, Wager et al., 2013, Helmbold and Long, 2015, Gal and Ghahramani, 2016, Wei et al., 2020]. Most of these works, however, steer clear from the algorithmic and computational learning aspects of dropout. More precisely, none of the prior work, to the best of our knowledge, yields insights into the runtime of learning using dropout on non-linear neural networks. In this paper, we initiate a study into the iteration complexity of dropout training for achieving ϵ-suboptimality on true error the misclassification error with respect to the underlying population in two-layer neural networks with Re LU activations. We leverage recent advances in the theory of deep learning in over-parameterized settings with extremely (or infinitely) wide networks [Jacot et al., 2018, Lee et al., 2019]. Analyzing two-layer Re LU networks in such a regime has led to a series of exciting results recently establishing that gradient descent (GD) or stochastic gradient descent (SGD) can successfully minimize the empirical error and the true error [Li and Liang, 2018, Du et al., 2019, Daniely, 2017, Zou et al., 2018, Allen-Zhu et al., 2019, Song and Yang, 2019, Arora et al., 2019, Cao and Gu, 2019, Oymak and Soltanolkotabi, 2020]. In a related line of research, several works attribute generalization in over-parametrized settings to the implicit inductive bias of optimization algorithms (through the geometry of local search methods) [Neyshabur et al., 2017]. However, many real-world state-of-the-art systems employ various explicit regularizers, and there is growing evidence that implicit bias may be unable to explain generalization even in a simpler setting of stochastic convex optimization [Dauber et al., 2020]. Here, we extend convergence guarantees and generalization bounds for GD-based methods with explicit 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. regularization due to dropout. We show that the key insights from analysis of GD-based methods in over-parameterized settings carry over to dropout training. We summarize our key contributions as follows. 1. We give precise non-asymptotic convergence rates for achieving ϵ-subotimality in the test error via dropout training in two-layer Re LU networks. 2. We show that dropout training implicitly compresses the network. In particular, we show that there exists a sub-network, i.e., one of the iterates of dropout training, that can generalize as well as any complete network. 3. This work contributes to a growing body of work geared toward a theoretical understanding of GD-based methods for regularized risk minimization in over-parameterized settings. The rest of the paper is organized as follows. In Section 2, we survey the related work. In Section 3, we formally introduce the problem setup and dropout training, state the key assumptions, and introduce the notation. In Section 4, we give the main results of the paper. In Section 5, we present a sketch of the proofs of our main results the detailed proofs are deferred to the Appendix. We conclude the paper by providing empirical evidence for our theoretical results in Section 6. 2 Related Work Empirical success of dropout has inspired a series of works aimed at understanding its theoretical underpinnings. Most of these works have focused on explaining the algorithmic regularization due to dropout in terms of conventional regularizers. Dropout training has been shown to be similar to weight decay in linear regression [Srivastava et al., 2014], in generalized linear models [Wager et al., 2013], and in a PAC-Bayes setting [Mc Allester, 2013]. These results have recently been extended to multi-variate regression models, where dropout induces a nuclear norm penalty in single hidden-layer linear networks [Mianjy et al., 2018] and deep linear networks [Mianjy and Arora, 2019], and a weighted trace-norm penalty in matrix completion [Arora et al., 2020]. In a recent work, Wei et al. [2020] characterize explicit regularization due to dropout in terms of the derivatives of the loss and the model, and argue for an implicit regularization effect that stems from the stochasticity in dropout updates. A parallel strand of research has focused on bounding the generalization gap in dropout training, leveraging tools from uniform convergence. The early work of Wager et al. [2014] showed that under a certain topic model assumption on the data, dropout in linear classification can improve the decay of the excess risk of the empirical risk minimizer. Assuming certain norm-bounds on the weights of the network, the works of Wan et al. [2013], Zhai and Wang [2018], Gao and Zhou [2016] showed that the Rademacher complexity of networks trained by dropout decreases with the dropout rate. Finally, Arora et al. [2020] showed further that the Rademacher complexity can be bounded merely in terms of the explicit regularizer induced by dropout. Despite the crucial insights provided by the previous art, there is not much known about the nonasymptotic convergence behaviour of dropout training in the literature. A very recent work by Senen Cerda and Sanders [2020] shows for deep neural networks with polynomially bounded activations with continuous derivatives, under squared loss, that the network weights converge to a stationary set of system of ODEs. In contrast, our results leverages over-parameterization in two-layer networks with non-differentiable Re LU activations, works with logistic loss, and establishes ϵ-suboptimality in the true misclassification error. Our results are inspired by the recent advances in over-parameterized settings. A large body of literature has focused on deriving optimization theoretic guarantees for (S)GD in this setting. In particular, Li and Liang [2018], Du et al. [2019] were among the first to provide convergence rates for empirical risk minimization using GD. Several subsequent works extended those results beyond twolayers, for smooth activation functions [Du et al., 2018], and general activation functions [Allen-Zhu et al., 2018, Zou et al., 2018] . Learning theoretic aspects of GD-based methods have been studied for several important target concept classes. Under linear-separability assumption, via a compression scheme, Brutzkus et al. [2017] showed that SGD can efficiently learn a two-layer Re LU network. Li and Liang [2018] further showed that SGD enjoys small generalization error on two-layer Re LU networks if the data follows a well-separated mixture of distributions. Allen-Zhu et al. [2019] showed generalization error bounds for SGD in twoand three-layer networks with smooth activations where the concept class has fewer parameters. Arora et al. [2019] proved data-dependent generalization error bounds based on the neural tangent kernel by analyzing the Rademacher complexity of the class of networks reachable by GD. When the data distribution can be well-classified in the random feature space induced by the gradient of the network at initialization, Cao and Gu [2019] provide generalization guarantees for SGD in networks with arbitrary depth. Nitanda and Suzuki [2019] studied convergence of GD in two-layer networks with smooth activations, when the data distribution is further separable in the infinite-width limit of the random feature space. Ji and Telgarsky [2019] adopted the same margin assumption and improved the convergence rate as well as the over-parameterization size for non-smooth Re LU activations. Here, we generalize the margin assumption in Nitanda and Suzuki [2019] to take into account the randomness injected by dropout into the gradient of the network at initialization, or equivalently, the scale of the corresponding random feature. Our work is most closely related to and inspired by Ji and Telgarsky [2019]; however, we analyze dropout training as opposed to plain SGD, give generalization bounds in expectation, and show the compression benefits of dropout training. We emphasize that all of the results above focus on (S)GD in absence of any explicit regularization. We summarize a few papers that study regularization in the over-parameterized setting. The work of Wei et al. [2019] showed that even simple explicit regularizers such as weight decay can indeed provably improve the sample complexity of training using GD in the Neural Tangent Kernel (NTK) regime, appealing to a margin-maximization argument in homogeneous networks. We also note the recent works by Li et al. [2019] and Hu et al. [2020], which studied the robustness of GD to noisy labels, with explicit regularization in forms of early stopping; and squared norm of the distance from initialization, respectively. 3 Preliminaries and Notation Let X Rd and Y = { 1} denote the input and label spaces, respectively. We assume that the data is jointly distributed according to an unknown distribution D on X Y. Given T i.i.d. examples ST = {(xt, yt)}T t=1 DT , the goal of learning is to find a hypothesis f( ; Θ) : X R, parameterized by Θ, that has a small misclassification error R(Θ) := P{yf(x; Θ) < 0}. Given a convex surrogate loss function ℓ: R R 0, a common approach to the above learning problem is to solve the stochastic optimization problem minΘ L(Θ) := ED[ℓ(yf(x; Θ))]. In this paper, we focus on logistic loss ℓ(z) = log(1 + e z), which is one of the most popular loss functions for classification tasks. We consider two-layer Re LU networks of width m, parameterized by the weights Θ = (W, a) Rm d Rm, computing the function f( ; Θ) : x 7 1 ma σ (Wx). We initialize the network with ar Unif({+1, 1}) and wr,1 N(0, I), for all hidden nodes r [m]. We then fix the top layer weights and train the hidden layer W using the dropout algorithm. We denote the weight matrix at time t by Wt, and wr,t represents its r-th column. For the sake of simplicity of the presentation, we drop non-trainable arguments from all functions, e.g., we use f( ; W) in lieu of f( ; Θ). Let Bt Rm m, t [T], be a random diagonal matrix with diagonal entries drawn independently and identically from a Bernoulli distribution with parameter q, i.e., br,t Bern(q), where br,t is the r-th diagonal entry of Bt. At the t-th iterate, dropout entails a SGD step on (the parameters of) the sub-network g(W; x, Bt) = 1 ma Btσ(Wx), yielding updates of the form Wt+ 1 2 Wt η ℓ(ytg(Wt; xt, Bt)). The iteration concludes with projecting the incoming weights i.e. rows of Wt+ 1 2 onto a pre-specified Euclidean norm ball. We note that such max-norm constraints are standard in the practice of deep learning, and has been a staple to dropout training since it was proposed in Srivastava et al. [2014]1. Finally, at test time, the weights are multiplied by q so as 1Quote from Srivastava et al. [2014]: One particular form of regularization was found to be especially useful for dropout constraining the norm of the incoming weight vector at each hidden unit to be upper bounded by a fixed constant c Algorithm 1 Dropout in Two-Layer Networks Input: data ST = {(xt, yt)}T t=1 DT ; Bernoulli masks BT = {Bt}T t=1; dropout rate 1 q; max-norm constraint parameter c; learning rate η 1: initialize: wr,1 N(0, I) and ar Unif({+1, 1}), r [m] 2: for t = 1, . . . , T 1 do 3: forward: g(Wt; xt, Bt) = 1 ma Btσ(Wtxt) 4: backward: Lt(Wt) = ℓ(ytg(Wt; xt, Bt) = ℓ (ytg(Wt; xt, Bt)) yt g(Wt; xt, Bt) 5: update: Wt+ 1 2 Wt η Lt(Wt) 6: max-norm: Wt+1 Πc(Wt+ 1 2 ) 7: end for Test Time: re-scale the weights as Wt q Wt to make sure that the output at test time is on par with the expected output at training time. The pseudo-code for dropout training is given in Algorithm 12. Our analysis is motivated by recent developments in understanding the dynamics of (S)GD in the socalled lazy regime. Under certain initialization, learning rate, and network width requirements, these results show that the iterates of (S)GD tend to stay close to initialization; therefore, a first-order Taylor expansion of the t-th iterate around initialization, i.e. f(x; Wt) f(x; W1) + f(x; W1), Wt W1 , can be used as a proxy to track the evolution of the network predictions [Li and Liang, 2018, Chizat et al., 2018, Du et al., 2019, Lee et al., 2019]. In other words, training in lazy regime reduces to finding a linear predictor in the reproducing kernel Hilbert space (RKHS) associated with the gradient of the network at initialization, f( ; W1). In this work, following Nitanda and Suzuki [2019], Ji and Telgarsky [2019], we assume that the data distribution is separable by a positive margin in the limiting RKHS: Assumption 1 ((q, γ)-Margin). Let z N(0, Id) and b Bern(q) be a d-dimensional standard normal random vector, and a Bernoulli random variable with parameter q, respectively. There exists a margin parameter γ > 0, and a linear transformation ψ : Rd Rd satisfying A) Ez[ ψ(z) 2] < ; B) ψ(z) 2 1 for all z Rd; and C) Ez,b[y ψ(z), bx I[z x 0] ] γ for almost all (x, y) D. The above assumption provides an infinite-width extension to the separability of data in the RKHS induced by g(W1; , B1). To see that, define V := [v1, . . . , vm] Rm d, where vr = 1 marψ(wr,1) for all r [m], satisfying V F 1. For any given point (x, y) X Y, the margin attained by V is at least y g(W1; x, B1), V = 1 m Pm r=1 y ψ(wr,1), br,1x I{w r,1x > 0} , which is a finite-width approximation of the quantity E[y ψ(z), bx I{z x > 0} ] in Assumption 1. We remark that when q = 1 (pure SGD no dropout), with probability one it holds that b = 1, so that Assumption 1 boils down to that of Nitanda and Suzuki [2019] and Ji and Telgarsky [2019]. When q < 1, this assumption translates to a margin of γ/q on the full features f( ; W1), which is the appropriate scaling given that f( ; W1) = 1 q EB[ g(W1; , B)]. Alternatively, dropout training eventually outputs a network with weights scaled down as q Wt, which (in expectation) corresponds to the shrinkage caused by the Bernoulli mask in bx I{z x > 0}. Regardless, we note that our analysis can be carried over even without this scaling; however, new polynomial factors of 1/q will be introduced in the required width in our results in Section 4. 3.1 Notation We denote matrices, vectors, scalar variables and sets by Roman capital letters, Roman small letters, small letters, and script letters, respectively (e.g. Y, y, y, and Y). The r-th entry of vector y, and the r-th row of matrix Y, are denoted by yi and yi, respectively. Furthermore, for a sequence of matrices Wt, t N, the r-th row of the t-th matrix is denoted by wr,t. Let I denote the indicator of an event, i.e., I{y Y} is one if y Y, and zero otherwise. For any integer d, we represent the set {1, . . . , d} by [d]. Let x represent the ℓ2-norm of vector x; and X , and X F represent the 2In a popular variant that is used in machine learning frameworks such as Py Torch, known as inverted dropout, (inverse) scaling is applied at the training time instead of the test time. The inverted dropout is equivalent to the method we study here, and can be analyzed in a similar manner. operator norm, and the Frobenius norm of matrix X, respectively. , represents the standard inner product, for vectors or matrices, where X, X = Tr(X X ). For a matrix W Rm d, and a scalar c > 0, Πc(W) projects the rows of W onto the Euclidean ball of radius c with respect to the ℓ2-norm. For any t [T] and any W, let ft(W) := f(xt; W) denote the network output given input xt, and let gt(W) := g(W; xt, Bt) denote the corresponding output of the sub-network associated with the dropout pattern Bt. Let Lt(W) = ℓ(ytgt(W)) and Qt(W) = ℓ (ytgt(W)) be the associated instantaneous loss and its negative derivative. The partial derivative of gt with respect to the r-th hidden weight vector is given by gt(W) wr = 1 marbr,t I{w r xt 0}xt. We denote the linearization of gt(W) based on features at time t by g(k) t (W) := gt(Wk), W ; and its corresponding instantaneous loss and its negative derivative by L(k) t (W) := ℓ(ytg(k) t (W)) and Q(k) t (W) := ℓ (ytg(k) t (W)), respectively. Q plays an important role in deriving generalization bounds for dropout sub-networks g(Wt; x, Bt); it has been recently used in [Cao and Gu, 2019, Ji and Telgarsky, 2019] for analyzing the convergence of SGD and bounding its generalization error. We conclude this section by listing a few useful identities that are used throughout the paper. First, due to homogeneity of the Re LU, it holds that g(t) t (Wt) = gt(Wt), Wt = gt(Wt). Moreover, the norm of the network gradient, and the norm of the the gradient of the instantaneous loss can be upper-bounded as gt(W) 2 F = Pm r=1 gt(W) wr 2 Bt 2 F m 1, and Lt(W) F = ℓ (ytgt(W)) yt gt(W) F Qt(W), respectively. Finally, the logistic loss satisfies |ℓ (z)| ℓ(z), so that Qt(W) Lt(W). 4 Main Results We begin with a simple observation that given the random initialization scheme in Algorithm 1, the ℓ2-norm of the rows of W1 are expected to be concentrated around d. In fact, using Gaussian concentration inequality (Theorem A.1 in the appendix), it holds with probability at least 1 1/m, uniformly for all r [m], that wr,1 ln (m). For the sake of the simplicity of the presentation, we assume that the event maxr [m] wr,1 2 p ln (m) holds through the run of dropout training. Alternatively, we can re-initialize the weights until this condition is satisfied, or multiply the probability of success in our theorems by a factor of 1 1/m. Our first result establishes that the true misclassification error of dropout training vanishes as O(1/T). Theorem 4.1 (Learning with Dropout). Let c = d + max{ 1 14γ2 , 2 p ln (m)} + 1 and λ := 5γ 1 ln (2ηT)+ p 44γ 2 ln (24ηc m T 2). Under Assumption 1, for any learning rate η (0, ln (2)] and any network of width satisfying m 2401γ 6λ2, with probability one over the randomization due to dropout, we have that min t [T ] E[R(q Wt)] 1 t=1 E[R(q Wt)] 4λ2 ln (T)2 + ln (md T) where the expectation is with respect to the initialization and the training samples. Theorem 4.1 shows that dropout successfully trains the complete network f( ; Wt). Perhaps more interestingly, our next result shows that dropout successfully trains a potentially significantly narrower sub-network g(Wt; , Bt). For this purpose, denote the misclassification error due to a network with weights W given a Bernoulli mask B as follows R(W; B) := P{yg(W; x, B) < 0}. Then the following result holds for the misclassification error of the iterates of dropout training. Theorem 4.2 (Compression with Dropout). Under the setting of Theorem 4.1, with probability at least 1 δ over initialization, the training data, and the randomization due to dropout, we have that min t [T ] R(Wt; Bt) 1 t=1 R(Wt; Bt) 12λ2 ηT + 6 ln (1/δ) T = O ln (m T) A few remarks are in order. Theorem 4.1 gives a generalization error bound in expectation. A technical challenge here stems from the unboundedness of the logistic loss. In our analysis, the max-norm constraint in Algorithm 1 is essential to guarantee that the logistic loss remains bounded through the run of the algorithm, thereby controlling the loss of the iterates in expectation. However, akin to analysis of SGD in the lazy regime, the iterates of dropout training are not likely to leave a small proximity of the initialization whatsoever. Therefore, for the particular choice of c in the above theorems, the max-norm projections in Algorithm 1 will be virtually inactive for a typical run. The expected width of the sub-networks in Theorem 4.2 is only qm. Using Hoeffding s inequality and a union bound argument, for any δ (0, 1), with probability at least 1 δ, it holds for all t [T] that g(Wt; x, Bt) has at most qm + p 2m ln (T/δ) active hidden neurons. That is, in a typical run of the dropout training, with high probability, there exists a sub-network of width qm + O( m) whose generalization error is no larger than O(1/T). In Section 6, we further provide empirical evidence to verify this compression result. We note that dropout has long been considered as a means of network compression, improving post-hoc pruning [Gomez et al., 2019], in Bayesian settings [Molchanov et al., 2017], and in connection with the Lottery Ticket Hypothesis [Frankle and Carbin, 2019]. However, we are not aware of any theoretical result supporting that claim prior to our work. Finally, the sub-optimality results in both Theorem 4.1 and Theorem 4.2 are agnostic to the dropout rate 1 q. This is precisely due to the (q, γ)-Margin assumption: if it holds, then so does (q , γ)- Margin for any q [q, 1]. That is, these theorems hold for any dropout rate not exceeding 1 q. Therefore, in light of the remark above, larger admissible dropout rates are preferable since they result in higher compression rates, while enjoying the same generalization error guarantees. We begin by bounding ESt[R(q Wt)], the expected population error of the iterates, in terms of ESt,Bt[Lt(Wt)], the expected instantaneous loss of the random sub-networks. In particular, using simple arguments including the smoothing property, the fact that Wt is independent from (xt, yt) given St 1, and that logistic loss upper-bounds the zero-one loss, it is easy to bound the expected population risk as ESt[R(q Wt)] ESt[ℓ(ytf(xt; q Wt))]. Furthermore, using Jensen s inequality, we have that ℓ(ytft(q Wt)) EBt[Lt(Wt)]. The upper bound then follows from these two inequalities. In the following, we present the main ideas in bounding the average instantaneous loss of the iterates. Under Algorithm 1, dropout iterates are guaranteed to remain in the set Wc := {W Rm d : wr c}. Using this property and the dropout update rule, we track the distance of consecutive iterates (Wt+1, Wt) from any competitor U Wc, which leads to the following upper bound on the average instantaneous loss of iterates. Lemma 5.1. Let W1, . . . , WT be the sequence of dropout iterates with a learning rate satisfying η ln (2). Then, it holds for any U Wc that t=1 Lt(Wt) W1 U 2 F ηT + 2 t=1 L(t) t (U). (1) Note that the upper bound in Equation (1) holds for any competitor U Wc; however, we seek to minimize the upper-bound on the right hand side of Equation (1) by finding a sweet spot that maintains a trade-off between 1) the distance from initialization, and 2) the average instantaneous loss for the linearized models. Following Ji and Telgarsky [2019], we represent such a competitor as an interpolation between the initial weights W1 and the max-margin competitor V, i.e. U := W1 + λV, where λ is the trade-off parameter. Recall that V := [v1, , vm] Rd m, where vr = 1 marψ(wr,1) for any r [m], and ψ is given by assumption 1 . Thus, the first term on the right hand side above can be conveniently bounded as λ2 ηT ; Lemma 5.2 bounds the second term as follows. Lemma 5.2. Under the setting of Theorem 4.1, it holds with probability at least 1 δ simultaneously for all iterates t [T] that 1) wr,t wr,1 7λ 2γ m, for all r [m]; and 2) L(t) t (U) λ2 2ηT . Therefore, left hand side of Equation (1), i.e., the average instantaneous loss of the iterates, can be bounded with high probability as 2λ2 ηT . To get the bound in expectation, as presented in Theorem 4.1, 103 104 105 106 Test Accuracy width = 250000, p = 0.1 full network f( ; Wt) sub-network g(Wt; , Bt) 103 104 105 106 width = 250000, p = 0.5 103 104 105 106 width = 250000, p = 0.7 103 104 105 106 Test Accuracy width = 100, p = 0.5 103 104 105 106 width = 10000, p = 0.5 103 104 105 106 width = 100000, p = 0.5 Figure 1: Test accuracy of the full network f( ; Wt) as well as the sub-networks g(Wt; , Bt) drawn by dropout iterates, as a function of number of iterations t, for (top) fixed width m = 250K and several dropout rates 1 p {0.1, 0.5, 0.7}; and (bottom) fixed dropout rate 1 p = 0.5 and several widths m {100, 10K, 100K}. we need to control L(t) t (U) in worst-case scenario. We take advantage of the max-norm constraints in Algorithm 1, and show in Lemma B.5 that with probability one all iterates satisfy L(t) t (U) c m ln(2) +1. The proof of Theorem 4.1 then follows from carefully choosing δ, the confidence parameter. To prove the compression result in Theorem 4.2, we use the fact that the zero-one loss can be bounded in terms of the negative derivative of the logistic loss [Cao and Gu, 2019]. Therefore, we can bound R(Wt; Bt), the population risk of the sub-networks, in terms of Q(Wt; Bt) = ED[ ℓ (ytgt(Wt))]. Following Ji and Telgarsky [2019], the proof of Theorem 4.2 then entails showing that PT t=1 Q(Wt; Bt) is close to PT t=1 Qt(Wt), which itself is bounded by the average instantaneous loss of the iterates. We now present the main ideas in proving Lemma 5.2, which closely follows Ji and Telgarsky [2019]. Since L(t) t (U) e yt gt(Wt),U , the proof entails lower bounding yt gt(Wt), U , which can be decomposed as follows yt gt(Wt), U = yt gt(W1), W1 + yt gt(Wt) gt(W1), W1 + λyt gt(W1), V + λyt gt(Wt) gi(Bt W1), V . (2) By homogeneity of the Re LU activations, the first term in Equation (2) precisely computes ytgt(W1), which cannot be too negative under the initialization scheme used in Algorithm 1, as we show in Lemma B.3. On the other hand, we show in Lemma B.4 that under Assumption 1, V has a good margin with respect to the randomly initialized weights W1, so that the third term in Equation (2) is concentrated around the margin parameter γ. The second and the fourth terms in Equation (2) can be bounded thanks to the lazy regime, where Wt remains close to W1 at all times. In particular, provided wr,t wr,1 7λ 2γ m, we show in Lemma B.1 that at most only O(1/ m)-fraction of neural activations change, and thus gt(Wt) gt(W1) has a small norm. Lemma 5.2 then follows from carefully choosing λ and m such that the right hand side of Equation (2) is sufficiently positive. 6 Experimental Results The goal of this section is to investigate if dropout indeed compresses the model, as predicted by Theorem 4.2. In particular, we seek to understand if the (sparse) dropout sub-networks g(W; , B) regardless of being visited by dropout during the training are comparable to the full network f( ; W), in terms of the test accuracy. 102 103 104 105 106 Iterate Test Accuracy width = 100 sub-networks full network 102 103 104 105 106 Iterate width = 1000 102 103 104 105 106 Iterate width = 10000 102 103 104 105 106 Iterate width = 100000 Figure 2: Test accuracy of the full network f( ; Wt) as well as 100 random sub-networks g(Wt; , B1), . . . , g(Wt; , B100) with dropout rate 1 p = 0.8, as a function of number of iterations t, for several width m {100, 1K, 10K, 100K}. We train a convolutional neural network with a dropout layer on the top hidden layer, using crossentropy loss, on the MNIST dataset. The network consists of two convolutional layers with maxpooling, followed by three fully-connected layers. All the activations are Re LU. We use a constant learning rate η = 0.01 and batch-size equal to 64 for all the experiments. We train several networks where except for the top layer widths (m {100, 500, 1K, 5K, 10K, 50K, 100K, 250K}), all other architectural parameters are fixed. We track the test accuracy over 25 epochs as a function of number of iterations, for the full network, the sub-networks visited by dropout, as well as random but fixed sub-networks that are drawn independently, using the same dropout rate. We run the experiments for several values of the dropout rate, 1 p {0.1, 0.2, 0.3, . . . , 0.9}. Figure 1 shows the test accuracy of the full network f( ; Wt) (blue, solid curve) as well as the dropout iterates g(Wt; , Bt) (orange, dashed curve), as a function of the number of iterations. Both curves are averaged over 50 independent runs of the experiment; the grey region captures the standard deviation. It can be seen that the (sparse) sub-networks drawn by dropout during the training, are indeed comparable to the full network in terms of the generalization error. As expected, the gap between the full network and the sparse sub-networks is higher for narrower networks, and for higher dropout rates. This figure verifies our compression result in Theorem 4.2. Next, we show that dropout also generalizes to sub-networks that were not observed during the training. In other words, random sub-networks drawn from the same Bernoulli distribution, also performed well. We run the following experiment on the same convolutional network architecture described above with widths m {100, 1K, 10K, 100K}. We draw 100 sub-networks g(W; , B1), . . . , g(W; , B100), corresponding to diagonal Bernoulli matrices B1, . . . , B100, all generated by the same Bernoulli distribution used at training (Bernoulli parameter p = 0.2, i.e., dropout rate 1 p = 0.8). In Figure 2, we plot the generalization error of these sub-networks as well as the full network as a function of iteration number, as orange and blue curves, respectively. We observe that, as the width increases, the sub-networks become increasingly more competitive; it is remarkable that the effective width of these typical sub-networks are only 1/5 of the full network. 7 Conclusion Most of the results in the literature of over-parameterized neural networks focus on GD-based methods without any explicit regularization. On the other hand, recent theoretical investigations have challenged the view that implicit bias due to such GD-based methods can explain generalization in deep learning [Dauber et al., 2020]. Therefore, it seems crucial to explore algorithmic regularization techniques in over-parameterized neural networks. This paper takes a step towards understanding a popular algorithmic regularization technique in deep learning. In particular, assuming that the data distribution is separable in the RKHS induced by the neural tangent kernel, this paper presents precise iteration complexity results for dropout training in two-layer Re LU networks using the logistic loss. Broader Impact We investigate the convergence and generalization of a popular algorithmic regularization technique in deep learning. Although we can not think of any direct social impacts per se, we hope such theoretical studies serve the community in long-term, by improving our understanding of the foundations, which shall eventually lead to more powerful machine learning systems. Acknowledgements This research was supported, in part, by NSF BIGDATA award IIS-1546482, NSF CAREER award IIS-1943251 and NSF TRIPODS award CCF-1934979. Poorya Mianjy acknowledges support as a MINDS fellow. Raman Arora acknowledges support from the Simons Institute as part of the program on the Foundations of Deep Learning and the Institute for Advanced Study (IAS), Princeton, New Jersey, as part of the special year on Optimization, Statistics, and Theoretical Machine Learning. Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via overparameterization. ar Xiv preprint ar Xiv:1811.03962, 2018. Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pages 6155 6166, 2019. Raman Arora, Peter Bartlett, Poorya Mianjy, and Nathan Srebro. Dropout: Explicit forms and capacity control. ar Xiv preprint ar Xiv:2003.03397, 2020. Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. ar Xiv preprint ar Xiv:1901.08584, 2019. Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Information Processing Systems (NIPS), pages 2814 2822, 2013. Alina Beygelzimer, John Langford, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandit algorithms with supervised learning guarantees. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 19 26, 2011. Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns overparameterized networks that provably generalize on linearly separable data. ar Xiv preprint ar Xiv:1710.10174, 2017. Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pages 10835 10845, 2019. Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. arxiv e-prints, page. ar Xiv preprint ar Xiv:1812.07956, 2018. George E Dahl, Tara N Sainath, and Geoffrey E Hinton. Improving deep neural networks for LVCSR using rectified linear units and dropout. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 8609 8613. IEEE, 2013. Amit Daniely. Sgd learns the conjugate kernel class of the network. In Advances in Neural Information Processing Systems, pages 2422 2430, 2017. Assaf Dauber, Meir Feder, Tomer Koren, and Roi Livni. Can implicit bias explain generalization? stochastic convex optimization as a case study. ar Xiv preprint ar Xiv:2003.06152, 2020. Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. ar Xiv preprint ar Xiv:1811.03804, 2018. Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1e K3i09YQ. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=r Jl-b3Rc F7. Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Int. Conf. Machine Learning (ICML), 2016. Wei Gao and Zhi-Hua Zhou. Dropout rademacher complexity of deep neural networks. Science China Information Sciences, 59(7):072104, 2016. Aidan N Gomez, Ivan Zhang, Kevin Swersky, Yarin Gal, and Geoffrey E Hinton. Learning sparse networks using targeted dropout. ar Xiv preprint ar Xiv:1905.13678, 2019. David P Helmbold and Philip M Long. On the inductive bias of dropout. Journal of Machine Learning Research (JMLR), 16:3403 3454, 2015. Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. Wei Hu, Zhiyuan Li, and Dingli Yu. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hke3gy HYw H. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571 8580, 2018. Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. ar Xiv preprint ar Xiv:1909.12292, 2019. Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pages 8570 8581, 2019. Mingchen Li, Mahdi Soltanolkotabi, and Samet Oymak. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. ar Xiv preprint ar Xiv:1903.11680, 2019. Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pages 8157 8166, 2018. David Mc Allester. A PAC-bayesian tutorial with a dropout bound. ar Xiv preprint ar Xiv:1307.2118, 2013. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. ar Xiv preprint ar Xiv:1708.02182, 2017. Poorya Mianjy and Raman Arora. On dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575 4584, 2019. Poorya Mianjy, Raman Arora, and Rene Vidal. On the implicit bias of dropout. In International Conference on Machine Learning, pages 3540 3548, 2018. Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2498 2507. JMLR. org, 2017. Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. ar Xiv preprint ar Xiv:1705.03071, 2017. Atsushi Nitanda and Taiji Suzuki. Refined generalization analysis of gradient descent for overparameterized two-layer neural networks with smooth activations on classification problems. ar Xiv preprint ar Xiv:1905.09870, 2019. Samet Oymak and Mahdi Soltanolkotabi. Towards moderate overparameterization: global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 2020. Albert Senen-Cerda and Jaron Sanders. Almost sure convergence of dropout algorithms for neural networks. ar Xiv preprint ar Xiv:2002.02247, 2020. Zhao Song and Xin Yang. Quadratic suffices for over-parametrization via matrix chernoff bound. ar Xiv preprint ar Xiv:1906.03593, 2019. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1), 2014. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1 9, 2015. Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018. Stefan Wager, Sida Wang, and Percy S Liang. Dropout training as adaptive regularization. In Advances in Neural Information Processing Systems (NIPS), 2013. Stefan Wager, William Fithian, Sida Wang, and Percy S Liang. Altitude training: Strong bounds for single-layer dropout. In Adv. Neural Information Processing Systems, 2014. Li Wan, Matthew Zeiler, Sixin Zhang, Yann Le Cun, and Rob Fergus. Regularization of neural networks using dropconnect. In International conference on machine learning, pages 1058 1066, 2013. Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. In Advances in Neural Information Processing Systems, pages 9709 9721, 2019. Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. ar Xiv preprint ar Xiv:2002.12915, 2020. Ke Zhai and Huan Wang. Adaptive dropout with rademacher complexity regularization. In International Conference on Learning Representations, 2018. URL https://openreview.net/ forum?id=S1uxsye0Z. Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. ar Xiv preprint ar Xiv:1811.08888, 2018.