# on_dropout_and_nuclear_norm_regularization__49268121.pdf On Dropout and Nuclear Norm Regularization Poorya Mianjy 1 Raman Arora 1 We give a formal and complete characterization of the explicit regularizer induced by dropout in deep linear networks with squared loss. We show that (a) the explicit regularizer is composed of an ℓ2-path regularizer and other terms that are also rescaling invariant, (b) the convex envelope of the induced regularizer is the squared nuclear norm of the network map, and (c) for a sufficiently large dropout rate, we characterize the global optima of the dropout objective. We validate our theoretical findings with empirical results. 1. Introduction Deep learning is revolutionizing the technological world with recent advances in artificial intelligence. However, a formal understanding of when or why deep learning algorithms succeed has remained elusive. Recently, a series of works focusing on computational learning theoretic aspects of deep learning have implicated inductive biases due to various algorithmic choices to be a crucial potential explanation (Zhang et al., 2016; Gunasekar et al., 2018a; Neyshabur et al., 2014; Martin & Mahoney, 2018; Mianjy et al., 2018). Here, we examine such an implicit bias of dropout in deep linear networks. Dropout is a popular algorithmic approach that helps training deep neural networks that generalize better (Hinton et al., 2012; Srivastava et al., 2014). Inspired by the reproduction model in the evolution of advanced organisms, dropout training aims at breaking co-adaptation among neurons by dropping them independently and identically according to a Bernoulli random variable. Here, we restrict ourselves to simpler networks; we consider multi-layered feedforward networks with linear activations (Goodfellow et al., 2016). While the overall function 1Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA. Correspondence to: Raman Arora . Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). is linear, the representation in factored form makes the optimization landscape non-convex and hence, challenging to analyze. More importantly, we argue that the fact we choose to represent a linear map in a factored form has important implications to the learning problem, akin in many ways to the implicit bias due to stochastic optimization algorithms and various algorithmic heuristics used in deep learning (Gunasekar et al., 2017; Li et al., 2018; Azizan & Hassibi, 2019). Several recent works have investigated the optimization landscape properties of deep linear networks (Baldi & Hornik, 1989; Saxe et al., 2013; Kawaguchi, 2016; Hardt & Ma, 2016; Laurent & Brecht, 2018), as well as the implicit bias due to first-order optimization algorithms for training such networks (Gunasekar et al., 2018b; Ji & Telgarsky, 2018), and the convergence rates of such algorithms (Bartlett et al., 2018; Arora et al., 2018). The focus here is to have a similar development for dropout when training deep linear networks. 1.1. Notation For an integer i, [i] represents the set {1, . . . , i}, ei denotes the i-th standard basis, and 1i Ri is the vector of all ones. The set of all k-combinations of a set S is denoted by S k . We denote linear operators and vectors by Roman capital and lowercase letters, respectively, e.g. Y and y. Scalar variables are denoted by lower case letters (e.g. y) and sets by script letters, e.g. Y. We denote the ℓ2 norm of a vector x by x . For a matrix X, X F denotes the Frobenius norm, X denotes the nuclear norm, and σi(X) is the i-th largest singular value of matrix X. For X Rd2 d1 and a positive definite matrix C Rd1 d1, X 2 C := Tr XCX . The standard inner product between two matrices X, X , is denoted by X, X := Tr X X . We denote the i-th column and the j-th row of a matrix X with vectors x:i and xj:, both in column form. The vector of diagonal elements of X is denoted as diag (X). The diagonal matrix with diagonal entries as the elements of a vector x is denoted as diag (x). With a slight abuse of notation, we use {Wi} as a shorthand for the tuple (W1, . . . , Wk+1). Dropout and Nuclear Norm Regularization Table 1. Key terms, corresponding symbols, and descriptions. Term Symbol Description architecture {di} (d0, . . . , dk+1) implementation {Wi} (W0, . . . , Wk+1) network map Wk+1 1 Wk+1Wk W1 population risk L({Wi}) Ex,y[ y Wk+1 1x 2] dropout objective Lθ({Wi}) see Equation (1) explicit regularizer R({Wi}) Lθ({Wi}) L({Wi}) induced regularizer Θ(M) see Equation (2) 1.2. Problem Setup We consider the hypotheses class of multilayer feed-forward linear networks with input dimension d0, output dimension dk+1, k hidden layers of widths d1, . . . , dk, and linear transformations Wi Rdi di 1, for i = 1, . . . , k + 1 : L{di} = {g : x 7 Wk+1 W1x, Wi Rdi di 1}. We refer to the set of k + 1 integers {di}k+1 i=0 specifying the width of each layer as the architecture of the function class, the set of the weight parameters {Wi}k+1 i=1 as an implementation, or an element of the function class, and Wk+1 1 := Wk+1Wk W1 as the network map. The focus here is on deep regression with dropout under ℓ2 loss, which is widely used in computer vision tasks, including human pose estimation (Toshev & Szegedy, 2014), facial landmark detection, and age estimation (Lathuili ere et al., 2019). More formally, we study the following learning problem for deep linear networks. Let X Rd0 and Y Rdk+1 denote the input feature space and the output label space, respectively. Let D denote the joint probability distribution on X Y. We assume that E[xx ] has full rank. Given a training set {xi, yi}n i=1 drawn i.i.d. from the distribution D, the goal of the learning problem is to minimize the population risk under the squared loss L({Wi}) := Ex,y[ y Wk+1 1x 2]. Note that the population risk L depends only on the network map and not the specific implementations of it, i.e. L({Wi}) = L({W i}) for all Wk+1 W1 = W k+1 W 1. For that reason, with a slight abuse of notation we write L(Wk+1 1) := L({Wi}). Dropout is an iterative procedure wherein at each iteration each node in the network is dropped independently and identically according to a Bernoulli random variable with parameter θ. Equivalently, we can view dropout, algorithmically, as an instance of stochastic gradient descent for minimizing the following objective over {Wi}: Lθ({Wi}) := E(x,y,{bi})[ y Wk+1 1x 2], (1) where Wi j := 1 θk Wi Bi 1Wi 1 Bj Wj, and Bl = diag ([bl(1), . . . , bl(dl)]) represents the dropout pattern in the lth layer with Bernoulli random variables on the diagonal; if Bl(i, i) = 0 then all paths from the input to the output that pass through the ith hidden node in the lth layer are turned off , i.e., those paths have no contribution in determining the output of the network for that instance of the dropout pattern; we refer to the parameter 1 θ as the dropout rate. Wi j is an unbiased estimator of Wi j, i.e. E{bi}[ Wi j] = Wi j. We say that the dropout algorithm succeeds in training a network if it returns a map Wk+1 1 that (approximately) minimizes Lθ. In this paper, the central question under investigation is to understand which network maps/architectures is a successful dropout training biased towards. To answer this question, we begin with the following simple observation that Lθ({Wi})=L({Wi})+E(x,{bi}) Wk+1 1x Wk+1 1x 2 In other words, the dropout objective is composed of the population risk L({Wi}) plus an explicit regularizer R({Wi}) := E(x,y,{bi})[ Wk+1 1x Wk+1 1x 2] induced by dropout. Denoting the second moment of x by C := E[xx ], we note that R({Wi}) = E{bi}[ Wk+1 1 Wk+1 1 2 C]. Since any stochastic network map specified by Wk+1 1 is an unbiased estimator of the network map specified by Wk+1 1, the explicit regularizer captures the variance of the network implemented by the weights {Wi} under Bernoulli perturbations. By minimizing this variance term, dropout training aims at breaking co-adaptation between hidden nodes it biases towards networks whose random sub-networks yield similar outputs (Srivastava et al., 2014). If {W i } is an infimum of (1), then it minimizes the explicit regularizer among all implementations of the network map, M = W k+1 W 1, i.e., R({W i }) = inf Wk+1 W1=M R({Wi}). We refer to the infimum of the ex- plicit regularizer over all implementations of a given network map M as the induced regularizer: Θ(M) := inf Wk+1 W1=M R({Wi}) (2) The domain of the induced regularizer Θ is the linear maps implementable by the network, i.e., the set {M : rank (M) mini di}. Since the infimum of the induced regularizer is always attained (see Lemma A.2 in the appendix), we can equivalently study the following problem to understand the solution to Problem 1 in terms of the network map: min M L(M) + Θ(M), rank (M) min i [k+1] di. (3) To characterize which networks are preferable by dropout training, one needs to understand the explicit regularizer Dropout and Nuclear Norm Regularization R, understand the induced regularizer Θ, and explore the global minima of Problem 3. In this regard, we make several important contributions summarized as follows. 1. We derive the closed form expression for the explicit regularizer R({Wi}) induced by dropout training in deep linear networks. The explicit regularizer is comprised of the ℓ2-path regularizer as well as other rescaling invariant sub-regularizers. 2. We show that the convex envelope of the induced regularizer is proportional to the squared nuclear norm of the network map, generalizing a similar result for matrix factorization (Cavazza et al., 2018) and single hidden layer linear networks (Mianjy et al., 2018). Furthermore, we show that the induced regularizer equals its convex envelope if and only if the network is equalized, a notion that quantitatively measures coadaptation between hidden nodes (Mianjy et al., 2018). 3. We completely characterize the global minima of the dropout objective Lθ in Problem 1 despite the objective being non-convex, under a simple eigengap condition (see Theorem 2.7). This gap condition depends on the model, the data distribution, the network architecture and the dropout rate, and is always satisfied by deep linear network architectures with one output neuron. 4. We empirically verify our theoretical findings. The rest of the paper is organized as follows. In Section 2, we present the main results of the paper. In Section 3, we discuss the proof ideas and the key insights. Section 4 details the experimental results and Section 5 concludes with a discussion of future work. We refer the reader to Table 1 for a quick reference to the most useful notation. 2. Main Results 2.1. The explicit regularizer In this section, we give the closed form expression for the explicit regularizer R({Wi}), and discuss some of its important properties. Proposition 2.1. The explicit regularizer is composed of k sub-regularizers: R({Wi}) = P l [k] λl Rl({Wi}), where λ := 1 θ θ . Each of the sub-regularizers has the form: Rl({Wi}) = X (jl,...,j1) ( [k] l ) (il,...,i1) [djl] [dj1] p=1 l 1 β2 pγ2 jl,il where αj1,i1 := Wj1 1(i1, :) C, βp := Wjp+1 jp+1(ip+1, ip), and γjl,il := Wk+1 jl+1(:, il) . Understanding the regularizer. For simplicity, we assume here the case where the data distribution is whitened, i.e. C = I. This assumption is by no means restrictive, as we can always redefine W1 W1C 1 2 to absorb the second moment the first layer. Moreover, it is a common practice to whiten the data as a preprocessing step. The l-th sub-regularizer, i.e. Rl({Wi}), partitions the network graph (see Figure 1) into l + 1 subgraphs. This partitioning is done via the choice of l pivot layers, a set of l distinct hidden layers, indexed by (j1, . . . , jl) [k] l . The sub-regularizer enumerates over all such combinations of pivot layers, and pivot nodes within them indexed by (i1, . . . , il) [dj1] [djl]. For a given set of pivot layers and pivot nodes, the corresponding summand in the sub-regularizer is a product of three types of terms: a head term αj1,i1, middle terms βp, p [l 1], and tail terms γjl,il. It is easy to see that each of the head, middle and tail terms computes a summation over product of the weights along certain walks on the (undirected) graph associated with the network (see Figure 1). For instance, a head term i 1,i 2,...,i j1 1 i j1 1,...,i 2 ,i 1 W1(i 1, i0)W2(i 2, i 1) Wj1(i1, i jl 1)Wj1(i1, i jl 1) W2(i 2, i 1)W1(i 1, i0), is precisely the sum of the product of all weights along all walks from i0 in the input layer to i1 in layer j1 and back to i0, i.e., walks from i0 i 1,i 2,...,i j1 1 i1 i j1 1,...,i 2 ,i 1 i0. Similarly, middle terms are the sum of the product of the weights along ip i 1,i 2,...,i j1 1 ip+1 i j1 1,...,i 2 ,i 1 ip. A few remarks are in order. Remark 2.2. For k = 1, the explicit regularizer reduces to R(W2, W1) = λ W1(:, i) 2 W2(i, :) 2, which recovers the regularizer studied by the previous work of Cavazza et al. (2018) and Mianjy et al. (2018) in the setting of matrix factorization and single hidden layer linear networks, respectively. Remark 2.3. All sub-regularizers, and consequently the explicit regularizer itself are rescaling invariant. That is, for any given implementation {Wi}, and any sequence of scalars α1, . . . , αk+1 such that Q i αi = 1, it holds that Rl({Wi}) = Rl({αi Wi}). In particular, Rk equals Rk({Wi}) = X W1(i1, :) 2W2(i2, i1)2 W3(i3, i2)2 Wk(ik, ik 1)2 Wk+1(:, ik) 2. Dropout and Nuclear Norm Regularization ... ... ... ... ... ... ... ... ... ... Input layer j1-th pivot j2-th pivot j3-th pivot Output layer Figure 1. Illustration of the explicit regularizer as given in Proposition 2.1 for k = 9, l = 3, (i1, i2, i3) = (2, 3, 2) and (j1, j2, j3) = (2, 5, 7). The head term α2 j1,i1 corresponds to the summation over the product of the weights on any pairs of path from an input node to i1-th node in the j1-th hidden layer. Similarly, the tail term γ2 jl,il accounts for the product of the weights along any pair of path between the output and the il-th node in the jl-th hidden layer. Each of the middle terms β2 p, accumulates the product of of the weights along pair of path between ip-th node in the (jp + 1)-th hidden layer and the ip+1-th node in the jp+1-th hidden layer. Note that Rk({Wi}) = ψ2 2(Wk+1, . . . , W1), the ℓ2path regularizer, which which has been recently studied in (Neyshabur et al., 2015b) and (Neyshabur et al., 2015a). 2.2. The induced regularizer In this section, we study the induced regularizer as given by the optimization problem in Equation (2). We show that the convex envelope of Θ factors into a product of two terms: a term that only depends on the network architecture and the dropout rate, and a term that only depends on the network map. These two factors are captured by the following definitions. Definition 2.4 (effective regularization parameter). For given {di} and λ, we refer to the following quantity as the effective regularization parameter: (jl,...,j1) ( [k] l ) i [l] dji . We drop the subscript {di} whenever it is clear from the context. The effective regularization parameter naturally arises when we lowerbound the explicit regularizer (see Equation (5)). It is only a function of the network architecture and the dropout rate and does not depend on the weights it in- creases with the dropout rate and the depth of the network, but decreases with the width. Definition 2.5 (equalized network). A network implemented by {Wi}k+1 i=1 is said to be equalized if Wk+1 W1C 1 2 is equally distributed among all the summands in Proposition 2.1, i.e. for any l [k], (jl, . . . , j1) [k] l , and (il, . . . , i1) [djl] [dj1] it holds that |αj1,i1β1 βl 1γjl,il| = Wk+1 W1C 1 2 Πldjl . We are now ready to state the main result of this section. Recall that the convex envelope of a function is the largest convex under-estimator of that function. We show that irrespective of the architecture, the convex envelope of the induced regularizer is proportional to the squared nuclear norm of the network map times the principal root of the second moment. Theorem 2.6 (Convex Envelope). For any architecture {di} and any network map M Rdk+1 d0 implementable by that architecture, it holds that: Θ (M) = ν{di} MC 1 2 2 Furthermore, Θ(M) = Θ (M) if and only if the network is equalized. Dropout and Nuclear Norm Regularization This result is particularly interesting because it connects dropout, an algorithmic heuristic to avoid overfitting, to nuclear norm regularization, which is a classical regularization method with strong theoretical foundations. We remark that a result similar to Theorem 2.6 was recently established for matrix factorization (Cavazza et al., 2018). 2.3. Global optimality Theorem 2.6 provides a sufficient and necessary condition under which the induced regularizer equals its convex envelope. If any network map can be implemented by an equalized network, then Θ(M) = Θ (M) = ν{di} MC 1 2 2 , and the learning problem in Equation (3) is a convex program. In particular, for the case of linear networks with single hidden layer, Mianjy et al. (2018) show that any network map can be implemented by an equalized network, which enables them to characterize the set of global optima under the additional generative assumption y = Mx. However, it is not clear if the same holds for general deep linear networks since the regularizer here is more complex. Nonetheless, the following result provides a sufficient condition under which global optima of Lθ({Wi}) are completely characterized. Theorem 2.7. Let Cyx := E[yx ] and C := E[xx ], and denote M := Cyx C 1 2 . If σ1( M) σ2( M) 1 ν σ2( M), then M , the global optimum of Problem 3, is given by W k+1 1 = S νσ1( M) 1+ν ( M)C 1 where Sα( M) shrinks the spectrum of matrix M by α and thresholds it at zero. Furthermore, it is possible to implement M by an equalized network {W i } which is a global optimum of Lθ({Wi}). The gap condition in the theorem above can always be satisfied (e.g. by increasing the dropout rate or the depth, or decreasing the width) as long as there exists a gap between the first and the second singular values of M. Moreover, for the special case of deep linear networks with one output neuron (Ji & Telgarsky, 2018; Nacson et al., 2018), this condition is always satisfied since M R1 d0 and σ2( M) = 0. Corollary 2.8. Consider the class of deep linear networks with a single output neuron. Let {W i } be a minimizer of Lθ. For any architecture {di} and any network map Wk+1 1, it holds that: (1) Θ(Wk+1 1) = ν Wk+1 1 2 C, (2) W k+1 1= 1 1+ν Cyx, (3) the network specified by {W i } is equalized. We conclude this section with a remark. We know from the early work of (Srivastava et al., 2014) that feature dropout in linear regression is closely related to ridge regression. Corollary 2.8 generalizes the results of (Srivastava et al., 2014) to deep linear networks, and establishes a similar connection between dropout training and ridge regression. 3. Proof Ideas Here, we sketch proofs of the main results from Section 2; complete proofs are deferred to the supplementary. Sketch of the Proof of Theorem 2.6 The key steps are: 1. First, in Lemma 3.1, we show that for any set of weights {Wi}, it holds that R({Wi}) ν{di} Wk+1 1C 1 2 2 . In particular, this implies that Θ(M) ν{di} MC 1 2 2 holds for any M. 2. Next, in Lemma 3.2, we show that Θ (M) ν{di} MC 1 2 2 holds for all M. 3. The claim is implied by Lemmas 3.1 and 3.2, and the fact that 2 is a convex function. Lemma 3.1. Let {Wi} be an arbitrary set of weights. The explicit regularizer R({Wi}) satisfies R({Wi}) ν{di} Wk+1Wk W1C 1 2 2 , and the equality holds if and only if the network is equalized. We sketch the proof for isotropic distributions (C = I), and emphasize the role of equalization and effective regularization parameter. Sketch of the Proof of Lemma 3.1 We show that each term in the explicit regularizer is lower bounded by some multiple of the square of the nuclear norm of the linear map implemented by the network. We begin by lower bounding a particular summand in the expansion of Rl({Wi}) given in Proposition 2.1 with indices j1, . . . , jl: R{jl,...,j1}({Wi}) = X p=1 l 1 β2 pγ2 jl,il il,...,i1 |αj1,i1β1 βl 1γjl,il| where the inequality follows due to Cauchy-Schwartz inequality and holds with equality if and only if all the summands in φ := P il,...,i1 |αj1,i1β1 βl 1γjl,il| are equal for all il, . . . , i1 [djl] [dj1]. At the same time, we have that Wk+1 jl+1(:, il) |β1 βl 1| Wj1 1(i1, :) il,...,i1 Wk+1 jl+1(:, il)β1 βl 1Wj1 1(i1, :) Wk+1 jl+1(:, il)β1 βl 1Wj1 1(i1, :) (b) = Wk+1 W1 Dropout and Nuclear Norm Regularization where (a) follows since the outer product inside the nuclear norm has rank equal to 1, the inequality is due to the triangle inequality for matrix norms, and (b) follows trivially. In fact, the inequality above holds if each of the summands in (a) are equal to |αj1,i1β1 βl 1γjl,il| = Wk+1 W1 for all il, . . . , i1 [djl] [dj1]. Each of the sub-regularizers can be lowerbounded by noting that Rl({Wi}) = P jl,...,j1 R{jl,...,j1}({Wi}). Lemma 3.1 is central to our analysis for two reasons. First, it gives a sufficient and necessary condition for the induced regularizer to equal the square of the nuclear norm of the network map. This also motivates the concept of equalized networks in Definition 2.5. We note that for the special case of single hidden layer linear networks, i.e., for k=1, this lower bound can always be achieved (Mianjy et al., 2018); it remains to be seen whether the lower bound can be achieved for deeper networks. Second, summing over {jl, . . . , j1} [k] l , we conclude that l djl =: LBl({Wi}). (5) The right hand side above is the lowerbound for l-th subregularizer, denoted by LBl. Summing over l [k], we get the following lowerbound on the explicit regularizer R({Wi}) Wk+1 1 2 X l djl | {z } ν{di} which motivates the notion of effective regularization parameter in Definition 2.4. As an immediate corollary of Lemma 3.1, it holds that for any matrix M we have that Θ(M) ν{di} M 2 . We now focus on the biconjugate of the induced regularizer, and show that it is upper bounded by the same function, i.e. the effective regularization parameter times the square of the nuclear norm of the network map. Lemma 3.2. For any architecture {di} and any network map M, it holds that Θ (M) ν{di} MC 1 2 2 . To convey the main ideas of the proof, here we include a sketch for the simple case of k = 2, d1 = d2 = d under the isotropic assumption (C = I); for the general case, please refer to the appendix. Sketch of the proof of Lemma 3.2 First, the induced regularizer is non-negative, so the domain of Θ is Rdk+1 d0. The Fenchel dual of the induced regularizer Θ( ) is given by: Θ (M) = max P M, P Θ(P) = max P M, P min W3,W2,W1 W3 1=P R(W3, W2, W1) = max W3,W2,W1 M, W3 1 R(W3, W2, W1), (7) Denote the objective in (7) by Φ(W3, W2, W1) := M, W3 1 R(W3, W2, W1). Let (u1, v1) be the top singular vectors of M. For any α R, consider the following assignments to the weights: Wα 1 = αu11 d ,Wα 2 = 1d1 d and Wα 3 = 1dv 1 . Note that max W3,W2,W1 Φ(W3, W2, W1) max α Φ(Wα 3 , Wα 2 , Wα 1 ). We can express the objective on the right hand side merely in terms of α, d and M 2 as follows: R(Wα 3 , Wα 2 , Wα 1 ) = λ Wα 1 (i, :) 2 Wα 3 2(:, i) 2 Wα 2 1(i, :) 2 Wα 3 (:, i) 2 Wα 1 (i, :) 2Wα 2 (j, i)2 Wα 3 (:, j) 2 = 2λα2d3 + λ2α2d2. Similarly, the inner product M, Wα 3 1 reduces to M, Wα 3 1 = i,j=1 M, Wα 3 (:, j)Wα 2 (j, i)Wα 1 (i, :) i,j=1 αu 1 Mv1 = αd2 M 2. Plugging back the above equalities in Φ we get: Φ(Wα 3 , Wα 2 , Wα 1 ) = αd2 M 2 2λα2d3 λ2α2d2. Maximizing the right hand side above with respect to α Θ (M) max α Φ(Wα 3 , Wα 2 , Wα 1 ) = d2 4(2λd + λ2) M 2 2. Since Fenchel dual is order reversing, we get Θ (M) 2λd + λ2 d2 M 2 = ν{d,d} M 2 , (8) where we used the basic duality between the spectral norm and the nuclear norm. Lemma 3.1 implies that the biconjugate Θ (M) is lower bounded by ν{d,d} M 2 , which is a convex function. On the other hand, inequality (8) shows that the biconjugate is upper bounded ν{d,d} M 2 . Since square of the nuclear norm is a convex function, and that Θ ( ) is the largest convex function that lower bounds Θ( ), we conclude that Θ (M) = ν{d,d} M 2 . Sketch of the proof of Theorem 2.7 In light of Theorem 2.6, if the optimal network map W k+1 1, i.e. the optimum of the problem in Equation 3 can be implemented by an Dropout and Nuclear Norm Regularization equalized network, then Θ(W k+1 1) = Θ (W k+1 1) = ν{di} W k+1 1C 1 2 2 . Thus, the learning problem essentially boils down to the following convex program: min W Ex,y[ y Wx 2] + ν{di} WC 1 2 2 . (9) Following the previous work of (Cavazza et al., 2018; Mianjy et al., 2018), we show that the solution to problem (9) can be given as W = Sαρ(Cyx C 1 αρ := ν Pρ i=1 σi(Cyx C 1 2 ) 1+ρν , ρ is the rank of W , and Sαρ(M) shrinks the spectrum of the input matrix M by αρ and thresholds them at zero. However, as mentioned above, it is not clear if any network map can be implemented by an equalized network. Nonetheless, it is easy to see that the equalization property is satisfied for rank-1 network maps. Proposition 3.3. Let {di}k+1 i=0 be an architecture and M Rdk+1 d0 be a rank-1 network map. Then, there exists a set of weights {Wi}k+1 i=1 that implements M, and is equalized. For example, for deep networks with single output neuron, the weights W1 = 1d1w d1 and Wi = 1di1 di 1 didi 1 for i = 1 implements the network map w , and are equalized. Denote M := Cyx C 1 2 . Equipped with Proposition 3.3, the key here is to ensure that Sα( M) has rank equal to one. In this case, W will also have rank at most one and can be implemented by a network that is equalized. To that end, it suffices to have α σ2( M), which implies 1 + ν σ2( M) = σ1( M) σ2( M) σ2( M) which gives the sufficient condition in Theorem 2.7. 4. Experimental Results Dropout is widely used for training modern deep learning architectures resulting in the state-of-the-art performance in numerous machine learning tasks (Srivastava et al., 2014; Krizhevsky et al., 2012; Szegedy et al., 2015; Toshev & Szegedy, 2014). The purpose of this section is not to make a case for (or against) dropout when training deep networks, but rather verify empirically the theoretical results from the previous section.1 For simplicity, the training data {xi} is sampled from a standard Gaussian distribution which in particular ensures that C = I. The labels {yi} are generated as yi Nxi, where N Rdk+1 d0. N is composed of UV + noise, where U Rdk+1 r, V Rd0 r are sampled from a standard Gaussian and the entries of noise are sampled independently from a Gaussian distribution with small standard 1The code for the experiments can be found at: https://github.com/r3831/dln dropout 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Index of the singular values Singular values θ = 1.00 θ = 0.95 θ = 0.90 θ = 0.85 θ = 0.75 θ = 0.65 θ = 0.55 θ = 0.45 θ = 0.35 θ = 0.25 θ = 0.15 True Model Figure 2. Distribution of the singular values of the trained network for different values of the dropout rate 1 θ. It can be seen that the dropout training performs a more sophisticated form of shrinkage and thresholding on the spectrum of the model matrix M. deviation. At each step of the dropout training, we use a minibatch of size 1000 to train the network. The learning rate is tuned over the set {1, 0.1, 0.01}. All experiments are repeated 50 times, the curves correspond to the average of the runs, and the grey region shows the standard deviation. The experiments are organized as follows. First, since the convex envelope of the induced regularizer equals the squared nuclear norm of the network map (Theorem 2.6), it is natural to expect that dropout training performs a shrinkage-thresholding on the spectrum of Cyx C 1 2 = M (see Lemma A.4 in the appendix). We experimentally verify this in Section 4.1. Second, in Section 4.2, we focus on the equalization property. We attest Theorem 2.7 by showing that dropout training equalizes deep networks with one output neuron. 4.1. Spectral shrinkage and rank control Note that the induced regularizer Θ(M) is a spectral function (see Lemma A.2 in the appendix). On the other hand, by Theorem 2.6, Θ (M) = ν{di} M 2 . Therefore, if dropout training succeeds in finding an (approximate) minimizer of Lθ, it minimizes an upperbound on the squared of the nuclear norm of the network map. Hence, it is natural to expect that the dropout training performs a shrinkage-thresholding on the spectrum of the model, much like nuclear norm regularization. Figure 2 confirms this intuition. Here, we plot the singular value distribution of the final network map trained by dropout, for different values of the dropout rate. As can be seen in the figure, dropout training indeed shrinks the spectrum of the model and thresholds it at zero. However, unlike the nuclear norm regularization, the shrinkage is not uniform across the singular values that are not Dropout and Nuclear Norm Regularization 102 103 104 Training iterations Normalized Equalization Gap r1(t) r2(t) r3(t) r4(t) r5(t) r(t) Figure 3. The normalized equalization gap r(t) ℓ, which captures the gap between the sub-regularizers and their respective lower bounds, is plotted as a function of the number of iterations. Dropout converges to the set of equalized networks. thresholded. Moreover, note that the shrinkage parameter in Theorem 2.7 is governed by the effective regularization parameter ν{di}, which strictly increases with the dropout rate. This suggests that as we increase the dropout rate (decrease θ), the spectrum should be shrunk more severely, and the resulting network map should have a smaller rank. This is indeed the case as can be seen in Figure 2. 4.2. Convergence to equalized networks One of the key concepts behind our analysis is the notion of equalized networks. In particular, in Lemma 3.1 we see that if a network map can be implemented by an equalized network, then there is no gap between the induced regularizer and its convex envelope. It is natural to ask if dropout training indeed finds such equalized networks. As we will discuss, Figure 3 answers this question affirmatively. Recall that a network is equalized if and only if each and every sub-regularizer achieves its respective lowerbound in Equation 5, i.e. Rl({Wi}) = LBl({Wi}) for all l [k]. Figure 3 illustrates that dropout training consistently decreases the gap between the sub-regularizers and their respective lowerbounds. Here, the network has one output neuron, five hidden layers each of width 5, and input dimensionality of d0 = 5. In Figure 3 we plot the normalized equalization gap r(t) ℓ := Rℓ({W(t) i }) LBℓ({W(t) i }) 1 of the network under dropout training as a function of the iteration number. Similarly, we define the normalized equalization gap for the explicit regularizer r(t) = R({Wi}) Θ (Wk+1 1) 1. The network quickly becomes (approximately) equalized, and thereafter the trajectory of dropout training stays close to the equalized networks. We believe that this observation can be helpful in analyzing the dynamics of dropout training, which we leave for future work. 5. Discussion Motivated by empirical success of dropout (Srivastava et al., 2014; Krizhevsky et al., 2012), there has been several studies in recent years to understand its theoretical foundations (Baldi & Sadowski, 2013; Wager et al., 2013; 2014; Van Erven et al., 2014; Helmbold & Long, 2015; Gal & Ghahramani, 2016; Gao & Zhou, 2016; Helmbold & Long, 2017; Mou et al., 2018; Bank & Giryes, 2018). Previous work of Zhai & Zhang (2015); He et al. (2016); Cavazza et al. (2018) and Mianjy et al. (2018) study dropout training with ℓ2-loss in matrix factorization and shallow linear networks, respectively. The work that is most relevant to us is that of Cavazza et al. (2018); Mianjy et al. (2018), whose results are extended to the case of deep linear networks in this paper. In particular, we derive the explicit regularizer induced by dropout, which happens to be composed of the ℓ2-path regularizer and other rescaling invariant regularizers. Furthermore, we show that the convex envelope of the induced regularizer factors into an effective regularization parameter and the square of the nuclear norm of network map multiplied with the principal root of the second moment of the input distribution. We further highlight equalization as a key network property under which the induced regularizer equals its convex envelope. We specify a subclass of problems satisfying the equalization property, for which we completely characterize the optimal networks that dropout training is biased towards. Our work suggests several interesting directions for future research. First, given the connections that we establish with the nuclear norm and the ℓ2-path regularization, it is natural to ask what role does the dropout regularizer play in generalization. Second, how does the dropout regularizer change for neural networks with non-linear activation functions. Finally, it is important to understand dropout in networks trained with other loss functions, especially those that are popular for various classification tasks. Acknowledgements This research was supported in part by NSF BIGDATA grant IIS-1546482. Arora, S., Cohen, N., Golowich, N., and Hu, W. A convergence analysis of gradient descent for deep linear neural networks. ar Xiv preprint ar Xiv:1810.02281, 2018. Azizan, N. and Hassibi, B. Stochastic gradient/mirror descent: Minimax optimality and implicit regularization. In International Conference on Learning Representations (ICLR), 2019. Dropout and Nuclear Norm Regularization Baldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53 58, 1989. Baldi, P. and Sadowski, P. J. Understanding dropout. In Adv. Neural Information Processing Systems, 2013. Bank, D. and Giryes, R. On the relationship between dropout and equiangular tight frames. ar Xiv preprint ar Xiv:1810.06049, 2018. Bartlett, P., Helmbold, D., and Long, P. Gradient descent with identity initialization efficiently learns positive definite linear transformations. In ICML, 2018. Cavazza, J., Haeffele, B. D., Lane, C., Morerio, P., Murino, V., and Vidal, R. Dropout as a low-rank regularizer for matrix factorization. Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 2018. Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Int. Conf. Machine Learning (ICML), 2016. Gao, W. and Zhou, Z.-H. Dropout rademacher complexity of deep neural networks. Science China Information Sciences, 59(7):072104, 2016. Goodfellow, I., Bengio, Y., Courville, A., and Bengio, Y. Deep learning, volume 1. MIT press Cambridge, 2016. Gunasekar, S., Woodworth, B. E., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pp. 6151 6159, 2017. Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Characterizing implicit bias in terms of optimization geometry. ar Xiv preprint ar Xiv:1802.08246, 2018a. Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. ar Xiv preprint ar Xiv:1806.00468, 2018b. Hardt, M. and Ma, T. Identity matters in deep learning. ar Xiv preprint ar Xiv:1611.04231, 2016. He, Z., Liu, J., Liu, C., Wang, Y., Yin, A., and Huang, Y. Dropout non-negative matrix factorization for independent feature learning. In Int. Conf. on Computer Proc. of Oriental Languages. Springer, 2016. Helmbold, D. P. and Long, P. M. On the inductive bias of dropout. Journal of Machine Learning Research (JMLR), 16:3403 3454, 2015. Helmbold, D. P. and Long, P. M. Surprising properties of dropout in deep networks. The Journal of Machine Learning Research, 18(1):7284 7311, 2017. Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. ar Xiv preprint ar Xiv:1207.0580, 2012. Ji, Z. and Telgarsky, M. Gradient descent aligns the layers of deep linear networks. ar Xiv preprint ar Xiv:1810.02032, 2018. Kawaguchi, K. Deep learning without poor local minima. In Adv in Neural Information Proc. Systems, 2016. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Lathuili ere, S., Mesejo, P., Alameda-Pineda, X., and Horaud, R. A comprehensive analysis of deep regression. IEEE transactions on pattern analysis and machine intelligence, 2019. Laurent, T. and Brecht, J. Deep linear networks with arbitrary loss: All local minima are global. In International Conference on Machine Learning, pp. 2908 2913, 2018. Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2 47, 2018. Martin, C. H. and Mahoney, M. W. Implicit selfregularization in deep neural networks: Evidence from random matrix theory and implications for learning. ar Xiv preprint ar Xiv:1810.01075, 2018. Mianjy, P., Arora, R., and Vidal, R. On the implicit bias of dropout. In International Conference on Machine Learning, pp. 3537 3545, 2018. Mou, W., Zhou, Y., Gao, J., and Wang, L. Dropout training, data-dependent regularization, and generalization bounds. In International Conference on Machine Learning, pp. 3642 3650, 2018. Nacson, M. S., Lee, J., Gunasekar, S., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data. ar Xiv preprint ar Xiv:1803.01905, 2018. Neyshabur, B., Tomioka, R., and Srebro, N. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014. Neyshabur, B., Salakhutdinov, R. R., and Srebro, N. Pathsgd: Path-normalized optimization in deep neural networks. In Advances in Neural Information Processing Systems, pp. 2422 2430, 2015a. Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based Dropout and Nuclear Norm Regularization capacity control in neural networks. In Conference on Learning Theory, pp. 1376 1401, 2015b. Saxe, A. M., Mc Clelland, J. L., and Ganguli, S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ar Xiv preprint ar Xiv:1312.6120, 2013. Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15(1), 2014. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, 2015. Toshev, A. and Szegedy, C. Deeppose: Human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653 1660, 2014. Van Erven, T., Kotłowski, W., and Warmuth, M. K. Follow the leader with dropout perturbations. In Conference on Learning Theory, pp. 949 974, 2014. Wager, S., Wang, S., and Liang, P. S. Dropout training as adaptive regularization. In Adv. Neural Information Processing Systems, 2013. Wager, S., Fithian, W., Wang, S., and Liang, P. S. Altitude training: Strong bounds for single-layer dropout. In Adv. Neural Information Processing Systems, 2014. Zhai, S. and Zhang, Z. Dropout training of matrix factorization and autoencoder for link prediction in sparse graphs. In ICDM, pp. 451 459, 2015. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. ar Xiv preprint ar Xiv:1611.03530, 2016.