# implicit_regularization_in_deep_matrix_factorization__b4a6c08e.pdf Implicit Regularization in Deep Matrix Factorization Sanjeev Arora Princeton University and Institute for Advanced Study arora@cs.princeton.edu Nadav Cohen Tel Aviv University cohennadav@cs.tau.ac.il Wei Hu Princeton University huwei@cs.princeton.edu Yuping Luo Princeton University yupingl@cs.princeton.edu Efforts to understand the generalization mystery in deep learning have led to the belief that gradient-based optimization induces a form of implicit regularization, a bias towards models of low complexity. We study the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Our first finding, supported by theory and experiments, is that adding depth to a matrix factorization enhances an implicit tendency towards low-rank solutions, oftentimes leading to more accurate recovery. Secondly, we present theoretical and empirical arguments questioning a nascent view by which implicit regularization in matrix factorization can be captured using simple mathematical norms. Our results point to the possibility that the language of standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization. 1 Introduction It is a mystery how deep neural networks generalize despite having far more learnable parameters than training examples. Explicit regularization techniques alone cannot account for this generalization, as they do not prevent the networks from being able to fit random data (see [52]). A view by which gradient-based optimization induces an implicit regularization has thus arisen. Of course, this view would be uninsightful if implicit regularization were treated as synonymous with promoting generalization the question is whether we can characterize the implicit regularization independently of any validation data. Notably, the mere use of the term regularization already predisposes us towards characterizations based on known explicit regularizers (e.g. a constraint on some norm of the parameters), but one must also be open to the possibility that something else is afoot. An old argument (cf. [25, 29]) traces implicit regularization in deep learning to beneficial effects of noise introduced by small-batch stochastic optimization. The feeling is that solutions that do not generalize correspond to sharp minima, and added noise prevents convergence to such solutions. However, recent evidence (e.g. [26, 51]) suggests that deterministic (or near-deterministic) gradientbased algorithms can also generalize, and thus a different explanation is in order. A major hurdle in this study is that implicit regularization in deep learning seems to kick in only with certain types of data (not with random data for example), and we lack mathematical tools for reasoning about real-life data. Thus one needs a simple test-bed for the investigation, where data admits a crisp mathematical formulation. Following earlier works, we focus on the problem of matrix completion: given a randomly chosen subset of entries from an unknown matrix W , the task is to recover the unseen entries. To cast this as a prediction problem, we may view each entry in W as a data point: observed entries constitute the training set, and the average reconstruction error over the unobserved entries is the test error, quantifying generalization. Fitting the observed entries is obviously an underdetermined problem with multiple solutions. However, an extensive body of work (see [11] for a survey) has shown that if W is low-rank, certain technical assumptions (e.g. incoherence ) are satisfied and sufficiently many entries are observed, then various algorithms 33rd Conference on Neural Information Processing Systems (Neur IPS 2019), Vancouver, Canada. Figure 1: Matrix completion via gradient descent over deep matrix factorizations. Left (respectively, right) plot shows reconstruction errors for matrix factorizations of depths 2, 3 and 4, when applied to the completion of a random rank-5 (respectively, rank-10) matrix with size 100 100. x-axis stands for the number of observed entries (randomly chosen), y-axis represents reconstruction error, and error bars (indiscernible) mark standard deviations of the results over multiple trials. All matrix factorizations are full-dimensional, i.e. have hidden dimensions 100. Both learning rate and standard deviation of (random, zero-centered) initialization for gradient descent were set to the small value 10 3. Notice, with few observed entries factorizations of depths 3 and 4 significantly outperform that of depth 2, whereas with more entries all factorizations perform well. For further details, and a similar experiment on matrix sensing tasks, see Appendix E. can achieve approximate or even exact recovery. Of these, a well-known method based upon convex optimization finds the minimal nuclear norm matrix among those fitting all observed entries (see [9]).1 One may try to solve matrix completion using shallow neural networks. A natural approach, matrix factorization, boils down to parameterizing the solution as a product of two matrices W = W2W1 and optimizing the resulting (non-convex) objective for fitting observed entries. Formally, this can be viewed as training a depth-2 linear neural network. It is possible to explicitly constrain the rank of the produced solution by limiting the shared dimension of W1 and W2. However, in practice, even when the rank is unconstrained, running gradient descent with small learning rate (step size) and initialization close to zero tends to produce low-rank solutions, and thus allows accurate recovery if W is low-rank. This empirical observation led Gunasekar et al. to conjecture in [20] that gradient descent over a matrix factorization induces an implicit regularization minimizing nuclear norm: Conjecture 1 (from [20], informally stated). With small enough learning rate and initialization close enough to the origin, gradient descent on a full-dimensional matrix factorization converges to the minimum nuclear norm solution. Deep matrix factorization Since standard matrix factorization can be viewed as a two-layer neural network (with linear activations), a natural extension is to consider deeper models. A deep matrix factorization2 of W Rd,d , with hidden dimensions d1, . . . , d N 1 N, is the parameterization: W = WNWN 1 W1 , (1) where Wj Rdj,dj 1, j = 1, . . . , N, with d N := d, d0 := d . N is referred to as the depth of the factorization, the matrices W1, . . . , WN as its factors, and the resulting W as the product matrix. Could the implicit regularization of deep matrix factorizations be stronger than that of their shallow counterpart (which Conjecture 1 equates with nuclear norm minimization)? Experiments reported in Figure 1 suggest that this is indeed the case depth leads to more accurate completion of a low-rank matrix when the number of observed entries is small. Our purpose in the current paper is to mathematically analyze this stronger form of implicit regularization. Can it be described by a matrix norm (or quasi-norm) continuing the line of Conjecture 1, or is a paradigm shift required? 1.1 Paper overview Review of related work is given in Appendix A (deferred to supplementary material per lack of space). In Section 2 we investigate the potential of norms for capturing the implicit regularization in deep matrix factorization. Surprisingly, we find that the main theoretical evidence connecting nuclear norm and shallow (depth-2) matrix factorization proof given in [20] for Conjecture 1 in a particular restricted setting extends to arbitrarily deep factorizations as well. This result disqualifies the 1Recall that the nuclear norm (also known as trace norm) of a matrix is the sum of its singular values, regarded as a convex relaxation of rank. 2Note that the literature includes various usages of this term some in line with ours (e.g. [47, 53, 33]), while others less so (e.g. [50, 16, 49]). natural hypothesis by which Schatten quasi-norms replace nuclear norm as the implicit regularization when one adds depth to a shallow matrix factorization. Instead, when interpreted through the lens of [20], it brings forth a conjecture by which the implicit regularization is captured by nuclear norm for any depth. Since our experiments (Figure 1) show that depth changes (enhances) the implicit regularization, we are led to question the theoretical direction proposed in [20], and accordingly conduct additional experiments to evaluate the validity of Conjecture 1. Typically, when the number of observed entries is sufficiently large with respect to the rank of the matrix to recover, nuclear norm minimization yields exact recovery, and thus it is impossible to distinguish between that and a different implicit regularization which also perfectly recovers. The regime most interesting to evaluate is therefore that in which the number of observed entries is too small for exact recovery by nuclear norm minimization here there is room for different implicit regularizations to manifest themselves by providing higher quality solutions. Our empirical results show that in this regime, matrix factorizations consistently outperform nuclear norm minimization, suggesting that their implicit regularization admits stronger bias towards low-rank, in contrast to Conjecture 1. Together, our theory and experiments lead us to suspect that the implicit regularization in matrix factorization (shallow or deep) may not be amenable to description by a simple mathematical norm, and a detailed analysis of the dynamics in optimization may be necessary. Section 3 carries out such an analysis, characterizing how the singular value decomposition of the learned solution evolves during gradient descent. Evolution rates of singular values turn out to be proportional to their size exponentiated by 2 2/N, where N is the depth of the factorization. This establishes a tendency towards low rank solutions, which intensifies with depth. Experiments validate the findings, demonstrating the dynamic nature of implicit regularization in deep matrix factorization. We believe the trajectories traversed in optimization may be key to understanding generalization in deep learning, and hope that our work will inspire further progress along this line. 2 Can the implicit regularization be captured by norms? In this section we investigate the possibility of extending Conjecture 1 for explaining implicit regularization in deep matrix factorization. Given the experimental evidence in Figure 1, one may hypothesize that gradient descent on a depth-N matrix factorization implicitly minimizes some norm (or quasi-norm) that approximates rank, with the approximation being more accurate the larger N is. For example, a natural candidate would be Schatten-p quasi-norm to the power of p (0 < p 1), which for a matrix W Rd,d is defined as: W p Sp := Pmin{d,d } r=1 σp r(W), where σ1(W), . . . , σmin{d,d }(W) are the singular values of W. For p = 1 this reduces to nuclear norm, which by Conjecture 1 corresponds to a depth-2 factorization. As p approaches zero we obtain a closer approximation of rank(W), which could be suitable for factorizations of higher depths. We will focus in this section on matrix sensing a more general problem than matrix completion. Here, we are given m measurement matrices A1, . . . , Am, with corresponding labels y1, . . . , ym generated by yi = Ai, W , and our goal is to reconstruct the unknown matrix W . As in the case of matrix completion, well-known methods, and in particular nuclear norm minimization, can recover W if it is low-rank, certain technical conditions are met, and sufficiently many observations are given (see [42]). 2.1 Current theory does not distinguish depth-N from depth-2 Our first result is that the theory developed by [20] to support Conjecture 1 can be generalized to suggest that nuclear norm captures the implicit regularization in matrix factorization not just for depth 2, but for arbitrary depth. This is of course inconsistent with the experimental findings reported in Figure 1. We will first recall the existing theory, and then show how to extend it. [20] studied implicit regularization in shallow (depth-2) matrix factorization by considering recovery of a positive semidefinite matrix from sensing via symmetric measurements, namely: min W Sd + ℓ(W) := 1 i=1(yi Ai, W )2 , (2) where A1, . . . , Am Rd,d are symmetric and linearly independent, and Sd + stands for the set of (symmetric and) positive semidefinite matrices in Rd,d. Focusing on the underdetermined regime m d2, they investigated the implicit bias brought forth by running gradient flow (gradient descent with infinitesimally small learning rate) on a symmetric full-rank matrix factorization, i.e. on the objective: ψ : Rd,d R 0 , ψ(Z) := ℓ(ZZ ) = 1 i=1(yi Ai, ZZ )2 . For α > 0, denote by Wsha, (α) (sha here stands for shallow ) the final solution ZZ obtained from running gradient flow on ψ( ) with initialization αI (α times identity). Formally, Wsha, (α) := limt Z(t)Z(t) where Z(0) = αI and Z(t) = dψ d Z (Z(t)) for t R 0 (t here is a continuous time index, and Z(t) stands for the derivative of Z(t) with respect to time). Letting represent matrix nuclear norm, the following result was proven by [20]: Theorem 1 (adaptation of Theorem 1 in [20]). Assume the measurement matrices A1, . . . , Am commute. Then, if Wsha := limα 0 Wsha, (α) exists and is a global optimum for Equation (2) with ℓ( Wsha) = 0, it holds that Wsha argmin W Sd +, ℓ(W )=0 W , i.e. Wsha is a global optimum with minimal nuclear norm.3 Motivated by Theorem 1 and empirical evidence they provided, [20] raised Conjecture 1, which, formally stated, hypothesizes that the condition in Theorem 1 of {Ai}m i=1 commuting is unnecessary, and an identical statement holds for arbitrary (symmetric linearly independent) measurement matrices.4 While the analysis of [20] covers only symmetric matrix factorizations of the form ZZ , they noted that it can be extended to also account for asymmetric factorizations of the type considered in the current paper. Specifically, running gradient flow on the objective: φ(W1, W2) := ℓ(W2W1) = 1 i=1(yi Ai, W2W1 )2 , with W1, W2 Rd,d initialized to αI, α > 0, and denoting by Wsha, (α) the product matrix obtained at the end of optimization (i.e. Wsha, (α) := limt W2(t)W1(t) where Wj(0) = αI and Wj(t) = φ Wj (W1(t), W2(t)) for t R 0), Theorem 1 holds exactly as stated. For completeness, we provide a proof of this fact in Appendix D. Next, we show that Theorem 1 the main theoretical justification for the connection between nuclear norm and shallow matrix factorization extends to arbitrarily deep factorizations as well. Consider gradient flow over the objective: φ(W1, . . . , WN) := ℓ(WNWN 1 W1) = 1 i=1(yi Ai, WNWN 1 W1 )2 , with W1, . . . , WN Rd,d initialized to αI, α > 0. Using Wdeep, (α) to denote the product matrix obtained at the end of optimization (i.e. Wdeep, (α) := limt WN(t)WN 1(t) W1(t) where Wj(0) = αI and Wj(t) = φ Wj (W1(t), . . . , WN(t)) for t R 0), a result analogous to Theorem 1 holds: Theorem 2. Suppose N 3, and that the matrices A1, . . . , Am commute. Then, if Wdeep := limα 0 Wdeep, (α) exists and is a global optimum for Equation (2) with ℓ( Wdeep) = 0, it holds that Wdeep argmin W Sd +, ℓ(W )=0 W , i.e. Wdeep is a global optimum with minimal nuclear norm.5 Proof sketch (for complete proof see Appendix C.1). Our proof is inspired by that of Theorem 1 given in [20]. Using the expression for W(t) derived in [3] (Lemma 3 in Appendix B), it can be shown that W(t) commutes with {Ai}m i=1, and takes on a particular form. Taking limits t and α 0, optimality (minimality) of nuclear norm is then established using a duality argument. Theorem 2 provides a particular setting where the implicit regularization in deep matrix factorizations boils down to nuclear norm minimization. By Proposition 1 below, there exist instances of this setting for which the minimization of nuclear norm contradicts minimization (even locally) of Schatten-p 3The result of [20] is slightly more general it allows gradient flow to be initialized by αO, where O is an arbitrary orthogonal matrix, and it is shown that this leads to the exact same Wsha, (α) as one would obtain from initializing at αI. For simplicity, we limit our discussion to the latter initialization. 4Their conjecture also relaxes the requirement from the initialization of gradient flow an initial value of αZ0 is believed to suffice, where Z0 is an arbitrary full-rank matrix (that does not depend on α). 5By Appendix C.1: WN(t)WN 1(t) W1(t) 0 t. Therefore, even though the theorem treats optimization over Sd + using an unconstrained asymmetric factorization, gradient flow implicitly constrains the search to Sd +, so the assumption of Wdeep being a global optimum for Equation (2) with ℓ( Wdeep) = 0 is no stronger than the analogous assumption in Theorem 1 from [20]. The implicit constraining to Sd + also holds when N = 2 (see Appendix D), so the asymmetric extension of Theorem 1 does not involve strengthening assumptions either. Figure 2: Evaluation of nuclear norm as the implicit regularization in deep matrix factorization. Each plot compares gradient descent over matrix factorizations of depths 2 and 3 (results for depth 4 were indistinguishable from those of depth 3; we omit them to reduce clutter) against minimum nuclear norm solution and ground truth in matrix completion tasks. Top (respectively, bottom) row corresponds to completion of a random rank-5 (respectively, rank-10) matrix with size 100 100. Left, middle and right columns display (in y-axis) reconstruction error, nuclear norm and effective rank (cf. [43]) respectively. In each plot, x-axis stands for the number of observed entries (randomly chosen), and error bars (indiscernible) mark standard deviations of the results over multiple trials. All matrix factorizations are full-dimensional, i.e. have hidden dimensions 100. Both learning rate and standard deviation of (random, zero-centered) initialization for gradient descent were initially set to 10 3. Running with smaller learning rate did not yield a noticeable change in terms of final results. Initializing with smaller standard deviation had no observable effect on results of depth 3 (and 4), but did impact those of depth 2 the outcomes of dividing standard deviation by 2 and by 4 are included in the plots.7 Notice, with many observed entries minimum nuclear norm solution coincides with ground truth (minimum rank solution), and matrix factorizations of all depths converge to these. On the other hand, when there are fewer observed entries minimum nuclear norm solution does not coincide with ground truth, and matrix factorizations prefer to lower the effective rank at the expense of higher nuclear norm, in a manner that is more potent for deeper factorizations. For further details, and a similar experiment on matrix sensing tasks, see Appendix E. quasi-norm for any 0 < p < 1. Therefore, one cannot hope to capture the implicit regularization in deep matrix factorizations through Schatten quasi-norms. Instead, if we interpret Theorem 2 through the lens of [20], we arrive at a conjecture by which the implicit regularization is captured by nuclear norm for any depth. Proposition 1. For any dimension d 3, there exist linearly independent symmetric and commutable measurement matrices A1, . . . , Am Rd,d, and corresponding labels y1, . . . , ym R, such that the limit solution defined in Theorem 2 Wdeep which has been shown to satisfy Wdeep argmin W Sd +, ℓ(W )=0 W , is not a local minimum of the following program for any 0 < p < 1:6 min W Sd +, ℓ(W )=0 W Sp . Proof sketch (for complete proof see Appendix C.2). We choose A1, . . . , Am and y1, . . . , ym such that: (i) Wdeep = diag(1, 1, 0, . . . , 0); and (ii) adding ϵ (0, 1) to entries (1, 2) and (2, 1) of Wdeep maintains optimality. The result then follows from the fact that the addition of ϵ decreases Schatten-p quasi-norm for any 0 < p < 1. 2.2 Experiments challenging Conjecture 1 Subsection 2.1 suggests that from the perspective of current theory, it is natural to apply Conjecture 1 to matrix factorizations of arbitrary depth. On the other hand, the experiment reported in Figure 1 implies that depth changes (enhances) the implicit regularization. To resolve this tension we conduct a more refined experiment, which ultimately puts in question the validity of Conjecture 1. 6Following [20], we take for granted existence of Wdeep and it being a global optimum for Equation (2) with ℓ( Wdeep) = 0. If this is not the case then Theorem 2 does not apply, and hence it obviously does not disqualify minimization of Schatten quasi-norms as the implicit regularization. 7As can be seen, using smaller initialization enhanced the implicit tendency of depth-2 matrix factorization towards low rank. It is possible that this tendency can eventually match that of depth-3 (and -4), but only if initialization size goes far below what is customary in deep learning. Our experimental protocol is as follows. For different matrix completion tasks with varying number of observed entries, we compare minimum nuclear norm solution to those brought forth by running gradient descent on matrix factorizations of different depths. For each depth, we apply gradient descent with different choices of learning rate and standard deviation for (random, zero-centered) initialization, observing the trends as these become smaller. The outcome of the experiment is presented in Figure 2. As can be seen, when the number of observed entries is sufficiently large with respect to the rank of the matrix to recover, factorizations of all depths indeed admit solutions that tend to minimum nuclear norm. However, when there are less entries observed precisely the data-poor setting where implicit regularization matters most neither shallow (depth-2) nor deep (depth-N with N 3) factorizations minimize nuclear norm. Instead, they put more emphasis on lowering the effective rank (cf. [43]), in a manner which is stronger for deeper factorizations. A close look at the experiments of [20] reveals that there too, in situations where the number of observed entries (or sensing measurements) was small (less than required for reliable recovery), a discernible gap appeared between the minimal nuclear norm and that returned by (gradient descent on) a matrix factorization. In light of Figure 2, we believe that if [20] had included in its plots an accurate surrogate for rank (e.g. effective rank or Schatten-p quasi-norm with small p), scenarios where matrix factorization produced sub-optimal (higher than minimum) nuclear norm would have manifested superior (lower) rank. More broadly, our experiments suggest that the implicit regularization in (shallow or deep) matrix factorization is somehow geared towards low rank, and just so happens to minimize nuclear norm in cases with sufficiently many observations, where minimum nuclear norm and minimum rank are known to coincide (cf. [9, 42]). We note that the theoretical analysis of [32] supporting Conjecture 1 is limited to such cases, and thus cannot truly distinguish between nuclear norm minimization and some other form of implicit regularization favoring low rank. Given that Conjecture 1 seems to hold in some settings (Theorems 1 and 2; [32]) but not in other (Figure 2), we hypothesize that capturing implicit regularization in (shallow or deep) matrix factorization through a single mathematical norm (or quasi-norm) may not be possible, and a detailed account for the optimization process might be necessary. This is carried out in Section 3. 3 Dynamical analysis This section characterizes trajectories of gradient flow (gradient descent with infinitesimally small learning rate) on deep matrix factorizations. The characterization significantly extends past analyses for linear neural networks (e.g. [44, 3]) we derive differential equations governing the dynamics of singular values and singular vectors for the product matrix W (Equation (1)). Evolution rates of singular values turn out to be proportional to their size exponentiated by 2 2/N, where N is the depth of the factorization. For singular vectors, we show that lack of movement implies a particular form of alignment with the gradient, and by this strengthen past results which have only established the converse. Via theoretical and empirical demonstrations, we explain how our findings imply a tendency towards low-rank solutions, which intensifies with depth. Our derivation treats a setting which includes matrix completion and sensing as special cases. We assume minimization of a general analytic loss ℓ( ),8 overparameterized by a deep matrix factorization: φ(W1, . . . , WN) := ℓ(WNWN 1 W1) . (3) We study gradient flow over the factorization: Wj(t) := d dt Wj(t) = Wj φ(W1(t), . . . , WN(t)) , t 0 , j = 1, . . . , N , (4) and in accordance with past work, assume that factors are balanced at initialization, i.e.: W j+1(0)Wj+1(0) = Wj(0)W j (0) , j = 1, . . . , N 1 . (5) Equation (5) is satisfied approximately in the common setting of near-zero initialization (it holds exactly in the residual setting of identity initialization cf. [23, 5]). The condition played an important role in the analysis of [3], facilitating derivation of a differential equation governing the product matrix of a linear neural network (see Lemma 3 in Appendix B). It was shown in [3] empirically that there is an excellent match between the theoretical predictions of gradient flow with balanced initialization, and its practical realization via gradient descent with small learning rate and near-zero initialization. Other works (e.g. [4, 28]) later supported this match theoretically. 8A function f( ) is analytic on a domain D if at every x D: it is infinitely differentiable; and its Taylor series converges to it on some neighborhood of x (see [30] for further details). We note that by Section 6 in [3], for depth N 3, the dynamics of the product matrix W (Equation (1)) cannot be exactly equivalent to gradient descent on the loss ℓ( ) regularized by a penalty term. This preliminary observation already hints to the possibility that the effect of depth is different from those of standard regularization techniques. Employing results of [3], we will characterize the evolution of singular values and singular vectors for W. As a first step, we show that W admits an analytic singular value decomposition ([7, 12]): Lemma 1. The product matrix W(t) can be expressed as: W(t) = U(t)S(t)V (t) , (6) where: U(t) Rd,min{d,d }, S(t) Rmin{d,d },min{d,d } and V (t) Rd ,min{d,d } are analytic functions of t; and for every t, the matrices U(t) and V (t) have orthonormal columns, while S(t) is diagonal (elements on its diagonal may be negative and may appear in any order). Proof sketch (for complete proof see Appendix C.3). We show that W(t) is an analytic function of t and then invoke Theorem 1 in [7]. The diagonal elements of S(t), which we denote by σ1(t), . . . , σmin{d,d }(t), are signed singular values of W(t); the columns of U(t) and V (t), denoted u1(t), . . . , umin{d,d }(t) and v1(t), . . . , vmin{d,d }(t), are the corresponding left and right singular vectors (respectively). With Lemma 1 in place, we are ready to characterize the evolution of singular values: Theorem 3. The signed singular values of the product matrix W(t) evolve by: σr(t) = N σ2 r(t) 1 1/N ℓ(W(t)), ur(t)v r (t) , r = 1, . . . , min{d, d } . (7) If the matrix factorization is non-degenerate, i.e. has depth N 2, the singular values need not be signed (we may assume σr(t) 0 for all t). Proof sketch (for complete proof see Appendix C.4). Differentiating the analytic singular value decomposition (Equation (6)) with respect to time, multiplying from the left by U (t) and from the right by V (t), and using the fact that U(t) and V (t) have orthonormal columns, we obtain σr(t) = u r (t) W(t)vr(t). Equation (7) then follows from plugging in the expression for W(t) developed by [3] (Lemma 3 in Appendix B). Strikingly, given a value for W(t), the evolution of singular values depends on N depth of the matrix factorization only through the multiplicative factors N (σ2 r(t))1 1/N (see Equation (7)). In the degenerate case N = 1, i.e. when the product matrix W(t) is simply driven by gradient flow over the loss ℓ( ) (no matrix factorization), the multiplicative factors reduce to 1, and the singular values evolve by: σr(t) = ℓ(W(t)), ur(t)v r (t) . With N 2, i.e. when depth is added to the factorization, the multiplicative factors become non-trivial, and while the constant N does not differentiate between singular values, the terms (σ2 r(t))1 1/N do they enhance movement of large singular values, and on the other hand attenuate that of small ones. Moreover, the enhancement/attenuation becomes more significant as N (depth of the factorization) grows. Next, we turn to the evolution of singular vectors: Lemma 2. Assume that at initialization, the singular values of the product matrix W(t) are distinct and different from zero.9 Then, its singular vectors evolve by: U(t) = U(t) F(t) U (t) ℓ(W(t))V (t)S(t) + S(t)V (t) ℓ (W(t))U(t) Id U(t)U (t) ℓ(W(t))V (t)(S2(t)) 1 2 1 N (8) V (t) = V (t) F(t) S(t)U (t) ℓ(W(t))V (t) + V (t) ℓ (W(t))U(t)S(t) Id V (t)V (t) ℓ (W(t))U (t)(S2(t)) 1 2 1 N , (9) where Id and Id are the identity matrices of sizes d d and d d respectively, stands for the Hadamard (element-wise) product, and the matrix F(t) Rmin{d,d },min{d,d } is skew-symmetric with ((σ2 r (t))1/N (σ2 r(t))1/N) 1 in its (r, r ) th entry, r = r .10 9This assumption can be relaxed significantly all that is needed is that no singular value be identically zero ( r t s.t. σr(t) = 0), and no pair of singular values be identical through time ( r, r t s.t. σr(t) = σr (t)). 10Equations (8) and (9) are well-defined when t is such that σ1(t), . . . , σmin{d,d }(t) are distinct and different from zero. By analyticity, this is either the case for every t besides a set of isolated points, or it is not the case for any t. Our assumption on initialization disqualifies the latter option, so any t for which Equations (8) or (9) are ill-defined is isolated. The derivatives of U and V for such t may thus be inferred by continuity. Figure 3: Dynamics of gradient descent over deep matrix factorizations specifically, evolution of singular values and singular vectors of the product matrix during training for matrix completion. Top row corresponds to the task of completing a random rank-5 matrix with size 100 100 based on 2000 randomly chosen observed entries; bottom row corresponds to training on 10000 entries chosen randomly from the Movie Lens 100K dataset (completion of a 943 1682 matrix, cf. [24]).11 First (left) three columns show top singular values for, respectively, depths 1 (no matrix factorization), 2 (shallow matrix factorization) and 3 (deep matrix factorization). Last (right) column shows singular vectors for a depth-2 factorization, by comparing onvs. off-diagonal entries in the matrix U (t) ℓ(W(t))V (t) (see Corollary 1) for each group of entries, mean of absolute values is plotted, along with shaded area marking the standard deviation. All matrix factorizations are full-dimensional (hidden dimensions 100 in top row plots, 943 in bottom row plots). Notice, increasing depth makes singular values move slower when small and faster when large (in accordance with Theorem 3), which results in solutions with effectively lower rank. Notice also that U (t) ℓ(W(t))V (t) is diagonally dominant so long as there is movement, showing that singular vectors of the product matrix align with those of the gradient (in accordance with Corollary 1). For further details, and a similar experiment on matrix sensing, see Appendix E. Proof sketch (for complete proof see Appendix C.5). We follow a series of steps adopted from [46] to obtain expressions for U(t) and V (t) in terms of U(t), V (t), S(t) and W(t). Plugging in the expression for W(t) developed by [3] (Lemma 3 in Appendix B) then yields Equations (8), (9). Corollary 1. Assume the conditions of Lemma 2, and that the matrix factorization is non-degenerate, i.e. has depth N 2. Then, for any time t such that the singular vectors of the product matrix W(t) are stationary, i.e. U(t) = 0 and V (t) = 0, it holds that U (t) ℓ(W(t))V (t) is diagonal, meaning they align with the singular vectors of ℓ(W(t)). Proof sketch (for complete proof see Appendix C.6). By Equations (8) and (9), U (t) U(t)S(t) S(t)V (t) V (t) is equal to the Hadamard product between U (t) ℓ(W(t))V (t) and a (timedependent) square matrix with zeros on its diagonal and non-zeros elsewhere. When U(t) = 0 and V (t) = 0 obviously U (t) U(t)S(t) S(t)V (t) V (t) = 0, and so the Hadamard product is zero. This implies that U (t) ℓ(W(t))V (t) is diagonal. Earlier papers studying gradient flow for linear neural networks (e.g. [44, 1, 31]) could show that singular vectors are stationary if they align with the singular vectors of the gradient. Corollary 1 is significantly stronger and implies a converse if singular vectors are stationary, they must be aligned with the gradient. Qualitatively, this suggests that a goal of gradient flow on a deep matrix factorization is to align singular vectors of the product matrix with those of the gradient. 3.1 Implicit regularization towards low rank Figure 3 presents empirical demonstrations of our conclusions from Theorem 3 and Corollary 1. It shows that for a non-degenerate deep matrix factorization, i.e. one with depth N 2, under gradient descent with small learning rate and near-zero initialization, singular values of the product matrix are subject to an enhancement/attenuation effect as described above: they progress very slowly after initialization, when close to zero; then, upon reaching a certain threshold, the movement of a singular 11Observations of Movie Lens 100K were subsampled solely for reducing run-time. value becomes rapid, with the transition from slow to rapid movement being sharper with a deeper factorization (larger N). In terms of singular vectors, the figure shows that those of the product matrix indeed align with those of the gradient. Overall, the dynamics promote solutions that have a few large singular values and many small ones, with a gap that is more extreme the deeper the matrix factorization is. This is an implicit regularization towards low rank, which intensifies with depth. Theoretical illustration Consider the simple case of square matrix sensing with a single measurement fit via ℓ2 loss: ℓ(W) = 1 2( A, W y)2, where A Rd,d is the measurement matrix, and y R the corresponding label. Suppose we learn by running gradient flow over a depth-N matrix factorization, i.e. over the objective φ( ) defined in Equation (3). Corollary 1 states that the singular vectors of the product matrix {ur(t)}r and {vr(t)}r are stationary only when they diagonalize the gradient, meaning |u r (t) ℓ(W(t))vr| : r = 1, . . . , d coincides with the set of singular values in ℓ(W(t)). In our case ℓ(W) = ( A, W y)A, so stationarity of singular vectors implies |u r (t) ℓ(W(t))vr| = |δ(t)| ρr, where δ(t) := A, W(t) y and ρ1, . . . , ρd are the singular values of A (in no particular order). We will assume that starting from some time t0 singular vectors are stationary, and accordingly u r (t) ℓ(W(t))vr(t) = δ(t) er ρr for r = 1, . . . , d, where e1, . . . , ed { 1, 1}. Theorem 3 then implies that (signed) singular values of the product matrix evolve by: σr(t) = N σ2 r(t) 1 1/N δ(t) er ρr , t t0 . (10) Let r1, r2 {1, . . . , d}. By Equation (10): Z t σ2 r1(t ) 1+1/N σr1(t )dt = er1ρr1 σ2 r2(t ) 1+1/N σr2(t )dt . Computing the integrals, we may express σr1(t) as a function of σr2(t):12 αr1,r2 σr2(t) + const , N = 1 σr2(t) αr1,r2 const , N = 2 αr1,r2 (σr2(t)) N 2 N + const N N 2 , N 3 where αr1,r2 := er1ρr1(er2ρr2) 1, and const stands for a value that does not depend on t. Equation 11 reveals a gap between σr1(t) and σr2(t) that enhances with depth. For example, consider the case where 0 < αr1,r2 < 1. If the depth N is one, i.e. the matrix factorization is degenerate, σr1(t) will grow linearly with σr2(t). If N = 2 shallow matrix factorization σr1(t) will grow polynomially more slowly than σr2(t) (const here is positive). Increasing depth further will lead σr1(t) to asymptote when σr2(t) grows, at a value which can be shown to be lower the larger N is. Overall, adding depth to the matrix factorization leads to more significant gaps between singular values of the product matrix, i.e. to a stronger implicit bias towards low rank. 4 Conclusion The implicit regularization of gradient-based optimization is key to generalization in deep learning. As a stepping stone towards understanding this phenomenon, we studied deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization. Through theory and experiments, we questioned prevalent norm-based explanations for implicit regularization in matrix factorization (cf. [20]), and offered an alternative, dynamical approach. Our characterization of the dynamics induced by gradient flow on the singular value decomposition of the learned matrix significantly extends prior work on linear neural networks. It reveals an implicit tendency towards low rank which intensifies with depth, supporting the empirical superiority of deeper matrix factorizations. An emerging view is that understanding optimization in deep learning necessitates a detailed account for the trajectories traversed in training (cf. [4]). Our work adds another dimension to the potential importance of trajectories we believe they are necessary for understanding generalization as well, and in particular, may be key to analyzing implicit regularization for non-linear neural networks. 12In accordance with Theorem 3, if N 2, we assume without loss of generality that σr1(t), σr2(t) 0, while disregarding the trivial case of equality to zero. Acknowledgments This work was supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. Nadav Cohen was a member at the Institute for Advanced Study, and was additionally supported by the Zuckerman Israeli Postdoctoral Scholars Program. The authors thank Nathan Srebro for illuminating discussions which helped improve the paper. [1] Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. ar Xiv preprint ar Xiv:1710.03667, 2017. [2] Akshay Agrawal, Robin Verschueren, Steven Diamond, and Stephen Boyd. A rewriting system for convex optimization problems. Journal of Control and Decision, 5(1):42 60, 2018. [3] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization. In International Conference on Machine Learning, pages 244 253, 2018. [4] Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. International Conference on Learning Representations, 2019. [5] Peter Bartlett, Dave Helmbold, and Phil Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations. In International Conference on Machine Learning, pages 520 529, 2018. [6] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873 3881, 2016. [7] Angelika Bunse-Gerstner, Ralph Byers, Volker Mehrmann, and Nancy K Nichols. Numerical computation of an analytic singular value decomposition of a matrix valued function. Numerische Mathematik, 60(1): 1 39, 1991. [8] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003. [9] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717, 2009. [10] Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. ar Xiv preprint ar Xiv:1809.09573, 2018. [11] Mark A Davenport and Justin Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE Journal of Selected Topics in Signal Processing, 10(4):608 622, 2016. [12] B De Moor and S Boyd. Analytic properties of singular values and vectors. Katholic Univ. Leuven, Belgium Tech. Rep, 28:1989, 1989. [13] Steven Diamond and Stephen Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1 5, 2016. [14] Simon S Du and Wei Hu. Width provably matters in optimization for deep linear neural networks. ar Xiv preprint ar Xiv:1901.08572, 2019. [15] Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. ar Xiv preprint ar Xiv:1806.00900, 2018. [16] Jicong Fan and Jieyu Cheng. Matrix completion by deep matrix factorization. Neural Networks, 98:34 41, 2018. [17] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973 2981, 2016. [18] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1233 1242. JMLR. org, 2017. [19] Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in deep linear neural networks. ar Xiv preprint ar Xiv:1904.13262, 2019. [20] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, pages 6151 6159, 2017. [21] Suriya Gunasekar, Jason Lee, Daniel Soudry, and Nathan Srebro. Characterizing implicit bias in terms of optimization geometry. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 1832 1841, 2018. [22] Suriya Gunasekar, Jason D Lee, Daniel Soudry, and Nati Srebro. Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems, pages 9461 9471, 2018. [23] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. International Conference on Learning Representations, 2016. [24] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis), 5(4):19, 2016. [25] Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1 42, 1997. [26] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731 1741, 2017. [27] Yulij Ilyashenko and Sergei Yakovenko. Lectures on analytic differential equations, volume 86. American Mathematical Soc., 2008. [28] Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. International Conference on Learning Representations, 2019. [29] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. International Conference on Learning Representations, 2017. [30] Steven G Krantz and Harold R Parks. A primer of real analytic functions. Springer Science & Business Media, 2002. [31] Andrew K Lampinen and Surya Ganguli. An analytic theory of generalization dynamics and transfer learning in deep linear networks. International Conference on Learning Representations, 2019. [32] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Proceedings of the 31st Conference On Learning Theory, pages 2 47, 2018. [33] Zechao Li and Jinhui Tang. Deep matrix factorization for social image tag refinement and assignment. In 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP), pages 1 6. IEEE, 2015. [34] Junhong Lin, Raffaello Camoriano, and Lorenzo Rosasco. Generalization properties and implicit regularization for multiple passes sgm. In International Conference on Machine Learning, pages 2340 2348, 2016. [35] Cong Ma, Kaizheng Wang, Yuejie Chi, and Yuxin Chen. Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval and matrix completion. In International Conference on Machine Learning, pages 3351 3360, 2018. [36] Mor Shpigel Nacson, Jason Lee, Suriya Gunasekar, Pedro Henrique Pamplona Savarese, Nathan Srebro, and Daniel Soudry. Convergence of gradient descent on separable data. In Proceedings of Machine Learning Research, volume 89, pages 3420 3428, 2019. [37] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. ar Xiv preprint ar Xiv:1412.6614, 2014. [38] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nati Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947 5956, 2017. [39] Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square matrix sensing without spurious local minima via the burer-monteiro approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pages 65 74, 2017. [40] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. [41] Nasim Rahaman, Devansh Arpit, Aristide Baratin, Felix Draxler, Min Lin, Fred A Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of deep neural networks. ar Xiv preprint ar Xiv:1806.08734, 2018. [42] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010. [43] Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 2007 15th European Signal Processing Conference, pages 606 610. IEEE, 2007. [44] Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations, 2014. [45] Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878, 2018. [46] James Townsend. Differentiating the singular value decomposition. Technical report, 2016. [47] George Trigeorgis, Konstantinos Bousmalis, Stefanos Zafeiriou, and Björn W Schuller. A deep matrix factorization method for learning attribute representations. IEEE transactions on pattern analysis and machine intelligence, 39(3):417 429, 2017. [48] Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via procrustes flow. In International Conference on Machine Learning, pages 964 973, 2016. [49] Qi Wang, Mengying Sun, Liang Zhan, Paul Thompson, Shuiwang Ji, and Jiayu Zhou. Multi-modality disease modeling via collective deep matrix factorization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1155 1164. ACM, 2017. [50] Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen. Deep matrix factorization models for recommender systems. In IJCAI, pages 3203 3209, 2017. [51] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. ar Xiv preprint ar Xiv:1708.03888, 2017. [52] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. International Conference on Learning Representations, 2017. [53] Handong Zhao, Zhengming Ding, and Yun Fu. Multi-view clustering via deep matrix factorization. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.