# implicit_regularization_in_matrix_factorization__9d400b26.pdf

Implicit Regularization in Matrix Factorization

Suriya Gunasekar TTI at Chicago suriya@ttic.edu

Blake Woodworth TTI at Chicago blake@ttic.edu

Srinadh Bhojanapalli TTI at Chicago srinadh@ttic.edu

Behnam Neyshabur TTI at Chicago behnam@ttic.edu

Nathan Srebro TTI at Chicago nati@ttic.edu

We study implicit regularization when optimizing an underdetermined quadratic objective over a matrix X with gradient descent on a factorization of X. We conjecture and provide empirical and theoretical evidence that with small enough step sizes and initialization close enough to the origin, gradient descent on a full dimensional factorization converges to the minimum nuclear norm solution.

1 Introduction

When optimizing underdetermined problems with multiple global minima, the choice of optimization algorithm can play a crucial role in biasing us toward a speciﬁc global minima, even though this bias is not explicitly speciﬁed in the objective or problem formulation. For example, using gradient descent to optimize an unregularized, underdetermined least squares problem would yield the minimum Euclidean norm solution, while using coordinate descent or preconditioned gradient descent might yield a different solution. Such implicit bias, which can also be viewed as a form of regularization, can play an important role in learning.

In particular, implicit regularization has been shown to play a crucial role in training deep models [14, 13, 18, 11]: deep models often generalize well even when trained purely by minimizing the training error without any explicit regularization, and when there are more parameters than samples and the optimization problem is underdetermined. Consequently, there are many zero training error solutions, all global minima of the training objective, many of which generalize badly. Nevertheless, our choice of optimization algorithm, typically a variant of gradient descent, seems to prefer solutions that do generalize well. This generalization ability cannot be explained by the capacity of the explicitly speciﬁed model class (namely, the functions representable in the chosen architecture). Instead, it seems that the optimization algorithm biases us toward a simple" model, minimizing some implicit regularization measure , and that generalization is linked to this measure. But what are the regularization measures that are implicitly minimized by different optimization procedures?

As a ﬁrst step toward understanding implicit regularization in complex models, in this paper we carefully analyze implicit regularization in matrix factorization models, which can be viewed as two-layer networks with linear transfer. We consider gradient descent on the entries of the factor matrices, which is analogous to gradient descent on the weights of a multilayer network. We show how such an optimization approach can indeed yield good generalization properties even when the problem is underdetermined. We identify the implicit regularizer as the nuclear norm, and show that even when we use a full dimensional factorization, imposing no constraints on the factored matrix, optimization by gradient descent on the factorization biases us toward the minimum nuclear norm solution. Our empirical study leads us to conjecture that with small step sizes and initialization close

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

to zero, gradient descent converges to the minimum nuclear norm solution, and we provide empirical and theoretical evidence for this conjecture, proving it in certain restricted settings.

2 Factorized Gradient Descent for Matrix Regression

We consider least squares objectives over matrices X Rn n of the form:

min X 0 F(X) = A(X) y 2 2. (1)

where A : Rn n Rm is a linear operator speciﬁed by A(X)i = Ai, X , Ai Rn n, and y Rm. Without loss of generality, we consider only symmetric positive semideﬁnite (p.s.d.) X and symmetric linearly independent Ai (otherwise, consider optimization over a larger matrix W X X Z with A operating symmetrically on the off-diagonal blocks). In particular, this setting covers problems including matrix completion (where Ai are indicators, [5]), matrix reconstruction from linear measurements [15] and multi-task training (where each column of X is a predictor for a different task and Ai have a single non-zero column, [2, 1]).

We are particularly interested in the regime where m n2, in which case (1) is underdetermined with many global minima satisfying A(X) = y. For such underdetermined problems, merely minimizing (1) cannot ensure recovery (in matrix completion or recovery problems) or generalization (in prediction problems). For example, in a matrix completion problem (without diagonal observations), we can minimize (1) by setting all non-diagonal unobserved entries to zero, or to any arbitrary value.

Instead of working on X directly, we will study a factorization X = UU . We can write (1) equivalently as optimization over U as,

min U Rn d f(U) = A(UU ) y 2

When d < n, this imposes a constraint on the rank of X, but we will be mostly interested in the case d = n, under which no additional constraint is imposed on X (beyond being p.s.d.) and (2) is equivalent to (1). Thus, if m n2, then (2) with d = n is similarly underdetermined and can be optimized in many ways estimating a global optima cannot ensure generalization (e.g. imputing zeros in a matrix completion objective). Let us investigate what happens when we optimize (2) by gradient descent on U.

To simulate such a matrix reconstruction problem, we generated m n2 random measurement matrices and set y = A(X ) according to some planted X 0. We minimized (2) by performing gradient descent on U to convergence, and then measured the relative reconstruction error X X F / X F for X = UU . Figure 1 shows the normalized training objective and reconstruction error as a function of the dimensionality d of the factorization, for different initialization and step-size policies, and three different planted X .

First, we see that (for sufﬁciently large d) gradient descent indeed ﬁnds a global optimum, as evidenced by the training error (the optimization objective) being zero. This is not surprising since with large enough d this non-convex problem has no spurious local minima [4, 9] and gradient descent converges almost surely to a global optima [12]; there has also been recent work establishing conditions for global convergence for low d [3, 7].

The more surprising observation is that in panels (a) and (b), even when d > m/n, indeed even for d = n, we still get good reconstructions from the solution of gradient descent with initialization U0 close to zero and small step size. In this regime, (2) is underdetermined and minimizing it does not ensure generalization. To emphasize this, we plot the reference behavior of a rank unconstrained global minimizer Xgd obtained via projected gradient descent for (1) on the X space. For d < n we also plot an example of an alternate bad" rank d global optima obtained with an initialization based on SVD of Xgd ( SVD Initialization ).

When d < m/n, we understand how the low-rank structure can guarantee generalization [16] and reconstruction [10, 3, 7]. What ensures generalization when d m/n? Is there a strong implicit regularization at play for the case of gradient descent on factor space and initialization close to zero?

Observing the nuclear norm of the resulting solutions plotted in Figure 2 suggests that gradient descent implicitly induces a low nuclear norm solution. This is the case even for d = n when the factorization

0 10 20 30 40 50 dimension d

Relative error

(a) Low rank X

0 10 20 30 40 50 dimension d

0.8 (b) Low nuclear norm X

0 10 20 30 40 50 dimension d

2.0 (c) Low rank X , m = nr/4

Training error

U0 F = 10 4, η

U0 F = 10 4, η = 10 3

SVD Initialization

U0 F = 1, η = 10 3

Figure 1: Reconstruction error of the global optima for 50 50 matrix reconstruction. (Left) X is of rank r = 2 and m = 3nr; (Center) X has a spectrum decaying as O(1/k1.5) normalized to have X = r X F for r = 2 and m = 3nr, and (Right) is a non-reconstructable setting where the number of measurements m = nr/4 is much smaller than the requirement to reconstruct a rank r = 2 matrix. The plots compare the reconstruction error of gradient descent on U for different choices initialization U0 and step size η, including ﬁxed step-size and exact line search clipped for stability (ηELS). Additonally, the orange dashed reference line represents the performance of Xgd a rank unconstrained global optima obtained by projected gradient descent for (1) on X space, and SVD-Initialization is an example of an alternate rank d global optima, where initialization U0 is picked based on SVD of Xgd and gradient descent is run on factor space with small stepsize. Training error behaves similarly in all these settings (zero for d 2) and is plotted for reference. Results are averaged across 3 random initialization and (near zero) errorbars indicate the standard deviation.

0 10 20 30 40 50 dimension d

Nuclear norm

(a) Low rank X

0 10 20 30 40 50 dimension d

(b) Low nuclear norm X

0 10 20 30 40 50 dimension d

(c) Low rank X , m = nr/4

U0 F = 10 4, η = 10 3

SVD Initialization

U0 F = 10 4, η

min A(X) = y X

Figure 2: Nuclear norm of the solutions from Figure 1. In addition to the reference of Xgd from Figure 1, the magenta dashed line (almost overlapped by the plot of U F = 10 4, η = 10 3) is added as a reference for the (rank unconstrained) minimum nuclear norm global optima. The error bars indicate the standard deviation across 3 random initializations. We have dropped the plot for U F = 1, η = 10 3 to reduce clutter.

imposes no explicit constraints. Furthermore, we do not include any explicit regularization and optimization is run to convergence without any early stopping. In fact, we can see a clear bias toward low nuclear norm even in problems where reconstruction is not possible: in panel (c) of Figure 2 the number of samples m = nr/4 is much smaller than those required to reconstruct a rank r ground truth matrix X . The optimization in (2) is highly underdetermined and there are many possible zero-error global minima, but gradient descent still prefers a lower nuclear norm solution. The emerging story is that gradient descent biases us to a low nuclear norm solution, and we already know how having low nuclear norm can ensure generalization [17, 6] and minimizing the nuclear norm ensures reconstruction [15, 5].

Can we more explicitly characterize this bias? We see that we do not always converge precisely to the minimum nuclear norm solution. In particular, the choice of step size and initialization affects which solution gradient descent converges to. Nevertheless, as we formalize in Section 3, we argue that when U is full dimensional, the step size becomes small enough, and the initialization approaches zero, gradient descent will converge precisely to a minimum nuclear norm solution, i.e. to argmin X 0 X s.t. A(X) = y.

3 Gradient Flow and Main Conjecture

The behavior of gradient descent with inﬁnitesimally small step size is captured by the differential equation Ut := d Ut

dt = f(Ut) with an initial condition for U0. For the optimization in (2) this is

Ut = A (A(Ut U t ) y)Ut, (3)

where A : Rm Rn n is the adjoint of A and is given by A (r) = P i ri Ai. Gradient descent can be seen as a discretization of (3), and approaches (3) as the step size goes to zero.

The dynamics (3) deﬁne the behavior of the solution Xt = Ut U t and using the chain rule we can verify that Xt = Ut U t + Ut U t = A (rt)Xt Xt A (rt), where rt = A(Xt) y is a vector of the residual. That is, even though the dynamics are deﬁned in terms of speciﬁc factorization Xt = Ut U t , they are actually independent of the factorization and can be equivalently characterized as Xt = A (rt)Xt Xt A (rt). (4)

We can now deﬁne the limit point X (Xinit) := limt Xt for the factorized gradient ﬂow (4) initialized at X0 = Xinit. We emphasize that these dynamics are very different from the standard gradient ﬂow dynamics of (1) on X, corresponding to gradient descent on X, which take the form Xt = F(Xt) = A (rt).

Based on the preliminary experiments in Section 2 and a more comprehensive numerical study discussed in Section 5, we state our main conjecture as follows:

Conjecture. For any full rank Xinit, if b X = limα 0 X (αXinit) exists and is a global optima for (1) with A( b X) = y, then b X argmin X 0 X s.t. A(X) = y.

Requiring a full-rank initial point demands a full dimensional d = n factorization in (2). The assumption of global optimality in the conjecture is generally satisﬁed: for almost all initializations, gradient ﬂow will converge to a local minimizer [12], and when d = n any such local minimizer is also global minimum [9]. Since we are primarily concerned with underdetermined problems, we expect the global optimum to achieve zero error, i.e. satisfy A(X) = y. We already know from these existing literature that gradient descent (or gradient ﬂow) will generally converge to a solution satisfying A(X) = y; the question we address here is which of those solutions will it converge to.

The conjecture implies the same behavior for asymmetric factorization as X = UV with gradient ﬂow on (U, V ), since this is equivalent to gradient ﬂow on the p.s.d. factorization of W X X Z .

4 Theoretical Analysis

We will prove our conjecture for the special case where the matrices Ai commute, and discuss the more challenging non-commutative case. But ﬁrst, let us begin by reviewing the behavior of straight-forward gradient descent on X for the convex problem in (1).

Warm up: Consider gradient descent updates on the original problem (1) in X space, ignoring the p.s.d. constraint. The gradient direction F(X) = A (A(X) y) is always spanned by the m matrices Ai. Initializing at Xinit = 0, we will therefore always remain in the m-dimensional subspace L = {X = A (s)|s Rm}. Now consider the optimization problem min X X 2 F s.t. A(X) = y. The KKT optimality conditions for this problem are A(X) = y and ν s.t. X = A (ν). As long as we are in L, the second condition is satisﬁed, and if we converge to a zero-error global minimum, then the ﬁrst condition is also satisﬁed. Since gradient descent stays on this manifold, this establishes that if gradient descent converges to a zero-error solution, it is the minimum Frobenius norm solution.

Getting started: m = 1 Consider the simplest case of the factorized problem when m = 1 with A1 = A and y1 = y. The dynamics of (4) are given by Xt = rt(AXt + Xt A), where rt is simply a scalar, and the solution for Xt is given by, Xt = exp (st A) X0 exp (st A) where s T = R T 0 rtdt. Assuming b X = limα 0 X (αX0) exists and A( b X) = y, we want to show b X is an optimum for the following problem min X 0 X s.t. A(X) = y. (5)

The KKT optimality conditions for (5) are:

ν Rm s.t. A(X) = y X 0 A (ν) I (I A (ν))X = 0 (6)

We already know that the ﬁrst condition holds, and the p.s.d. condition is guaranteed by the factorization of X. The remaining complementary slackness and dual feasibility conditions effectively require

that b X is spanned by the top eigenvector(s) of A. Informally, looking to the gradient ﬂow path above, for any non-zero y, as α 0 it is necessary that |s | in order to converge to a global optima, thus eigenvectors corresponding to the top eigenvalues of A will dominate the span of X (αXinit).

What we can prove: Commutative {Ai}i [m] The characterization of the the gradient ﬂow path from the previous section can be extended to arbitrary m in the case that the matrices Ai commute, i.e. Ai Aj = Aj Ai for all i, j. Deﬁning s T = R T 0 rtdt a vector integral, we can verify by differentiating that solution of (4) is

Xt = exp (A (st)) X0 exp (A (st)) (7)

Theorem 1. In the case where matrices {Ai}m i=1 commute, if b X = limα 0 X (αI) exists and is a global optimum for (1) with A( b X) = y, then b X argmin X 0 X s.t. A(X) = y.

Proof. It sufﬁces to show that such a b X satisﬁes the complementary slackness and dual feasibility KKT conditions in (6). Since the matrices Ai commute and are symmetric, they are simultaneously diagonalizable by a basis v1, .., vn, and so is A (s) for any s Rm. This implies that for any α, X (αI) given by (7) and its limit b X also have the same eigenbasis. Furthermore, since X (αI) converges to b X, the scalars v k X (αI)vk v k b Xvk for each k [n]. Therefore, λk(X (αI)) λk( b X), where λk( ) is deﬁned as the eigenvalue corresponding to eigenvector vk and not necessarily the kth largest eigenvalue.

Let β = log α, then using X0 = e βI in (7), λk(X (αI)) = exp(2λk(A (s (β))) 2β). For all k such that λk( b X) > 0, by the continuity of log, we have

2λk(A (s (β))) 2β log λk( b X) 0 = λk A s (β)

β 1 log λk( b X) 2β 0. (8)

Deﬁning ν(β) = s (β)/β, we conclude that for all k such that λk( b X) = 0, limβ λk(A (ν(β))) = 1. Similarly, for each k such that λk( b X) = 0,

exp(2λk(A (s (β))) 2β) 0 = exp(λk(A (ν(β))) 1)2β 0. (9)

Thus, for every ϵ (0, 1], for sufﬁciently large β

exp(λk(A (ν(β))) 1) < ϵ 1 2β < 1 = λk(A (ν(β))) < 1. (10)

Therefore, we have shown that limβ A (ν(β)) I and limβ A (ν(β)) b X = b X establishing the optimality of b X for (5).

Interestingly, and similarly to gradient descent on X, this proof does not exploit the particular form of the control" rt and only relies on the fact that the gradient ﬂow path stays within the manifold

M = {X = exp (A (s)) Xinit exp (A (s)) | s Rm} . (11)

Since the Ai s commute, we can verify that the tangent space of M at a point X is given by TXM = Span {Ai X + XAi}i [m], thus gradient ﬂow will always remain in M. For any control rt such that following Xt = A (rt)Xt Xt A (rt) leads to a zero error global optimum, that optimum will be a minimum nuclear norm solution. This implies in particular that the conjecture extends to gradient ﬂow on (2) even when the Euclidean norm is replaced by certain other norms, or when only a subset of measurements are used for each step (such as in stochastic gradient descent).

However, unlike gradient descent on X, the manifold M is not ﬂat, and the tangent space at each point is different. Taking ﬁnite length steps, as in gradient descent, would cause us to fall off" of the manifold. To avoid this, we must take inﬁnitesimal steps, as in the gradient ﬂow dynamics.

In the case that Xinit and the measurements Ai are diagonal matrices, gradient descent on (2) is equivalent to a vector least squares problem, parametrized in terms of the square root of entries:

Corollary 2. Let x (xinit) be the limit point of gradient ﬂow on minu Rn Ax(u) y 2 2 with initialization xinit, where x(u)i = u2 i , A Rm n and y Rm. If bx = limα 0 x (α 1) exists and Abx = y, then bx argminx Rm + x 1 s.t. Ax = y.

The plot thickens: Non-commutative {Ai}i [m] Unfortunately, in the case that the matrices Ai do not commute, analysis is much more difﬁcult. For a matrix-valued function F, d

dt exp(Ft) is equal to Ft exp(Ft) only when Ft and Ft commute. Therefore, (7) is no longer a valid solution for (4). Discretizing the solution path, we can express the solution as the time ordered exponential":

Xt = lim ϵ 0

τ=t/ϵ exp ( ϵA (rτϵ))

τ=1 exp ( ϵA (rτϵ))

where the order in the products is important. If Ai commute, the product of exponentials is equal to an exponential of sums, which in the limit evaluates to the solution in (7). However, since in general exp(A1) exp(A2) = exp(A1 + A2), the path (12) is not contained in the manifold M deﬁned in (11).

It is tempting to try to construct a new manifold M such that Span {Ai X + XAi}i [m] TXM

and X0 M , ensuring the gradient ﬂow remains in M . However, since Ai s do not commute, by combining inﬁnitesimal steps along different directions, it is possible to move (very slowly) in directions that are not of the form A (s)X + XA (s) for any s Rm. The possible directions of movements indeed corresponds to the Lie algebra deﬁned by the closure of {Ai}m i=1 under the commutator operator [Ai, Aj] := Ai Aj Aj Ai. Even when m = 2, this closure will generally encompass all of Rn n, allowing us to approach any p.s.d. matrix X with some (wild) control rt. Thus, we cannot hope to ensure the KKT conditions for an arbitrary control as we did in the commutative case it is necessary to exploit the structure of the residuals A(Xt) y in some way.

Nevertheless, in order to make ﬁnite progress moving along a commutator direction like [Ai, Aj]Xt + Xt[Ai, Aj] , it is necessary to use an extremely non-smooth control, e.g., looping 1/ϵ2 times between ϵ steps in the directions Ai, Aj, Ai, Aj, each such loop making an ϵ2 step in the desired direction. We expect the actual residuals rt to behave much more smoothly and that for smooth control the non-commutative terms in the expansion of the time ordered exponential (12) are asymptotically lower order then the direct term A (s) (as Xinit 0). This is indeed conﬁrmed numerically, both for the actual residual controls of the gradient ﬂow path, and for other random controls.

5 Empirical Evidence

Beyond the matrix reconstruction experiments of Section 2, we also conducted experiments with similarly simulated matrix completion problems, including problems where entries are sampled from power-law distributions (thus not satisfying incoherence), as well as matrix completion problem on non-simulated Movielens data. In addition to gradient descent, we also looked more directly at the gradient ﬂow ODE (3) and used a numerical ODE solver provided as part of Sci Py [8] to solve (3). But we still uses a ﬁnite (non-zero) initialization. We also emulated staying on a valid steering path" by numerically approximating the time ordered exponential of 12 for a ﬁnite discretization η, instead of moving linearly in the direction of the gradient f(U) (like in gradient descent), we multiply Xt on right and left by e ηA (rt). The results of these experiments are summarized in Figure 3.

In these experiments, we again observe trends similar to those in Section 2. In some panels in Figure 3, we do see a discernible gap between the minimum nuclear norm global optima and the nuclear norm of the gradient ﬂow solution with U0 F = 10 4. This discrepancy could either be due to starting at a non-limit point of U0, or numerical issue arising from approximations to the ODE, or it could potentially suggest a weakening of the conjecture. Even if the later case were true, the experiments so far provide strong evidence for atleast approximate versions of our conjecture being true under a wide range of problems.

(a) Low rank X (b) Low nuclear norm X (c) Low rank X , m = nr

4 0.0 0.5 1.0 1.5 2.0 2.5 3.0

Nuclear norm

min A(X) = y X

ODE approx. U0 F = 10 4

Time ordered exp. U0 F = 10 4, η = 0. 1

Gradient descent U0 F = 10 4, η = 10 3 Xgd

(i) Gaussian random measurements. We report the nuclear norm of the gradient ﬂow solutions from three different approximations to (3) numerical ODE solver (ODE approx.), time ordered exponential speciﬁed in (12) (Time ordered exp.) and standard gradient descent with small step size (Gradient descent). The nuclear norm of the solution from gradient descent on X space Xgd and the minimum nuclear norm global minima are provided as references. In (a) X is rank r and m = 3nr, in (b) X has a decaying spectrum with X = r X F and m = 3nr, and in (c) X is rank r with m = nr/4, where n = 50, r = 2.

(a) Low rank X (b) Low nuclear norm X (c) Low rank X , m = nr

Nuclear norm

(ii) Uniform matrix completion: i, Ai measures a uniform random entry of X . Details on X , number of measurements, and the legends follow Figure3-(i).

(a) Low rank X (b) Low nuclear norm X (c) Low rank X , m = nr

4 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Nuclear norm

(iii) Power law matrix completion: i, Ai measures a random entry of X chosen according to a power law distribution. Details on X , number of measurements, and the legends follow Figure3-(i).

argmin A(X)=y X Gradient descent

U0 F = 10 3, η = 10 2 Xgd

Test Error 0.2880 0.2631 1.000 Nuclear norm 8391 8876 20912

(iv) Benchmark movie recommendation dataset Movielens 100k. The dataset contains 100k ratings from n1 = 943 users on n2 = 1682 movies. In this problem, gradient updates are performed on the asymmetric matrix factorization space X = UV with dimension d = min (n1, n2). The training data is completely ﬁt to have <10 2 error. Test error is computed on a held out data of 10 ratings per user. Here we are not interested in the recommendation performance (test error) itself but on observing the bias of gradient ﬂow with initialization close to zero to return a low nuclear norm solution the test error is provided merely to demonstrate the effectiveness of such a bias in this application. Also, due to the scale of the problem, we only report a coarse approximation of the gradient ﬂow 3 from gradient descent with U0 F = 10 3, η = 10 2.

Figure 3: Additional matrix reconstruction experiments

Exhaustive search Finally, we also did experiments on an exhaustive grid search over small problems, capturing essentially all possible problems of this size. We performed an exhaustive grid search for matrix completion problem instances in symmetric p.s.d. 3 3 matrices. With m = 4, there are 15 unique masks or {Ai}i [4] s that are valid symmetric matrix completion observations.

For each mask, we ﬁll the m = 4 observations with all possible combinations of 10 uniformly spaced values in the interval [ 1, 1]. This gives us a total of 15 104 problem instances. Of these problems instances, we discard the ones that do not have a valid PSD completion and run the ODE solver on every remaining instance with a random U0 such that U0 F = α, for different values of α. Results on the deviation from the minimum nuclear norm are reported in Figure 4. For small α = 10 5, 10 3, most of instances of our grid search algorithm returned solutions with near minimal nuclear norms, and the deviations are within the possibility of numerical error. This behavior also decays for α = 1.

0.2 0.0 0.2 (X )

Number of experiments

(a) α = 10 5

0.2 0.0 0.2 (X )

(b) α = 10 3

0.2 0.0 0.2 (X )

Figure 4: Histogram of relative sub-optimality of nuclear norm of X in grid search experiments. We plot the histogram of (X ) = X Xmin

Xmin , where Xmin = min A(X)=y X . The panels correspond to different

values of norm of initialization α = U0 F . (Left) α = 10 5, (Center) α = 10 3, and (Right) α = 1.

6 Discussion

It is becoming increasingly apparent that biases introduced by optimization procedures, especially for under-determined problems, are playing a key role in learning. Yet, so far we have very little understanding of the implicit biases associated with different non-convex optimization methods. In this paper we carefully study such an implicit bias in a two-layer non-convex problem, identify it, and show how even though there is no difference in the model class (problems (1) and (2) are equivalent when d = n, both with very high capacity), the non-convex modeling induces a potentially much more useful implicit bias.

We also discuss how the bias in the non-convex case is much more delicate then in convex gradient descent: since we are not restricted to a ﬂat manifold, the bias introduced by optimization depends on the step sizes taken. Furthermore, for linear least square problems (i.e. methods based on the gradients w.r.t. X in our formulation), any global optimization method that uses linear combination of gradients, including conjugate gradient descent, Nesterov acceleration and momentum methods, remains on the manifold spanned by the gradients, and so leads to the same minimum norm solution. This is not true if the manifold is curved, as using momentum or passed gradients will lead us to shoot off the manifold.

Much of the recent work on non-convex optimization, and matrix factorization in particular, has focused on global convergence: whether, and how quickly, we converge to a global minima. In contrast, we address the complimentary question of which global minima we converge to. There has also been much work on methods ensuring good matrix reconstruction or generalization based on structural and statistical properties. We do not assume any such properties, nor that reconstruction is possible or even that there is anything to reconstruct for any problem of the form (1) we conjecture that (4) leads to the minimum nuclear norm solution. Whether such a minimum nuclear norm solution is good for reconstruction or learning is a separate issue already well addressed by the above literature.

We based our conjecture on extensive numerical simulations, with random, skewed, reconstructible, non-reconstructible, incoherent, non-incoherent, and and exhaustively enumerated problems, some of which is reported in Section 5. We believe our conjecture holds, perhaps with some additional technical conditions or corrections. We explain how the conjecture is related to control on manifolds and the time ordered exponential and discuss a possible approach for proving it.

[1] Yonatan Amit, Michael Fink, Nathan Srebro, and Shimon Ullman. Uncovering shared structures in multiclass classiﬁcation. In Proceedings of the 24th international conference on Machine learning, pages 17 24. ACM, 2007.

[2] Andreas Argyriou, Theodoros Evgeniou, and Massimiliano Pontil. Multi-task feature learning. Advances in neural information processing systems, 19:41, 2007.

[3] Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Global optimality of local search for low rank matrix recovery. Advances in Neural Information Processing Systems, 2016.

[4] Samuel Burer and Renato DC Monteiro. A nonlinear programming algorithm for solving semideﬁnite programs via low-rank factorization. Mathematical Programming, 95(2):329 357, 2003.

[5] Emmanuel J Candès and Benjamin Recht. Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6):717 772, 2009.

[6] Rina Foygel and Nathan Srebro. Concentration-based guarantees for low-rank matrix reconstruction. In COLT, pages 315 340, 2011.

[7] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973 2981, 2016.

[8] Eric Jones, Travis Oliphant, Pearu Peterson, et al. Sci Py: Open source scientiﬁc tools for Python, 2001.

[9] Michel Journée, Francis Bach, P-A Absil, and Rodolphe Sepulchre. Low-rank optimization on the cone of positive semideﬁnite matrices. SIAM Journal on Optimization, 20(5):2327 2351, 2010.

[10] Raghunandan Hulikal Keshavan. Efﬁcient algorithms for collaborative ﬁltering. Ph D thesis, STANFORD, 2012.

[11] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2016.

[12] Jason D. Lee, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In 29th Annual Conference on Learning Theory, 2016.

[13] Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. ar Xiv preprint ar Xiv:1705.03071, 2017.

[14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In International Conference on Learning Representations, 2015.

[15] Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471 501, 2010.

[16] Nathan Srebro, Noga Alon, and Tommi S Jaakkola. Generalization error bounds for collaborative prediction with low-rank matrices. In Advances In Neural Information Processing Systems, pages 1321 1328, 2005.

[17] Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In International Conference on Computational Learning Theory, pages 545 560. Springer, 2005.

[18] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.