# incremental_learning_in_diagonal_linear_networks__ea0cf172.pdf

Journal of Machine Learning Research 24 (2023) 1-26 Submitted 12/22; Revised 4/23; Published 5/23

Incremental Learning in Diagonal Linear Networks

Rapha el Berthier raphael.berthier@epfl.ch EPFL Lausanne, Switzerland

Editor: Lorenzo Rosasco

Diagonal linear networks (DLNs) are a toy simpliﬁcation of artiﬁcial neural networks; they consist in a quadratic reparametrization of linear regression inducing a sparse implicit regularization. In this paper, we describe the trajectory of the gradient ﬂow of DLNs in the limit of small initialization. We show that incremental learning is eﬀectively performed in the limit: coordinates are successively activated, while the iterate is the minimizer of the loss constrained to have support on the active coordinates only. This shows that the sparse implicit regularization of DLNs decreases with time. This work is restricted to the underparametrized regime with anti-correlated features for technical reasons.

Keywords: diagonal linear networks, incremental learning, saddle-to-saddle dynamics, implicit bias, Lotka-Volterra

1. Introduction

Artiﬁcial neural networks are the state of the art for many machine learning tasks (Le Cun et al., 2015); however, we lack theoretical understanding of this success (Zhang et al., 2021). Indeed, the parametrization of neural networks induces a non-convex loss, and consequently it is challenging to analyze the optimization error of gradient descent methods. Moreover, neural networks can be successful even without any (explicit) regularizer; this challenges the statistical wisdom in overparametrized settings. Recent research suggests that these two problems are intertwined: through its nonconvex parametrization, the gradient descent dynamics of neural networks induce an implicit regularization that controls the statistical performance (Bartlett et al., 2021). However, this phenomenon is diﬃcult to describe because it is a joint eﬀect of the parametrization, the gradient descent dynamics and the initialization. As a consequence, theoretical research has focused on studying implicit regularization in toy simpliﬁcations of neural networks (Soudry et al., 2018; Gunasekar et al., 2017; Li et al., 2018; Chizat and Bach, 2020; Li et al., 2020). We are interested in an extreme simpliﬁcation, called diagonal linear networks (DLNs) (Vaskevicius et al., 2019; Zhao et al., 2019; Woodworth et al., 2020; Hao Chen et al., 2021; Li et al., 2021; Azulay et al., 2021; Pesme et al., 2021; Pillaud-Vivien et al., 2022; Nacson et al., 2022; Chou et al., 2023). In fact, it is only a linear regression where regressors θi are parametrized quadratically; speciﬁcally, in this paper, we parametrize θi = u2 i /4. We then perform a gradient descent in terms of ui and not θi (see Section 2.1 for more details). This quadratic reparametrization is loosely argued to have an eﬀect similar to the composition of two layers in a neural

c 2023 Rapha el Berthier.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v24/22-1395.html.

network. When started from a small initialization, DLNs were rigorously shown to enforce an implicit sparse regularization; in an overparametrized setting, DLNs converge to a sparse interpolator. Previous works have thus focused on describing the limit point of the dynamics of DLNs. Instead, in this work, we study the full trajectory of the continuous-time gradient ﬂow dynamics. In the limit of small initialization, we show that incremental learning (see, e.g., Saxe et al. (2019); Gidel et al. (2019); Gissin et al. (2019); Li et al. (2020)) is eﬀectively performed: as time increases, coordinates are successively activated and the iterate is the minimizer of the loss constrained to have support on the active coordinates only. The main contribution of this paper is the description of the time-dependent set of active coordinates, and the rigorous proof of the convergence to a regressor whose sparsity depends on the stopping time. The take-home message is that DLNs enforce a sparse implicit regularization that decreases as the stopping time increases. As a corollary of our description of the dynamics, we obtain an asymptotic equivalent of the convergence time to the minimizer of the loss in the limit of small initialization. It is quite remarkable that such a precise estimate of the global convergence time can be obtained for a non-convex optimization problem. For technical reasons, our work is restricted to the special case of anti-correlated feature; consequently, we study only underparametrized problems (see Section 2.2). However, we do not expect this restriction to be necessary for incremental learning to occur in DLNs. We leave the proof of this for future work. The rest of this paper is organized as follows. In Section 2, we present our setting, our assumptions and our results. In Section 3, we articulate the related work in more detail. Sections 4 and 5 prove our results.

Notations. We use bold notations for vectors and matrices: for instance, if θ Rd is a multi-dimensional vector, we denote θi its coordinates. Similarly, if M Rd d, we denote Mij its entries. If I is a subset of {1, . . . , d}, we denote Ic its complement and |I| its cardinality. We denote θI R|I| the subvector obtained from θ Rd by keeping only the coordinates indexed by i I. Similarly, if I, J are subsets of {1, . . . , d}, we denote M IJ the submatrix obtained from M by keeping only the rows indexed by i I and columns indexed by j J. If θ, ν Rd, we write θ ν (resp. θ > ν) to denote that for all i {1, . . . , d}, θi νi (resp. θi > νi). In particular, θ 0 (resp. θ > 0) denotes that all coordinates of θ are non-negative (resp. positive). We use the notations ., . and . to denote the Euclidean dot product and norm respectively.

2. Main Results

In this section, we ﬁrst introduce the parametrization of DLNs and the induced gradient ﬂow dynamics (Section 2.1). Then, we state the assumption that features are anti-correlated and discuss the consequences (Section 2.2). Finally, we state our results (Section 2.3).

Incremental Learning in Diagonal Linear Networks

2.1 Setting

We perform a linear regression of n output variables y1, . . . , yn R from n corresponding input variables x1, . . . , xn Rd. The traditional approach is to minimize the quadratic loss:

k=1 (yk θ, xk )2 )

Denote y = (y1, . . . , yn) Rn, X Rn d the matrix whose rows are x1, . . . , xn Rd, and X1, . . . , Xd Rn the columns of the matrix X, i.e., the features of the regression problem. The quadratic loss f can be expressed as a function of the covariance M = X X Rd d

of the features and of the covariance r = X y between the features and the output:

2 y 2 r, θ + 1

A strategy to minimize this convex function is to perform a gradient descent, i.e., an Euler discretization of the gradient ﬂow

j=1 Mijθj , (2)

where θ = θ(t) is a function from R 0 to Rd. It is widely known that for any initial point, this gradient ﬂow converges exponentially fast to a minimizer of f. In this paper, we are interested in the eﬀect of reparametrizing

This reparametrization of linear regression is called a diagonal linear network (DLN). We perform the gradient ﬂow in terms of u Rd instead of θ Rd: dui

dui . Using that dθi = 1

2uidui, we compute the resulting equation in θi:

2ui df dui = 1

4u2 i df dθi ,

dt = θi df dθi = θi

Compare (3) with (2). The reparametrization has added a factor θi in the derivative of θi. This implies that if θi is initialized at 0, then it remains at 0 in the DLN dynamics (3). In particular, θ = 0 Rd is a stable point of the dynamics. In this paper, we are interested in the DLN when initialized close to this stable point. More precisely, for ε > 0, deﬁne θ(ε) = θ(ε)(t) as the solution of the DLN dynamics (3) initialized from θ(ε)(0) = (C1εk1, . . . , Cdεkd), where C = (C1, . . . , Cd) > 0 and k = (k1, . . . , kd) > 0 are constants.

2.2 Assumptions

Before we get to a rigorous statement of our results, let us state our assumptions.

(A1) r = X y > 0, i.e., the covariance ri = Xi, y between the output y and the feature Xi is positive for all i {1, . . . , d}.

The reparametrization θi = 1

4u2 i constrains the linear regression to have non-negative weights. In this situation, it is natural to preprocess the data by potentially changing the signs of the features X1, . . . , Xd so that the output is positively correlated with the features. Assumption (A1) assumes that this pre-processing has been done, and for technical reasons that the correlations are non-zero.

(A2) For all i = j, Mij = Xi, Xj 0, i.e., the features are anti-correlated.

We assume that once the features have been positively correlated with the output, they are anti-correlated. This assumption is a strong restriction to the class of studied problems and weakening it is left as an open problem. A major motivation for this assumption is that it implies that the trajectories of the DLN dynamics are nondecreasing.

Lemma 1 Assume (A1)-(A2). There exists ε0 > 0 such that for all ε (0, ε0], for all i {1, . . . , d}, θ(ε) i (t) is nondecreasing in t.

The proof of this result is postponed to Appendix B.

As a side comment, note that Assumptions (A1) and (A2) jointly constrain the problem to be in the underparametrized regime n d.

Proposition 1 Assume (A1)-(A2). Then M = X X is positive deﬁnite. In particular, as M Rd d and X Rn d, we have n d.

The proof of this result is postponed to Appendix C.

2.3 Statement of the Results

Our main result (Theorem 1 below) states that the DLN spends long periods of time in the vicinity of ﬁxed points of (3), and describes the times at which transitions occur. To start with, we describe this family of ﬁxed points, using the notations introduced in Section 1.

Proposition 2 Assume (A1)-(A2). For all I {1, . . . , d}, there exists a unique θ 0 ﬁxed point of (3) with support {i {1, . . . , d} | θi > 0} equal to I. We denote this ﬁxed point as θ(I) . Its non-zero coordinates are (θ(I) )I = (M II) 1r I. There are thus 2d ﬁxed points of (3).

Incremental Learning in Diagonal Linear Networks

0.0 0.2 0.4 0.6 0.8

Figure 1: In dimension d = 2, we show the vector ﬁeld V (θ1, θ2) = (θ1(r1 A11θ1 A12θ2), θ2(r2 A21θ1 A22θ2)) associated to the DLN dynamics (3) (gray arrows), its ﬁxed points θ(I) for I {1, 2} (black crosses) and the trajectories of θ(ε)(t) for ε = 10 8 (blue) and ε = 10 20 (green). In this simulation, n = 3 and the data X Rn d, y Rn is generated randomly with i.i.d. standard Gaussian entries, conditionally on the event that Assumptions (A1) and (A2) hold. The initialization is θ(ε)(0) = (ε, ε).

The proof that (M II) 1 exists and the proof of the proposition are postponed to Appendix D. We give here a high-level intuition. For each i {1, . . . , d}, there are two ways of canceling out the right hand side of (3): either θi = 0 or ri P j Mijθj = 0. For the

ﬁxed point θ(I) , the set I {1, . . . , d} is the set of coordinates i such that θ(I) ,i = 0 and

ri P j Mijθ(I) ,j = 0; conversely for i / I, θ(I) ,i = 0. We say that the coordinates in I are the

active coordinates of θ(I) .

If no coordinate is active, we obtain the ﬁxed point θ( ) = 0 of (3). If all coordinates are active, θ({1,...,d}) = M 1r is the minimum of f, thus a ﬁxed point of both gradient ﬂows (2) and (3). In Figure 1, we provide an illustration in dimension d = 2. We show the vector ﬁeld deﬁned by the DLN dynamics (3), its 2d = 4 ﬁxed points enumerated above and the trajectories θε(t) for diﬀerent values of ε.

We are now in position to state our main theorem. Recall that for ε > 0, θ(ε) = θ(ε)(t) is the solution of the DLN dynamics (3) initialized from θ(ε)(0) = (C1εk1, . . . , Cdεkd), where C = (C1, . . . , Cd) > 0 and k = (k1, . . . , kd) > 0 are constants.

Theorem 1 Assume (A1)-(A2). For s > 0, deﬁne µ(s) as the unique minimizer of the regularized and constrained minimization problem

min. θ Rd, θ 0

and I(s) = {i {1, . . . , d} | µi(s) > 0} .

Then we have the following:

1. The minimizer µ(s) is nondecreasing in s, i.e., for all i {1, . . . , d}, µi(s) is nondecreasing in s. Consequently, I(s) is nondecreasing in s for the inclusion.

2. Denote s1, . . . , sq the points of discontinuity of the function s 7 I(s). For all s > 0, s = s1, . . . , sq,

θ(ε) s log 1

ε 0 θ(I(s)) .

Moreover, the convergence is uniform for s in compact subsets of R>0\{s1, . . . , sq}.

3. For all s > 0, 1 s log 1

0 dt θ(ε)(t) ε 0 µ(s) .

Moreover, the convergence is uniform for s in compact subsets of R>0.

This theorem is proved in Section 4. It states that θ(ε) s log 1

ε converges to a piecewise constant function, taking values at the ﬁxed points of the DLN dynamics (3). Moreover, the set I(s) of active coordinates of the limit is nondecreasing, showing that there are successive coordinate activations. We provide an illustration of these successive coordinate activations in Figure 2. Note that when a new coordinate is activated, all other coordinates are perturbed. Moreover, as ε decreases from 10 8 to 10 20, one can observe that θ(I(s)) is becoming a sharper approximation of θ(ε) s log 1

ε and f(θ(I(s)) ) is becoming a sharper approximation of f θ(ε) s log 1

ε . The set I(s) of active coordinates of the limit is obtained by solving a regularized and constrained version (4) of the original optimization problem (1). The non-active constraints at the optimum µ(s) correspond to the active coordinates of θ(I(s)) . The regularization term +1

s k, θ in (4) has a decreasing sparse regularizing eﬀect. The author did not ﬁnd a ﬁner high-level motivation to explain why I(s) should be deﬁned through (4); his insights come only from the proof of the theorem. However, Theorem 1.(3) states a second relation between the DLN dynamics (3) and the optimization problem (4):

the average 1 s log 1

ε 0 dt θ(ε)(t) of the trajectory converges to the minimizer µ(s) as

ε 0; said diﬀerently, the average of the trajectory computes the regularization path (Hastie et al., 2009, Section 3) of the regularized optimization problem (4).

Incremental Learning in Diagonal Linear Networks

0.0 0.5 1.0 1.5 2.0 2.5 3.0 s = t/log1

(I(s)) *, 1

(I(s)) *, 2

(I(s)) *, 3

(I(s)) *, 4

0.0 0.5 1.0 1.5 2.0 2.5 3.0 s = t/log1

f( ( )(t)) f( (I(s)) * )

(a) ε = 10 8

0.0 0.5 1.0 1.5 2.0 2.5 3.0 s = t/log1

0.0 0.5 1.0 1.5 2.0 2.5 3.0 s = t/log1

(b) ε = 10 20

Figure 2: Comparison between the coordinates θ(ε) i s log 1

ε and their asymptotic approxi-

mation θ(I(s)) ,i (upper plots) and between the losses f θ(ε) s log 1

ε and f(θ(I(s)) )

(lower plots). The simulations are run with ε = 10 8 (left plots) and ε = 10 20

(right plots). In this simulation, n = 5, d = 4 and the data X Rn d, y Rn is generated randomly with i.i.d. standard Gaussian entries, conditionally on the event that Assumptions (A1) and (A2) hold. The initialization is θ(ε)(0) = (ε, . . . , ε) and thus k = (1, . . . , 1).

Remark 3 In Theorem 1.(2), it is not possible to have uniform convergence in neighborhoods of s1, . . . , sq as the functions θ(ε) s log 1

ε are continuous while θ(I(s)) is discontinuous at s1, . . . , sq; a uniform convergence would contradict the Arzel a theorem (Dunford and Schwartz, 1988, Section IV.6).

The deﬁnition of the asymptotic process θ(I(s)) is rather complex as one has to solve an optimization problem. Nevertheless, in the following corollary, we show that it is still possible to deduce a simple expression for the convergence time of θ(ε)(t) to the minimizer θ({1,...,d}) of f.

Corollary 1 (convergence time to the minimizer) Assume (A1)-(A2). For all η > 0, denote τ (ε) η = inf n t 0 θ(ε)(t) θ({1,...,d}) η o

the hitting time of the ball centered around the minimizer θ({1,...,d}) of f and of radius η. Then for η small enough, τ (ε) η log 1

ε ε 0 max i {1,...,d} (M 1k)i (M 1r)i .

This corollary is proved in Section 5. Note that we describe the hitting time τ (ε) η only in the asymptotic limit ε 0, and in this limit, τ (ε) η / log 1

ε is independent of η (for η small enough). This surprising property is due to the fact that the limit trajectory reaches the global minimizer θ({1,...,d}) through a last jump of the iterates (see Figure 2) and the duration of this jump is negligible before log 1

3. Related Work

Diagonal linear networks (DLNs). Previous studies of DLNs show that their dynamics select sparse estimators in an overparametrized setting (Vaskevicius et al., 2019; Zhao et al., 2019; Woodworth et al., 2020; Hao Chen et al., 2021; Li et al., 2021; Azulay et al., 2021; Pesme et al., 2021; Pillaud-Vivien et al., 2022; Nacson et al., 2022; Chou et al., 2023). Our work diﬀers from previous studies in two ways: ﬁrst, we describe the limit of the full trajectory of the dynamics, but second, we are technically restricted to the underparametrized setting. Many studies consider the more general quadratic reparametrization θi = (u2 i v2 i )/4 or equivalently θi = uivi, which do not constrain θi to be non-negative, while our reparametrization θi = u2 i /4 does. However, under our Assumptions (A1) and (A2), the restriction to nonnegative regressors is benign. Heuristically, the regressors θi are nondecreasing (Lemma 1), thus they do not try to become negative. We thus claim that under the more general parametrization θi = (u2 i v2 i )/4, the variables vi would remain negligible and the results would be the same. When M = X X = Id, the DLNs dynamics (3) are separable across coordinates and can be solved using the logistic equation. In this special case, the activation of a coordinate does not aﬀect the other coordinates. Further, one can check that the coordinates are activated in the decreasing order of the loss decrease that they induce. In (Vaskevicius et al., 2019; Zhao et al., 2019; Li et al., 2021), a restricted isometry property or incoherence

Incremental Learning in Diagonal Linear Networks

property controls the deviation from this special case. On the contrary, in this paper, we do not make such an assumption and observe richer phenomena. In Figure 2, we observe each coordinate activation has a large inﬂuence on other active coordinates; moreover, the coordinate introduced last is the second largest coordinate of the optimum θ({1,...,4}) and induces the largest loss decrease. To the best of our knowledge, previous analyses of DLNs have focused on the case where the initializations θ(ε) i (0) = Ciεki of all coordinates have the same order of magnitude, i.e., k = (1, . . . , 1). In this paper, we generalize to k = (1, . . . , 1): this has the eﬀect of weighting the sparse regularizing term of (4). We believe that the techniques of this paper can be adapted to deeper DLNs, i.e., when θi ul i, l > 2. One would only need to assume additionally that k1 = = kd. In this case, the time rescaling to have a limiting trajectory would change from log 1/ε for l = 2 to ε2/l 1 for l > 2. Moreover, in this latter case, one observes that the eﬀective regularization in Theorem 1 depends on the constants C1, . . . , Cd (but is still linear in θ). This is observed by repeating the proof of Section 4, redeﬁning w(ε) i as (θi/ε)2/l 1. We have omitted this adaptation for simplicity. Finally, we note that when a ℓ2 penalization on u (or on u and v) is added to f, DLNs are related to iterative reweighted least-squares, a reparametrization of the Lasso problem appreciated for computational purposes, see (Bach et al., 2012, Section 5) or (Poon and Peyr e, 2021). However, in this paper, there is no explicit ℓ2 penalty on u and thus no explicit ℓ1 penality on θ.

Incremental learning. Incremental learning describes some learning curves observed in human and machine learning that are almost piecewise constant: they consist of stages where little progress is made, separated by sharp transitions. For instance, this phenomenon occurs in non-diagonal linear networks (Saxe et al., 2019; Gidel et al., 2019; Gissin et al., 2019; Arora et al., 2019; Chou et al., 2020; Li et al., 2020), in tensor decomposition (Ge et al., 2021; Razin et al., 2021, 2022; Hariz et al., 2022) and in shallow Re LU networks (Boursier et al., 2022). In general, obtaining a mathematical description of the process of the times of the transitions and the progress made is mathematically challenging. To the best of our knowledge, existing works obtain a rigorous and complete mathematical description only in separable cases where the learning dynamics can be separated into several onedimensional learning dynamics. For instance, Gissin et al. (2019) study DLNs but only in the special case M = Id. As a consequence, a major contribution of our work is to describe precisely some non-separable incremental learning dynamics.

Heteroclinic networks. From a dynamical systems perspective, the dynamics (3) form a heteroclinic network (Bakhtin, 2011): it has several ﬁxed points (also called saddle points) (θ(I) )I {1,...,d} connected by geodesics of the ﬂow. Such a dynamical system spends large amounts of time in the vicinity of ﬁxed points, with sharp transitions between them. In our case, this is closely related to incremental learning. For our dynamical system, we describe the sequence of visited ﬁxed points and the transition times. The paper of Jacot et al. (2021) attempted a similar study for linear networks; we prove rigorously such results in the special case of diagonal linear networks.

Lotka Volterra equations. To ﬁnish, we note that the quadratic system (3) of ordinary diﬀerential equations are Lotka Volterra (LV) equations (Hofbauer and Sigmund, 1998; Baigent,

2017). Traditionally, in mathematical biology, these equations represent the evolution the populations sizes θ1, . . . , θd of d interacting species. The parameter ri represents the intrinsic growth of population i while the parameter Mij represents the interaction between populations i and j. This point of view, and in particular the paper of Goh (1979), inspired the author to use the function (10) in the proof of Theorem 1. In general, our paper can be interpreted as a study of LV equations for cooperative and symmetric interactions from inﬁnitesimal initial population sizes. To the best of our knowledge, such a study did not exist in the literature on LV equations; its implications will be the subject of a forthcoming paper.

4. Proof of Theorem 1

In this proof, we use both time variables t and s, with t = s log 1

ε. As it is frequent in the literature on ordinary diﬀerential equations (ODEs), we abusively use the same notation for functions of t and s. For instance, by convention, θ(ε)(s) := θ(ε)(t) with t = s log 1

ε. In fact, we often drop the dependence on time. For instance, θ(ε) := θ(ε)(s) = θ(ε)(t). We start with a crude estimate of the trajectories θ(ε)(t) that is useful several times later in the proof.

Lemma 2 The trajectory θ(ε)(t) is bounded uniformly for ε (0, 1] and t R 0, i.e., there exists a constant B > 0 such that ε (0, 1], t R 0, θ(ε)(t) B.

Proof As Equation (3) is a (reparametrized) gradient ﬂow of f, f is a Lyapunov function, i.e.,

Thus for all ε (0, 1], for all t 0,

f(θ(ε)(t)) f(θ(ε)(0)) sup ε (0,1] f(θ(ε)(0)) . (5)

This supremum is ﬁnite as f is continuous and θ(ε)(0) is uniformly bounded for ε (0, 1]. Further, as M is positive deﬁnite (Proposition 1),

2 y 2 r, θ + 1

2 θ, Mθ as θ .

Thus the uniform bound (5) implies a uniform bound on θ(ε)(t) .

The central idea of the proof of Theorem 1 is to keep track of the size of the coordinates of θ(ε)(t), in order to be able to determine which coordinates of θ(ε)(t) are activated depending on time t. More precisely, deﬁne

w(ε) i = log θ(ε) i log ε . (6)

Incremental Learning in Diagonal Linear Networks

Equivalently, this gives θ(ε) i = εw(ε) i . This logarithmic transformation of the coordinates is particularly convenient because its time derivative is aﬃne in θ(ε):

dw(ε) i ds = dt

ds dw(ε) i dt = log 1

θ(ε) i log ε

dθ(ε) i dt = (3)

j=1 Mijθ(ε) j ri ,

or, using the vector notation w(ε) = (w(ε) 1 , . . . , w(ε) d ),

ds = Mθ(ε) r .

Our proof technique determines the limit of w(ε) as ε 0. The limit is described as the Lagrange multiplier of an optimization problem closely related to (4). We start with a brief reminder on duality in optimization.

Proposition 4 Let q, z Rd. The two following statements are equivalent:

1. z is the unique minimizer of the constrained optimization problem

min. θ Rd, θ 0

2 θ, Mθ . (7)

2. There exists w Rd such that (w, z) is the unique solution of

w = q + Mz ,

w 0 , z 0 , w z = 0 .

The four conditions above form a so-called linear complementarity problem (LCP), where q and M are the parameters and w and z are the variables.

The linear complementarity problem is widely studied; see for instance the monograph of Cottle et al. (2009) or Appendix A. In this connection with the quadratic programming problem (7), the variable w should be seen as the Lagrange multiplier associated to the constraint θ 0. The LCP expresses the Karush Kuhn Tucker (KKT) conditions for optimality to hold. More precisely, w = q + Mz is a condition of stationarity of the Lagrangian; w 0 and z 0 are respectively dual and primal feasibility conditions; and w z = 0 is a complementarity slackness condition. Put together, the conditions w 0, z 0 and w z = 0 impose that for all i {1, . . . , d}, either wi = 0 or zi = 0. Proposition 4 is classical; nevertheless we detail the appropriate references in Appendix E.

We are now in position to describe the asymptotic behavior of w(ε) i = log θ(ε) i log ε . Deﬁne

z(ε)(s) = Z s

0 du θ(ε)(u) .

Proposition 5 Let (w(s), z(s)) be the unique solution of the linear complementarity problem

w = k sr + Mz ,

w 0 , z 0 , w z = 0 . (8)

Then w(ε)(s) and z(ε)(s) converge respectively to w(s) and z(s) as ε 0, uniformly on compact subsets of R 0.

Proof In this proof, we deﬁne ε0 > 0 as in Lemma 1 and B > 0 as in Lemma 2. Further we take ε1 = min ε0, 1

2 and assume that ε ε1. Fix S > 0 and deﬁne the continuous functions

ϕ(ε) : s [0, S] 7 w(ε)(s), z(ε)(s) (Rd)2 ,

ϕ : s [0, S] 7 (w(s), z(s)) (Rd)2 .

We want to show that ϕ(ε) ϕ uniformly as ε 0. First, we use the Arzel a Ascoli theorem to check that the set {ϕ(ε), ε (0, ε1)} is relatively compact in the space of continuous functions from [0, S] to (Rd)2. The reader can consult the monograph of Dunford and Schwartz (1988, Section IV.6) for a reference on the Arzel a Ascoli theorem for real-valued functions; the multidimensional extension is straightforward.

For ε (0, ε1), s [0, S], we bound ϕ(ε)(s) 2 = w(ε)(s) 2 + z(ε)(s) 2. We bound the two terms separately.

From Lemmas 1 and 2,

Ciεki = θ(ε) i (0) θ(ε) i (s) B ,

thus (log ε < 0),

log ε w(ε) i = log θ(ε) i log ε log Ci

log ε + ki .

As ε 1/2, log ε is bounded away from 0. This shows that w(ε) i is bounded uniformly for ε (0, ε1] and s [0, S]. From Lemma 2,

z(ε)(s) Z s

0 du θ(ε)(u) s B SB .

The two points above show that ϕ(ε)(s) 2 is bounded uniformly for ε (0, ε1] and s [0, S].

The square norm of the derivative dϕ(ε)

= Mθ(ε)(s) r 2 + θ(ε)(s) 2

Incremental Learning in Diagonal Linear Networks

is bounded uniformly for ε (0, ε1] and s [0, S] by Lemma 2. Thus the set {ϕ(ε), ε (0, ε1)} is equicontinuous.

The two points above show that we can apply the Arzel a Ascoli theorem: {ϕ(ε), ε (0, ε1)} is relatively compact in the space of continuous functions from [0, S] to (Rd)2. To conclude on Proposition 5, it is then suﬃcient to show that the only subsequential uniform limit of ϕ(ε) as ε 0 is ϕ. Let ϕ = (w , z ) be a subsequential uniform limit of ϕ(ε) = (w(ε), z(ε)) as ε 0. There exists ε(n) (0, ε1] such that ε(n) 0 and ϕ(ε(n)) ϕ uniformly as n . Then w(ε(n)) w and z(ε(n)) z uniformly as n . We check that (w , z ) is a solution of the LCP (8).

w(ε(n))(s) k + sr Mz(ε(n))(s) = dw(ε(n))

ds (s) + r M dz(ε(n))

= Mθ(ε(n))(s) r + r Mθ(ε(n))(s)

Thus w(ε(n))(s) k + sr Mz(ε(n))(s) is constant in s, equal to its initial value w(ε(n))(0) k. Moreover, for all i,

w(ε(n)) i (0) ki = log Ciε(n)ki

log ε(n) ki = log Ci

log ε(n) n 0 .

Thus w(ε(n))(0) k = w(ε(n))(s) k + sr Mz(ε(n))(s) 0 as n . By identiﬁcation of the limit, we have w (s) k + sr Mz (s) = 0.

Second, using Lemma 2 and that log ε(n) < 0,

w(ε(n)) i (s) = log θ(ε(n)) i (s) log ε(n) log B log ε(n) ,

thus taking n , we obtain w 0.

Third, we have z(ε(n))(s) 0 trivially from the deﬁnition, and thus taking n , we obtain z 0.

Finally, w(ε(n))(s) z(ε(n))(s) X

w(ε(n)) i (s)z(ε(n)) i (s)

| log θ(ε(n)) i (s)| | log ε(n)|

0 du θ(ε(n)) i (u)

s | log ε(n)|

i | log θ(ε(n)) i (s)|θ(ε(n)) i (s) ,

where in this last inequality we use Lemma 1. The function x 7 | log x|x can be continuously extended in 0; it is thus bounded on [0, B]. Thus w(ε(n))(s) z(ε(n))(s) sd | log ε(n)| max x [0,B] | log x|x .

Taking n , we obtain w (s) z (s) = 0.

The four points above show that for all s [0, S], (w (s), z (s)) is a solution of the LCP (8). As the solution is unique (Proposition 6), ϕ = (w , z ) = (w, z) = ϕ. Thus ϕ is the unique subsequent limit of ϕ(ε) as ε 0. Thus ϕ(ε) ε 0 ϕ uniformly on [0, S].

We now show how the proof of Theorem 1 essentially follows from Proposition 5. Proof [Proof of Theorem 1] First, note that it is shown in Proposition 4 that there is a unique minimizer to (4). We continue by proving successively the three points of the theorem.

(1) The fact that µ(s) is nondecreasing follows from the connection with the LCP (Proposition 4) and the antitonicity property of the solution of the LCP (Proposition 7). We detail this argument.

Recall that µ(s) is the unique minimizer of

min. θ Rd, θ 0

sk r, θ + 1

Thus by Proposition 4, there exists v(s) Rd such that (v(s), µ(s)) is the unique solution of the LCP

sk r + Mµ ,

v 0 , µ 0 , v µ = 0 .

Let us make a side remark for later purposes. We rewrite the LCP above as

sv = k sr + M(sµ) ,

sv 0 , sµ 0 , (sv) (sµ) = 0 .

Thus (sv(s), sµ(s)) is a solution of the LCP (8), of which (w(s), z(s)) is also the solution. By unicity of the solution of the LCP (Proposition 6),

z(s) = sµ(s) . (9)

We now return to the proof of point (1) of the theorem. Consider s1 s2. Then (v(s1), µ(s1)) (resp. (v(s2), µ(s2))) is the unique solution of the LCP with parameters q(1) = 1

s1 k r (resp. q(2) = 1

s2 k r) and M. Note that as k 0, q(1) q(2). As M is symmetric positive deﬁnite (Proposition 1) with non-positive oﬀ-diagonal entries (Assumption (A2)), we apply the antitonicity property (Proposition 7) and obtain that µ(s1) µ(s2).

Incremental Learning in Diagonal Linear Networks

(3) Abusing again on the notation of the time indexations, we have

0 dt θ(ε)(t) = 1

0 du θ(ε)(u) = 1

By Proposition 5, z(ε)(s) converges to z(s) uniformly on compact subsets of R 0. Thus 1

sz(ε)(s) converges uniformly to 1

sz(s) on compact subsets of R>0. (Note that as the factor 1/s diverges at 0, we need to consider compact subsets bounded away from 0.) As 1

sz(s) = µ(s) (Equation (9)), this concludes point (3) of the theorem.

(2) Fix p {1, . . . , q, q + 1} and u, v R>0 such that sp 1 < u < v < sp (with the conventions s0 = 0, sq+1 = ). Let I be the constant value of I(s) for s (sp 1, sp). To prove point (2) of the theorem, it is suﬃcient to show θ(ε) s log 1

ε ε 0 θ(I) uniformly for s [u, v].

For θ Rd, deﬁne V (θ) = X

θi θ(I) ,i log θi . (10)

The function V is inspired from a Lyapunov function used in the study of Lotka Volterra equations (Goh, 1979). Here, we show that V is almost decreasing on the interval (sp 1, sp). We ﬁrst compute

d dt V (θ(ε)) = (3)

θ(ε) i θ(I) ,i ri Mθ(ε)

Using the deﬁnition of θ(I) in Proposition 2, we have

(Mθ(I) )I = M IIθ(I) ,I + M IIcθ(I) ,Ic = M IIM 1 II r I + M IIc0 = r I ,

thus we can rewrite (11) as

d dt V (θ(ε)) = X

θ(ε) i θ(I) ,i Mθ(I)

= D θ(ε) θ(I) , M(θ(ε) θ(I) ) E X

i/ I θ(ε) i Mθ(I)

Fix u , v such that sp 1 < u < u < v < v < sp. We integrate the equality above for s [u , v ]:

V (θ(ε)(v )) V (θ(ε)(u )) = Z v

ds V (θ(ε))

dt V (θ(ε))

u ds D θ(ε) θ(I) , M(θ(ε) θ(I) ) E

u ds θ(ε) i Mθ(I)

Denote λmin(M) the minimal eigenvalue of M. By Proposition 1, λmin(M) > 0.

u ds θ(ε) θ(I) 2 1 λmin(M)

u ds D θ(ε) θ(I) , M(θ(ε) θ(I) ) E

= 1 λmin(M)

V θ(ε)(u ) V θ(ε)(v )

u ds θ(ε) i Mθ(I)

Take ε0 > 0 as deﬁned in Lemma 1 and now assume ε min(1, ε0) so that both Lemmas 1 and 2 apply. Then we have the following estimates:

Fix i / I. From Lemma 2, there exists a constant C independent of ε such that

u ds θ(ε) i Mθ(I)

u ds θ(ε) i

0 ds θ(ε) i u 1

0 ds θ(ε) i

Using Theorem 1.(3), this last quantity converges as ε 0 to v µi(v ) u µi(u ), which is equal to 0 as i / I. We thus have

u ds θ(ε) i Mθ(I)

ε 0 0 . (13)

Further, if s = u or s = v ,

V (θ(ε)(s)) = X

θ(ε) i (s) θ(I) ,i log θ(ε) i (s)

i I θ(ε) i (s) + log 1

i I θ(I) ,i w(ε) i (s) (14)

by the deﬁnition of w(ε) i in Equation (6). The ﬁrst term P i I θ(ε) i (s) is bounded independently of ε by Lemma 2. Further, for all i I, µi(s) > 0 and thus by Equation (9), zi(s) > 0. By the complementary slackness of z(s) and w(s), we must have wi(s) = 0. As a consequence, using Proposition 5, X

i I θ(I) ,i w(ε) i (s) ε 0

i I θ(I) ,i wi(s) = 0 .

Returning to Equation (14), we obtain that

V (θ(ε)(u )) = o log 1

, V (θ(ε)(v )) = o log 1

Incremental Learning in Diagonal Linear Networks

Putting together Equations (12), (13) and (15), we obtain

u ds θ(ε) θ(I) 2 ε 0 0 . (16)

To conclude on uniform convergence, we use an elementary argument based on monotonicity:

max s [u,v]

θ(ε) i (s) θ(I) ,i = max s [u,v] max θ(ε) i (s) θ(I) ,i , θ(I) ,i θ(ε) i (s) . (17)

By Lemma 1, for all s [u, v] and s [v, v ], θ(ε) i (s) θ(I) ,i θ(ε) i (s ) θ(I) ,i , thus

θ(ε) i (s) θ(I) ,i 1 v v

v ds θ(ε) i (s ) θ(I) ,i .

Using H older s inequality, we obtain

θ(ε) i (s) θ(I) ,i 1

v ds θ(ε) i (s ) θ(I) ,i 2 !1/2

v ds θ(ε)(s ) θ(I) 2 !1/2 .

θ(I) ,i θ(ε) i (s) 1

u ds θ(ε)(s ) θ(I) 2 1/2 .

Finally, plugging these estimates back in Equation (17) and using Equation (16), we obtain

max s [u,v]

θ(ε) i (s) θ(I) ,i max

v ds θ(ε)(s ) θ(I) 2 !1/2 ,

u ds θ(ε)(s ) θ(I) 2 1/2 !

This being true for all i {1, . . . , d}, we conclude that θ(ε) converges to θ(I) uniformly on [u, v] and thus point (2) of the theorem holds.

5. Proof of Corollary 1

We ﬁrst study under which condition on s we have I(s) = {1, . . . , d}. By deﬁnition of I(s), this is equivalent to having µ(s) > 0 and thus I(s) = {1, . . . , d} if and only if f(θ)+ 1

s k, θ is minimized at a positive point. Note that

is minimized at M 1 r 1

I(s) = {1, . . . , d} M 1r 1

s > s , s := max i (M 1k)i (M 1r)i . (18)

Consider s > s . From Theorem 1.(2),

θ(ε) s log 1

ε 0 θ({1,...,d}) .

Take η > 0. Then there exists ε1 > 0 such that for all ε ε1, θ(ε) s log 1

θ({1,...,d})

Thus τ (ε) η s log 1

ε. This being true for all ε ε1, we have lim supε 0 τ (ε) η log 1/ε s. This

being true for all s > s , we have lim supε 0 τ (ε) η log 1/ε s .

We are left with showing that lim infε 0 τ (ε) η log 1/ε s . We assume that η < minj θ({1,...,d}) ,j ,

which is possible as from Proposition 2, θ({1,...,d}) > 0. Consider s < s . Then from (18), I(s) = {1, . . . , d} thus we can consider i / I(s). Assume further that s = s1, . . . , sq. Then by Theorem 1.(2),

ε 0 θ(I(s)) ,i = 0 .

Denote ν = minj θ({1,...,d}) ,j η > 0. There exists ε2 > 0 such that for all ε ε2,

Deﬁne ε0 as in Lemma 1 and assume ε min(ε0, ε2) so that both Lemma 1 and Equation (19) apply. Then for all t s log 1

θ(ε) i (t) (Lemma 1) θ(ε) i

< (Equation (19)) ν .

Incremental Learning in Diagonal Linear Networks

Thus θ(ε)(t) θ({1,...,d}) θ(ε) i (t) θ({1,...,d}) ,i θ({1,...,d}) ,i θ(ε) i (t)

> θ({1,...,d}) ,i ν min j θ({1,...,d}) ,j ν = η .

This being true for all t s log 1

ε, this gives τ (ε) η s log 1

ε. This being true for all ε

min(ε0, ε2), this gives lim infε 0 τ (ε) η log 1/ε s. This being true for all s < s , s = s1, . . . , sq,

this gives lim infε 0 τ (ε) η log 1/ε s .

We thus conclude that limε 0 τ (ε) η log 1/ε = s .

6. Conclusion

In this paper, we have shown how the implicit regularization of DLNs is generated by incremental learning with successive coordinate activations. We obtain a sharp description of the incremental learning process using an associated regularized optimization problem with decreasing regularization. An immediate open question is to obtain a similar description without the anti-correlation assumption (A2). This would cover the overparametrized setting. In this case, it should be necessary to parametrize θi = (u2 i v2 i )/4 (as in the article of Vaskevicius et al. (2019), for instance) so that the sign of θi is not constrained. Further, we leave open whether our strategy can be adapted to study incremental learning in matrix factorization problems and more general neural networks, as well as the statistical beneﬁts of the induced implicit regularization.

Acknowledgements

The author thanks Loucas Pillaud-Vivien for many discussions motivating this project.

Appendix A. Properties of the Linear Complementarity Problem

This section gathers a few properties of the linear complementarity problem (LCP) from the monograph of Cottle et al. (2009). Let q Rd and M Rd d. We recall that the LCP associated to the parameters q and M is the problem of ﬁnding (w, z) Rd 2 such that

w = q + Mz ,

w 0 , z 0 , w z = 0 . (20)

Proposition 6 Assume that M is symmetric positive deﬁnite. Then the LCP (20) has a unique solution.

This result is provided in the monograph of Cottle et al. (2009, Theorem 3.1.6) (actually without the symmetry requirement).

Proposition 7 (antitonicity property) Assume that M is symmetric positive deﬁnite, with non-positive oﬀ-diagonal entries. Consider q(1) q(2) and let (w1, z1), (w2, z2) be the unique solutions of (20) with q = q(1), q(2) respectively. Then z1 z2.

Proof This result is provided in the monograph of Cottle et al. (2009, Theorem 3.11.9). Indeed, the fact that M has non-positive oﬀ-diagonal entries means that M is a Zmatrix in the sense of Cottle et al. (2009, Deﬁnition 3.11.1). Further, M is a symmetric positive deﬁnite matrix, thus M is a P -matrix in the sense of Cottle et al. (2009, Section 3.3). Thus M is a K-matrix in the sense of Cottle et al. (2009, Deﬁnition 3.11.1). Thus Theorem 3.11.9 of Cottle et al. (2009) applies.

Appendix B. Proof of Lemma 1

The DLN dynamics (3) form an autonomous ordinary diﬀerential equation (ODE)

dt = Ψ(θ) , (Ψ(θ))i = θi

We ﬁrst show that the set

i {1, . . . , d}, ri

j=1 Mijθj 0

is positively invariant for this ODE, i.e., if θ(0) Q, then θ(t) Q for all t 0. The proof is based on Nagumo s theorem, see the original result of Nagumo (1942) or the recent introduction of Blanchini and Miani (2008, Section 4.2) for instance. Heuristically, Nagumo s theorem states that Q is positively invariant if the vector ﬁeld Ψ(θ) points in the set Q if θ is on the boundary of Q.

Incremental Learning in Diagonal Linear Networks

More precisely, for θ Q, denote

i {1, . . . , d}

j Mijθj = 0

the set of active constraints at θ and

i Act(θ), X

the tangent cone to Q at θ (Blanchini and Miani, 2008, Eq. (4.6)). Then Nagumo s theorem (Blanchini and Miani, 2008, Corollary 4.8) states that Q is positively invariant for the dynamics (21) if for all θ Q, Ψ(θ) TQ(θ). We now check that this latter condition is satisﬁed. Let θ Q and i Act(θ). We need to show that P j Mij (Ψ(θ))j 0. We have

j Mij (Ψ(θ))j = X

For the ﬁrst term, we have i Act(θ) and thus ri P k Mikθk = 0. For the sum, we have Mij 0 (Assumption (A2)), θj 0 and rj P k Mjkθk 0 (because θ Q). Thus we indeed have P j Mij (Ψ(θ))j 0 and thus Ψ(θ) TQ(θ). We conclude that Q is positively invariant. As θ(ε)(0) = (C1εk1, . . . , Cdεkd) 0 as ε 0 and r > 0, there exists ε0 > 0 such that ε (0, ε0], θ(ε)(0) Q. Thus ε (0, ε0], t 0, θ(ε)(t) Q. Thus ε (0, ε0], t 0, i {1, . . . , d},

Appendix C. Proof of Proposition 1

Consider the block matrix

f M = X y X y =

From Assumptions (A1)-(A2), f M is a matrix with non-positive oﬀ-diagonal entries. Thus there exists µ R such that A = µId+1 f M is a matrix with non-negative entries. Moreover, from Assumption (A1), for i {1, . . . , d}, Ai,d+1 = Ad+1,i = ri > 0. This implies that A is irreducible (see Cottle et al., 2009, Section 2.2 for a deﬁnition). By the Perron Frobenius theorem (Cottle et al., 2009, Theorem 2.2.21), the largest eigenvalue of A is

simple and there exists an eigenvector ev with positive entries associated to this eigenvalue. As a consequence, the smallest eigenvalue λ of f M = µId+1 A is simple and associated to ev. We now have two cases:

If the smallest eigenvalue λ is positive, then f M is positive deﬁnite and thus so is the principal submatrix M.

If λ = 0, then ker f M = {αev, α R}. We want to show that ker M = {0}. Consider v Rd such that Mv = X Xv = 0. This implies that Xv = 0. (This can be seen, for instance, using a singular value decomposition of X.) Then we perform the block computation

ker f M = {αev, α R}. As ev > 0, elements of ker f M have all entries

non-zero or all entries equal to 0. Thus it must be that v = 0. This concludes that ker M = {0} and thus that M is positive deﬁnite.

Appendix D. Proof of Proposition 2

Let I {1, . . . , d}. We ﬁrst prove that (M II) 1 exists. The matrix M II has non-positive oﬀ-diagonal entries thus M II is a Z-matrix in the sense of Cottle et al. (2009, Deﬁnition 3.11.1). Further, M II is a symmetric positive deﬁnite matrix as a principal submatrix of a positive deﬁnite matrix (Proposition 1). Thus M II is a P -matrix in the sense of Cottle et al. (2009, Section 3.3). Thus M II is a K-matrix in the sense of Cottle et al. (2009, Deﬁnition 3.11.1). From Theorem 3.11.10 of Cottle et al. (2009), (M II) 1 exists and has non-negative entries. Thus, we can deﬁne θ(I) Rd by the equations (θ(I) )I = (M II) 1r I and (θ(I) )Ic = 0. We check that (θ(I) )I > 0. Fix i I. Then θ(I) ,i = P j I (M II) 1

ij rj 0 from Assumption (A1) and the fact that (M II) 1 has non-negative entries. Moreover, assume by contradiction that θ(I) ,i = 0. As from Assumption (A1), r > 0, we have for all j I, (M II) 1

ij = 0. Thus a full row of (M II) 1 is 0, which contradicts the fact that (M II) 1

is invertible. Thus for all i I, θ(I) ,i > 0.

We now check that θ(I) is a ﬁxed point of (3). Fix i {1, . . . , d}. If i I,

j=1 Mijθ(I) ,j = ri X

j I Mij (M II) 1r I

j = r I M II(M II) 1r I

Incremental Learning in Diagonal Linear Networks

If i / I, by deﬁnition, θ(I) ,i = 0. In both cases, θ(I) ,i ri Pd j=1 Mijθ(I) ,j = 0. As this is true

for all i {1, . . . , d}, this proves that θ(I) is a ﬁxed point of (3). We now take a ﬁxed point θ of (3) with support I, and show that θ = θ(I) . It is suﬃcient to show the equality on the support of the vectors, namely that θI = θ(I)

I = (M II) 1r I.

Consider i I. As θ is a ﬁxed point, θi ri Pd j=1 Mijθj = 0. But as i I, the ﬁrst

factor is non-zero. Thus ri Pd j=1 Mijθj = ri P j I Mijθj = 0. With vector notation, we proved r I M IIθI = 0, which gives the claimed equality.

Appendix E. Proof of Proposition 4

In this proof, we use convex duality (Boyd and Vandenberghe, 2004, Section 5.5). The Lagrangian associated to (7) is

L(θ, w) = q, θ + 1

2 θ, Mθ w, θ

where w Rd is the Lagrange multiplier associated to the constraint θ 0. As the optimization problem (7) is convex, the KKT conditions are necessary and suﬃcient for optimality. The stationarity condition is

0 = θL(θ, w) = q + Mθ w ,

the feasibility conditions are θ 0 and w 0, and the complementary slackness condition is w θ = 0. At this point, we have proven the equivalence between:

1. z is a minimizer of the constrained optimization problem

min. θ Rd, θ 0

2. There exists w Rd such that (w, z) is a solution of

w = q + Mz ,

w 0 , z 0 , w z = 0 .

We are left with proving that the solutions of both problems are indeed unique. For the LCP, this is given by Proposition 6 as M is positive deﬁnite (Proposition 1). We can then use the equivalence shown above to prove that the constrained optimization problem (7) has a unique solution.

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019.

Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond inﬁnitesimal mirror descent. In International Conference on Machine Learning, pages 468 477. PMLR, 2021.

Francis Bach, Rodolphe Jenatton, Julien Mairal, and Guillaume Obozinski. Optimization with sparsity-inducing penalties. Foundations and Trends R in Machine Learning, 4(1): 1 106, 2012.

Stephen Baigent. Lotka Volterra dynamical systems. In Dynamical and Complex Systems, pages 157 188. World Scientiﬁc, 2017.

Yuri Bakhtin. Noisy heteroclinic networks. Probability Theory and Related Fields, 150(1): 1 42, 2011.

Peter Bartlett, Andrea Montanari, and Alexander Rakhlin. Deep learning: a statistical viewpoint. Acta Numerica, 30:87 201, 2021.

Franco Blanchini and Stefano Miani. Set-Theoretic Methods in Control. Springer, 2008.

Etienne Boursier, Loucas Pillaud-Viven, and Nicolas Flammarion. Gradient ﬂow dynamics of shallow Re LU networks for square loss and orthogonal inputs. In Advances in Neural Information Processing Systems, volume 35, pages 20105 20118, 2022.

Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, 2004.

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305 1338. PMLR, 2020.

Hung-Hsu Chou, Carsten Gieshoﬀ, Johannes Maly, and Holger Rauhut. Gradient descent for deep matrix factorization: Dynamics and implicit bias towards low rank. ar Xiv preprint ar Xiv:2011.13772, 2020.

Hung-Hsu Chou, Johannes Maly, and Holger Rauhut. More is less: inducing sparsity via overparameterization. Information and Inference: A Journal of the IMA, 12(3), 2023.

Richard Cottle, Jong-Shi Pang, and Richard Stone. The Linear Complementarity Problem. SIAM, 2009.

Nelson Dunford and Jacob Schwartz. Linear Operators, Part 1: General Theory, volume 10. John Wiley & Sons, 1988.

Rong Ge, Yunwei Ren, Xiang Wang, and Mo Zhou. Understanding deﬂation process in overparametrized tensor decomposition. Advances in Neural Information Processing Systems, 34:1299 1311, 2021.

Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. Advances in Neural Information Processing Systems, 32, 2019.

Incremental Learning in Diagonal Linear Networks

Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The implicit bias of depth: How incremental learning drives generalization. In International Conference on Learning Representations, 2019.

Bean-San Goh. Stability in models of mutualism. The American Naturalist, 113(2):261 275, 1979.

Suriya Gunasekar, Blake Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. Advances in Neural Information Processing Systems, 30, 2017.

JeﬀHao Chen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315 2357. PMLR, 2021.

Kais Hariz, Hachem Kadri, St ephane Ayache, Maher Moakher, and Thierry Arti eres. Implicit regularization with polynomial growth in deep tensor factorization. In International Conference on Machine Learning, pages 8484 8501. PMLR, 2022.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2009.

Josef Hofbauer and Karl Sigmund. Evolutionary Games and Population Dynamics. Cambridge University Press, 1998.

Arthur Jacot, Fran cois Ged, Berﬁn S im sek, Cl ement Hongler, and Franck Gabriel. Saddleto-saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. ar Xiv preprint ar Xiv:2106.15933, 2021.

Yann Le Cun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. Nature, 521(7553): 436 444, 2015.

Jiangyuan Li, Thanh Nguyen, Chinmay Hegde, and Ka Wai Wong. Implicit sparse regularization: The impact of depth and early stopping. Advances in Neural Information Processing Systems, 34:28298 28309, 2021.

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference on Learning Theory, pages 2 47. PMLR, 2018.

Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020.

Mor Shpigel Nacson, Kavya Ravichandran, Nathan Srebro, and Daniel Soudry. Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270 16295. PMLR, 2022.

Mitio Nagumo. Uber die lage der integralkurven gew ohnlicher diﬀerentialgleichungen. Proceedings of the Physico-Mathematical Society of Japan. 3rd Series, 24:551 559, 1942.

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable beneﬁt of stochasticity. Advances in Neural Information Processing Systems, 34:29218 29230, 2021.

Loucas Pillaud-Vivien, Julien Reygner, and Nicolas Flammarion. Label noise (stochastic) gradient descent implicitly solves the Lasso for quadratic parametrisation. In Conference on Learning Theory, pages 2127 2159. PMLR, 2022.

Clarice Poon and Gabriel Peyr e. Smooth bilevel programming for sparse regularization. Advances in Neural Information Processing Systems, 34:1543 1555, 2021.

Noam Razin, Asaf Maman, and Nadav Cohen. Implicit regularization in tensor factorization. In International Conference on Machine Learning, pages 8913 8924. PMLR, 2021.

Noam Razin, Asaf Maman, and Nadav Cohen. Implicit regularization in hierarchical tensor factorization and deep convolutional neural networks. In International Conference on Machine Learning, pages 18422 18462. PMLR, 2022.

Andrew Saxe, James Mc Clelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23):11537 11546, 2019.

Daniel Soudry, Elad Hoﬀer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822 2878, 2018.

Tomas Vaskevicius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery. Advances in Neural Information Processing Systems, 32, 2019.

Blake Woodworth, Suriya Gunasekar, Jason Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635 3673. PMLR, 2020.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Peng Zhao, Yun Yang, and Qiao-Chu He. Implicit regularization via Hadamard product over-parametrization in high-dimensional linear regression. ar Xiv preprint ar Xiv:1903.09367, 2019.