# saddletosaddle_dynamics_in_diagonal_linear_networks__0fa02baf.pdf Saddle-to-Saddle Dynamics in Diagonal Linear Networks Scott Pesme EPFL scott.pesme@epfl.ch Nicolas Flammarion EPFL nicolas.flammarion@epfl.ch In this paper we fully describe the trajectory of gradient flow over 2-layer diagonal linear networks for the regression setting in the limit of vanishing initialisation. We show that the limiting flow successively jumps from a saddle of the training loss to another until reaching the minimum ℓ1-norm solution. We explicitly characterise the visited saddles as well as the jump times through a recursive algorithm reminiscent of the LARS algorithm used for computing the Lasso path. Starting from the zero vector, coordinates are successively activated until the minimum ℓ1-norm solution is recovered, revealing an incremental learning. Our proof leverages a convenient arc-length time-reparametrisation which enables to keep track of the transitions between the jumps. Our analysis requires negligible assumptions on the data, applies to both under and overparametrised settings and covers complex cases where there is no monotonicity of the number of active coordinates. We provide numerical experiments to support our findings. 1 Introduction Strikingly simple algorithms such as gradient descent are driving forces for deep learning and have led to remarkable empirical results. Nonetheless, understanding the performances of such methods remains a challenging and exciting mystery: (i) their global convergence on highly non-convex losses is far from being trivial and (ii) the fact that they lead to solutions which generalise well [53] is still not fully understood. To explain this second point, a major line of work has focused on the concept of implicit regularisation: amongst the infinite space of zero-loss solutions, the optimisation process must be implicitly biased towards solutions which have good generalisation properties for the considered real-world prediction tasks. Many papers have therefore shown that gradient methods have the fortunate property of asymptotically leading to solutions which have a well-behaving structure [38, 24, 16]. Aside from these results which mostly focus on characterising the asymptotic solution, a slightly different point of view has been to try to describe the full trajectory. Indeed it has been experimentally observed that gradient methods with small initialisations have the property of learning models of increasing complexity across the training of neural networks [29]. This behaviour is usually referred to as incremental learning or as a saddle-to-saddle process and describes learning curves which are piecewise constant: the training process makes very little progress for some time, followed by a sharp transition where a new feature is suddenly learned. In terms of optimisation trajectory, this corresponds to the iterates "jumping" from a saddle of the training loss to another. Several settings exhibiting such dynamics for small initialisation have been considered: matrix and tensor factorisation [44, 27], simplified versions of diagonal linear networks [23, 7], linear networks [22, 45, 26], 2-layer neural networks with orthogonal inputs [10], learning leap functions with 2-layer neural networks [1] and matrix sensing [2, 33, 28]. However, all these results require 37th Conference on Neural Information Processing Systems (Neur IPS 2023). 0Sxtq WQj JTf09k NDJm HAW2M6I4NIve VPz P6QYXvu ZUEm KXLH5oj CVBGMy/Zv0he YM5dg Syr Swtx I2p Joyt Om Ub Aje4sv Lp Hl W9S6r5/c Xldp NHkc Rju AYTs GDK6j BHd Sh AQw G8Ayv8OZI58V5dz7mr QUnzm EP3A+fw AINI2kt1 t2 t3 Time t Non-activated coordinates t1 t2 t3 Time t First saddle Second saddle Third saddle Global minimum Iterates in space ZMxq YRW8q/ue1M4qug7FQa Uao+Hx Rl Em XEnf6u9s TGjn Jk SWMa2Fvdfm Aacb Jl Sy Ifi Ly+Tx7Oqf1k9v7+o1G7y OIpw BMdw Cj5c Q3uo A4N4DCEZ3i FNyd1Xpx352Pe Wn Dym UP4A+fz B+1mj08=β0 Train losses L(β t ) Iterates β Saddles: Jumps: oih GD2g J/Rsgf Vov Vivi9act Zw5RD9gv X0Cuwu Rk Q=β? De-activating Figure 1: Gradient flow (βα t )t with small initialisation scale α over a 2-layer diagonal linear network (for the precise experimental setting, see Appendix A). Left: Training loss across time, the learning is piecewise constant. Middle: The magnitudes of the coordinates are plotted across time: the process is piecewise constant. Right: In the R3 space in which the iterates evolve (the remaining coordinates stay at 0), the iterates jump from a saddle of the training loss to another. The jumping times ti as well as the visited saddles βi are entirely predicted by our theory. restrictive assumptions on the data or only characterise the first jump. Obtaining a complete picture of the saddle-to-saddle process by describing all the visited saddles and jump times is mathematically challenging and still missing. We intend to fill this gap by considering diagonal linear networks which are simplified neural networks that have received significant attention lately [50, 48, 25, 43, 20] as they are ideal proxy models for gaining a deeper understanding of complex phenomenons such as saddle-to-saddle dynamics. 1.1 Informal statement of the main result In this paper, we provide a full description of the trajectory of gradient flow over 2-layer diagonal linear networks in the limit of vanishing initialisation. The main result is informally presented here. Theorem 1 (Main result, informal). In the regression setting and in the limit of vanishing initialisation, the trajectory of gradient flow over a 2-layer diagonal linear network converges towards a limiting process which is piecewise constant: the iterates successively jump from a saddle of the training loss to another, each visited saddle and jump time can recursively be computed through an algorithm (Algorithm 1) reminiscent of the LARS algorithm for the Lasso. The incremental learning stems from the particular structure of the saddles as they correspond to minimisers of the training loss with a constraint on the set of non-zero coordinates. The saddles therefore correspond to sparse vectors which partially fit the dataset. For simple datasets, a consequence of our main result is that the limiting trajectory successively starts from the zero vector and successively learns the support of the sparse ground truth vector until reaching it. However, we make minimal assumptions on the data and our analysis also holds for complex datasets. In that case, the successive active sets are not necessarily increasing in size and coordinates can deactivate as well as activate until reaching the minimum ℓ1-norm solution (see Figure 1 (middle) for an example of a deactivating coordinate). The regression setting and the diagonal network architecture are introduced in Section 2. Section 3 provides an intuitive construction of the limiting saddle-to-saddle dynamics and presents the algorithm that characterises it. Our main result regarding the convergence of the iterates towards this process is presented in Section 4 and further discussion is provided in Section 5. 2 Problem setup and leveraging the mirror structure Linear regression. We study a linear regression problem with inputs (x1, . . . , xn) (Rd)n and outputs (y1, . . . , yn) Rn. We consider the typical quadratic loss: i=1 ( β, xi yi)2 . (1) We make no assumption on the number of samples n nor the dimension d. The only assumption we make on the data throughout the paper is that the inputs (x1, . . . , xn) are in general position. In order to state this assumption, let X Rn d be the feature matrix whose ith row is xi and let xj Rn be its jth column for j [d]. Assumption 1 (General position). For any k min(n, d) and arbitrary signs σ1, . . . , σk { 1, 1}, the affine span of any k points σ1 xj1, . . . , σk xjk does not contain any element of the set { xj, j = j1, . . . , jk}. This assumption is slightly technical but is standard in the Lasso literature [47]. Note that it is not restrictive as it is almost surely satisfied when the data is drawn from a continuous probability distribution [47, Lemma 4]. Letting S = arg minβ L(β) denote the affine space of solutions, Assumption 1 ensures that the minimisation problem minβ S β 1 has a unique minimiser which we denote β ℓ1 and which corresponds to the minimum ℓ1-norm solution. 2-layer diagonal linear network. In an effort to understand the training dynamics of neural networks, we consider a 2-layer diagonal linear network which corresponds to writing the regression vector β as βw = u v where w = (u, v) R2d . (2) This parametrisation can be interpreted as a simple neural network x 7 u, σ(diag(v)x) where u are the output weights, the diagonal matrix diag(v) represents the inner weights, and the activation σ is the identity function. We refer to w = (u, v) R2d as the weights and to β := u v Rd as the prediction parameter. With the parametrisation (2), the loss function F over the parameters w = (u, v) R2d is defined as: F(w) := L(u v) = 1 i=1 ( u v, xi yi)2 . (3) Though this reparametrisation is simple, the associated optimisation problem is non-convex and highly non-trivial training dynamics already occur. The critical points of the function F exhibit a very particular structure, as highlighted in the following proposition proven in Appendix B. Proposition 1. All the critical points wc of F which are not global minima, i.e., F(wc) = 0 and F(wc) > minw F(w), are necessarily saddle points (i.e., not local extrema). They map to parameters βc = uc vc which satisfy |βc| L(βc) = 0 and: βc arg min β[i]=0 for i/ supp(βc) L(β) (4) where supp(βc) = {i [d], βc[i] = 0} corresponds to the support of βc. The optimisation problem in Eq. (4) states that the saddle points of the train loss F correspond to sparse vectors that minimise the loss function L over its non-zero coordinates. This property already shows that the saddle points possess interesting properties from a learning perspective. In the following we loosely use the term of saddle to refer to points βc Rd solution of Eq. (4) that are not saddles of the convex loss function L. We adopt this terminology because they correspond to points wc R2d that are indeed saddles of the non-convex loss F. Gradient Flow and necessity of accelerating time. We minimise the loss F using gradient flow: dwt = F(wt)dt , (5) initialised at u0 = 2α1 Rd >0 with α > 0, and v0 = 0 Rd. This initialisation results in β0 = 0 Rd independently of the chosen weight initialisation scale α. We denote βα t := uα t vα t the prediction iterates generated from the gradient flow to highlight its dependency on the initialisation scale α1. The origin 0 R2d is a critical point of the function F and taking the initialisation α 0 therefore arbitrarily slows down the dynamics. In fact, it can be easily shown for any fixed time t, that (uα t , vα t ) 0 as α 0, indicating that the iterates are stuck at the origin. Therefore if we restrict ourselves to a finite time analysis, there is no hope of exhibiting the observed saddle-to-saddle behaviour. To do so, we must find an appropriate bijection tα in R 0 which accelerates time (i.e. tα(t) α 0 + for all t) and consider the accelerated iterates βα tα(t) which can escape the saddles. Finding this bijection becomes very natural once the mirror structure is unveiled. 1We point out that the trajectory of βα t exactly matches that of another common parametrisation βw := 1 2(w2 + w2 ), with initialisation w+,0 = w ,0 = α1. 2.2 Leveraging the mirror flow structure While the iterates (wα t )t follow a gradient flow on the non-convex loss F, it is shown in [5] that the iterates βα t follow a mirror flow on the convex loss L with potential ϕα and initialisation βα t=0 = 0: d ϕα(βα t ) = L(βα t )dt, (6) where ϕα is the hyperbolic entropy function [21] defined as: βiarcsinh( βi β2 i + α4 + α2 . (7) Unveiling the mirror flow structure enables to leverage convex optimisation tools to prove convergence of the iterates to a global minimiser β α as well as a simple proof of the implicit regularisation problem it solves. As shown by Woodworth et al. [50], in the overparametrised setting where d > n and where there exists an infinite number of global minima, the limit β α is the solution of the problem: β α = arg min yi= xi,β , i ϕα(β). (8) Furthermore, a simple function analysis shows that ϕα behaves as a rescaled ℓ1-norm as α goes to 0, meaning that the recovered solution β α converges to the minimum ℓ1-norm solution β ℓ1 = arg minyi= xi,β β 1 as α goes to 0 (see [49] for a precise rate). To bring to light the saddle-tosaddle dynamics which occurs as we take the initialisation to 0, we make substantial use of the nice mirror structure from Eq. (6). Appropriate time rescaling. To understand the limiting dynamics of βα t , it is natural to consider the limit α 0 in Eq. (6). However, the potential ϕα is such that ϕα(β) ln(1/α) β 1 for small α and therefore degenerates as α 0. Similarly, for β = 0, ϕα(β) as α 0. The formulation from Eq. (6) is thus not appropriate to take the limit α 0. We can nonetheless obtain a meaningful limit by considering the opportune time acceleration tα(t) = ln(1/α) t and looking at the accelerated iterates βα t := βα tα(t) = βα ln(1/α)t. (9) Indeed, a simple chain rule leads to the accelerated mirror flow : d ϕα( βα t ) = ln ( 1 α) L( βα t )dt. The accelerated iterates ( βα t )t follow a mirror descent with a rescaled potential: d ϕα( βα t ) = L( βα t )dt, where ϕα := 1 ln(1/α) ϕα, (10) with βt=0 = 0 and where ϕα is defined Eq. (7). Our choice of time acceleration ensures that the rescaled potential ϕα is non-degenerate as the initialisation goes to 0 since ϕα(β) α 0 β 1. 3 Intuitive construction of the limiting flow and saddle-to-saddle algorithm In this section, we aim to give a comprehensible construction of the limiting flow. We therefore choose to provide intuition over pure rigor, and defer the full and rigorous proof to the Appendix E. The technical crux of our analysis is to demonstrate the existence of a piecewise constant limiting process towards which the iterates βα converge to. The convergence result is deferred to the following Section 4. In this section we assume this convergence and refer to this piecewise constant limiting process as ( β t )t. Our goal is then to determine the jump times (t1, . . . , tp) as well as the saddles (β0, . . . , βp) which fully define this process. To do so, it is natural to examine the limiting equation obtained when taking the limit α 0 in Eq. (10). We first turn to its integral form which writes: 0 L( βα s )ds = ϕα( βα t ). (11) Provided the convergence of the flow βα towards β , the left hand side of the previous equation converges to R t 0 L( β s)ds. For the right hand side, recall that ϕα(β) α 0 β 1, it is therefore natural to expect the right hand side of Eq. (11) to converge towards an element of β t 1, where we recall the definition of the subderivative of the ℓ1-norm as: β 1= {1} if β > 0, { 1} if β < 0, [ 1, 1] if β = 0. The arising key equation which must satisfy the limiting process β is then, for all t 0: 0 L( β s)ds β t 1. (12) We show that this equation uniquely determines the piecewise constant process β by imposing the number of jumps p, the jump times as well as the saddles which are visited between the jumps. Indeed the relation described in Eq. (12) provides 4 restrictive properties that enable to construct β . To state them, let st = R t 0 L( β s)ds and notice that it is continuous and piecewise linear since β is piecewise constant. For each coordinate i [d], it holds that: (K1) st[i] [ 1, 1] (K2) st[i] = 1 β t [i] 0 and st[i] = 1 β t [i] 0 (K3) st[i] ( 1, 1) β t [i] = 0 (K4) β t [i] > 0 st[i] = 1 and β t [i] < 0 st[i] = 1 To understand how these conditions lead to the algorithm which determines the jump times and the visited saddles, we present a 2-dimensional example for which we can walk through each step. The general case then naturally follows from this simple example. 3.1 Construction of the saddle-to-saddle algorithm with an illustrative 2d example. Let us consider n = d = 2 and data matrix X R2 2 such that X X = ((1, 0.2), (0.2, 0.2)). We consider β = ( 0.2, 2) R2 and outputs y = Xβ . This setting is such that the loss L has β as its unique minimum and L(β ) = 0. Furthermore the non-convex loss F has 3 saddles which map to: βc,0 := (0, 0) = arg minβi=0, i L(β), βc,1 := (0.2, 0) = arg minβ[2]=0 L(β) and βc,2 := (0, 1.6) = arg minβ[1]=0 L(β). The loss function L is sketched in Figure 2 (Left). Notice that by the definition of βc,1 and βc,2, the gradients of the loss at these points are orthogonal to the axis they belong to. When running gradient flow with a small initialisation over our diagonal linear network, we obtain the plots illustrated Figure 2 (Middle and Right). We observe three jumps: the iterates jump from the saddle at the origin to βc,1 at time t1, then to βc,2 at time t2 and finally to the global minimum β at time t3. sÆ(2) Time t Accelerated iterates β ØÆ(1) ØÆ(2) balg Q8U39P5CQx Zpx Etj Mh MDSL3l T8z/Mzi K/Dn Ms0Aybpf FGc CQw KT/Hfa4ZBTG2h FDN7a2YDokm FGx KFRu Ct/jy Mmf1b3L+vn9Ra1x U8Rkfo GJ0i D12h Brp DTd RCFCn0j F7Rmw POi/Puf Mxb S04xc4j+w Pn8AW3rk Vw=β? Ø (1) Ø (2) r L(βc,1) r L(βc,1) OIJj OAUPrq AGd1CHBj AYw DO8wpsjn Bfn3fm Ytxacf OYQ/s D5/AHe2Y2J[1] [2] Figure 2: Left: Sketch of the 2d loss. Middle and right: Outputs of gradient flow with small initialisation scale: the iterates are piecewise constant and st is piecewise linear across time. We refer to the main text for further details. Let us show how Eq. (12) enables us to theoretically recover this trajectory. A simple observation which we will use several times below is that for any t > t such that β is constant equal to β over the time interval (t, t ), the definition of s enables to write that st = st (t t) L(β). Zeroth saddle: The iterates are at the saddle at the origin: β t = β0 := βc,0 and therefore st = t L(β0). Our key equation Eq. (12) is verified since st = t L(β0) β0 1= [ 1, 1]d. However the iterates cannot stay at the origin after time t1 := 1/ L(β0) which corresponds to the time at which the first coordinate of st hits +1: st1[1] = 1. If the iterates stayed at the origin after t1, (K1) for i = 1 would be violated. The iterates must hence jump. First saddle: The iterates can only jump to a point different from the origin which maintains Eq. (12) valid. We denote this point as β1. Notice that: st1[2] = t1 L(β0)[2] ( 1, 1) and since st is continuous, we must have β1[2] = 0 (K3) st1[1] = 1 and hence for t t1, st[1] = 1 (t t1) L(β1)[1]. We cannot have L(β1)[1] < 0 (K1), and neither L(β1)[1] > 0 since otherwise st[1] ( 1, 1) and β1 = 0 (K3) The two conditions β1[2] = 0 and L(β1)[1] = 0 uniquely defines β1 as equal to βc,1. We now want to know if and when the iterates jump again. We saw that st[1] remains at the value +1. However since β1 is not a global minimum, L(β1)[2] = 0 and st[2] hits +1 at time t2 defined such that (t1 L(β0) + (t2 t1) L(β1))[2] = 1. The iterates must jump otherwise (K1) would break. The iterates cannot jump to β yet! As the second coordinate of the iterates can activate, one could expect the iterates to be able to jump to the global minimum. However note that st is a continuous function and that st2 is equal to the vector (1, 1). If the iterates jumped to the global minimum, then the first coordinate of the iterates would change sign from +0.2 to 0.2. Due to (K4) this would lead st jumping from +1 to 1, violating its continuity. Second saddle: We denote as β2 the point to which the iterates jump. st2 is now equal to the vector (1, 1) and therefore (i) β2 0 (coordinate-wise) from (K2 and K3) and the continuity of s. Since st = st2 (t t2) L(β2), we must also have: (ii) L(β2) 0 from (K1) (iii) for i {1, 2}, if β2[i] = 0 then L(β2)[i] = 0 from (K4). The three conditions (i), (ii) and (iii) precisely correspond to the optimality conditions of the following problem: arg min β[1] 0,β[2] 0 L(β). The unique minimiser of this problem is βc,2, hence β2 = βc,2, which means that the first coordinate deactivates. Similar to before, (K1) is valid until the time t3 at which the first coordinate of st = st2 (t t2) L(β2) reaches 1 due to the fact that L(β2)[1] > 0. Global minimum: We follow the exact same reasoning as for the second saddle. We now have st3 equal to the vector ( 1, 1) and the iterates must jump to a point β3 such that (i) β3[1] 0, β3[2] 0 (K2 and K3), (ii) L(β3)[1] 0, L(β3)[2] 0 (K1), (iii) for i {1, 2}, if β3[i] = 0 then L(β3)[i] = 0 (K4). Again, these are the optimality conditions of the following problem: arg min β[1] 0,β[2] 0 L(β). β is the unique minimiser of this problem and β3 = β . For t t3 we have st = st3 and Eq. (12) is satisfied for all following times: the iterates do not have to move anymore. 3.2 Presentation of the full saddle-to-saddle algorithm We can now provide the full algorithm (Algorithm 1) which computes the jump times (t1, . . . , tp) and saddles (β0 = 0, β1, . . . , βp) as the values and vectors such that the associated piecewise constant process satisfies Eq. (12) for all t. This algorithm therefore defines our limiting process β . Algorithm 1 in words. The algorithm is a concise representation of the steps we followed in the previous section to construct β . We explain each step in words below. Starting from k = 0, assume we enter the loop number k at the saddle βk computed in the previous loop: The set Ak contains the set of coordinates "which are unstable": by having a non-zero derivative, the loss could be decreased by moving along each one of these coordinates and one of these coordinates will have to activate. Algorithm 1: Successive saddles and jump times of limα 0 βα Initialise: (t, β, s) (0, 0, 0); while L(β) = 0 do A {j [d], L(β)(j) = 0} inf {δ > 0 s.t. i A, s(i) δ L(β)(i) = 1} (t, s) (t + , s L(β)) β arg min L(β) where β n β Rd s.t. βi 0 if s(i)=+1 βi 0 if s(i)= 1 βi=0 if s(i) ( 1,1) end Output: Successive values of β and t The time gap k corresponds to the time spent at the saddle βk. It is computed as being the elapsed time just before (K1) breaks if the coordinates do not jump. We update tk+1 = tk + k and sk+1 = sk k L(βk): tk+1 corresponds to the time at which the iterates leave the saddle βk and sk+1 constrains the signs of the next saddle βk+1 The solution βk+1 of the constrained minimisation problem is the saddle to which the flow jumps to at time tk+1. The optimality conditions of this problem are such that Eq. (12) is maintained for t tk+1. Various comments on Algorithm 1. First we point out that any solution βc of the constrained minimisation problem which appears in Algorithm 1 also satisfies βc = arg minβ[i]=0 for i/ supp(βc) L(β) as in Eq. (4): the algorithm hence indeed outputs saddles as expected. Up until now we have never checked whether the algorithm s constrained minimisation problem has a unique minimum. This is crucial otherwise the assignment step would be ill-defined. Showing the uniqueness is non-trivial and is guaranteed thanks to the general position Assumption 1 on the data (see Proposition 7 in Appendix D.1). In this same proposition, we also show that the algorithm terminates in at most min (2d, Pn k=0 d k ) steps, that the loss strictly decreases at each step and that the final output βp is the minimum ℓ1-norm solution. These last two properties are expected given the fact that the algorithm arises as being the limit process of βα which follows the mirror flow Eq. (10). Links with the LARS algorithm for the Lasso. Recall that the Lasso problem [46, 15] is formulated as: β λ = arg min β Rd L(β) + λ β 1. (13) The optimality condition of Eq. (13) writes L(β λ) λ β λ 1. Now notice the similarity with Eq. (12): the two would be equivalent with λ = 1/t if the integration on the left hand side of Eq. (12) did not average over the whole trajectory but only on the final iterate, in which case R t 0 L( β t )ds = t L( β t ). Though the difference is small, the trajectories of our limiting trajectory β and the lasso path (β λ)λ are quite different: one has jumps, whereas the other is continuous. Nonetheless, the construction of Algorithm 1 shares many similarities with that of the Least Angle Regression (LARS) algorithm [19] (originally named the Homotopy algorithm [39]) which is used to compute the Lasso path. A notable difference however is the fact that each step of our algorithm depends on the whole trajectory through the vector s, whereas the LARS algorithm can be started from any point on the path. 3.3 Outputs of the algorithm under a RIP and gap assumption on the data. Unlike previous results on incremental learning, complex behaviours can occur when the feature matrix is ill designed: several coordinates can activate and deactivate at the same time (see Appendix A for various cases). However, if the feature matrix satisfies the 2r-restricted isometry property (RIP) [14] and there exists an r-sparse solution β , the visited saddles can be easily approximated using Algorithm 1. We provide the precise characterisation below. Sparse regression with RIP and gap assumption. (RIP) Assume that there exists an r-sparse vector β such that yi = xi, β . Furthermore we assume that the feature matrix X Rn,d satisfies the 2r-restricted isometry property with constant ε < 2 1 < 1/2: i.e. for all submatrix Xs where we extract any s 2r columns of X, the matrix X s Xs/n of size s s has all its eigenvalues in the interval [1 ε, 1 + ε]. (Gap assumption) Furthermore we assume that the r-sparse vector β has coordinates which have a sufficient gap . W.l.o.g we write β = (β 1, . . . , β r, 0, . . . , 0) with |β 1| . . . |β r|> 0 and we define λ := mini [r](|β i | |β i+1|) 0 which corresponds to the smallest gap between the entries of |β |. We assume that 5 ε β 2< λ/2 and we let ε := 5 ε. A classic result from compressed sensing (see Candes [13, Theorem 1.2]) is that the 2r-restricted isometry property with constant 2 1 ensures that the minimum ℓ0-minimisation problem has a unique r-sparse solution which is β . This means that Algorithm 1 will have β as final output and the following proposition shows that we can precisely characterise each of its outputs when the data satisfies the previous assumptions. Proposition 2. Under the restricted isometry property and the gap assumption stated right above, Algorithm 1 terminates in r-loops and outputs: β1 = (β1[1], 0, . . . , 0) with β1[1] [β 1 ε β , β 2 + ε β ] β2 = (β2[1], β2[2], 0, . . . , 0) with β2[1] [β 1 ε β , β 1 + ε β ] β2[2] [β 2 ε β , β 2 + ε β ] ... βr 1 = (βr 1[1], . . . , βr 1[r 1], 0, . . . , 0) with βr 1[i] [β i ε β , β i + ε β ] βr = β = (β 1, . . . , β r, 0, . . . , 0), at times t1, . . . , tr such that ti h 1 |β i |+ε β , 1 |β i | ε β i and where denotes the ℓ2 norm. Informally, this means that the algorithm terminates in exactly r loops and outputs jump times and saddles roughly equal to ti = 1/|β i | and βi = (β 1, , β i , 0, . . . , 0). Therefore, in simple settings, the support of the sparse vector is learnt a coordinate at a time, without any deactivations. We refer to Appendix D.2 for the proof. 4 Convergence of the iterates towards the process defined by Algorithm 1 We are now fully equipped to state our main result which formalises the convergence of the accelerated iterates towards the limiting process β which we built in the previous section. Theorem 2. Let the saddles (β0 = 0, β1, . . . , βp 1, βp = β ℓ1) and jump times (t0 = 0, t1, . . . , tp) be the outputs of Algorithm 1 and let ( β t )t be the piecewise constant process defined as follows: (Saddles) β t = βk for t (tk, tk+1) and 0 k p, tp+1 = + . The accelerated flow ( βα t )t defined in Eq. (9) uniformly converges towards the limiting process ( β t )t on any compact subset of R 0\{t1, . . . , tp}. Convergence result. We recall that from a technical point of view, showing the existence of a limiting process limα 0 βα is the toughest part. Theorem 2 provides this existence as well as the uniform convergence of the accelerated iterates towards β over all closed intervals of R which do not contain the jump times. We highlight that this is the strongest type of convergence we could expect and a uniform convergence over all intervals of the form [0, T] is impossible given that the limiting process β is discontinuous. In Proposition 3, we give an even stronger result by showing a graph convergence of the iterates which takes into account the path followed between the jumps. We also point out that we can easily show the same type of convergence for the accelerated weights wα t := wα tα(t). Indeed, using the bijective mapping which links the weights wt and the predictors βt (see Lemma 1 in Appendix C), we immediately get that the accelerated weights ( uα, vα) uniformly converge towards the limiting process ( q | β |, sign( β ) q | β |) on any compact subset of R 0\{t1, . . . , tp}. Estimates for the non-accelerated iterates βα t . We point out that our result provides no speed of convergence of βα towards β . We believe that a non-asymptotic result is challenging and leave it as future work. Note that we experimentally notice that the convergence rate quickly degrades after each saddle. Nonetheless, we can still write for the non-accelerated iterates that βα t = βα t/ln(1/α) β t/ln(1/α) as α 0. Hence, for α small enough the iterates βα t are roughly equal to 0 until time t1 ln(1/α) and the minimum ℓ1-norm interpolator is reached at time tp ln(1/α). Such a precise estimate of the global convergence time is rather remarkable and goes beyond classical Lyapunov analysises which only leads to L(βα t ) ln(1/α)/t (see Proposition 4 in Appendix C). Natural extensions of our setting. More general initialisations can easily be dealt with. For instance, initialisations of the form ut=0 = αu0 Rd lead to the exact same result as it is shown in [50] (Discussion after Theorem 1) that the associated mirror still converges to the ℓ1-norm. Initialisations of the form [ut=0]i = αki, where ki > 0, lead to the associated potential converging towards a weighted ℓ1-norm and one should modify Algorithm 1 by accordingly weighting L(β) in the algorithm. Also, deeper linear architectures of the form βw = w D + w D as in [50] do not change our result as the associated mirror still converges towards the ℓ1-norm. Though we only consider the square loss in the paper, we believe that all our results should hold for any loss of the type L(β) = Pn i=1 ℓ(yi, xi, β ) where for all y R, ℓ(y, ) is strictly convex with a unique minimiser at y. In fact, the only property which cannot directly be adapted from our results is showing the uniform boundedness of the iterates (see discussion before Proposition 5 in Appendix C). 4.1 High level sketch of proof of βα β which leverages an arc-length parametrisation In this section, we give the high level ideas concerning the proof of the convergence βα β given in Theorem 2. A full and detailed proof can be found in Appendix E. The main difficulty stems from the non-continuity of the limit process β . To circumvent this difficulty, a clever trick which we borrow to [18, 36] is to slow-down time when the jumps occur by considering an arc-length parametrisation of the path. We consider the R 0 arclength bijection τ α and leverage it to define the appropriately slowed down iterates ˆβα τ as: ˆβα τ = βα ˆtα(τ) where ˆtα τ = (τ α) 1(τ) and τ α(t) = t + Z t This time reparametrisation has the fortunate but crucial property of leading to ˆtα(τ) + ˆβα τ = 1 by a simple chain rule, which means that the speed of (ˆβα τ )τ is uniformly upperbounded by 1 independently of α. This behaviour is in stark contrast with the process ( βα t )t which has a speed which explodes at the jumps. This change of time now allows us to use Arzelà-Ascoli s theorem to extract a subsequence which uniformly converges to a limiting process which we denote ˆβ. Importantly, ˆβ enables to keep track of the path followed between the jumps as we show that its trajectory has two regimes: Saddles: ˆβτ = βk Connections: ˆβτ = |ˆβτ| L(ˆβτ) |ˆβτ| L(ˆβτ) . The process ˆβ is illustrated on the right: the red curves correspond to the paths which the iterates follow during the jumps. These paths are called heteroclinic orbits in the dynamical systems literature [31, 3]. To prove Theorem 2, we can map back the convergence of ˆβα to show that of βα . Moreover from the convergence ˆβα ˆβ we get a more complete picture of the limiting dynamics of βα as it naturally implies the convergence of the graph of the iterates ( βα t )t converges towards that of (ˆβτ)τ. The graph convergence result is formalised in this last proposition. ˆβ / |ˆβ| r L(ˆβ) Connections Proposition 3. For all T > tp, the graph of the iterates ( βα t )t T converges to that of (ˆβτ)τ : dist({ βα t }t T , {ˆβτ}τ 0) α 0 0 (Hausdorff distance) 5 Further discussion and conclusion Link between incremental learning and saddle-to-saddle dynamics. The incremental learning phenomenon and the saddle-to-saddle process are often complementary facets of the same idea and refer to the same phenomenon. Indeed for gradient flows dwt = F(wt)dt, fixed points of the dynamics correspond to critical points of the loss. Stages with little progress in learning and minimal movement of the iterates necessarily correspond to the iterates being in the vicinity of a critical point of the loss. It turns out that in many settings (linear networks [30], matrix sensing [8, 41]), critical points are necessarily saddle points of the loss (if not global minima) and that they have a very particular structure (high sparsity, low rank, etc.). We finally note that an alternative approach to realising saddle-to-saddle dynamics is through the perturbation of the gradient flow by a vanishing noise as studied in [6]. Characterisation of the visited saddles. A common belief is that the saddle-to-saddle trajectory can be found by successively computing the direction of most negative curvature of the loss (i.e. the eigenvector corresponding to the most negative eigenvalue) and following this direction until reaching the next saddle [26]. However this statement cannot be accurate as it is inconsistent with our algorithm in our setting. In fact, it can be shown that this algorithm would match the orthogonal matching pursuit (OMP) algorithm [42, 17] which does not necessarily lead to the minimum ℓ1-norm interpolator. In [7], which is the closest to our work and the first to prove convergence of the iterates towards a piece-wise constant process, the successive saddles are entirely characterised and connected to the Lasso regularisation path in the underparameterised setting. Recently, [9] extended the diagonal linear network setting to diagonal parametrisations of the form fu v, but at the cost of stronger assumptions on the trajectory. Adaptive Inverse Scale Space Method. Following the submission of our paper, we were informed that Algorithm 1 had already been proposed and analysed in the compressed sensing literature. Indeed it exactly corresponds to the Adaptive Inverse Scale Space Method (a ISS) proposed in [11]. The motivations behind its study are extremely different from ours and originate from the study of Bregman iteration [12, 40, 52] which is an efficient method for solving ℓ1 related minimisation problems. The so-called inverse scale space flow which corresponds to Eq. (12) in our paper can be seen as the continuous version of Bregman iteration. As in our paper, [11] show that this equation can be solved through an iterative algorithm. We refer to [51, Section 2] for further details. However we did not find any results in this literature concerning the uniqueness of the constrained minimisation problem due to Assumption 1, nor on the maximum number of iterations, the behaviour under RIP assumptions and the maximum number of active coordinates. Subdifferential equations and rate-independent systems. As in Eq. (12), subdifferential inclusions of the form L(βt) d dt h(βt) for non-differential functions h have been studied by Attouch et al. [4] but for strongly convex functions h. In this case, the solutions are continuous and do not exhibit jumps. On another hand, [18, 36, 37] consider so-called rate-independent systems of the form q E(t, qt) h( qt) for 1-homogeneous dissipation potentials h. Examples of such systems are ubiquitous in mechanics and appear in problems related to friction, crack propagation, elastoplasticity and ferromagnetism to name a few [35, Ch. 6 for a survey]. As in our case, the main difficulty with such processes is the possible appearance of jumps when the energy E is non-convex. Conclusion. Our study examines the behaviour of gradient flow with vanishing initialisation over diagonal linear networks. We prove that it leads to the flow jumping from a saddle point of the loss to another. Our analysis characterises each visited saddle point as well as the jumping times through an algorithm which is reminiscent of the LARS method used in the Lasso framework. There are several avenues for further exploration. The most compelling one is the extension of these techniques to broader contexts for which the implicit bias of gradient flow has not yet fully been understood. Acknowledgments. S.P. would like to thank Loucas Pillaud-Vivien for introducing him to this beautiful topic and for the many insightful discussions. S.P. also thanks Quentin Rebjock for the many helpful discussions and Johan S. Wind for reaching out and providing the reference of [11]. The authors also thank Jérôme Bolte for the discussions concerning subdifferential equations, Aris Daniilidis for the reference of [32], as well as Aditya Varre and Mathieu Even for proofreading the paper. [1] Emmanuel Abbe, Enric Boix Adsera, and Theodor Misiakiewicz. Sgd learning on neural networks: leap complexity and saddle-to-saddle dynamics. In The Thirty Sixth Annual Conference on Learning Theory, pages 2552 2623. PMLR, 2023. [2] Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization. Advances in Neural Information Processing Systems, 32, 2019. [3] Peter Ashwin and Michael Field. Heteroclinic networks in coupled cell systems. Arch. Ration. Mech. Anal., 148(2):107 143, 1999. [4] H. Attouch, J. Bolte, P. Redont, and M. Teboulle. Singular Riemannian barrier methods and gradient-projection dynamical systems for constrained optimization. Optimization, 53(5-6): 435 454, 2004. [5] Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 468 477. PMLR, 18 24 Jul 2021. [6] Yuri Bakhtin. Noisy heteroclinic networks. Probab. Theory Related Fields, 150(1-2):1 42, 2011. [7] Raphaël Berthier. Incremental learning in diagonal linear networks. ar Xiv preprint ar Xiv:2208.14673, 2022. [8] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, volume 29, 2016. [9] Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, and Joshua Susskind. Transformers learn through gradual rank increase. ar Xiv preprint ar Xiv:2306.07042, 2023. [10] Etienne Boursier, Loucas Pillaud-Vivien, and Nicolas Flammarion. Gradient flow dynamics of shallow re LU networks for square loss and orthogonal inputs. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. [11] Martin Burger, Michael Möller, Martin Benning, and Stanley Osher. An adaptive inverse scale space method for compressed sensing. Mathematics of Computation, 82(281):269 299, 2013. [12] Jian-Feng Cai, Stanley Osher, and Zuowei Shen. Split bregman methods and frame based image restoration. Multiscale modeling & simulation, 8(2):337 369, 2010. [13] Emmanuel J Candes. The restricted isometry property and its implications for compressed sensing. Comptes rendus mathematique, 346(9-10):589 592, 2008. [14] E. Candès, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207 1223, 2006. [15] Scott Shaobing Chen, David L Donoho, and Michael A Saunders. Atomic decomposition by basis pursuit. SIAM review, 43(1):129 159, 2001. [16] Lénaïc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 1305 1338. PMLR, 09 12 Jul 2020. [17] Geoff Davis, Stephane Mallat, and Marco Avellaneda. Adaptive greedy approximations. Constructive approximation, 13:57 98, 1997. [18] Messoud A. Efendiev and Alexander Mielke. On the rate-independent limit of systems with dry friction and small viscosity. J. Convex Anal., 13(1):151 167, 2006. [19] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. 2004. [20] Mathieu Even, Scott Pesme, Suriya Gunasekar, and Nicolas Flammarion. (s)gd over diagonal linear networks: Implicit regularisation, large stepsizes and edge of stability. ar Xiv preprint ar Xiv:2302.08982, 2023. [21] Udaya Ghai, Elad Hazan, and Yoram Singer. Exponentiated gradient meets gradient descent. In Aryeh Kontorovich and Gergely Neu, editors, Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 386 407. PMLR, 08 Feb 11 Feb 2020. [22] Gauthier Gidel, Francis Bach, and Simon Lacoste-Julien. Implicit regularization of discrete gradient dynamics in linear neural networks. Advances in Neural Information Processing Systems, 32, 2019. [23] Daniel Gissin, Shai Shalev-Shwartz, and Amit Daniely. The implicit bias of depth: How incremental learning drives generalization. In International Conference on Learning Representations, 2020. [24] Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, volume 30, 2017. [25] Jeff Z Hao Chen, Colin Wei, Jason Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, pages 2315 2357. PMLR, 2021. [26] Arthur Jacot, François Ged, Berfin Sim sek, Clément Hongler, and Franck Gabriel. Saddle-tosaddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity. ar Xiv preprint ar Xiv:2106.15933, 2021. [27] Liwei Jiang, Yudong Chen, and Lijun Ding. Algorithmic regularization in model-free overparametrized asymmetric matrix factorization. ar Xiv preprint ar Xiv:2203.02839, 2022. [28] Jikai Jin, Zhiyuan Li, Kaifeng Lyu, Simon S Du, and Jason D Lee. Understanding incremental learning of gradient descent: A fine-grained analysis of matrix sensing. ar Xiv preprint ar Xiv:2301.11500, 2023. [29] Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019. [30] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, volume 29, 2016. [31] M. Krupa. Robust heteroclinic cycles. J. Nonlinear Sci., 7(2):129 176, 1997. [32] Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. Ann. Inst. Fourier (Grenoble), 48(3):769 783, 1998. [33] Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2021. [34] Julien Mairal and Bin Yu. Complexity analysis of the lasso regularization path. ar Xiv preprint ar Xiv:1205.0079, 2012. [35] Alexander Mielke. Evolution of rate-independent systems. Evolutionary equations, 2:461 559, 2005. [36] Alexander Mielke, Riccarda Rossi, and Giuseppe Savaré. Modeling solutions with jumps for rate-independent systems on metric spaces. Discrete Contin. Dyn. Syst., 25(2):585 615, 2009. [37] Alexander Mielke, Riccarda Rossi, and Giuseppe Savaré. Variational convergence of gradient flows and rate-independent evolutions in metric spaces. Milan Journal of Mathematics, 80: 381 410, 2012. [38] Behnam Neyshabur. Implicit regularization in deep learning. ar Xiv preprint ar Xiv:1709.01953, 2017. [39] Michael R Osborne, Brett Presnell, and Berwin A Turlach. A new approach to variable selection in least squares problems. IMA journal of numerical analysis, 20(3):389 403, 2000. [40] Stanley Osher, Martin Burger, Donald Goldfarb, Jinjun Xu, and Wotao Yin. An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation, 4(2):460 489, 2005. [41] Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pages 65 74. PMLR, 20 22 Apr 2017. [42] Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar conference on signals, systems and computers, pages 40 44. IEEE, 1993. [43] Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. In Advances in Neural Information Processing Systems, 2021. [44] Noam Razin, Asaf Maman, and Nadav Cohen. Implicit regularization in tensor factorization. In International Conference on Machine Learning, pages 8913 8924. PMLR, 2021. [45] Andrew M Saxe, James L Mc Clelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116 (23):11537 11546, 2019. [46] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267 288, 1996. [47] Ryan J. Tibshirani. The lasso problem and uniqueness. Electron. J. Stat., 7:1456 1490, 2013. [48] Tomas Vaskevicius, Varun Kanade, and Patrick Rebeschini. Implicit regularization for optimal sparse recovery. Advances in Neural Information Processing Systems, 32, 2019. [49] Johan S Wind, Vegard Antun, and Anders C Hansen. Implicit regularization in ai meets generalized hardness of approximation in optimization sharp results for diagonal linear networks. ar Xiv preprint ar Xiv:2307.07410, 2023. [50] Blake Woodworth, Suriya Gunasekar, Jason D. Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3635 3673. PMLR, 09 12 Jul 2020. [51] Yi Yang, Michael Möller, and Stanley Osher. A dual split bregman method for fast l1 minimization. Mathematics of computation, 82(284):2061 2085, 2013. [52] W Yin, S Osher, D Goldfarb, and J Darbon. Bregman iterative algorithms for l1-minimization with applications to compressed sensing: Siam journal on imaging sciences, 1, 143 168. LIST OF FIGURES, 2008. [53] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. Organisation of the Appendix. 1. In Appendix A, we give the experimental setup and provide additional experiments. 2. In Appendix B, we prove Proposition 1 and provide additional comments concerning the unicity of the minimisation problem which appears in the proposition. 3. In Appendix C, we provide some general results on the flow. 4. In Appendix D, we prove Proposition 2 and give standalone properties of Algorithm 1. 5. In Appendix E, we explain in more detail the arc-length parametrisation explained in the main text as well as prove Theorem 2 and Proposition 3. 6. In Appendix F, we provide technical lemmas which are useful to prove the main results. A Experimental setup and additional: experiments, extension, related works. Experimental setup and additional experiments. For each experiment we generate our dataset as yi = xi, β where xi = N(0, H) for a a diagonal covariance matrix H and β is a vector of Rd. Gradient descent is run with a small step size and from initialisation ut=0 = 2α1 Rd and vt=0 = 0 for some initialisation scale α > 0. Figure 1 and Figure 4 (Left): (n, d, α) = (5, 7, 10 120), H = Id, β = (10, 20, 0, 0, 0, 0, 0) R7. Figure 4 (Right): (n, d, α) = (6, 6, 10 10), H = diag(1, 10, 10, 10, 10, 10) R6 6, β = (1, 0, 0, 0, 0, 0, 0) R6. Figure 3 (Left): (n, d, α1, α2) = (7, 2, 10 100, 10 10), H = Id, β = (10, 20) R7. Figure 3 (Right): (n, d, α) = (3, 3, 10 100) , X is the square root matrix of the matrix ((20, 6, 1.4), (6, 2, 0.4), ( 1.4, 0.4, 0.12)) R3 3, β = (1, 9, 10). 0.0 0.2 0.4 0.6 0.8 1.0 Time t Accelerated iterates βα t = βα ln(1/α)t βα1 t (1) βα1 t (2) βα2 t (1) βα2 t (2) 10 2 10 1 100 101 102 Accelerated iterates t = ln(1/ )t Figure 3: Left: Visualisation of the uniform convergence of βα towards β as α 0. α1 = 10 100 α2 = 10 10 Right: In some cases, 2 coordinates can activate at the same time. Note that the time axis is in log-scale for better visualisation. 0.0 0.2 0.4 0.6 0.8 Time t Accelerated iterates βα t = βα ln(1/α)t βα t (1) βα t (2) βα t (3) βα t (4) βα t (5) βα t (6) βα t (7) 0.0 2.5 5.0 7.5 10.0 12.5 15.0 Time t Accelerated iterates βα t = βα ln(1/α)t βα t (1) βα t (2) βα t (3) βα t (4) βα t (5) βα t (6) Figure 4: Complex dynamics can occur. Left and right: Coordinates are not monotonic and the number of active coordinates neither as several coordinates can deactivate at the same time. The piecewise constant process plotted in black is the limiting process β predicted by our theory. B Proof of Proposition 1 Proposition 1. All the critical points wc of F which are not global minima, i.e., F(wc) = 0 and F(wc) > minw F(w), are necessarily saddle points (i.e., not local extrema). They map to parameters βc = uc vc which satisfy |βc| L(βc) = 0 and: βc arg min β[i]=0 for i/ supp(βc) L(β) (4) where supp(βc) = {i [d], βc[i] = 0} corresponds to the support of βc. Proof. Non-existence of maxima / non-global minima. This is a simpler version of results which appear in [30], for the sake of completeness we provide here a simple proof adapted to our setting. The intuition follows the fact that if there existed a local maximum / non-global minimum for F then this would translate to the existence of a local maximum / non-global minimum for the convex loss L, which is absurd. Assume that there exists a local maximum w = (u , v ), i.e. assume that there exists ε > 0 such that for all w = (u, v) such that w w 2 2 ε, F(w) F(w ). We show that this would imply that β = u v is a local maximum of L, which is absurd. The mapping g : (u, v) 7 (u v, p (u2 v2)/2) from Rd 0 Rd Rd Rd 0 is a bijection with inverse g 1 : (β, α) 7 ( q β2 + α4, sign(β) q β2 + α4). (14) Also notice that F(g 1(β, α)) = L(β) for all β and α. Now let ε > 0 and let β Rd such that β β 2 2 ε, then for (u, v) = g 1(β, α ) where α = p ((u )2 (v )2)/2 we have that: (u, v) (u , v ) 2 2 = 2 p α4 + β 2 2 1 2 β2 β 2 1 = 2 (β β )2 + 2(β β )β 1 2 (β β )2 1+2 β β β 1 where the last inequality is for ε small enough. This means that L(β) = F(w) F(w ) = L(β ) and β is a local maximum of L, which is absurd. The exact same proof holds to show that there are no local minima of F which are not global minima. Critical points. The gradient of the loss function F writes: w F(w) = u F(w) v F(w) = L(β) v L(β) u Therefore F(wc) = 0 R2d implies that L(βc) βc = 0 Rd. Now consider such a βc and let supp(βc) = {i [d] such that βc(i) = 0} denote the support of βc. Since [ L(βc)]i = 0 for i / supp(βc), we can therefore write that βc arg min βi=0 for i supp(βc) L(β). Furthermore we point out that since supp(βc) [d], there are at most 2d distinct sets supp(βc), and therefore at most 2d values F(wc) = L(βc), where wc is a critical point of F. Additional comment concerning the uniqueness of arg minβi=0,i supp(βc) L(β). We point out that the constrained minimisation problem (4) does not necessarily have a unique solution, even when βc is not a global solution. Though not required for any of our results, for the sake of completeness, we show here that under an additional mild assumption on the data, we can ensure that the minimisation problem (4) which appears in Proposition 1 has a unique minimum when L(βc) > 0. Under this additional assumption, there is therefore a finite number of saddles βc. Recall that we let X Rn d be the feature matrix and ( x1, . . . , xd) be its columns. Now assume temporarily that the following assumption holds. Assumption 2 (Assumption used just in this short section). Any subset of ( x1, . . . , xd) of size smaller than min(n, d) is linearly independent. One can easily check that this assumption holds with probability 1 as soon as the data is drawn from a continuous probability distribution, similarly to [47, Lemma 4]). In the following, for a subset ξ = {i1, . . . , ik} [d], we write Xξ = ( xi1, . . . , xik) Rn k (we extract the columns from X). For a vector β Rd we write β[ξ] = (βi1, . . . , βik) and β[ξC] = (βi)i/ ξ. We distinguish two different settings: Underparametrised setting (n d) : in this case, for any ξ = {i1, . . . , ik} [d], then β := argmin βi=0,i ξ L(β) is unique. Indeed we simply set the gradient to 0 and notice that due to Assumption 2, there exists a unique solution, indeed it is β such that β [ξ] = (X ξ Xξ) 1X ξ y and β [ξC] = 0. Overparametrised setting (d > n) : Global solutions: arg minβ Rd L(β) is an affine space spanned by the orthogonal of (x1, . . . , xn) in Rd. Since span( x1, . . . , xd) = Rn from Assumption 2, any β arg minβ Rd L(β) satisfies Xβ = y and L(β ) = 0. "Saddle points": now let βc Rd be such that we can write βc arg minβi=0,i/ supp(βc) L(β) and assume that L(βc) > 0 (i.e., not a global solution), then: (1) βc has at most n non-zero entries, indeed if it were not the case, then y would necessarily belong to span( xi)i supp(βc) due to the assumption on the data, and this would lead to L(βc) = 0, (2) therefore, similar to the underparametrised case, arg minβi=0,i/ supp(βc) L(β) is unique, equal to βc, and we have that βc[ξ] = (X ξ Xξ) 1X ξ y and βc[ξC] = 0 where ξ = supp(βc). Thus, in both the underparametrised and overparametrised settings, the minimisation problem (4) appearing in Proposition 1 has a unique minimum when L(βc) > 0 and Assumption 2 holds. C General results on the iterates In the following lemma we recall a few results concerning the gradient flow Eq. (5): dwt = F(wt)dt , (15) where F is defined in Eq. (3) as: F(w) := L(u v) = 1 i=1 ( u v, xi yi)2 . Lemma 1. For an initialisation u0 = 2α, v0 = 0, the flow wα t = (uα t , vα t ) from Eq. (15) is such that the quantity (uα t )2 (vα t )2 is constant and equal to 2α21. Furthermore uα t > |vα t | 0 and therefore from the bijection Eq. (14) we have that: (βα t )2 + α4, vα t = sign(βα t ) q (βα t )2 + α4. Proof. From the expression of F(w), notice that the derivative of (uα t )2 (vα t )2 is equal to 0 and therefore equal to its initial value. Since (uα t )2 (vα t )2 = (uα t + vα t )(uα t vα t ) > 0, by continuity we get that uα t + vα t > 0 and uα t vα t > 0 and therefore uα t > |vα t |. In this section we consider the accelerated iterates Eq. (9) which follow: d ϕα( βα t ) = L( βα t )dt, where ϕα := 1 ln(1/α) ϕα (16) with βt=0 = 0 and where ϕα is defined Eq. (7). Proposition 4. For all α > 0 and minimum β arg minβ L(β), the loss values L( βα t ) and the Bregman divergence D ϕα(β , βα t ) are decreasing. Moreover L( βα t ) L(β ) ϕα(β ) 0 βα s ds L(β ) ϕα(β ) Proof. The loss is decreasing since: d dt L( βα t ) = L( βα t ) βα t = βα t 2 ϕα( βα t ) βα t 0. d dt D ϕα(β , βα t ) = L( βα t ) ( βα t β ) = 2(L( βα t ) L(β )) (since L is the quadratic loss), therefore the Bregman distance is decreasing. We can also integrate this last equality from 0 to t, and divide by 2t: 0 L( βα s )ds L(β ) = D ϕα(β , βα 0 = 0) D ϕα(β , βα t ) Since the loss is decreasing we get that L( βα t ) L(β ) ϕα(β ) 2t and from the convexity of L we get that L 1 t R t 0 βα s ds L(β ) ϕα(β ) In the following proposition, we show that for α small enough, the iterates are bounded independently of α. Note that this result unfortunately only holds for the quadratic loss, we expect it to hold for other convex losses of the type L(β) = 1 i ℓ(yi, xi, β ) where ℓ(y, ) is strictly convex has a unique root at y but we don t know how to show it. Also note that bounding the accelerated iterates βα is equivalent to bounding the iterates βα since βα t = βα ln(1/α)t. Proposition 5. For α < α0, where α0 depends on β ℓ1, the iterates βα t are bounded independently of α: βα t 3 β ℓ1 1+1 Proof. From Eq. (16), integrating and using that L is the quadratic loss, we get: ϕα( βα t ) = t n X (y X βα t ) = t n X X( βα t β ), where we recall that X Rn d is the input data represented as a matrix and where we denote the averaged iterate by βα t = 1 t R t 0 βα s ds. Thus we get ϕα( βα t ) ( βα t β ) = t n( βα t β ) X X( βα t β ). (19) By convexity of ϕα we have ϕα(βα t ) ϕα(β ) ϕα(βα t ) (βα t β ). By the Cauchy-Schwarz inequality, we also have ( βα t β ) X X(βα t β ) X(βα t β ) X( βα t β ) . Using Proposition 4: X(βα t β ) 2 n ϕα(β )/t and X( βα t β ) 2 n ϕα(β )/t we can further bound the right hand side of Eq. (19) as n( βα t β ) X X(βα t β ) ϕα(β ). Thus it yields ϕα(βα t ) ϕα(β ) ϕα(β ). From [50] (proof of Lemma 1 in the appendix) we get that for α < min n 1, p β 1, (2 β 1) 1o and for all α < exp( d/2): ϕα(β) β 1 d ln(1/α2) β 1 1, which finally leads for α < α0 := min n 1, q β ℓ1 1, 2 β ℓ1 1 1 , exp( d/2) o to the result. The following proposition shows that we can bound the path length of the flow βα independently of α. Keep in mind that the path length of βα is equivalent to that of βα as the first is just an acceleration of the second: βα t = βα ln(1/α)t. Proposition 6. For α < α0 where α0 is the same as in Proposition 5, the path length of the iterates (βα t )t 0 is bounded independently of α > 0: Z + 0 βα t dt < C, where C does not depend on α. Hence the path length of the accelerated flow βα is also bounded independently of α. Proof. Having shown that the iterates βα t are bounded independently of α, it also implies that the iterates wt = (ut, vt) are bounded following Lemma 1. Since the loss w 7 F(w) is a multivariate polynomial function, it is a semialgebraic function and we can consequently apply the result of Kurdyka [32, Theorem 2] which grants that Z + 0 wt dt < C, where the constant C only depends on the loss and on the bound on the iterates. We further use that β = u v + u v and u v + u v C1( u + v ) using that u and v are bounded and u + v C2 w using the equivalence of norms. Therefore R + 0 βα t dt < C for some C which is independent of the initialisation scale α. D Standalone properties of Algorithm 1 D.1 Well-definedness of Algorithm 1 and upperbound on its number of loops Notice that this proposition highlights the fact that Algorithm 1 is on its own an algorithm of interest for finding the minimum ℓ1-norm solution in an overparametrised regression setting. We point out that the provided upperbound on the number of iterations is very crude and could certainly be improved. Proposition 7. Algorithm 1 is well defined: at each iteration (i) the attribution of is well defined as < + , (ii) the constrained minimisation problem has a unique solution and the attribution of the value of β is therefore well-founded. Furthermore, along the loops: the iterates β have at most n non-zero coordinates, the loss is strictly decreasing and the algorithm terminates in at most min (2d, Pn k=0 d k ) steps by outputting the minimum ℓ1-norm solution β ℓ1 := arg min β arg min L β 1. Proof. In the following, for the matrix X and for a subset I = {i1, . . . , ik} [d], we write XI = ( xi1, . . . , xik) Rn k (we extract the columns from X). For a vector β Rd we write βI = (βi1, . . . , βik). (1) The constrained minimisation problem has a unique solution: we follow the proof of [47, Lemma 2]. Following the notations in Algorithm 1, we define I = {i [d], |si|= 1} and we point out that after k loops of the algorithm, the value of s is equal to s = ( 1 L(β0) + + k L(βk 1)) span(x1, . . . , xn). We can therefore write s = X r for some r Rn. Now assume that ker(XI) = {0}. Then, for some i I, we have xi = P j I\{i} cj xj where cj R. Without loss of generality, we can assume that I \ {i} has at most n elements. Indeed, we can otherwise always find n elements I I \ {i} such that xi = P j I cj xj. Rewriting the previous equality, we get j I\{i} (sisjcj)(sj xj). (20) Now by definitions of the set I and of r, we have that xj, r = sj {+1, 1} for any j I. Taking the inner product of Eq. (20) with r, we obtain that 1 = P j I\{i}(sisjcj). Consequently, we have shown that if ker(XI) = {0}, then we necessarily have for some i I, j I\{i} aj(sj xj), j I\{i} aj = 1, which means that si xi lies in the affine space generated by (sj xj)j I\{i}. This fact is however impossible due to Assumption 1 (recall that without loss of generality we have that I \ {i} has at most n elements, and trivially less that d elements). Therefore XI is full rank, and Card(I) n. Now notice that the constrained minimisation problem corresponds to arg minβi 0,i I+ βi 0,i I y XIβI 2 2. Since XI is full rank, this restricted loss is strictly convex and the constrained minimisation problem has a unique minimum. (2) < + : Notice that the optimality conditions of β = arg min βi 0,i I+ βi 0,i I βi=0,i/ I y XIβI 2 2, are (i) β satisfies the constraints, (ii) if i I+ (resp i I ) then [ L(β)]i 0 (resp [ L(β)]i 0) and (iii) if βi = 0 then [ L(β)]i = 0. One can notice that condition (ii) ensures that at each iteration, for δ k, sk 1 δ L(βk 1) [ 1, 1] coordinate wise. Also, if L(βk 1) = 0, then a coordinate of the vector |sk 1 δ L(βk 1)| must necessarily hit 1, this value of δ corresponds to k. (3) The loss is strictly decreasing: Let Ik 1, and Ik, be the equicorrelation sets defined in the algorithm at step k 1 and k, and βk 1 and βk the solutions of the constrained minimisation problems. Also, let ik be the newly added coordinate which breaks the constraint at step k (which we assume to be unique for simplicity). Without loss of generality, assume that sk(ik) = +1. Since the sets Ik 1,+ \ (Ik,+ \ {ik}) and Ik 1, \ Ik, are (if not empty) only composed of indexes of coordinates of βk 1 which are equal to 0, one can notice that βk 1 also satisfies the new constraints at step k. Therefore L(βk) L(βk 1). Now since [ L(βk 1)]ik > 0, from the strict convexity of the restricted loss on Ik, this means that βk(ik) > 0 (which also means that newly activated coordinate ik must activate), and therefore βk 1 = βk and L(βk) < L(βk 1). (4) The algorithm terminates in at most min 2d, Pn k=0 d k steps: Recall that we showed in part (1) of the proof that at each iteration k of the algorithm, Ik as at most min(n, d) elements. Since supp(βk) Ik, we have that βk has at most min(n, d) non-zero elements, also recall that we always have βk = arg minβi=0,i/ supp(βk) L(β) (we here have unicity of this minimisation problem following part (1) of the proof). There are hence at most such minimisation problems. The loss being strictly decreasing, the algorithm cannot output the same solution β at two different loops, and the algorithm must terminate in at most min 2d, Pn k=0 d k iterations by outputting a vector β such that L(β ) = 0, i.e. β arg min L(β). (5) The algorithm outputs the minimum ℓ1-norm solution. Let β be the output of the algorithm after p iterations. Notice that by the definition of the successive sets Ik, and of the constraints on the minimisation problem, we have that at each iteration sk βk 1. Therefore sp β 1. Also, recall from part (1) of the proof that sp span(x1, . . . , xn) which means that there exists r Rn such that sp = X r. Putting the two together we get that X r β 1, this condition along with the fact that L(β ) = min L(β) are exactly the KKT conditions of arg min β arg min L β 1. To put our upperbound on the number of iterations into perspective, the worst-case number of iterations for the LARS algorithm is (3d + 1)/2 [34]. Hence Algorithm 1 has fewer iterations in the worst-case setting. Whether an exponential dependency in the dimension is inevitable for Algorithm 1 is unknown and we leave this as future work. However, when the number of samples is much smaller than the dimension we lose the exponential dependency. Indeed, for ε := n/d 1/2, we have the upperbound Pn k=0 d k 2H(ε)d where H(ε) = ε log2(ε) (1 ε) log2(1 ε) is the binary entropy. Since for ε 1/2, H(ε) 2ε log2(ε), we get the upperbound Pn k=0 d k 2H(ε)d ( d n)2n, which is much better than 2d. D.2 Proof of Proposition 2 As mentioned several times, for general feature matrices X complex behaviours can occur with coordinates deactivating and changing sign several times. Here we show that for simple datasets which have a feature matrix X that satisfy the restricted isometry property (RIP) [14], we can simply determine the jump times and the saddles as a function of the sparse predictor which we seek to recover. The non-realistic but enlightening extreme case of the RIP assumption is to consider that the feature matrix is such that X X/n = Id. In this case, by letting β be the unique vector such that y = x, β and assuming that β = (β 1, . . . , β r, 0, . . . , 0) with |β 1|> > |β r|> 0, then the loss writes L(β) = β β 2 2/2 and one can easily check that Algorithm 1 would terminate in r loops and output exactly ti = 1 |β i | and βi = (β 1, . . . , β i , 0, . . . , 0) for i r (the case where several coordinates of β are stricly equal can also be treated: for example if β 1 = β 2 then the first output of the algorithm is directly β1 = (β 1, β 2, 0, . . . , 0)). We now recall the more realistic RIP setting which is an adaptation of the previous observation. Sparse regression with RIP and gap assumption. (RIP) Assume that there exists an r-sparse vector β such that yi = xi, β . Furthermore we assume that the feature matrix X Rn,d satisfies the 2r-restricted isometry property with constant ε < 2 1 < 1/2: i.e. for all submatrix Xs where we extract any s 2r columns of X, the matrix X s Xs/n of size s s has all its eigenvalues in the interval [1 ε, 1 + ε]. (Gap assumption) Furthermore we assume that the r-sparse vector β has coordinates which have a sufficient gap . W.l.o.g we write β = (β 1, . . . , β r, 0, . . . , 0) with |β 1| . . . |β r|> 0 and we define λ := mini [r](|β i | |β i+1|) 0 which corresponds to the smallest gap between the entries of |β |. We assume that 5 ε β 2< λ/2 and we let ε := 5 ε. A classic result from compressed sensing (see Candes [13, Theorem 1.2]) is that the 2r-restricted isometry property with constant 2 1 ensures that the minimum ℓ0-minimisation problem has a unique r-sparse solution which is β . Furthermore it ensures that the minimum ℓ1-norm solution is unique and is equal to β . This means that Algorithm 1 will have β as a final output. We now recall the result which characterises the outputs of Algorithm 1 when the data satisfies the previous assumptions. Proposition 2. Under the restricted isometry property and the gap assumption stated right above, Algorithm 1 terminates in r-loops and outputs: β1 = (β1[1], 0, . . . , 0) with β1[1] [β 1 ε β , β 2 + ε β ] β2 = (β2[1], β2[2], 0, . . . , 0) with β2[1] [β 1 ε β , β 1 + ε β ] β2[2] [β 2 ε β , β 2 + ε β ] ... βr 1 = (βr 1[1], . . . , βr 1[r 1], 0, . . . , 0) with βr 1[i] [β i ε β , β i + ε β ] βr = β = (β 1, . . . , β r, 0, . . . , 0), at times t1, . . . , tr such that ti h 1 |β i |+ε β , 1 |β i | ε β i and where denotes the ℓ2 norm. Proof. In all the proof denotes the ℓ2 norm 2. For simplicity we assume that β i > 0 for all i [r], the proof can easily be adapted to the general case. We first define ξ := X X/n Id. By the restricted isometry property, for any k 2r, we have that any k k square matrix extracted from ξ which we denote ξkk has its eigenvalues in [ ε, ε]. It also means that the eigenvalues of (Ik + ξkk) 1 Ik are in [ 1 1+ ε 1, 1 1 ε 1] [ 2 ε, 2 ε]. We now proceed by induction with the following induction hypothesis: βk 1 has its support on its (k 1) first coordinates with |βk 1[i] β i | 5 ε β for i < k tk h 1 β k+5 ε β , 1 β k 5 ε β i and stk[k] = 1 stk[i] [tk(β i 5 ε β ), tk(β i + 5 ε β )] ( 1, 1) for i > k From the recurrence hypothesis, the output of the algorithm at step k is hence βk = arg min L(β) under the constraint β[i] 0 for i k and β[i] = 0 otherwise. We first search for the solution of the minimisation problem without the sign constraint and still (abusively) denote it βk: we will show that it turns out to satisfy the sign constraint and that it is therefore indeed βk. In the following, for a vector v, we denote by v[: k] its k first coordinates. Setting the k first coordinates of the gradient to 0, we get that [X X(βk β )][:k] = 0, which leads to (Ik + ξkk)βk[: k] = β [:k] + [ξβ ][:k], which gives: βk[:k] = (Ik + ξkk) 1(β [:k] + [ξβ ][:k]) = β [:k] + [ξβ ][:k] + v1 where from the bound on the eigenvalues of (Ik + ξkk) 1 Ik and ξβ ε β : v1 2 ε β [:k] + [ξβ ][:k]) 2 ε( β + ξβ ) 2 ε( β + ε β ) 4 ε β . Therefore βk[:k] = β [:k] + v2 where v2 = [ξβ ][: k] + v1 hence v2 v2 5 ε β . Notice that from the definition of λ and the fact that 5 ε β < λ/2 we have that βk[:k] 0 coordinate-wise, hence verifying the sign constraint. Also note that βk β +5 ε β 4 β . For t tk, st = stk (t tk) L(βk), and [ L(βk)][: k] = 0 therefore st[: k] = stk[: k]. Now for i > k, [ L(βk)]i = n 1[X X(β βk)]i = β i + [ξ(βk β )]i. Now since (βk β ) is r-sparse we have that: ξ(βk β ) ξ(βk β ) ε βk β ε( βk + β ) 5 ε β < λ/2, (21) Now from the fact that st[i] = stk[i] + (t tk)β i + (t tk)[ξ(βk β )]i and using the recurrence hypothesis: stk[i] [tk(β i 5 ε β ), tk(β i + 5 ε β )], we get (using the bound Eq. (21)) that st[i] [t(β i 5 ε β ), t(β i + 5 ε β )]. From the separation assumption we have that 5 ε β < λ/2 and therefore the next coordinate to activate is necessarily the (k + 1)th at time tk+1 with stk+1[k + 1] = 1 and: tk+1 h 1 β k+1 + 5 ε β , 1 β k+1 5 ε β This proves the recursion. The algorithm cannot stop before iteration r as β is the unique minimiser of L that has at most r non-zero coordinates. But it stops at iteration r as β is the unique minimiser of L(β) under the constraints βi 0 for i r and βi = 0 otherwise. E Proof of Theorem 2 and Proposition 3 through the arc-length parametrisation In this section, we explain in more details the arc-length reparametrisation which circumvents the apparition of discontinuous jumps and leads to the proof of Theorem 2. The main difficulty to show the convergence stems from the non-continuity of the limit process β . Therefore we cannot expect uniform convergence of βα towards β as α 0. In addition, β does not provide any insights into the path followed between the jumps. Arc-length parametrisation. The high-level idea is to slow-down time when the jumps occur. To do so we follow the approach from [18, 36] and we consider an arc-length parametrisation of the path, i.e., we consider τ α equal to: τ α(t) = t + Z t In Proposition 6, we showed that the full path length R + 0 βα s ds is finite and bounded independently of α. Therefore τ α is a bijection in R 0. We can then define the following quantities: ˆtα τ = (τ α) 1(τ) and ˆβα τ = βα ˆtα(τ). By construction, a simple chain rule leads to ˆtα(τ) + ˆβα τ = 1, which means that the speed of (ˆβα τ )τ is always upperbounded by 1, independently of α. This behaviour is in stark contrast with the process ( βα t )t which has a speed which explodes at the jumps. It presents a major advantage as we can now use Arzelà-Ascoli s theorem to extract a converging subsequent. A simple change of variable shows that the new process satisfies the following equations: 0 ˆtα s L(ˆβα s )ds = ϕα(ˆβα τ ) and ˆtα τ + ˆβα τ = 1 (22) started from ˆβα τ = 0 and ˆt0 = 0. The next proposition states the convergence of the rescaled process, up to a subsequence. Proposition 8. Let T 0. For every α > 0, let (ˆtα, ˆβα) be the solution of Eq. (22). Then, there exists a subsequence (ˆtαk, ˆβαk)k N and (ˆt, ˆβ) such that as αk 0 : (ˆtαk, ˆβαk) (ˆt, ˆβ) in (C0([0, T], R Rd), ) (23) ( ˆtαk, ˆβαk) ( ˆt, ˆβ) in L1[0, T] (24) Limiting dynamics. The limits (ˆt, ˆβ) satisfy: 0 ˆts L(ˆβs)ds ˆβτ 1 and ˆtτ + ˆβτ 1 (25) Heteroclinic orbit. In addition, when ˆβτ is such that |ˆβτ| L(ˆβτ) = 0, we have ˆβτ = |ˆβτ| L(ˆβτ) |ˆβτ| L(ˆβτ) and ˆtτ = 0. (26) Furthermore, the loss strictly decreases along the heteroclinic orbits and the path length R T 0 ˆβτ dτ is upperbounded independently of T. Proof. Differentiating Eq. (22) and from the Hessian of ϕα we get: ˆβα τ = ˆtα τ ( 2 ϕα(ˆβα τ )) 1 L(ˆβα τ ) = (1 ˆβα τ )( 2 ϕα(ˆβα τ )) 1 L(ˆβα τ ). Therefore taking the norm on the right hand side we obtain that ˆβα τ = ( 2 ϕα(ˆβα τ )) 1 L(ˆβα τ ) 1 + ( 2 ϕα(ˆβα τ )) 1 L(ˆβα τ ) , and therefore ˆβα τ = ( 2 ϕα(ˆβα τ )) 1 L(ˆβα τ ) 1 + ( 2 ϕα(ˆβα τ )) 1 L(ˆβα τ ) . (27) Subsequence extraction. By construction Eq. (22) we have ˆtα τ + ˆβα τ = 1 , therefore the sequences ( ˆtα)α, ( ˆβα)α as well as (ˆtα)α, (ˆβα)α are uniformly bounded on [0, T]. The Arzelà-Ascoli theorem yields that, up to a subsequence, there exists (ˆt, ˆβ) such that (ˆtαk, ˆβαk) (ˆt, ˆβ) in (C0([0, T], R Rd), ). Since ˆβα τ , ˆtα τ 1 we have, applying the Banach Alaoglu theorem, that up to a new subsequence ( ˆtαk, ˆβαk) ( ˆt, ˆβ) in L (0, T) (28) and ˆβτ lim infαk ˆβαk τ 1 and thus ˆtτ + ˆβτ 1: Z T 0 ˆβτ dτ Z T 0 lim inf αk ˆβαk τ dτ Z + 0 lim inf αk ˆβαk τ dτ lim inf αk 0 ˆβαk τ dτ < C, where the third inequality is by Fatou s lemma. Note that since [0, T] is bounded then it also implies the weak convergence in any Lp(0, T), 1 p < . Since (ˆβα) converges uniformly on [0, T], and L is continuous, we have that L(ˆβα) converges uniformly to L(ˆβ). Since ˆtαk ˆt in L1(0, T), passing to the limit in the equation ϕα(ˆβα τ ) = R τ 0 ˆtα s L(ˆβα s )ds leads to 0 ˆts L(ˆβs)ds ˆβτ 1, due to Lemma 2. Recall from Eq. (27) and the definition of ϕα that: ˆβα τ + α4 L(ˆβα τ ) 1/ln(1/α) + q ˆβα τ + α4 L(ˆβα τ ) . (29) Hence assuming that ˆβτ is such that |ˆβτ| L(ˆβτ) = 0, we can ensure that |ˆβτ | L(ˆβτ ) = 0 for τ [τ, τ + ε] and ε small enough. We have then ˆβα τ +α4 L( ˆβα τ ) 1/ln(1/α)+ q ˆβα τ +α4 L( ˆβα τ ) converges uniformly toward | ˆβτ | L( ˆβτ ) | ˆβτ | L( ˆβτ ) on [τ, τ + ε]. Using the dominated convergence theorem, we have R τ+ε τ ˆβα τ +α4 L( ˆβα τ ) 1/log(1/α)+ q ˆβα τ +α4 L( ˆβα τ ) dτ R τ+ε τ | ˆβτ | L( ˆβτ ) | ˆβτ | L( ˆβτ ) dτ . We therefore obtain ˆβτ = | ˆβτ | L( ˆβτ ) | ˆβτ | L( ˆβτ ) in L1[0, T]. Consequently ˆβτ = 1 and ˆtτ = 0. Proof that the loss stricly decreases along the heteroclinic orbits. Assume ˆβτ is such that |ˆβτ| L(ˆβτ) = 0, then the flow follows ˆβτ = |ˆβτ| L(ˆβτ) |ˆβτ| L(ˆβτ) Letting γ(τ) = 1 | ˆβτ | L( ˆβτ ) we get: d L(ˆβτ) = γ(τ) X i |ˆβτ(i)| [ L(ˆβτ)]2 i dτ < 0, because |ˆβτ| L(ˆβτ)2 = 0. Borrowing terminologies from [18], we can distinguish two regimes: when ˆβτ = 0, the system is sticked to the saddle point. When ˆtτ = 0 and ˆβτ = 1 the system switches to a viscous slip which follows the normalised flow Eq. (26). We use the term of heteroclinic orbit as in the dynamical systems literature since in the weight space (u, v) it corresponds to a path with links two distinct critical points of the loss F. Since ˆtτ = 0, this regime happens instantly for the original t time scale (i.e. a jump occurs). From Proposition 8, following the same reasoning as in Section 3, we can show that the rescaled process converges uniformly to a continuous saddle-to-saddle process where the saddles are linked by normalized flows. Theorem 3. Let T > 0. For all subsequences defined in Proposition 8, there exist times 0 = τ 0 < τ1 < τ 1 < < τp < τ p < τp+1 = + such that the the iterates (ˆβαk τ )τ converge uniformly on [0, T] to the following limit trajectory : ( Saddle ) ˆβτ = βk for τ [τ k, τk+1] where 0 k p (Orbit) ˆβτ = |ˆβτ| L(ˆβτ) |ˆβτ| L(ˆβτ) for τ [τk+1, τ k+1] where 0 k p 1 where the saddles (β0 = 0, β1, . . . , βp = β ℓ1) are constructed in Algorithm 1. Also, the loss (L(ˆβτ))τ is constant on the saddles and strictly decreasing on the orbits. Finally, independently of the chosen subsequence, for k [p] we have ˆtτk = ˆtτ k = tk where the times (tk)k [p] are defined through Algorithm 1. Proof. Some parts of the proof are slightly technical. To simplify the understanding, we make use of auxiliary lemmas which are stated in Appendix F. The overall spirit follows the intuitive ideas given in Section 3 and relies on showing that Eq. (25) can only be satisfied if the iterates visit the saddles from Algorithm 1. We let ˆsτ := R τ 0 ˆts L(ˆβs)ds, which is continuous and satisfies ˆsτ ˆβτ 1 from Eq. (25). Let S = {β Rd, |β| L(β) = 0} denote the set of critical points and let (βk, tk, sk) be the successive values of (β, t, s) which appear in the loops of Algorithm 1. We do a proof by induction: we start by assuming that the iterates are stuck at the saddle βk 1 at time τ τ k 1 where ˆtτ k 1 = tk 1 and ˆsτ k 1 = sk 1 (recurrence hypothesis), we then show that they can only move at a time τk and follow the normalised flow Eq. (26). We finally show that they must end up stuck at the new critical point βk, validating the recurrence hypothesis. Proof of the jump time τk such that ˆtτk = tk : we set ourselves at time τ τ k 1, stuck at the saddle βk 1. Let τk := sup{τ, ˆtτ tk}, we have that τk < from Lemma 3. Note that by continuity of ˆtτ it holds that ˆtτk = tk. Now notice that ˆsτ = ˆsτ k 1 (ˆtτ ˆtτ k 1) L(βk 1) = sk 1 (ˆtτ tk 1) L(βk 1). We argue that for any ε > 0, we cannot have ˆβτ = βk 1 on (τk, τk + ε). Indeed by the definition of τk and from the algorithmic construction of time tk, it would lead to |ˆsτ(i)|> 1 for some coordinate i [d], which contradicts Eq. (25). Therefore the iterates must move at the time τk. Heterocline leaving βk 1 for τ [τk, τ k] : contrary to before, our time rescaling enables to capture what happens during the jump . We have shown that for any ε, there exists τε (τk, τk + ε), such that ˆβτε = βk 1. From Lemma 4, since the saddles are distinct along the flow, we must have that ˆβτε / S for ε small enough. The iterates therefore follow a heterocline flow leaving βk 1 with a speed of 1 given by Eq. (26). We now define τ k := inf{τ > τk, ε0 > 0, ε [0, ε0], ˆβτ+ε S} which corresponds to the time at which the iterates reach a new critical point and stay there for at least a small time ε0. We have just shown that τ k > τk. Now from Proposition 8, the path length of ˆβ is finite, and from Lemma 4 the flow visits a finite number of distinct saddles at a speed of 1. These two arguments put together, we get that τ k < + and also ˆβτ k+ε = ˆβτ k, ε [0, ε0]. On another note, since ˆtτ = 0 for τ [τk, τ k] we have ˆtτ k = ˆtτk(= tk) as well as ˆsτk = ˆsτ k(= sk). Proof of the landing point βk : we now want to find to which saddle ˆβτ k S the iterates have moved to. To that end, we consider the following sets which also appear in Algorithm 1: I ,k := {i {1, . . . , d}, s.t. ˆsτ k(i) = 1} and Ik = I+,k I ,k. (30) The set Ik corresponds to the coordinates of ˆβτ k which are allowed (but not obliged) to be activated (i.e. non-zero). For τ [τ k, τ k + ε0] we have that ˆsτ = ˆsτ k (ˆtτ tk) L(ˆβτ k). By continuity of ˆs and the fact that ˆsτ ˆβτ k 1, the equality translates into: if i / Ik, ˆβτ k(i) = 0 if i I+,k, then [ L(ˆβτ k)]i 0 and ˆβτ k(i) 0 if i I ,k, then [ L(ˆβτ k)]i 0 and ˆβτ k(i) 0 for i Ik, if ˆβτ k(i) = 0, then [ L(ˆβτ k)]i = 0 One can then notice that these conditions exactly correspond to the optimality conditions of the following constrained minimisation problem: arg min βi 0,i Ik,+ βi 0,i Ik, βi=0,i/ Ik We showed in Proposition 7 that the solution to this problem is unique and equal to βk from Algorithm 1. Therefore ˆβτ = βk for τ [τ k, τ k + ε0]. It finally remains to show that ˆβτ = βk while τ τk+1, where τk+1 := sup{τ, ˆtτ = tk+1}. For this let τ [τ k, τk+1], notice that for i / Ik, we necessarily have that ˆβτ(i) = βk(i) = 0, otherwise we break the continuity of ˆsτ. Similarly, for i Ik,+, we necessarily have that ˆβτ(i) 0 and for i Ik, , ˆβτ(i) 0 for the same continuity reasons. Now assume that ˆβτ(Ik) = βk(Ik). Then from Lemma 4 and continuity of the flow, τ (τ k, τ) such that ˆβτ / S and there must exist a heterocline flow Eq. (26) starting from βk which passes through βτ . This is absurd since along this flow the loss strictly decreases, which is in contradiction with the definition of βk which minimises the problem Eq. (31). E.1 Proof of Theorem 2 Theorem 3 enables to prove without difficulty Theorem 2 which we recall below. Indeed we can show that any extracted limit ˆβ maps back to the unique discontinuous process β . Theorem 2. Let the saddles (β0 = 0, β1, . . . , βp 1, βp = β ℓ1) and jump times (t0 = 0, t1, . . . , tp) be the outputs of Algorithm 1 and let ( β t )t be the piecewise constant process defined as follows: (Saddles) β t = βk for t (tk, tk+1) and 0 k p, tp+1 = + . The accelerated flow ( βα t )t defined in Eq. (9) uniformly converges towards the limiting process ( β t )t on any compact subset of R 0\{t1, . . . , tp}. Proof. We directly apply Theorem 3, let αk be the subsequence from the theorem. Let ε > 0, for simplicity we prove the result on [t1 + ε, t2 ε], all the other compacts easily follow the same line of proof. Note that since ˆtαk(τ 1) t1 and ˆtαk(τ2) t2, for αk small enough ˆtαk(τ 1) t1 + ε and ˆtαk(τ2) t2 ε, by the monotonicity of τ αk, this means that for αk small enough, τ 1 τ αk(t1 + ε) and τ2 τ αk(t2 ε). Therefore sup t [t1+ε,t2 ε] βαk t β1 = sup t [t1+ε,t2 ε] ˆβαk(ταk(t)) β1 = sup τ [τ αk (t1+ε),τ αk (t2 ε)] ˆβαk(τ) β1 sup τ [τ 1,τ2] ˆβαk(τ) β1 , which goes uniformly to 0 following Theorem 3. Since this result is independent of the subsequence αk, we get the result of Theorem 2. E.2 Proof of Proposition 3 We restate and prove Proposition 3 below. Proposition 3. For all T > tp, the graph of the iterates ( βα t )t T converges to that of (ˆβτ)τ : dist({ βα t }t T , {ˆβτ}τ 0) α 0 0 (Hausdorff distance) Proof. For α small enough, we have that ˆtα τ p tp + ε T sup τ 0 d(ˆβτ, { βα t }t T ) = sup τ τ p d(ˆβτ, { βα t }t T ) sup τ τ p ˆβτ βα ˆtα τ = sup τ τ p ˆβτ ˆβα τ α 0 0, according to Theorem 3. sup t T d( βα t , {ˆβτ }τ ) = sup τ τ α T d(ˆβα τ , {ˆβτ }τ ) sup τ τ α T ˆβα τ ˆβτ α 0 0, according to Theorem 3, which concludes the proof. F Technical lemmas The following lemma describes the behaviour of ϕα(βα) as α 0 in function of the subdifferential 1. Lemma 2. Let (βα)α>0 such that βα α 0 β Rd. if βi > 0 then [ ϕα(βα)]i converges to 1 if βi < 0 then [ ϕα(βα)]i converges to 1. Moreover if we assume that ϕα(βα) converges to η Rd, we have that: ηi ( 1, 1) βi = 0 βi = 0 ηi [ 1, 1]. Overall, assuming that (βα, ϕα(βα)) α 0 (β, η), we can write: Proof. We have that [ ϕα(βα)]i = 1 2 ln(1/α)arcsinh(βα i α2 ) = 1 2 ln(1/α) ln βα i α2 + Now assume that βα i βi > 0, then [ ϕα(βα)]i 1, if βi < 0 we conclude using that arcsinh is an odd function. All the claims are simple consequences of this. The following lemma shows that the extracted limits ˆt as defined in Proposition 8 diverge to . This divergence is crucial as it implies that the rescaled iterates (ˆβτ)τ explore the whole trajectory.. Lemma 3. For any extracted limit ˆt as defined in Proposition 8, we have that τ C ˆtτ where C is the upperbound on the length of the curves defined in proposition 6. Proof. Recall that τ α(t) = t + Z t From Proposition 6, the full path length R + 0 βα s ds is finite and bounded by some constant C independently of α. Therefore τ α is a bijection in R 0 and we defined ˆtα τ = (τ α) 1(τ). Furthermore τ α(t) t + C leads to t ˆtα(t + C) and therefore τ C ˆtα(τ) for all τ 0. This inequality still holds for any converging subsequence, which proves the result. Under a mild additional assumption on the data (see Assumption 2), we showed after the proof of Proposition 1 in Appendix B that the number of saddles of F is finite. Without this assumption, the number of saddles is a priori not finite. However the following lemma shows that along the flow of ˆβ the number of saddles which can potentially be visited is indeed finite. Lemma 4. The limiting flow ˆβ as defined in Proposition 8 can only visit a finite number of critical points β S := {β Rd, β L(β) = 0} and can visit each one of them at most once. Proof. Let τ 0, and assume that ˆβτ S, i.e., we are at a critical point at time τ. From Proposition 1, we have that ˆβτ arg min βi=0 for i/ supp( ˆβτ ) L(β), (32) Let us define the sets I := {i {1, . . . , d}, s.t. ˆsτ(i) = 1} and I = I+ I . The set I corresponds to the coordinate of ˆβτ which are allowed (but not obliged) to be non-zero since from Eq. (25), supp(ˆβτ) I. Now given the fact that the sub-matrix XI = ( xi)i I Rn card(I) is full rank (see part (1) of the proof of Proposition 7 for the explanation), the solution of the minimisation problem (32) is unique and equal to β[ξ] = (X ξ Xξ) 1X ξ y and β[ξC] = 0 where ξ = supp(ˆβτ). There are 2d = Card(P([d])) (where P([d]) contains all the subsets of [d]) number of constraints of the form {βi = 0, i / A}, where A [d], and ˆβτ is the unique solution of one of them. ˆβτ can therefore take at most 2d values (very crude upperbound). There is therefore a finite number of critical points which can be reached by the flow ˆβ. Furthermore, from Proposition 8, the loss is strictly decreasing along the heteroclinic orbits, each of these critical points can therefore be visited at most once.