# smooth_bilevel_programming_for_sparse_regularization__739a2435.pdf Smooth Bilevel Programming for Sparse Regularization Clarice Poon , Gabriel Peyré Iteratively reweighted least square (IRLS) is a popular approach to solve sparsityenforcing regression problems in machine learning. State of the art approaches are more efficient but typically rely on specific coordinate pruning schemes. In this work, we show how a surprisingly simple re-parametrization of IRLS, coupled with a bilevel resolution (instead of an alternating scheme) is able to achieve top performances on a wide range of sparsity (such as Lasso, group Lasso and trace norm regularizations), regularization strength (including hard constraints), and design matrices (ranging from correlated designs to differential operators). Similarly to IRLS, our method only involves linear systems resolutions, but in sharp contrast, corresponds to the minimization of a smooth function. Despite being non-convex, we show that there are no spurious minima and that saddle points are ridable , so that there always exists a descent direction. We thus advocate for the use of a BFGS quasi-Newton solver, which makes our approach simple, robust and efficient. We perform a numerical benchmark of the convergence speed of our algorithm against state of the art solvers for Lasso, group Lasso, trace norm and linearly constrained problems. These results highlight the versatility of our approach, removing the need to use different solvers depending on the specificity of the ML problem under study. 1 Introduction Regularized empirical risk minimization is a workhorse of supervised learning, and for a linear model, it reads min β Rn R(β) + 1 λL(Xβ, y) (Pλ) where X Rm n is the design matrix (n being the number of samples and m the number of features), L : Rm Rm [0, ) is the loss function, and R : Rn [0, ) the regularizer. Here λ 0 is the regularisation parameter which is typically tuned by cross-validation, and in the limit case λ = 0, (P0) is a constraint problem minβ R(β) under the constraint L(Xβ, y) = 0. In this work, we focus our attention to sparsity enforcing penalties, which induce some form of structure on the solution of (Pλ), the most celebrated examples (reviewed in Section 2) being the Lasso, group-Lasso and trace norm regularizers. All these regularizers, and much more (as detailed in Section), can be conveniently re-written as an infimum of quadratic functions. While Section 2 reviews more general formulations, this so-called quadratic variational form is especially simple in the case of block-separable functionals (such as Lasso and group-Lasso), where one has R(β) = min η Rk + ||βg||2 2 ηg + 1 Department of mathematical sciences, University of Bath, Bath BA2 7AY, UK cmshp20@bath.ac.uk CNRS and DMA, Ecole Normale Supérieure, PSL University, 45 rue d Ulm, F-75230 PARIS cedex 05, FRANCE, gabriel.peyre@ens.fr 35th Conference on Neural Information Processing Systems (Neur IPS 2021). where G is a partition of {1, . . . , n}, k = |G| is the number of groups and h : Rk + [0, ). An important example is the group-Lasso, where R(β) = P g ||βg||2 is a group-ℓ1 norm, in which case h(η) = P i ηi. The special case of the Lasso, corresponding to the ℓ1 norm is obtained when g = {i} for i = 1, . . . , n and k = n. This quadratic variational form (1) is at the heart of the Iterative Reweighted Least Squares (IRLS) approach, reviewed in Section 1.1. We refer to Section 2 for an in-depth exposition of these formulations. Sparsity regularized problems (Pλ) are notoriously difficult to solve, especially for small λ, because R is a non-smooth function. It is the non-smoothness of R which forces the solutions of (Pλ) to belong to low-dimensional spaces (or more generally manifolds), the canonical example being spaces of sparse vectors when solving a Lasso problem. We refer to Bach et al. [2011] for an overview of sparsity-enforcing regularization methods. The core idea of our algorithm is that a simple reparameterization of (1) combined with a bi-level programming (i.e. solving two nested optimization problems) can turn (Pλ) into a smooth program which is much better conditioned, and can be tackled using standard but highly efficient optimization techniques such as quasi-Newton (L-BFGS). Indeed, by doing the change of variable (vg, ug) ( ηg, βg/ ηg) in (1), (Pλ) is equivalent to min v Rk f(v) where f(v) min u Rn G(u, v) (2) 2h(v v) + 1 2||u||2 + 1 λL(X(v G u), y). (3) Throughout, we define to be the standard Hadamard product and for v Rk and u Rn, we define v G u Rn to be such that (v G u)g = vgug. Provided that v 7 h(v v) is differentiable and L( , y) is a convex, proper, lower semicontinuous function, the inner minimisation problem has a unique solution and f is differentiable. Moreover, in the case of the quadratic loss L(z, y) 1 2||z y||2 2, the gradient of f can be computed in closed form, by solving a linear system of dimension m or n. This paper is thus devoted to study the theoretical and algorithmic implications of this simple twist on the celebrated IRLS approach. Comparison with proximal gradient To provide some intuition about the proposed approach, the figure on the right contrasts the iterations of a gradient descent on f and of the iterative soft thresholding algorithm (ISTA) on the Lasso. We consider a random Gaussian matrix X R10 20 with λ = ||X y|| /10. The ISTA trajectory is non-smooth when some feature crosses 0. In particular, if a coefficient (such as the red one on the figure) is initialized with the wrong sign, it takes many iteration for ISTA to flip sign. In sharp contrast, the gradient flow of f does not exhibit such a singularity and exhibits a smooth geometric convergence. We refer to the appendix for an analysis of this phenomenon. Contributions Our main contribution is a new versatile algorithm for sparse regularization, which applies standard smooth optimization methods to minimize the function f in (2). We first propose in Section 2 a generic class of regularizers R that enjoy a quadratic variational form. This section recaps existing results under a common umbrella and shows the generality of our approach. Section 3.1 then gathers the theoretical analysis of the method, and in particular the proof that while being non-convex, the function f has no local minimum and only ridable saddle points. As a result, one can guarantee convergence to a global minimum for many optimisation schemes, such as gradient descent with random perturbations [Lee et al., 2017, Jin et al., 2017] or trust region type methods [Pascanu et al., 2014]. Furthermore, for the case of the group Lasso, we show that f is an infinitely differentiable function with uniformly bounded Hessian. Consequently, standard solvers such as Newton s method/BFGS can be applied and with a superlinear convergence guarantee. Section 4 performs a detailed numerical study of the method and benchmarks it against several popular competing algorithms for Lasso, group-Lasso and trace norm regularization. Our method is consistently amongst the best performers, and is in particular very efficient for small values of λ, and can even cope with the constrained case λ = 0. 1.1 Related works State of the art solvers for sparse optimisation Popular approaches to nonsmooth sparse optimisation include proximal based methods. The simplest instance is the Forward-Backward algorithm [Lions and Mercier, 1979, Daubechies et al., 2004] which handles the case where L is a smooth function. There are many related inertial based acceleration schemes, such as FISTA [Beck and Teboulle, 2009] and particularly effective techniques that leads to substantial speedups are the adaptive use of stepsizes and restarting strategies [O donoghue and Candes, 2015]. Other related approaches are proximal quasi-Newton and variable metric methods [Combettes and V u, 2014, Becker et al., 2019], which incorporate variable metrics into the quadratic term of the proximal operator. Another popular approach, particularly for the Lasso-like problems are coordinate descent schemes [Friedman et al., 2010]. These schemes are typically combined with support pruning schemes [Ghaoui et al., 2010, Ndiaye et al., 2017, Massias et al., 2018]. In the case where both the regularisation and loss terms are nonsmooth (e.g. in the basis pursuit setting), typical solvers are the primal-dual [Chambolle and Pock, 2011], ADMM [Boyd et al., 2011] and Douglas-Rachford algorithms [Douglas and Rachford, 1956]. Although these schemes are very popular due to their relatively low per iteration complexity, these methods have sublinear convergence rates in general, with linear convergence under strong convexity. The quadratic variational formulation and IRLS Quadratic variational formulations such as (1) have been exploited in many early computer vision works, such as Geiger and Yuille [1991] and Geman and Reynolds [1992]. One of the first theoretical results can be found in Geman and Reynolds [1992], see also Black and Rangarajan [1996] which lists many examples. Our result in Theorem 1 can be seen as a generalisation of these early works. Norm regularizers of this form were introduced in Micchelli et al. [2013], and further studied in the monograph Bach et al. [2011] under the name of subquadratic norms. The main algorithmic consequence of the quadratic variational formulation in the literature is the iterative reweighted least squares (IRLS) algorithm. When L(z, y) = 1 2||z y||2 2 is the quadratic loss, a natural optimisation strategy is alternating minimising. However, due to the 1/ηg term, one needs to introduce some regularisation to ensure convergence. One popular approach is to add ε g η 1 g to the formulation (1) which leads to the IRLS algorithm [Daubechies et al., 2010]. The ε-term can be seen as a barrier to keep ηg positive which is reminiscent of interior point methods. The idea of IRLS is to do alternating minimisation over β and η, where the minimisation with respect to β is a least squares problem, and the minimisation with respect to η admits a closed form solution. A nuclear norm version of IRLS has been used in Argyriou et al. [2008] where an alternating minimisation algorithm was introduced. Finally, we remark that although nonconvex formulations for various low complexity regularizers have appeared in the literature, see for instance Rennie and Srebro [2005], Hastie et al. [2015], Mardani and Giannakis [2015], Hoff [2017] for the case of the ℓ1 and nuclear norms, they are typically associated with alternating minimisation algorithms. Variable projection/reduced gradient approaches IRLS methods are quite slow because the resulting minimization problem is poorly conditioned. Adding the smoothing term ε g η 1 g only partly alleviates this, and also breaks the sparsity enforcing property of the regularizer R. We avoid both issues in (2) by solving a reduced problem which is much better conditioned and smooth. This idea of solving a bi-variate problem by re-casting it as a bilevel program is classical, we refer in particular to [Rockafellar and Wets, 2009, Chap. 10] for some general theoretical results on reduced gradients, and also Danskin s theorem (although this is restricted to the convex setting) in Bertsekas [1997]. Our formulation falls directly into the framework of variable projection [Ruhe and Wedin, 1980, Golub and Pereyra, 2003], introduced initially for solving nonlinear least squares problems. Properties and advantages of variable projection have been studied in Ruhe and Wedin [1980], we refer also to Hong et al. [2017], Zach and Bourmaud [2018] for more recent studies. Nonsmooth variable projection is studied in van Leeuwen and Aravkin [2016], although the present work is in the classical setting of variable projection due to our smooth reparametrization. Reduced gradients have also been associated with the quadratic variational formulation in several works [Bach et al., 2011, Pong et al., 2010, Rakotomamonjy et al., 2008]. The idea is to apply descent methods over g(η) = minβ R0(η, β) + 1 2||Xβ y||2 2. Although the function over η and β is discontinuous, the function g over η is smooth and one can apply first order methods, such as proximal gradient descent to minimise g under positivity constraints. While quasi-Newton methods can be applied in this setting with bound constraints, we show in Section 4.1 that this approach is typically less effective than our nonconvex bilevel approach. In the setting of the trace norm, the optimisation problem is constrained on the set of positive semidefinite matrices, so one is restricted to using first order methods [Pong et al., 2010]. 2 Quadratic variational formulations We describe in this section some general results about when a regulariser has a quadratic variational form. Our first result brings together results which are scattered in the literature: it is closely related to Theorem 1 in Geman and Reynolds [1992], but their proof was only for strictly concave differentiable functions and did not explicitly connect to convex conjugates, while the setting for norms have been characterized in the monograph Bach et al. [2011] under the name of subquadratic norms. Theorem 1. Let R : Rn R. The following are equivalent: (i) R(β) = ϕ(β β) where ϕ is proper, concave and upper semi-continuous, with domain Rd +. (ii) There exists a convex function ψ for which R(β) = infz Rn + 1 2 Pn i=1 ziβ2 i + ψ(z). Furthermore, ψ(z) = ( ϕ) ( z/2) is defined via the convex conjugate ( ϕ) of ϕ, leading to (1) using the change of variable η 1/z and h(η) = 2ψ(1/η). When R is a norm, the function h can be written in terms of the dual norm R as h(η) = max R (w) 1 P i w2 i ηi. Moreover, R(β)2 = infη Rn + P i β2 i η 1 i \ h(η) 1 . See Appendix B for the proof to this Theorem. Some additional properties of ψ are derived in Lemma 1 of Appendix B, one property is that if R is coercive, then lim||z|| 0 ψ(z) = + , so the function f is coercive (see also remark 2). 2.1 Examples Let us first give some simple examples of both convex and non-convex norms: Euclidean norms: for R = || ||2, making use of R as stated in Theorem 1, one has h = || || . Group norms: for G is a partition of {1, . . . , n}, the group norm is R(β) = P g G ||βg||. Using the previous result for the Euclidean norm, one has h(η) = P g G ||(ηi)i g|| . This expression can be further simplified to obtain (1) with a reduced vector η in R|G| + in place of Rn by noticing that the optimal η is constant in each group g. ℓq (quasi) norms: For R(β) = |β|q where q (0, 2), one has ϕ(u) = uq/2 and one verifies that h(η) = Cqη q 2 q where Cq = (2 q)qq/(2 q). Note that for q > 2 3, v 7 h(v2) = vγ for γ > 1 is differentiable. Analysis and numerics for this nonconvex setting can be found in Appendix F. Matrix regularizer The extension of Theorem 1 to the case where β = B is a matrix can be found in the appendix. When R = ϕ(BB ) is a function on matrices, the analogous quadratic variational formulation is R(B) = min Z Sn + min B Rn r 1 2 g tr(B Z 1B) + 1 where Sn + denotes the set of symmetric positive semidefinite matrices and h(Z) = 2( ϕ) ( Z 1/2). Letting U = Z 1/2B and V = Z1/2, we have B = V U and the equivalence min B R(B) + 1 2λ||A(B) y||2 2 (5) = min V Rn n f(V ) min U Rn r 1 2||U||2 F + 1 2h(V V ) + 1 2λ||A(V U) y||2 2. (6) where A : Rn r Rm is a linear operator. Again, provided that V 7 h(V V ) is differentiable, f is a differentiable function with f(V ) = V h(V V ) + 1 λA (A(V U) y))U and U such that λU + V A (A(V U) y) = 0. For the trace norm, R(B) = tr( B B), we have h(Z) = tr(Z) and h(Z) = Id. Note that, just in the vectorial case, one could write the inner minimisation problem over the dual variable α Rm and handle the case of λ = 0. 3 Theoretical analysis Our first result shows the equivalence between (Pλ) and a smooth bilevel problem. Theorem 2. Denote Ly L( , y) and let L y denote the convex conjugate of Ly. Assume that Ly is a convex, proper, lower semicontinuous function and R takes the form (1). The problem (Pλ) is equivalent to min v Rk f(v) min u Rn 1 2h(v v) + 1 2||u||2 + 1 λLy (X(v G u)) . (7) = max α Rm 1 2h(v v) 1 λL y (λα) 1 2||v G (X α)||2 2. (8) where v G u (ugvg)g G. The minimiser β to (Pλ) and the minimiser v to (7) are related by β = v G u = v2 G X α. Provided that v 7 h(v v) is differentiable, the function f is differentiable with gradient f(v) = v h(v2) v G ||X g α||2 g where α argmax α L y( α) 1 2||v G(X α)||2 2. (9) Note that f is uniquely defined even if α is not unique. If λ > 0 and Ly is differentiable, then we have the additional formula, with u argmin u 1 2|| u||2 2 + 1 λLy(X(v G u)), f(v) = v h(v2) + 1 λ ug, X g Ly(X(v G u) Proof. The equivalence between (Pλ) and (7) is simply due to the quadratic variational form of R, and the change of variable vg = ηg, and ug = βg/ ηg. The equivalence to (8) follows by convex duality on the inner minimisation problem, that is f(v) = min u G1(v, u) 1 2||u||2 + 1 λLy(X(v G u)) = min u,z 1 2h(v2) + 1 2||u||2 + 1 λLy(z) where z = X(v G u) = min u,z max α 1 2h(v2) + 1 2||u||2 + 1 λLy(z) α, z + α, X(v G u) = max α min u 1 2h(v2) + 1 2||u||2 + α, X(v G u) 1 Using the optimality condition over u, we obtain u = v G X α and hence, f(v) = max α G2(v, α) 1 2||v G X α||2 2 1 By [Rockafellar and Wets, 2009, Theorem 10.58], if the following set S(v) v G2(v, α) = v v G (||X g α||2 2)g G \ α argminα G2(v, α) is singled valued, then f is differentiable with f(v) = v G2(v, α) for α argminα G2(v, α). Observe that even if argminα G2(v, α) is not single valued, since G2 is strongly convex for v GX α, S(v) is single-valued and hence, f is a differentiable function. In the case where Ly is differentiable, we can again apply [Rockafellar and Wets, 2009, Theorem 10.58], to obtain f(v) = v h(v2) + v G1(v, y) with u = argminu G1(v, u) (noting that G1(v, ) is strongly convex and has a unique minimiser) which is precisely the gradient formula (10). For the Lasso and basis pursuit setting, the gradient of f can be computed in closed form: Corollary 1. If R has a quadratic variational form and Ly(z) = 1 2||z y||2 2, then Ly(z) = z y, L y(α) = y, α + 1 2||α||2 2 and the gradient of f can be written as in (9) and additionally (10) when λ > 0. Furthermore, α Rm in (9) and u Rn in (10) solves (X diag( v2)X + λId)α = y, and (diag( v)X X diag( v) + λId)u = v G (X y), (11) where v Rn is defined as v u = (vgug)g G for all u Rn and Id denotes the identity matrix. This shows that our method caters for the case λ = 0 with the same algorithm in a seamless manner. This is unlike most existing approach which work well for λ > 0 (and typically do not require matrix inversion) but fails when λ is small, whereas solvers dedicated for λ = 0 might require inverting a linear system, see Section 4.4 for an illustrative example. 3.1 Properties of the projected function f In this section, we analyse the case of the group Lasso. The following theorem ensures that the projected function f has only strict saddle points or global minima. We say that v is a second order stationary point if f(v) = 0 and 2f(v) 0. We say that v is a strict saddle point (often called ridable ) if it is a stationary point but not a second order stationary point. One can thus always find a direction of descent outside the set of global minimum. This can be exploited to derive convergence guarantees to second order stationary points for trust region methods [Pascanu et al., 2014] and gradient descent methods [Lee et al., 2017, Jin et al., 2017]. Theorem 3. In the case h(z) = P i zi and L(z, y) = 1 2||z y||2 2, the projected function f is infinitely continuously differentiable and for v Rk, f(v) = v 1 |ξ|2 where ξg = 1 λX g (X(u v) y) and u solves the inner least squares problem for v. Let J denote the support of v, by rearranging the columns and rows, the Hessian of f can be written as the following block diagonal matrix 2f(v) = diag(1 ||ξg||2 2)g J + 4U WU 0 0 diag(1 ||ξg||2 2)g Jc where W Id λ (vg X g Xhvh)g,h J + λId J 1 and U is the block diagonal matrix with blocks (ξg)g J, with maxg G ||ξg||2 C and || 2f(v)|| 1 + 3C2 where C ||y||2 maxg G ||Xg||/λ. Moreover, all stationary points of f are either global minima or strict saddles. At stationary points, the eigenvalues of the Hessian of f are at most 4 and is at least min 4(1 λ/(λ + ˆσ)), min g J (1 ||ξg||2) where ˆσ is the smallest eigenvalue of (vg X g Xhvh)g,h J. The proof can be found in Appendix C. We simply mention here that by examining the first order condition of (Pλ), we see that β is a minimizer if and only if ξ satisfies ξg = βg ||βg||2 for all g Supp(β) and ||ξg||2 1 for all g G. The first condition on the support of β is always satisfied at stationary points of the nonconvex function (2), and by examining (12), the second condition is also satisfied unless the stationary point is strict. Remark 1 (Example of strict saddle point for our f). One can observe that v = 0 is a strict saddle point, as the solution to the associated linear system yields u = 0 and hence f(v) = 0. If λ ||X y|| , then u = v = 0 corresponds to a global minimum, otherwise, it is clear to see that there exists g such that 1 ||ξg||2 < 0 and v = 0 is a strict saddle point. Remark 2. Since f C , it is Lipschitz smooth on any bounded domain. As mentioned, f is coercive when R is coercive, and hence, its sublevel sets are bounded. So, for any descent algorithm, we can apply results based on kf being Lipschitz smooth for all k. Remark 3. The nondegeneracy condition that ||ξg||2 < 1 outside the support of v and invertibility of (X g Xh)g,h J is often used to derive consistency results [Bach, 2008, Zhao and Yu, 2006]. By Proposition 1, we see that this condition guarantees that the Hessian of f is positive definite at the minimum, and hence, combining with the smoothness properties of f explained in the previous remark, BFGS is guaranteed to converge superlinearly for starting points sufficiently close to the optimum [Nocedal and Wright, 2006, Theorem 6.6]. 4 Numerical experiments In this section, we use L-BFGS [Byrd et al., 1995] to optimise our bilevel function f and we denote the resulting algorithm Noncvx-Pro . Throughout, the inner problem is solved exactly using either a full or a sparse Cholesky solver. One observation from our numerics below is that although Noncvx Pro is not always the best performing, unlike other solvers, it is robust to a wide range of settings: for example, our solver is mostly unaffected by the choice of λ while one can observe in Figures 1 and 2 that this has a large impact on the proximal based methods and coordinate descent. Moreover, Noncvx-Pro is simple to code and rely on existing robust numerical routines (Cholesky/ conjugate gradient + BFGS) which naturally handle sparse/implicit operators, and we thus inherit their nice convergence properties. All numerics are conducted on 2.4 GHz Quad-Core Intel Core i5 processor with 16GB RAM. The code to reproduce the results of this article is available online3. 3 https://github.com/gpeyre/2021-Non Cvx Pro (60000,683) λ 1 2λmax 1 10λmax 1 50λmax Figure 1: Lasso comparisons with different regularisation parameters. We first consider the Lasso problem where R(β) = Pn i=1 |βi|. Datasets. We tested on 8 datasets from the Libsvm repository4. These datasets are mean subtracted and normalised by m. Solvers. We compare against the following 10 methods: 1. 0-mem SR1: a proximal quasi newton method [Becker et al., 2019]. 2. FISTA w/ BB: FISTA with Barzilai Borwein stepsize [Barzilai and Borwein, 1988] and restarts [O donoghue and Candes, 2015]. 3. SPG/Spa RSA: spectral projected gradient [Wright et al., 2009]. 4. CGIST: an active set method with conjugate gradient iterative shrinkage/thresholding [Goldstein and Setzer, 2010]. 5. Interior point method: from Koh et al. [2007]. 6. CELER: a coordinate descent method with support pruning [Massias et al., 2018]. 7. Non-cvx-Alternating-min: alternating minimisation of u and v in G from (3) [Hoff, 2017]. 8. Non-cvx-LBFGS: Apply L-BFGS to minimise the function (u, v) 7 G(u, v) in (3). 9. L-BFGS-B [Byrd et al., 1995]: apply L-BFGS-B under positivity constraints to minu,v Rn + P i ui+ P i vi+ 1 2λ||X(u v) y||2 2. This is the standard approach for applying L-BFGS to ℓ1 minimisation and corresponds to splitting β into its positive and negative parts. 10. Quad-variational: Based on our idea of Noncvx-Pro, another natural (and to our knowledge novel) approach is to apply L-BFGS-B to the bilevel formulation of (1) without nonconvex reparametrization. Indeed, by applying (1) and using convex duality, the Lasso can solved by minimizing g(η) maxα Rm 1 i ηi| xi, α |2 + α, y . The gradient of g is g(η) = λ λ|X α|2 where | |2 is in a pointwise sense and α maximises the inner problem, and we apply L-BFGS-B with positivity constraints to minimise g. Experiments. The results are shown in Figure 1 (with further experiments in the appendix). We show comparisons at different regularisation strengths, with λ being the regularisation parameter found by 10 fold cross validation on the mean squared error, and λmax = ||X y|| is the smallest parameter at which the Lasso solution is guaranteed to be trivial. 4.2 Group Lasso The multi-task Lasso [Gramfort et al., 2012] is the problem (7) where one minimises over β Rn q, the observed data is y Rm q and R(β) = Pn j=1 ||βj||2 with βj Rq denotes the jth row of the matrix β and the loss function is 1 2||y Xβ||2 F in terms of the the Frobenius norm. 4 https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ λ = 1 10λmax λ = 1 20λmax λ = 1 50λmax λ = 1 100λmax Figure 2: Comparisons for multitask Lasso on MEG/EEG data Datasets. We benchmark our proposed scheme on a joint MEG/EEG data from Ndiaye et al. [2015]. X denotes the forward operator with n = 22494 source locations, and is constructed from 301 MEG sensors and 59 EEG sensors, so that m = 360. The observations y Rm q represent time series measurements at m sensor with q = 181 timepoints each. The β corresponds to the source locations which are assumed to remain constant across time. Solvers. We perform a comparison against FISTA with restarts and BB step, the spectral projected gradient method SPG/Spa RSA, and CELER. Since n is much larger than q and m, we use for Non Cvx Pro the saddle point formulation (8) where the maximisation is over α Rm q. Computation of α in f(v) involves solving (Idm + 1 λX diag(v2)X )α = y., that is, q linear systems each of size m. Experiments. Figure 2 displays the objective convergence plots against running time for different λ: λ = λmax/r with λmax = maxi ||X i y||2 for r = 10, 20, 50, 100. We observe substantial performance gains over the benchmarked methods: In MEG/EEG problems, the design matrix tends to exhibit high correlation of columns and proximal-based algorithms tend to perform poorly here. Coordinate descent with pruning is known to perform well here when the regularisation parameter large [Massias et al., 2018], but its performance deteriorates as λ increases. 4.3 Trace norm In multi-task feature learning [Argyriou et al., 2008], for each task t = 1, . . . , T, we aim to find ft : Rn R given training examples (xt,i, yt,i) Rn R for i = 1, . . . , mt. One approach is to jointly solve these T regression problems by minimising (5) where R is the trace norm (also called nuclear norm ), r = T, y = (yt)T t=1, A(B) = (Xt Bt)T t=1 with Xt being the matrix with ith row as xt,i and Bt being the tth column of the matrix B. Note that letting ut Rn denote the tth column of U, the computation of U in f(V ) involves solving T linear systems (λIdn + V X t Xt V )ut = V X t yt. Here, the trace norm encourages the tasks to share a small number of linear features. Datasets. We consider the three datasets commonly considered in previous works. The Schools dataset 5 from Argyriou et al. [2008] consists of the scores of 15362 students from 139 schools. There are therefore T = 139 tasks with 15362 = P t mt data points in total, and the goal is to map n = 27 student attributes to exam performance. The SARCOS dataset 6 [Zhang and Yang, 2021, Zhang and Yeung, 2012] has 7 tasks, each corresponding to learning the dynamics of a SARCOS anthropomorphic robot arms. There are n = 21 features and m = 48, 933 data points, which are shared across all tasks. The Parkinsons dataset [Tsanas et al., 2009] 7 which is made up of m = 5875 datapoints from T = 42 patients. The goal is to map n = 19 biomarkers to Parkinson s disease symptom scores for each patient. Solvers. Figure 3 reports a comparison against FISTA with restarts and IRLS. The IRLS algorithm for (5) is introduced in Argyriou et al. [2008] (see also Bach et al. [2011]), and applies alternate minimisation after adding the regularisation term ελ tr(Z 1)/2 to (4). The update of B is a simple least squares problem while the update for Z is Z (BB + εId) 1 2 . Our nonconvex approach has 5 https://home.ttic.edu/~argyriou/code/ 6 http://www.gaussianprocess.org/gpml/data/ 7 http://archive.ics.uci.edu/ml/datasets/Parkinsons+Telemonitoring Schools Parkinsons SARCOS Figure 3: Comparisons for multi-task feature learning, IRLS-d corresponds to IRLS with ε = 10 d. the same per-iteration complexity as IRLS, but one advantage is that we directly deal with (5) without any ε regularisation. Remark 4. Quad-variational mentioned in Section 4.1 does not extend to the trace norm case, since the function g would be minimised over Sn +, for which the application of L-BFGS-B is unclear. For this reason, bilevel formulations for the trace norm [Pong et al., 2010] have been restricted to the use of first order methods. Experiments. For each dataset, we compute the regularisation parameter λ by 10-fold crossvalidation on the RMSE averaged across 10 random splits. Then, with this regularisation parameter, we compare Non-convex-pro, FISTA and IRLS with different choices of ε. The convergence plots are show in Figure 3. We observe substantial computational gains for the Schools and Parkinson s dataset. For the SARCOS dataset, IRLS performed the best, and even though Nonconvex-pro is comparatively less effective here, although we remark that the number of tasks is much smaller (T = 7) and the recorded times are much shorter (less than 0.2s). Further numerical illustrations with synthetic data are shown in the appendix our method is typically less sensitive to problem variations. 4.4 Constraint Group Lasso and Optimal Transport A salient feature of our method is that it can handle arbitrary small regularization parameter λ and can even cope with the constrained formulation, when λ = 0, which cannot be tackled by most state of the art Lasso solvers. To illustrate this, we consider the computation of an Optimal Transport (OT) map, which has recently gained a lot of attention in ML [Peyré et al., 2019]. We focus on the Monge problem, where the ground cost is the geodesic distance on either a graph or a surface (which extends original Monge s problem where the cost is the Euclidean distance). This type of OT problems has been used for instance to analyze and process brain signals in M/EEG [Gramfort et al., 2015], for application in computer graphics [Solomon et al., 2014] and is now being applied to genomic datasets [Schiebinger et al., 2019]. As explained for instance in [Santambrogio, 2015, Sec.4.2], the optimal transport between two probability measures a and b on a surface can be computed by advecting the mass along a vector field v(x) R3 (tangent to the surface) with minimum vectorial L1 norm R ||v(x)||dx (where dx is the surface area measure) subject to the conservation of mass div(v) = a b. Once discretized on a 3-D mesh, this boils down to solving a constrained group Lasso problem (P0) where βg R3 is a discretization of v(zg) at some vertex zg of the mesh, X is a finite element discretization of the divergence using finite elements and y = a b. The same applies on a graph, in which case the vector field is aligned with the edge of the graph and the divergence is the discrete divergence associated to the graph adjacency matrix, see Peyré et al. [2019]. This formulation is often called Beckmann problem . Datasets. We consider two experiments: (i) following Gramfort et al. [2015] on a 3-D mesh of the brain with n = 20000 vertices with localized Gaussian-like distributions a and b (in blue and red), (ii) a 5-nearest neighbors graph in a 7-dimensional space of gene expression (corresponding to 7 different time steps) of baker s yeast, which is the dataset from De Risi et al. [1997]. The n = 614 nodes correspond to the most active genes (maximum variance across time) and this results in a graph with 1927 edges. The distributions are synthetic data, where a is a localized source whereas b is more delocalized. 5 10 15 20 25 30 35 Non Cvx-Pro DR, =0.1 PD, =0.1 50 100 150 200 250 Non Cvx-Pro DR, =0.001 DR, =0.01 DR, =0.05 PD, =0.1 PD, =1 PD, =10 Figure 4: Resolution of Beckmann problem on a graph (left) and a 3-D mesh (right). The probability distributions a and b are displayed in blue and red and the optimal flow β is represented as green segments (on the edge of the graph and on the faces of the mesh). The convergence curves display the decay of log10(||βt β ||1) during the iterations of the algorithms (DR=Douglas-Rachford, PD=Primal-Dual) as a function of time t in second. Solvers. We test our method against two popular first order schemes: the Douglas-Rachford (DR) algorithm [Lions and Mercier, 1979, Combettes and Pesquet, 2007] (DR) and the primal-dual scheme of Chambolle and Pock [2011]. DR is used in its dual formulation (the Alternating Direction Method of Multipliers ADMM) but on a re-parameterized problem in Solomon et al. [2014]. The details of these algorithms are given in Appendix E. Experiments. Figure 4 shows the solution of this Beckmann problem in these two settings. While DR has the same complexity per iteration as the computation of the gradient of f (resolution of a linear system), a chief advantage of PD is that it only involves the application of X and X et each iterations. Both DR and PD have stepping size parameters (denoted µ and σ) which have been tuned manually (the rightmost figure shows two other sub-optimal choices of parameters). In contrast, our algorithm has no parameter to tune and is faster than both DR and PD on these two problems. 5 Conclusion Most existing approaches to sparse regularisation involve careful smoothing of the nonsmooth term, either by proximal operators or explicit regularisation as in IRLS. We propose a different direction: a simple reparameterization leads to a smooth optimisation problem, and allows for the use of standard numerical tools, such as BFGS. Our numerical results demonstrate that this approach is versatile, effective and can handle a wide range of problems. We end by making some remarks on possible future research directions. The application of our method to other loss functions requires the use of an inexact solver for the inner problems, and controlling the impact of its approximation is an interesting avenue for future work. Furthermore, it is possible that one can obtain further acceleration by combining with screening rules or active set techniques. Acknowledgments The work of G. Peyré was supported by the French government under management of Agence Nationale de la Recherche as part of the Investissements d avenir program, reference ANR19-P3IA0001 (PRAIRIE 3IA Institute) and by the European Research Council (ERC project NORIA). A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine learning, 73 (3):243 272, 2008. F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties. ar Xiv preprint ar Xiv:1108.0775, 2011. F. R. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9(6), 2008. J. Barzilai and J. M. Borwein. Two-point step size gradient methods. IMA journal of numerical analysis, 8(1):141 148, 1988. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009. S. Becker, J. Fadili, and P. Ochs. On quasi-newton forward-backward splitting: Proximal calculus and convergence. SIAM Journal on Optimization, 29(4):2445 2481, 2019. D. P. Bertsekas. Nonlinear programming. Journal of the Operational Research Society, 48(3): 334 334, 1997. M. J. Black and A. Rangarajan. On the unification of line processes, outlier rejection, and robust statistics with applications in early vision. International journal of computer vision, 19(1):57 91, 1996. S. Boyd, N. Parikh, and E. Chu. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011. R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190 1208, 1995. A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120 145, 2011. P. L. Combettes and J.-C. Pesquet. A douglas rachford splitting approach to nonsmooth convex variational signal recovery. IEEE Journal of Selected Topics in Signal Processing, 1(4):564 574, 2007. P. L. Combettes and B. C. V u. Variable metric forward backward splitting with applications to monotone inclusions in duality. Optimization, 63(9):1289 1318, 2014. I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 57(11):1413 1457, 2004. I. Daubechies, R. De Vore, M. Fornasier, and C. S. Güntürk. Iteratively reweighted least squares minimization for sparse recovery. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 63(1):1 38, 2010. J. L. De Risi, V. R. Iyer, and P. O. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278(5338):680 686, 1997. J. Douglas and H. H. Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421 439, 1956. J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010. D. Geiger and A. Yuille. A common framework for image segmentation. International Journal of Computer Vision, 6(3):227 243, 1991. D. Geman and G. Reynolds. Constrained restoration and the recovery of discontinuities. IEEE Transactions on pattern analysis and machine intelligence, 14(3):367 383, 1992. L. E. Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination for the lasso and sparse supervised learning problems. ar Xiv preprint ar Xiv:1009.4219, 2010. T. Goldstein and S. Setzer. High-order methods for basis pursuit. UCLA CAM Report, pages 10 41, 2010. G. Golub and V. Pereyra. Separable nonlinear least squares: the variable projection method and its applications. Inverse problems, 19(2):R1, 2003. A. Gramfort, M. Kowalski, and M. Hämäläinen. Mixed-norm estimates for the m/eeg inverse problem using accelerated gradient methods. Physics in Medicine & Biology, 57(7):1937, 2012. A. Gramfort, G. Peyré, and M. Cuturi. Fast optimal transport averaging of neuroimaging data. In International Conference on Information Processing in Medical Imaging, pages 261 272. Springer, 2015. T. Hastie, R. Mazumder, J. D. Lee, and R. Zadeh. Matrix completion and low-rank svd via fast alternating least squares. The Journal of Machine Learning Research, 16(1):3367 3402, 2015. P. D. Hoff. Lasso, fractional norm and structured sparse estimation using a hadamard product parametrization. Computational Statistics & Data Analysis, 115:186 198, 2017. J. H. Hong, C. Zach, and A. Fitzgibbon. Revisiting the variable projection method for separable nonlinear least squares problems. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5939 5947. IEEE, 2017. C. Jin, R. Ge, P. Netrapalli, S. M. Kakade, and M. I. Jordan. How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724 1732. PMLR, 2017. K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. Journal of Machine learning research, 8(Jul):1519 1555, 2007. J. D. Lee, I. Panageas, G. Piliouras, M. Simchowitz, M. I. Jordan, and B. Recht. First-order methods almost always avoid saddle points. ar Xiv preprint ar Xiv:1710.07406, 2017. P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964 979, 1979. M. Mardani and G. B. Giannakis. Estimating traffic and anomaly maps via network tomography. IEEE/ACM transactions on networking, 24(3):1533 1547, 2015. M. Massias, A. Gramfort, and J. Salmon. Celer: a fast solver for the lasso with dual extrapolation. In International Conference on Machine Learning, pages 3315 3324. PMLR, 2018. C. A. Micchelli, J. M. Morales, and M. Pontil. Regularizers for structured sparsity. Advances in Computational Mathematics, 38(3):455 489, 2013. E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparse multi-task and multi-class models. ar Xiv preprint ar Xiv:1506.03736, 2015. E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparsity enforcing penalties. The Journal of Machine Learning Research, 18(1):4671 4703, 2017. J. Nocedal and S. Wright. Numerical optimization. Springer Science & Business Media, 2006. B. O donoghue and E. Candes. Adaptive restart for accelerated gradient schemes. Foundations of computational mathematics, 15(3):715 732, 2015. R. Pascanu, Y. N. Dauphin, S. Ganguli, and Y. Bengio. On the saddle point problem for non-convex optimization. ar Xiv preprint ar Xiv:1405.4604, 2014. G. Peyré, M. Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends R in Machine Learning, 11(5-6):355 607, 2019. T. K. Pong, P. Tseng, S. Ji, and J. Ye. Trace norm regularization: Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization, 20(6):3465 3489, 2010. A. Rakotomamonjy, F. Bach, S. Canu, and Y. Grandvalet. Simplemkl. Journal of Machine Learning Research, 9:2491 2521, 2008. J. D. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pages 713 719, 2005. R. T. Rockafellar and R. J.-B. Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009. A. Ruhe and P. Å. Wedin. Algorithms for separable nonlinear least squares problems. SIAM review, 22(3):318 337, 1980. F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015. G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, J. Gould, S. Liu, S. Lin, P. Berube, et al. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell, 176(4):928 943, 2019. J. Solomon, R. Rustamov, L. Guibas, and A. Butscher. Earth mover s distances on discrete surfaces. ACM Transactions on Graphics (TOG), 33(4):1 12, 2014. A. Tsanas, M. Little, P. Mc Sharry, and L. Ramig. Accurate telemonitoring of parkinson s disease progression by non-invasive speech tests. Nature Precedings, pages 1 1, 2009. T. van Leeuwen and A. Aravkin. Non-smooth variable projection. ar Xiv preprint ar Xiv:1601.05011, 2016. S. J. Wright, R. D. Nowak, and M. A. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on signal processing, 57(7):2479 2493, 2009. C. Zach and G. Bourmaud. Descending, lifting or smoothing: Secrets of robust cost optimization. In Proceedings of the European Conference on Computer Vision (ECCV), pages 547 562, 2018. Y. Zhang and Q. Yang. A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering, 2021. Y. Zhang and D.-Y. Yeung. A convex formulation for learning task relationships in multi-task learning. ar Xiv preprint ar Xiv:1203.3536, 2012. P. Zhao and B. Yu. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541 2563, 2006.