# dual_extrapolation_for_sparse_glms__ccd5374f.pdf

Journal of Machine Learning Research 21 (2020) 1-33 Submitted 7/19; Revised 8/20; Published 10/20

Dual Extrapolation for Sparse Generalized Linear Models

Mathurin Massias mathurin.massias@inria.fr Universit e Paris-Saclay, Inria, CEA, Palaiseau, France

Samuel Vaiter samuel.vaiter@u-bourgogne.fr CNRS & Institut de Math ematiques de Bourgogne, 21078, Dijon, France

Alexandre Gramfort alexandre.gramfort@inria.fr Universit e Paris-Saclay, Inria, CEA, Palaiseau, France

Joseph Salmon joseph.salmon@umontpellier.fr IMAG, Univ Montpellier, CNRS, 34095, Montpellier, France

Editor: David Wipf

Generalized Linear Models (GLM) form a wide class of regression and classiﬁcation models, where prediction is a function of a linear combination of the input variables. For statistical inference in high dimension, sparsity inducing regularizations have proven to be useful while oﬀering statistical guarantees. However, solving the resulting optimization problems can be challenging: even for popular iterative algorithms such as coordinate descent, one needs to loop over a large number of variables. To mitigate this, techniques known as screening rules and working sets diminish the size of the optimization problem at hand, either by progressively removing variables, or by solving a growing sequence of smaller problems. For both techniques, signiﬁcant variables are identiﬁed thanks to convex duality arguments. In this paper, we show that the dual iterates of a GLM exhibit a Vector Auto Regressive (VAR) behavior after sign identiﬁcation, when the primal problem is solved with proximal gradient descent or cyclic coordinate descent. Exploiting this regularity, one can construct dual points that oﬀer tighter certiﬁcates of optimality, enhancing the performance of screening rules and working set algorithms. Keywords: Convex optimization, extrapolation, screening rules, working sets, Lasso, sparse logistic regression, generalized linear models

1. Introduction

Since the introduction of the Lasso (Tibshirani, 1996), similar to the Basis Pursuit denoising (Chen and Donoho, 1995) in signal processing, sparsity inducing penalties have had a tremendous impact on machine learning (Bach et al., 2012). They have been applied to a variety of statistical estimators, both for regression and classiﬁcation tasks: sparse logistic regression (Koh et al., 2007), Group Lasso (Yuan and Lin, 2006), Sparse Group Lasso (Simon et al., 2013), multitask Lasso (Obozinski et al., 2010), Square-Root Lasso (Belloni et al., 2011). All of these estimators fall under the framework of Generalized Linear Mod-

c 2020 Mathurin Massias, Samuel Vaiter, Alexandre Gramfort and Joseph Salmon.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v21/19-587.html.

Massias, Vaiter, Gramfort and Salmon

els (Mc Cullagh and Nelder, 1989), where the output is assumed to follow an exponential family distribution whose mean is a linear combination of the input variables. The key property of ℓ1 regularization is that it allows to jointly perform feature selection and prediction, which is particularly useful in high dimensional settings. Indeed, it can drastically reduce the number of variables needed for prediction, thus improving model interpretability and computation time for prediction. Amongst the algorithms proposed to solve these, coordinate descent1 (Tseng, 2001; Friedman et al., 2007) is the most popular in machine learning scenarios (Fan et al., 2008; Friedman et al., 2010; Richt arik and Tak aˇc, 2014; Fercoq and Richt arik, 2015; Perekrestenko et al., 2017; Karimireddy et al., 2018). It consists in updating the vector of parameters one coeﬃcient at a time, looping over all the predictors until convergence. Since only a fraction of the coeﬃcients are non-zero in the optimal parameter vector, a recurring idea to speed up solvers is to limit the size of the optimization problem by ignoring features which are not included in the solution. To do so, two approaches can be distinguished:

screening rules, introduced by El Ghaoui et al. (2012) and later developed by Ogawa et al. (2013); Wang et al. (2012); Xiang et al. (2016); Bonnefoy et al. (2014); Fercoq et al. (2015); Ndiaye et al. (2016, 2017), progressively remove features from the problems in a backward approach,

working sets techniques (Fan and Lv, 2008; Roth and Fischer, 2008; Kowalski et al., 2011; Tibshirani et al., 2012; Johnson and Guestrin, 2015) solve a sequence of smaller problems restricted to a growing number of features.

One common idea between the current state-of-art methods for screening (Gap Safe rules Fercoq et al. 2015; Ndiaye et al. 2017) and working sets (Blitz, Johnson and Guestrin 2015, 2018) is to rely heavily on a dual point to identify useful features. The quality of such a dual point for the dual problem is critical here as it has a direct impact on performance. However, although a lot of attention has been devoted to creating a sequence of primal iterates that converge fast to the optimum (Fercoq and Richt arik, 2015), the construction of dual iterates has not been scrutinized, and the standard approach to obtain dual iterates from primal ones (Mairal, 2010), although converging, is crude. In this paper, we propose a principled way to construct a sequence of dual points that converges faster than the standard approach proposed by Mairal (2010). Based on an extrapolation procedure inspired by Scieur et al. (2016), it comes with no signiﬁcant extra computational costs, while retaining convergence guarantees of the standard approach. This construction was ﬁrst introduced for non-smooth optimization by Massias et al. (2018) for the Lasso case only, while we generalize it here to any Generalized Linear Model (GLM). More precisely, we properly deﬁne, quantify and prove the asymptotic Vector Auto Regressive (VAR) behavior of dual iterates for sparse GLMs solved with proximal gradient descent or cyclic coordinate descent. The resulting new construction:

provides a tighter control of optimality through duality gap evaluation,

improves the performance of Gap safe rules,

1. throughout the paper, this means cyclic and proximal coordinate descent

Dual Extrapolation for Sparse GLMs

is easy to implement and combine with other solvers.

The article proceeds as follows. We introduce the framework of ℓ1-regularized GLMs and duality in Section 2. As a seminal example, we present our results on dual iterates regularity and dual extrapolation for the Lasso in Section 3. We generalize it to a variety of problems in Sections 4 and 5. Results of Section 6 demonstrate a systematic improvement in computing time when dual extrapolation is used together with Gap Safe rules or working set policies.

Notation For any integer d N, we denote by [d] the set {1, . . . , d}. The design matrix X Rn p is composed of observations xi Rp stored row-wise, and whose j-th column is xj Rn; the vector y Rn (resp. { 1, 1}n) is the response vector for regression (resp. binary classiﬁcation). The support of β Rp is S(β) = {j [p] : βj = 0}, of cardinality β 0. For W [p], βW and XW are β and X restricted to features in W. As much as possible, exponents between parenthesis (e.g., β(t)) denote iterates and subscripts (e.g., βj) denote vector entries or matrix columns. The sign function is sign : x 7 x/|x| with the convention 0/0 = 0. The sigmoid function is σ : x 7 1/(1 + e x). The soft-thresholding of x at level ν is ST(x, ν) = sign(x) max(0, |x| ν). Applied to vectors, sign, σ and ST( , ν) (for ν R+) act element-wise. Element-wise product between vectors of same length is denoted by . The vector of size n whose entries are all equal to 0 (resp. 1) is denoted by 0n (resp. 1n). On square matrices, 2 is the spectral norm (and the standard Euclidean norm for vectors reads ); 1 is the ℓ1-norm. For a symmetric positive deﬁnite matrix H, x, y H = x Hy is the H-weighted inner product, whose associated norm is denoted H. We extend the small-o notation to vector valued functions in the following way: for f : Rn Rn and g : Rn Rn, f = o(g) if and only if f = o( g ), i.e., f / g tends to 0 when g tends to 0. For a convex and proper function f : Rn R { }, its Fenchel Legendre conjugate f : Rn R { } is deﬁned by f (u) = supx Rn u x f(x), and its subdiﬀerential at x Rn is f(x) = {u Rn : y Rn, f(y) f(x) + u (y x)}.

2. GLMs, Vector Auto Regressive sequences and sign identiﬁcation

We ﬁrst introduce the class of optimization problems we consider.

Deﬁnition 1 (Sparse Generalized Linear Model). We call Sparse Generalized Linear Model the following optimization problem:

ˆβ arg min β Rp

i=1 fi(β xi) + λ β 1

| {z } P(β)

where all fi are convex2, diﬀerentiable functions with 1/γ-Lipschitz gradients. The parameter λ is a non-negative scalar, controlling the trade-oﬀbetween data ﬁdelity and regularization.

Two popular instances of Problem (1) are the Lasso (fi(t) = 1

2(yi t)2, γ = 1) and Sparse Logistic regression (fi(t) = log(1 + exp( yit)), γ = 4); our naming is an abuse of

2. by that we mean close, convex and proper following the framework of Bauschke and Combettes (2011).

Massias, Vaiter, Gramfort and Salmon

language since for some choice of fi s, e.g., an Huber loss, there is no underlying statistical GLM. A more complex regularizer could be used in Problem (1), to handle group penalties for example. For the sake of clarity we rather remain speciﬁc, and generalize to other penalties when needed in Section 4.2.

Proposition 2 (Duality for sparse GLMs). A dual formulation of Problem (1) reads:

ˆθ = arg max θ X

i=1 f i ( λθi)

| {z } D(θ)

where X = {θ Rn : X θ 1}. The dual solution ˆθ is unique because the f i s are γ-strongly convex (Hiriart-Urruty and Lemar echal, 1993, Thm 4.2.1) and the KKT conditions read:

i [n], ˆθi = f i(ˆβ xi)/λ (link equation) (3)

j [p], x j ˆθ | |(ˆβj) (subdiﬀerential inclusion) (4)

If for u Rn we write F(u) def. = Pn i=1 fi(ui), the link equation reads ˆθ = F(X ˆβ)/λ. For any (β, θ) Rp X, one has D(θ) P(β), and D(ˆθ) = P(ˆβ). The duality gap P(β) D(θ) can thus be used as an upper bound for the sub-optimality of a primal vector β: for any ϵ > 0, any β Rp, and any feasible θ X:

P(β) D(θ) ϵ P(β) P(ˆβ) ϵ . (5)

These results holds because Slater s condition is met: Problem (1) is unconstrained and the objective function has domain Rp (Boyd and Vandenberghe, 2004, 5.2.3), therefore strong duality holds.

Remark 3. Equation (5) shows that even though ˆβ is unknown in practice and the suboptimality gap cannot be evaluated, creating a dual feasible point θ X allows to compute an upper bound which can be used as a tractable stopping criterion.

In high dimension, solvers such as proximal gradient descent (PG) and coordinate descent (CD) are slowed down due to the large number of features. However, by design of the ℓ1 penalty, ˆβ is known to be sparse, especially for large values of λ. Thus, a key idea to speed up these solvers is to identify the support of ˆβ so that features outside of it can be safely ignored: this leads to a smaller problem that is faster to solve. Removing features when it is guaranteed that they are not in the support of the solution is at the heart of the so-called Gap Safe Screening rules (Fercoq et al., 2015; Ndiaye et al., 2017).

Proposition 4 (Gap Safe Screening rule, (Ndiaye et al., 2017, Thm. 6)). The Gap Safe screening rule for Problem (1) reads:

j [p], β Rp, θ X, 1 |x j θ| xj > q

2 γλ2 (P(β) D(θ)) = ˆβj = 0 . (6)

Dual Extrapolation for Sparse GLMs

Therefore, while running an iterative solver, the criterion (6) can be tested periodically for all features j, and the features guaranteed to be inactive at optimum can be ignored.3

Equations (5) and (6) do not require a speciﬁc choice of θ, provided it is in X. It is up to the user and so far it has not attracted much attention in the literature. Thanks to the link equation ˆθ = F(X ˆβ)/λ, a natural way to construct a dual feasible point θ(t) X at iteration t, when only a primal vector β(t) is available, is:

θ(t) res def. = F(Xβ(t))/ max(λ, X F(Xβ(t)) ) . (7)

This was coined residuals rescaling (Mairal, 2010) following the terminology used for the Lasso case where F(Xβ) is equal to the residuals, y Xβ. To improve the control of sub-optimality and identiﬁcation of useful features, the aim of our proposed dual extrapolation is to obtain a better dual point (i.e., closer to the optimum ˆθ). The idea is to do it at a low computational cost by exploiting the structure of the sequence of dual iterates (Xβ(t))t N; we explain what is this structure , and how to exploit it, in the following.

Deﬁnition 5 (Vector Auto Regressive sequence). We say that (r(t))t N (Rn)N is a Vector Auto Regressive (VAR) sequence (of order 1) if there exists A Rn n and b Rn such that for t N: r(t+1) = Ar(t) + b . (8)

We also say that the sequence (r(t))t N, converging to ˆr, is an asymptotic VAR sequence if there exist A Rn n and b Rn such that for t N:

r(t+1) Ar(t) b = o(r(t) ˆr) . (9)

Proposition 6 (Extrapolation for VAR sequences (Scieur, 2018, Thm 3.2.2)). Let (r(t))t N be a VAR sequence in Rn, satisfying r(t+1) = Ar(t) + b with A Rn n a symmetric positive deﬁnite matrix such that A 2 < 1, b Rn and K < n. Assume that for t K, the family {r(t K) r(t K+1), . . . , r(t 1) r(t)} is linearly independent and deﬁne

U(t) def. = [r(t K) r(t K+1), . . . , r(t 1) r(t)] Rn K , (10)

(c1, . . . , c K) def. = (U(t) U(t)) 11K 1 K(U(t) U(t)) 11K RK , (11)

rextr def. =

k=1 ckr(t K 1+k) Rn . (12)

Then, rextr satisﬁes Arextr b rextr O(ρK) , (13)

where ρ = 1

3. Johnson and Guestrin (2018, Thm. 5.1) improved the RHS in (6) by a factor

2. In our experiments, it did not lead to a noticeable speed-up as the bulk of the computation is spent on iterations before the screening rule discards variables, and the

2 factor is not large enough to make this happen much earlier

Massias, Vaiter, Gramfort and Salmon

The justiﬁcation for this extrapolation procedure is the following: since A 2 < 1, (r(t))t N converges, say to ˆr. For t N, we have r(t+1) ˆr = A(r(t) ˆr). Let (a0, . . . , an) Rn+1 be the coeﬃcients of A s characteristic polynomial. By Cayley-Hamilton s theorem, Pn k=0 ak Ak = 0. Given that A 2 < 1, 1 is not an eigenvalue of A and Pn k=0 ak = 0, so we can normalize these coeﬃcients to have Pn k=0 ak = 1. For t n, we have:

k=0 ak r(t n+k) ˆr =

k=0 ak Ak !

(r(t n) ˆr) = 0 , (14)

k=0 akr(t n+k) =

k=0 akˆr = ˆr . (15)

Hence, ˆr Span(r(t n), . . . , r(t)). Therefore, it is natural to seek to approximate ˆr as an aﬃne combination of the (n + 1) last iterates (r(t n), . . . , r(t)). Using (n + 1) iterates might be costly for large n, so one might rather consider only a smaller number K, i.e., ﬁnd (c1, . . . , c K) RK such that PK k=1 ckr(t K 1+k) approximates ˆr. Since ˆr is a ﬁxed point of r 7 Ar+b, PK k=1 ckr(t K 1+k)

should be one too. Under the normalizing condition PK k=1 ck = 1, this means that

k=1 ckr(t K 1+k) A

k=1 ckr(t K 1+k) b =

k=1 ckr(t K 1+k)

k=1 ck r(t K+k) b b

k=1 ck r(t K 1+k) r(t K+k) (16)

should be as close to 0n as possible; this leads to solving:

ˆc = arg min c RK c 1K=1

k=1 ck r(t K+k) r(t K 1+k) , (17)

which admits a closed-form solution if U(t) def. = [r(t K+1) r(t K), . . . , r(t) r(t 1)] Rn K

has full column rank (Scieur et al., 2016, Lemma 2.4):

ˆc = (U(t) U(t)) 11K 1 K(U(t) U(t)) 11K . (18)

In practice, the next proposition shows that when U(t) does not have full column rank, it is theoretically sound to use a lower value for the number of extrapolation coeﬃcients K.

Proposition 7. If U(t) U(t) is not invertible, then ˆr Span(r(t 1), . . . , r(t K)).

Proof Let x RK \ {0K} be such that U(t) U(t)x = 0K, with x K = 0 (the proof is similar if x K = 0, x K 1 = 0, etc.). Then U(t)x = PK k=1 xk(r(t K+k) r(t K+k 1)) = 0 and, setting x0 def. = 0, r(t) = 1 x K PK k=1(xk xk 1)r(t K+k 1) Span(r(t 1), . . . , r(t K)). Since

Dual Extrapolation for Sparse GLMs

1 x K PK k=k(xk xk 1) = 1, it follows that

r(t+1) = Ar(t) + b

k=1 (xk xk 1)(Ar(t K+k 1) + b)

k=1 (xk xk+1)r(t K+k) Span(r(t 1), . . . , r(t K)) , (19)

and subsequently r(s) Span(r(t 1), . . . , r(t K)) for all s t. By going to the limit, ˆr Span(r(t 1), . . . , r(t K)).

Finally, we state the results on sign identiﬁcation, which implies support identiﬁcation. For these results, which connect sparse GLMs to VAR sequences and extrapolation, we need to make the following assumption.

Assumption 8. Problem (1) is non degenerate: f(ˆβ)/λ relint 1, where relint denotes the relative interior and f(β) = F(Xβ).

This non-degeneracy condition is frequently used in works on support identiﬁcation (Fuchs, 2004; Hare and Lewis, 2007; Cand es and Recht, 2013; Vaiter et al., 2015). Using it, we can extend results by Hale et al. (2008) about sign identiﬁcation from proximal gradient to coordinate descent.

Theorem 9 (Sign identiﬁcation for proximal gradient and coordinate descent). Let Assumption 8 hold. Let (β(t))t N be the sequence of iterates converging to ˆβ and produced by proximal gradient descent or coordinate descent when solving Problem (1) (reminded in lines 10 and 13 of Algorithm 1). There exists T N such that: j [p], t T = sign(β(t) j ) = sign(ˆβj). The smallest epoch T for which this holds is when sign identiﬁcation is achieved.

Proof For lighter notation in this proof, we denote lj = xj 2/γ and hj(β) = βj 1 lj x j F(Xβ). For j [p], the subdiﬀerential inclusion (4) reads:

x j F(X ˆβ)

{1} , if ˆβj > 0 , { 1} , if ˆβj < 0 , [ 1, 1] , if ˆβj = 0 .

Motivated by these conditions, the equicorrelation set introduced by Tibshirani (2013) is:

E def. = {j [p] : |x j F(X ˆβ)| = λ} = {j [p] : |x j ˆθ| = 1} . (21)

We introduce the saturation gap associated to Problem (1):

ˆδ def. = min

1 |x j F(X ˆβ)|

1 |x j ˆθ| : j / E > 0 . (22)

Massias, Vaiter, Gramfort and Salmon

As ˆθ is unique, ˆδ is well-deﬁned, and strictly positive by deﬁnition of E. By Equation (20), the support of any solution is included in the equicorrelation set. By Assumption 8, we even have equality. We will now show that the coeﬃcients outside the equicorrelation eventually vanish. The proof requires to study the primal iterates after each update (instead of after each epoch), hence we use the notation β(s) for the primal iterate after the s-th update of coordinate descent. This update only modiﬁes the j-th coordinate, with s j 1 mod p :

β(s+1) j = ST hj( β(s)), λ

Note that at optimality, for every j [p], one has:

ˆβj = ST hj(ˆβ), λ

Let us consider an update s N of coordinate descent such that the updated coordinate j veriﬁes β(s+1) j = 0 and j / E, hence, ˆβj = 0. Then:

| β(s+1) j ˆβj| = ST hj( β(s)), λ

ST hj(ˆβ), λ

hj( β(s)) hj(ˆβ) λ lj |hj(ˆβ)| , (25)

where we used the following inequality (Hale et al., 2008, Lemma 3.2):

ST(x, ν) = 0, ST(y, ν) = 0 = | ST(x, ν) ST(y, ν)| |x y| (ν |y|) . (26)

Now notice that by deﬁnition of the saturation gap (22), and since j / E :

1 |x j F(X ˆβ)|

that is, λ lj |hj(ˆβ)| ˆδ (using ˆβj = 0) . (27)

Combining Equations (25) and (27) yields

| β(s+1) j ˆβj| hj( β(s)) hj(ˆβ) ˆδ . (28)

This can only be true for a ﬁnite number of updates, since otherwise taking the limit would give 0 ˆδ, and ˆδ > 0 (Eq. (22)). Therefore, after a ﬁnite number of updates, β(s) j = 0 for j / E. For j E, ˆβj = 0 by Assumption 8, so β(t) j has the same sign eventually since it converges to ˆβj. The proof for proximal gradient descent is a result of Hale et al. (2008, Theorem 4.5), who provide the bound T β(s) ˆβ 2 2/ˆδ2.

Dual Extrapolation for Sparse GLMs

Algorithm 1 PG/cyclic CD for Problem (1) with dual extrapolation

input : X = [x1| . . . |xp], y, λ, β(0), ϵ param: T, K = 5, fdual = 10 init : Xβ = Xβ(0), θ(0) = F(Xβ(0))/ max(λ, X F(Xβ(0)) )

1 for t = 1, . . . , T do

2 if t = 0 mod fdual then // compute θ and gap every f epoch only

3 t = t/fdual // dual point indexing

4 r(t ) = Xβ

5 compute θ(t ) res and θ(t ) acc with eqs. (7), (37) and (38)

6 θ(t ) = arg max n D(θ) : θ {θ(t 1), θ(t ) acc, θ(t ) res } o

// robust dual extr. with (39)

7 if P(β(t)) D(θ(t )) < ϵ then break

8 if PG then // proximal gradient descent:

9 Xβ = Xβ(t)

10 β(t+1) = ST β(t) γ X X 2 X F(Xβ), λγ X X 2

11 else if CD then // cyclic coordinate descent:

12 for j = 1, . . . , p do

13 β(t+1) j = ST β(t) j γx j F(Xβ)

xj 2 ), γλ xj 2

14 Xβ += (β(t+1) j β(t) j )xj

15 return β(t), θ(t )

3. A seminal example: the Lasso case

Dual extrapolation was originally proposed for the Lasso in the Celer algorithm (Massias et al., 2018). As the VAR model holds exactly in this case, we ﬁrst devote special attention to it. We will make use of asymptotic VAR models and generalize Celer to all sparse GLMs in Section 4. Using the identiﬁcation property of coordinate descent and proximal gradient descent, we can formalize the VAR behavior of dual iterates.

Proposition 10. When (β(t))t N is obtained by cyclic coordinate descent or proximal gradient descent applied to the Lasso problem, (Xβ(t))t N is a VAR sequence after sign identiﬁcation.

Proof Let us ﬁrst recall that the strong convexity constant γ is equal to 1 in the Lasso case. Let t N denote an epoch after sign identiﬁcation. The respective updates of proximal gradient descent and coordinate descent are reminded in lines 10 and 13 of Algorithm 1.

Coordinate descent: Let j1, . . . , j S be the indices of the support of ˆβ, in increasing order. As the sign is identiﬁed, coeﬃcients outside the support are 0 and remain 0. We decompose the t-th epoch of coordinate descent into individual coordinate updates: let β(0) Rp denote the initialization (i.e., the beginning of the epoch, β(0) = β(t)), β(1) Rp

the iterate after coordinate j1 has been updated, etc., up to β(S) after coordinate j S has been updated, i.e., at the end of the epoch ( β(S) = β(t+1)).

Massias, Vaiter, Gramfort and Salmon

Let s [S], then β(s) and β(s 1) are equal everywhere, except at coordinate js:

β(s) js = ST β(s 1) js + 1 xjs 2 x js y X β(s 1) , λ xjs 2

= β(s 1) js + 1 xjs 2 x js y X β(s 1) λ sign(ˆβjs)

xjs 2 , (29)

where we have used sign identiﬁcation: sign( β(s) js ) = sign(ˆβjs). Therefore

X β(s) X β(s 1) = xjs β(s) js β(s 1) js

x js(y X β(s 1)) λ sign(ˆβjs)

= 1 xjs 2 xjsx js y X β(s 1) λ sign(ˆβjs)

xjs 2 xjs . (30)

This leads to the following linear recurrent equation:

X β(s) = Idn 1 xjs 2 xjsx js

| {z } As Rn n

X β(s 1) + x jsy λ sign(ˆβjs)

xjs 2 xjs | {z } bs Rn

Hence, one gets recursively

X β(S) = ASX β(S 1) + b S

= ASAS 1X β(S 2) + ASb S 1 + b S

= AS . . . A1 | {z } A X β(0) + AS . . . A2b1 + + ASb S 1 + b S | {z } b

We can thus write the following VAR equations for Xβ at the end of each epoch coordinate descent:

Xβ(t+1) = AXβ(t) + b , (33)

Xβ(t+1) X ˆβ = A(Xβ(t) X ˆβ) . (34)

Proximal gradient: Let β(t) S , ˆβS and XS denote respectively β(t), ˆβ and X restricted to features in the support S(ˆβ). Notice that since we are in the identiﬁed sign regime, Xβ(t) = XSβ(t) S . With L = X X 2 , a proximal gradient descent update reads:

β(t+1) S = ST β(t) S 1

LX S (XSβ(t) S y), λ

LX S XSβ(t) S y λ

L sign(ˆβS)

LX S XS β(t) S + 1

L sign(ˆβS) . (35)

Dual Extrapolation for Sparse GLMs

Figure 1: Illustration of the VAR nature of the dual iterates of the Lasso, on a toy dataset with n = 2 and p = 3. Left: dual of the Lasso problem; the dual optimum ˆθ is the projection of y/λ onto X. Right: sequence of residuals after each update of coordinate descent (ﬁrst iterates in blue, last in yellow). After four updates, the iterates alternate geometrically between the same two constraint hyperplanes.

Hence the equivalent of Equation (33) for proximal gradient descent is:

Xβ(t+1) = Idn 1

LXSX S Xβ(t) + 1

LXS sign(ˆβS) . (36)

Figure 1 represents the Lasso dual for a toy problem and illustrates the VAR nature of r(t)/λ. As highlighted in Tibshirani (2017), the iterates r(t)/λ correspond to the iterates of Dykstra s algorithm to project y/λ onto X. During the ﬁrst updates, the dual iterates do not have a regular trajectory. However, after a certain number of updates (corresponding to sign identiﬁcation), they alternate in a geometric fashion between two hyperplanes. In this regime, it becomes beneﬁcial to use extrapolation to obtain a point closer to ˆθ.

Remark 11. Equation (32) shows why we combine extrapolation with cyclic coordinate descent: if the coeﬃcients are not always updated in the same order (see Massias et al. 2018, Figure 1(c-d)), the matrix A depends on the epoch, and the VAR structure may no longer hold.

Having highlighted the VAR behavior of (Xβ(t))t N, we can introduce our proposed dual extrapolation.

Deﬁnition 12 (Extrapolated dual point for the Lasso). For a ﬁxed number K of proximal gradient descent or coordinate descent epochs, let r(t) denote the residuals y Xβ(t) at epoch t of the algorithm. We deﬁne the extrapolated residuals

r(t), if t K , K X

k=1 ckr(t+1 k), if t > K . (37)

Massias, Vaiter, Gramfort and Salmon

where c = (c1, . . . , c K) RK is deﬁned as in (18) with U(t) = [r(t+1 K) r(t K), . . . , r(t) r(t 1)] Rn K. Then, we deﬁne the extrapolated dual point as:

θ(t) acc def. = r(t) acc/ max(λ, X r(t) acc ) . (38)

In practice, we use K = 5 and do not compute θ(t) acc if U(t) U(t) cannot be inverted. Additionally, to impose monotonicity of the dual objective, and guarantee a behavior at least as good at θres, we use as dual point at iteration t:

θ(t) = arg max

θ {θ(t 1),θ(t) acc,θ(t) res} D(θ) . (39)

There are two reasons why the results of Theorem 6 cannot be straightforwardly applied to Equation (38):

1. the analysis by Scieur et al. (2016) requires A to be symmetrical, which is the case for proximal gradient descent but not for cyclic coordinate descent (as Idn xjsx js/ xjs 2

and Idn xjs x js / xjs 2 only commute if xjs and xjs are collinear). To circumvent this issue, we can make A symmetrical: instead of considering cyclic updates, we could consider that iterates β(t) are produced by a cyclic pass over the coordinates, followed by a cyclic pass over the coordinates in reverse order. The matrix of the VAR in this case is no longer A = AS . . . A1, but A1 . . . ASAS . . . A1 = A 1 . . . A S AS . . . A1 = A A (the As s are symmetrical). Experiments of Section 6, where a simple cyclic order is used, tend to indicate that there is in fact no need for A to be symmetrical.

2. for both proximal gradient and coordinate descent we have A = 1 instead of A < 1 as soon as S < n: if the support of ˆβ is of size smaller than n (S < n), 1 is an eigenvalue of A. Indeed, for coordinate descent, if S < n, there exists a vector u Rn, orthogonal to the S vectors xj1, . . . , xj S. The matrix As = Idn 1 xjs 2 xjsx js being the

orthogonal projection onto Span(xjs) , we therefore have Asu = u for every s [S],

hence Au = u. For proximal gradient descent, 1

L XS XS is not invertible when S < n,

hence 1 is an eigenvalue of Idn 1

L XS XS . This seems to contradict the convergence of the VAR sequence but is addressed in Theorems 13 and 14.

Lemma 13. For coordinate descent, if an eigenvalue of A = AS . . . A1 has modulus 1, it is equal to 1.

Proof The matrix As = Idn 1 xjs 2 xjsx js is the orthogonal projection onto Span(xjs) . Hence, x Rn, Asx = x = Asx = x . (40)

Let (µ, x) C Rn s.t. |µ| = 1, x = 1 and Ax = µx. This means Ax = 1. Because A1x < 1 = AS . . . A1x AS . . . A2 A1x < 1 = Ax < 1, we must have A1x 1. Since it holds that A1x x = 1, we have A1x = x , thus A1x = x because A1 is an orthogonal projection. By a similar reasoning, A2x = x, etc. up to ASx = x, hence Ax = x and µ = 1.

Dual Extrapolation for Sparse GLMs

Lemma 14. For coordinate descent (resp. proximal gradient descent) applied to solve the Lasso, the VAR parameters A Rn n and b Rn deﬁned in (32) (resp. (36)) satisfy b Ker(Idn A) .

Coordinate descent case: Let us remind that b = AS . . . A2b1 + + ASb S 1 + b S in this case, with bs = x jsy λ sign(ˆβjs)xjs/ xjs 2. Let v Ker(Idn A). Following the proof of Theorem 13, we have A1v = = ASv = v. For s [S], since As is the projection on Span(xjs) , this means that v is orthogonal to xjs. Additionally, v AS . . . As+1bs = (As+1 . . . ASv) bs = v bs = 0 since bs is co-linear to xjs. Thus, v is orthogonal to the S terms which compose b, and b Ker(Idn A).

Proximal gradient descent case: Let v Ker(Idn A) = Ker(XSX S ). We have v XSX S v = 0 = X S v 2, hence X S v = 0. It is now clear that v b = v ( XSX S y + λXS sign ˆβ)/L = 0, hence b Ker(Idn A).

Proposition 15. Theorem 6 holds for the residuals r(t) (produced either by proximal gradient descent or coordinate descent) even though A 2 = 1 in both cases.

Proof Let us write A = A + A with A the orthogonal projection on Ker(Idn A). By Theorem 13, A < 1. Then, one can check that AA = A2 and A A = A2 = A and Ab = Ab. Let T be the epoch when support identiﬁcation is achieved. For t T, we have

r(t+1) = Ar(t) + b + Ar(T) . (41)

Indeed, it is trivially true for t = T and if it holds for t,

r(t+2) = Ar(t+1) + b

= A(Ar(t) + b + Ar(T)) + b

= A2r(t) + Ab + Ar(T) + b

= A(Ar(t) + b) + Ar(T) + b

= Ar(t+1) + Ar(T) + b . (42)

Therefore, on the space Ker(Idn A), the sequence r(t) is constant, and on its orthogonal Ker(Idn A) , it is a VAR sequence with associated matrix A, whose spectral normal is strictly less than 1. Therefore, the results of Theorem 6 still hold.

Remark 16 (Connection with primal-dual techniques). The goal of our construction is to improve convergence for the primal, by constructing a better dual certiﬁcate which provides a tighter stopping criterion. In our scheme, the primal iterates directly inﬂuence the dual ones either through the link equation (residuals rescaling), either through extrapolation

Massias, Vaiter, Gramfort and Salmon

but (apart from the inﬂuence of screening or working set selection), the primal iterates do not depend on the dual ones. An alternative technique to improve convergence in the dual would be to solve simultaneously the primal and the dual. The objective function in Problem (1) is F(Xβ) + λ β 1, hence since strong duality holds, an equivalent saddle point formulation is max θ Rn min β Rp,z Rn F(z) + λ β 1 + λθ (z Xβ) , i.e.,

max θ Rn min β Rp F ( λθ) + λ β 1 λθ Xβ | {z } L(β,θ)

To solve this problem, the primal-dual Arrow-Hurwicz (Arrow et al., 1958) method alternates proximal maximization steps in θ and proximal minimization steps in β. Here, the maximization step can even be performed exactly, yielding:

( β(t+1) = proxλ/L 1(β(t) + λ

LX θ(t)) , λ F ( λθ(t+1)) λXβ(t+1) = 0 , (44)

and the last line is equivalent to θ(t+1) = F(Xβ(t+1))/λ (Hiriart-Urruty and Lemar echal, 1993, Cor. 1.4.4), as in Equation (3). Using inertial variants of the scheme (44), such as the one by Chambolle and Pock (2011) is a potential lead, which we do not investigate further. In our opinion, a more promising direction of research would be to design extrapolation methods for the primal-dual coordinate descent method of Fercoq and Bianchi (2015), which is left to future work. Finally, we are not aware of algorithms working directly in the dual; a reason for that is that getting feasible iterates by other means than rescaling requires the knowledge of the projection onto X, which is as diﬃcult as the primal (see Tibshirani (2017) on this matter). D unner et al. (2016) use a so-called Lipschitzing trick to make the dual unconstrained, but the rough bound λ ˆβ 1 F(0n) they used is likely to lead to poor values of convergence rate constants in practice.

Although so far we have proven results for both coordinate descent and proximal gradient descent for the sake of generality, we observed that coordinate descent consistently converges faster. Hence from now on, we only consider the latter.

4. Generalized linear models

4.1 Coordinate descent for ℓ1 regularization

Proposition 17 (VAR for coordinate descent and Sparse GLM). When Problem (1) is solved by cyclic coordinate descent, the dual iterates (Xβ(t))t N form an asymptotical VAR sequence.

Proof As in the proof of Theorem 10, we place ourselves in the identiﬁed sign regime, and consider only one epoch t of CD: let β(0) denote the value of the primal iterate at the beginning of the epoch ( β(0) = β(t)), and for s [S], β(s) Rp denotes its value after the js coordinate has been updated ( β(S) = β(t+1)). Recall that in the framework of Problem (1), the data-ﬁtting functions fi have 1/γ-Lipschitz gradients, and F(u) = (f 1(u1), . . . , f n(un)).

Dual Extrapolation for Sparse GLMs

For s [S], β(s) and β(s 1) are equal everywhere except at entry js, for which the coordinate descent update with ﬁxed step size γ xjs 2 is

β(s) js = ST β(s 1) js γ xjs 2 x js F(X β(s 1)), γ xjs 2 λ

= β(s 1) js γ xjs 2 x js F(X β(s 1)) γ xjs 2 λ sign(ˆβjs) . (45)

X β(s) X β(s 1) = xjs β(s) js β(s 1) js

= xjs γ xjs 2 x js F(X β(s 1)) γ xjs 2 λ sign(ˆβjs) . (46)

Using point-wise linearization of the function F around X ˆβ, we have:

F(Xβ) = F(X ˆβ) + D(Xβ X ˆβ) + o(Xβ X ˆβ) , (47)

where D def. = diag(f 1 (ˆβ x1), . . . , f n(ˆβ xn)) Rn n. Therefore

X β(s) = Idn γ xjs 2 xjsx js D X β(s 1)

+ γ xjs 2 x js(DX ˆβ F(X ˆβ)) λ sign(ˆβjs) xjs + o(X β(s) X ˆβ) ,

D1/2X β(s) = Idn γ xjs 2 D1/2xjsx js D1/2

D1/2X β(s 1)

+ γ xjs 2 x js(DX ˆβ)D1/2xjs | {z } bs

+ o(X β(s) X ˆβ) , (48)

since the subdiﬀerential inclusion (4) gives x js F(X ˆβ) λ sign(ˆβjs) = 0. Thus, the sequence (D1/2Xβ(t))t N is an asymptotical VAR sequence:

D1/2Xβ(t+1) = AS . . . A1D1/2Xβ(t) + b S + . . . + AS . . . A2b1 + o(Xβ(t) X ˆβ) , (49)

and so is (Xβ(t))t N:

Xβ(t+1) = D 1

2 AS . . . A1D 1 2 | {z } A Xβ(t) + D 1

2 (b S + . . . + AS . . . A2b1) | {z } b

+o(Xβ(t) X ˆβ) . (50)

Proposition 18. As in Theorems 13 and 14, for the VAR parameters A and b deﬁned in Equation (50), 1 is the only eigenvalue of A whose modulus is 1 and b Ker(Idn A).

Massias, Vaiter, Gramfort and Salmon

Proof First, notice that as in the Lasso case, we have Idn As 0. Indeed, because f i takes values in ]0, 1/γ[, D1/2 exists and 1 γ Idn D1/2 0. For any u Rn,

u D1/2xjsx js D1/2u = (x js D1/2u)2 0, (51)

and x js D1/2u xjs D1/2u

1 γ xjs u , (52)

γ Idn D1/2xjsx js D1/2 0 and Idn As 0. However, contrary to the Lasso case, because D1/2xjs = γ xjs , As is not the orthogonal projection on (Span D1/2xjs) . Nevertheless, we still have As = A s , As 1, and for v Rn, Asv = v means that v D1/2xjs = 0, so the proof of Theorem 13 can be applied to show that the only eigenvalue of AS . . . A1 which has modulus 1 is 1. Then, observing that A = D 1/2AS . . . A1D1/2 has the same spectrum as AS . . . A1 concludes the ﬁrst part of the proof. For the second result, let v Ker(Idn A), i.e., Av = v, hence AS . . . A1D1/2v = D1/2Av = D1/2v. Therefore D1/2v is a ﬁxed point of AS . . . A1, and as in the Lasso case this means that for all s [S], As D1/2v = D1/2v and (D1/2v) D1/2xjs = 0. Now recall that

b = D 1/2(b S + . . . + AS . . . A2b1) , (53)

bs = γ xjs 2 x js(DX ˆβ F(X ˆβ)) λ sign(ˆβjs) D1/2xjs

= γ xjs 2 (x js DX ˆβ)D1/2xjs . (54)

Additionally, v D 1/2AS . . . As+1bs = (As+1 . . . ASD 1/2v) bs = (D 1/2v) bs = 0. Hence v is orthogonal to all the terms which compose b, hence v b = 0.

Theorem 17 and Theorem 18 show that we can construct an extrapolated dual point for any sparse GLM, by extrapolating the sequence (r(t) = Xβ(t))t N with the construction of Equation (37), and creating a feasible point with:

θ(t) acc def. = F(r(t) acc)/ max(λ, X F(r(t) acc) ) . (55)

4.2 Multitask Lasso

Let q N be a number of tasks, and consider an observation matrix Y Rn q, whose i-th row is the target in Rq for the i-th sample. For B Rp q, let B 2,1 = Pp 1 Bj (with Bj R1 q the j-th row of B).

Deﬁnition 19. The multitask Lasso estimator is deﬁned as the solution of:

ˆB arg min B Rn q 1 2 Y XB 2 F + λ B 2,1 . (56)

Let j1 < < j S denote the (row-wise) support of ˆB, and let t denote an iteration after support identiﬁcation. Note that the guarantees of support identiﬁcation for multitask Lasso

Dual Extrapolation for Sparse GLMs

requires more assumptions than the case of the standard Lasso. In particular it requires a source condition which depends on the design matrix X. This was investigated for instance by Vaiter et al. (2018) when considering a proximal gradient descent algorithm. Let B(0) = B(t), and for s [S], let B(s) denote the primal iterate after coordinate js has been updated. Let s [S], with B(s) and B(s 1) being equal everywhere, except for their js row for which one iteration of proximal block coordinate descent gives, with φ(B) def. = Bjs + 1 xjs 2 x js(Y XB) R1 q:

φ( B(s 1)) . (57)

From Equation (57),

X B(s) X B(s 1) = xjs( B(s) js B(s 1) js )

1 xjs 2 x js(Y X B(s 1)) λ/ xjs 2

φ( B(s 1)) φ( B(s 1))

x + h x + h = x x + 1 x

h + o( h ), (59)

and introducing ψ def. = e js 1 xjs 2 x js X R1 p, so that φ(B) = φ(ˆB) + ψ(B ˆB), one has the following linearization:

φ(B) φ(B) = φ(ˆB)

φ(ˆB) ψ(B ˆB)

Idq φ(ˆB) φ(ˆB)

+ o(B ˆB) , (60)

which does not allow to exhibit a VAR structure, as B should appear only on the right. Despite this negative result, empirical results of Section 6 show that dual extrapolation still provides a tighter dual point in the identiﬁed support regime. Celer s generalization to multitask Lasso consists in using d(t) j = (1 x j Θ(t) )/ xj with the dual iterate Θ(t) Rn q. The inner solver is cyclic block coordinate descent (BCD), and the extrapolation coeﬃcients are obtained by solving Equation (17), which is an easy to solve matrix leastsquares problem.

Remark 20. As a concluding remark, we point that for the three models studied here, a VAR structure in the dual implies a VAR structure in the primal, provided XS(ˆβ) has full column rank. Indeed, for any matrix B such that BXS(ˆβ) = Id ˆβ 0, after support identiﬁcation one

has β(t+1) S(ˆβ) = BAXS(ˆβ)β(t) S(ˆβ) +Bb. This paves the way for applying the techniques introduced here to extrapolation in the primal, which we leave to future work.

5. Working sets

Being able to construct a better dual point leads to a tighter gap and a smaller upper bound in Equation (6), hence to more features being discarded and a greater speed-up for Gap Safe screening rules. As we detail in this section, it can easily be integrated in a eﬃcient working set policy.

Massias, Vaiter, Gramfort and Salmon

5.1 Improved working sets policy

Working set methods originated in the domains of linear and quadratic programming (Thompson et al., 1966; Palacios-Gomez et al., 1982; Myers and Shih, 1988), where they are called active set methods. In the context of this paper, a working set approach starts by solving Problem (1) restricted to a small set of features W(0) [p] (the working set), then deﬁnes iteratively new working sets W(t) and solves a sequence of growing problems (Kowalski et al., 2011; Boisbunon et al., 2014; Santis et al., 2016). It is easy to see that when W(t) W(t+1) and when the subproblems are solved up to the precision required for the whole problem, then working sets techniques converge. It was shown by Massias et al. (2017) that every screening rule which writes

j [p], dj > τ ˆβj = 0 , (61)

allows to deﬁne a working set policy. For example for Gap Safe rules,

dj = dj(θ) def. = 1 |x j θ| xj , (62)

is deﬁned as a function of a dual point θ X. The value dj can be seen as measuring the importance of feature j, and so given an initial size p(1) the ﬁrst working set can be deﬁned as: W(1) = {j(1) 1 , . . . , j(1) p(1)} , (63)

with dj(1) 1 (θ) dj(1)

p(1)(θ) < dj(θ), j / W(0), i.e., the indices of the p(1) smallest values

of d(θ). Then, the subproblem solver is launched on XW(1). New primal and dual iterates are returned, which allow to recompute dj s and deﬁne iteratively:

W(t+1) = {j(t+1) 1 , . . . , j(t+1) p(t+1)} , (64)

where we impose dj(θ) = 1 when β(t) j = 0 to keep the active features in the next working set. As in Massias et al. (2018), we choose p(t) = min(p, 2 β(t) 0) to ensure a fast initial growth of the working set, and avoid growing too much when the support is nearly identiﬁed. The stopping criterion for the inner solver on W(t) is to reach a gap lower than a fraction ρ = 0.3 of the duality gap for the whole problem, P(β(t)) D(θ(t)). These adaptive working set policies are commonly used in practice (Johnson and Guestrin, 2015, 2018). Combined with coordinate descent as an inner solver, this algorithm was coined Celer (Constraint Elimination for the Lasso with Extrapolated Residuals) when addressing the Lasso problem. The results of Section 4 justify the use of dual extrapolation for any sparse GLM, thus enabling us to generalize Celer to the whole class of models (Algorithm 2).

5.2 Newton-Celer

When using a squared ℓ2 loss, the curvature of the loss is constant: for the Lasso and multitask Lasso, the Hessian does not depend on the current iterate. This is however not true for other GLM data ﬁtting terms, e.g., Logistic regression, for which taking into

Dual Extrapolation for Sparse GLMs

Algorithm 2 Celer for Problem (1)

input : X, y, λ, β(0), θ(0)

param: K = 5, p(1) = 100, ϵ, MAX WS init : W(0) =

1 if β(0) = 0p then p(1) = |S(β(0))| // warm start

2 for t = 1, . . . , MAX WS do

3 compute θ(t) res // Equation (7)

4 if solver is Prox-Celer then

5 do K passes of CD on the support of β(t), extrapolate to produce θ(t 1) acc

6 θ(t 1) inner = arg maxθ {θ(t 1),θ(t 1) inner } D(θ)

7 θ(t) = arg maxθ {θ(t 1),θ(t 1) inner ,θ(t) res} D(θ)

8 g(t) = P(β(t 1)) D(θ(t)) // global gap

9 if g(t) ϵ then break

10 ϵ(t), W(t) = create WS() // get tolerance and working set with Algorithm 3

// Subproblem solver is Algorithm 1 or 4 for Prox-Celer:

11 get β(t), θ(t) inner with subproblem solver applied to (XW(t), y, λ, (β(t 1))W(t), ϵ(t))

12 θ(t) inner = θ(t) inner/ max(1, X θ(t) inner )

13 set β(t) = 0p and (β(t))W(t) = β(t)

14 return β(t), θ(t)

Algorithm 3 create WS

input : X, y, λ, β(t 1), θ(t), W(t 1), g(t)

param: p(1) = 100, ρ = 0.3 init : d = 0p 1 for j = 1, . . . , p do

2 if β(t 1) j = 0 then d(t) j = 1

3 else d(t) j = (1 |x j θ(t)|)/ xj

4 ϵ(t) = ρg(t)

5 if t 2 then p(t) = min(2 β(t 1) 0, p)

6 W(t) = {j [p] : d(t) j among p(t) smallest values of d(t)}

7 return ϵ(t), W(t)

account the second order information proves to be very useful for fast convergence (Hsieh et al., 2014). To leverage this information, we can use a prox-Newton method (Lee et al., 2012; Scheinberg and Tang, 2013) as inner solver; an advantage of dual extrapolation is that it can be combined with any inner solver, as we detail below. For reproducibility and completeness, we ﬁrst brieﬂy detail the Prox-Newton procedure used. In the following and in Algorithms 4 to 6 we focus on a single subproblem optimization, so for lighter notation we assume that the design matrix X is already restricted to features in the working set.

Massias, Vaiter, Gramfort and Salmon

The reader should be aware that in the rest of this section, β, X and p in fact refers to βW(t), XW(t), and p(t). Writing the data-ﬁtting term f(β) = F(Xβ), we have 2f(β) = X DX, where D Rn n is diagonal with f i (β xi) as its i-th diagonal entry. Using H = 2f(β(t)) we can approximate the primal objective by4

f(β(t)) + f(β(t)) (β β(t)) + 1

2(β β(t)) H(β β(t)) + λ β 1 . (65)

Minimizing this approximation yields the direction (t) for the proximal Newton step:

(t) + β(t) = arg min β

β β(t) + H 1 f(β(t)) 2

H + λ β 1 . (66)

Then, a step size α(t) is found by backtracking line search (Algorithm 6), and:

β(t+1) = β(t) + α(t) (t) . (67)

Solving (66) amounts to solving the following Lasso problem:

u = arg min u 1 2 y Xu 2 2 + λ u 1 , (68)

where X = D1/2X, y = D1/2Xβ(t) D 1/2X X F(Xβ(t)) and X is the pseudoinverse of X. While this may seem costly computationally, it turns out that the terms X , y and X are not needed to solve (68) with coordinate descent. A coordinate descent update for (68) reads:

uj ST uj + 1

lj x j y Xu , λ

x j ( y Xu) = x j DXβ(t) x j F(Xβ(t)) x j DXu , (70)

lj = x j Dxj . (71)

Therefore, the update only involves X, y and inner products weighted by D. The algorithm is summarized in Algorithm 5. Contrary to coordinate descent, Newton steps do not lead to an asymptotic VAR, which is required to guarantee the success of dual extrapolation. To address this issue, we compute K passes of cyclic coordinate descent restricted to the support of the current estimate β before deﬁning a working set (Algorithm 2, line 5). The K values of Xβ obtained allow for the computation of both θacc and θres. The motivation for restricting the coordinate descent to the support of the current estimate β comes from the observation that dual extrapolation proves particularly useful once the support is identiﬁed. The Prox-Newton solver we use is detailed in Algorithm 4. When Algorithm 2 is used with Algorithm 4 as inner solver, we refer to it as the Newton-Celer variant.

4. H and D should read H(t) and D(t) as they depend on β(t); we omit the exponent for brevity.

Dual Extrapolation for Sparse GLMs

Algorithm 4 Prox-Newton subproblem solver (illustrated on logistic regression)

input : X = [x1| . . . |xp] Rn p, y Rn, λ, β(0) Rp, ϵ param: MAX CD = 20, MAX BACKTRACK = 10, K = 5 init : β = 0p, X β = 0n, θ(0) = 0n, D = 0n n, L = 0p,

1 for t = 1, . . . , T do

2 for i = 1, . . . , n do Dii = f i (β xi) = exp(yiβ xi)/ 1 + exp(yiβ xi) 2

3 for j = 1, . . . , p do Lj = xj, xj D = Pn i=1 x2 ijexp(yiβ xi)/ 1 + exp(yiβ xi) 2

4 if t = 1 then MAX CD = 1

5 else MAX CD = 20

6 β = newton direction(X, y, β(t 1), D, L = (L1, . . . , Lp), MAX CD)

7 α(t) = backtracking( β, X β, y, λ, MAX BACKTRACK)

8 β(t) = β(t 1) + α(t) β

9 θ(t) res = F(Xβ(t))/λ = y/(λ1n + λ exp(y Xβ(t)))

10 θ(t) res = θ(t) res/ max(1, X θ(t) res )

11 θ(t) = arg maxθ {θ(t 1),θres} D(θ) if P(β(t)) D(θ(t)) < ϵ then

13 return β(t), θ(t)

Algorithm 5 newton direction (illustrated on logistic regression) input : X = [x1| . . . |xp] Rn p, y Rn, β Rp, D Rn n, L Rp, MAX CD param: ϵ, MIN CD = 2 init : β = 0p, X β = 0n 1 for k = 1, . . . , MAX CD do

2 τ = 0 // stopping condition

3 for j = 1, . . . , p do

4 uj = βj + ( β)j

5 uj = ST βj + ( β)j 1

x j F(Xβ(t)) xj, X β D , λ

// see (69)

6 ( β)j = uj βj 7 X β += ( uj uj)xj 8 τ += ( uj uj)2 L2 j

9 if τ ϵ and k MIN CD then break

10 return β

Values of parameters and implementation details In practice, Prox-Newton implementations such as GLMNET (Friedman et al., 2010), new GLMNET (Yuan et al., 2012) or QUIC (Hsieh et al., 2014) only solve the direction approximately in Equation (66). How inexactly the problem is solved depends on some heuristic values. For reproducibility, we expose the default values of these parameters as inputs to the algorithms. Importantly, the variable MAX CD is set to 1 for the computation of the ﬁrst Prox-Newton direction. Experiments have indeed revealed that a rough Newton direction for the ﬁrst update was

Massias, Vaiter, Gramfort and Salmon

Algorithm 6 backtracking (illustrated on logistic regression) input : β, X β, λ param: MAX BACKTRACK = 20 init : α = 1

1 for k = 1, . . . , MAX BACKTRACK do

3 for j = 1, . . . , p do

4 if βj + α ( β)j < 0 then δ = λ( β)j 5 else if βj + α ( β)j > 0 then δ += λ( β)j 6 else if βj + α ( β)j = 0 then δ = λ |( β)j|

7 θ = F(Xβ + α X β) (= y σ( y (Xβ + α X β)))

8 δ += (X β) θ

9 if δ < 0 then break

10 else α = α/2

11 return α

suﬃcient and resulted in a substantial speed-up. Other parameters are set based on existing Prox-Newton implementations such as Blitz.

6. Experiments

In this section, we numerically illustrate the beneﬁts of dual extrapolation on various data sets. Implementation is done in Python, Cython (Behnel et al., 2011) and numba (Lam et al., 2015) for the low-level critical parts. The solvers exactly follow the scikit-learn API (Pedregosa et al., 2011; Buitinck et al., 2013), so that Celer can be used as a drop-in replacement in existing code. The package is available under BSD3 license at https://github. com/mathurinm/celer, with documentation and examples at https://mathurinm.github. io/celer.

In all this section, the estimator-speciﬁc λmax refers to the smallest value giving a null solution (for instance λmax = X y in the Lasso case, λmax = X y /2 for sparse logistic regression, and λmax = X Y 2, for the Multitask Lasso).

Table 1: Characteristics of datasets used name n p q density

leukemia 72 7,129 - 1 news20 19,996 632,983 - 6.1 10 4

rcv1 train 20,242 19,960 - 3.7 10 3

ﬁnance (log1p) 16,087 1,668,738 - 3.4 10 3

Magnetoencephalography (MEG) 305 7,498 49 1

Dual Extrapolation for Sparse GLMs

P(β(t)) D(θ(t) res) P(β(t)) D(θ(t) acc) P(β(t)) P( ˆβ)

0 200 400 600 800 1000 1200 1400 CD epoch t

(a) Lasso, on news20 for λ = λmax/5.

0 25 50 75 100 125 150 CD epoch t

(b) Log. reg., on rcv1 (train) for λ = λmax/20.

0 25 50 75 100 125 150 CD epoch t

(c) Multitask Lasso, on MEG data for λ = λmax/10.

Figure 2: Dual objectives with classical and proposed approach, for Lasso (top left), Logistic regression (top right), Multitask Lasso (bottom). The dashed line marks sign identiﬁcation (support identiﬁcation for Multitask Lasso).

6.1 Illustration of dual extrapolation

For the Lasso (Figure 2a), Logistic regression (Figure 2b) and Multitask Lasso (Figure 2c), we illustrate the applicability of dual extrapolation. Monotonicity of the duality gap computed with extrapolation is enforced via the construction of Equation (39). For all problems, the ﬁgures show that θacc gives a better dual objective after sign identiﬁcation, with a duality gap sometimes even matching the suboptimality gap. They also show that the behavior is stable before identiﬁcation.

In particular, Figure 2c hints that dual extrapolation works in practice for the Multitask Lasso, even though there is no such result as sign identiﬁcation, and we are not able to exhibit a VAR behavior for (XB(t))t N. Figure 1 suggests that the lower the stopping criterion threshold ϵ, the higher the impact of dual extrapolation is. However, when combined with screening, this improvement can be less visible in terms of time: if a large number of variables are screened before support identiﬁcation, the later iterations concern a very small number of features. In this case, decreasing the duality gap by running the solver longer after screening is not costly.

6.2 Alternative exploitation of VAR structure

Once one postulates that ˆθ is a linear combination of the K most recent residuals, alternatives to our proposed dual extrapolation can be investigated to determine the coeﬃcients of this combination. This is particularly appealing in the Lasso case, for which the dual

Massias, Vaiter, Gramfort and Salmon

P(β(t)) D(θ(t) res) P(β(t)) D(θ(t) acc) P(β(t)) D(θ(t) QP) P(β(t)) P( ˆβ)

0 25 50 75 100 125 150 175 CD epoch t

Figure 3: Duality gaps evaluated with rescaled residuals (yellow), our proposed dual extrapolation (blue), QP approach (purple) and optimal dual point (green), on the leukemia dataset, with λ = λmax/5 resulting in 23 non-zero coeﬃcients at optimum. The dashed line marks support identiﬁcation. Peaks occur because we set the duality gap to 0 when a method numerically fails to produce extrapolation coeﬃcients.

Problem (2) is:

ˆθ = arg max θ X

2 y/λ θ 2 . (72)

In this case, assuming that ˆθ belongs to Span(r(t), . . . , r(t K+1)), we can reformulate Problem (72) as a K-dimensional quadratic program, and directly optimize over the extrapolation coeﬃcients. If we write R = [r(t)| . . . |r(t K+1)] Rn K and assume that ˆθ = Rˆc, then Problem (72) is equivalent to:

ˆc = arg min c RK 1 2 y/λ Rc 2 subject to 1p X Rc 1p

= arg min c RK λ 2 c (R R)c (R y) c subject to Ac 12p ,

where A = [R X; R X] R2p K. Problem (73) can be solved straightforwardly with solvers such as CVXPY (Diamond and Boyd, 2016), which we use in Figure 3. As visible on the latter, the QP approach seems to suﬀer more from numerical instabilities: at some iterations, CVXPY does not converge, which we represent by setting the dual objective to 0, hence the visible peaks. Although it performs similarly to dual extrapolation at ﬁrst, the QP dual point appears to eventually perform the same as residuals rescaling. We do not perform an extensive time study of the compared approaches, but have observed that the 2p constraints of Problem (73) make it orders of magnitude slower to solve. In practice, we therefore had to limit the experiment to the rather small leukemia dataset to get reasonable running times. Finally, the QP approach does not lead to simple optimization problems for sparse logistic regression and Multitask Lasso.

Dual Extrapolation for Sparse GLMs

6.3 Improved screening and working set policy

In order to have a stopping criterion scaling with n, the solvers are stopped when the duality gap goes below ϵ F(0n). Features are normalized to have norm 1, and for sparse datasets, features with strictly less than 4 non-zero entries are removed.

6.3.1 Lasso

Path computation For a ﬁne (resp. coarse) grid of 100 (resp. 10) values of λ geometrically distributed between λmax and λmax/100, the competing algorithms solve the Lasso on various real world datasets from LIBSVM5 (Fan et al., 2008). Warm start is used for all algorithms: except for the ﬁrst value of λ, the algorithms are initialized with the solution obtained for the previous value of λ on the path. Note that paths are computed following a decreasing sequence of λ (from high value to low). Computing Lasso solutions for various values of λ is a classical task, in cross-validation for example. The values we choose for the grid are the default ones in scikit-learn or GLMNET. For Gap Safe Rules (GSR), we use the strong warm start variant which was shown by Ndiaye et al. (2017, Section 4.6.4) to have the best performance. We refer to GSR + extr. when, on top of this, our proposed dual extrapolation technique is used to create the dual points for screening. To evaluate separately the performance of working sets and extrapolation, we also implement Celer w/o extr. , i.e., Algorithm 2 without using extrapolated dual point. Doing this, GSR can be compared to GSR + extrapolation, and Celer without extrapolation to Celer. Finally, we also add the performance of Blitz6 (Johnson and Guestrin, 2018) and Stingy CD7 (Johnson and Guestrin, 2017), the latter being a Lasso-speciﬁc coordinate descent designed to skip zero-to-zero updates. Note that dual extrapolation could easily be combined with the update policy of Stingy CD. For fair comparison, all algorithms use the duality gap as a stopping criterion. On Figures 4 to 6, one can see that using acceleration systematically improves the performance of Gap Safe rules, up to a factor 3. Similarly, dual extrapolation makes Celer more eﬃcient than a WS approach without extrapolation (Blitz or Celer w/o extr.) This improvement is more visible for low values of stopping criterion ϵ, as dual extrapolation is beneﬁcial once the support is identiﬁed. Generally, working set approaches tend to perform better on coarse grid, while screening is beneﬁcial on ﬁne grids a ﬁnding corroborating Lasso experiments in Ndiaye et al. (2017, Sec. 6.1). Indeed, on a ﬁne grid, the value of the regularizer λ changes slowly and each solution on the grid is close to the previous one. In this case, when warm-start is used, the initialization (approximate solution for the previous value of the regularizer) is close to the solution for the new value of the regularizer, and the duality gap always remains low, allowing to quickly screen features. On the contrary, if the grid is coarse, each problem on the grid is quite diﬀerent from the previous one. Warm start here provides a less useful initialization as the duality gap is higher for the early iterations of each problem. This results in a reduced eﬃciency of screening.

Single λ The performance observed in the previous paragraph is not only due to the sequential setting: in the experiment of Table 2, we solve the Lasso for a single value of

5. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 6. https://github.com/tbjohns/BLitz L1 7. https://github.com/tbjohns/Stingy CD

Massias, Vaiter, Gramfort and Salmon

GSR GSR + extr. Blitz Stingy CD Celer w/o extr. Celer

10 4 & 0.56 10 6 & 0.94 10 8 & 1.28 ϵ & time GSR (s)

time / time GSR

10 4 & 0.86 10 6 & 2.97 10 8 & 5.04 ϵ & time GSR (s)

time / time GSR

Figure 4: Time to compute a Lasso path from λmax to λmax/100 on the leukemia dataset (left: coarse grid of 10 values, right: ﬁne grid of 100 values). λmax/100 gives a solution with 60 nonzero coeﬃcients.

GSR GSR + extr. Blitz Stingy CD Celer w/o extr. Celer

10 4 & 35.24 10 6 & 120.20 10 8 & 152.97 ϵ & time GSR (s)

time / time GSR

10 4 & 48.51 10 6 & 346.92 10 8 & 782.29 ϵ & time GSR (s)

time / time GSR

Figure 5: Time to compute a Lasso path from λmax to λmax/100 on the news20 dataset (left: coarse grid of 10 values, right: ﬁne grid of 100 values). λmax/100 gives a solution with 14,817 nonzero coeﬃcients.

GSR GSR + extr. Blitz Stingy CD Celer w/o extr. Celer

10 4 & 5.38 10 6 & 12.00 10 8 & 16.59 ϵ & time GSR (s)

time / time GSR

10 4 & 11.77 10 6 & 36.07 10 8 & 137.33 ϵ & time GSR (s)

time / time GSR

Figure 6: Time to compute a Lasso path from λmax to λmax/100 on the rcv1 dataset (left: coarse grid of 10 values, right: ﬁne grid of 100 values). λmax/100 gives a solution with 4,610 nonzero coeﬃcients.

Dual Extrapolation for Sparse GLMs

Table 2: Computation time (in seconds) for Celer, Blitz and scikit-learn to reach a given precision ϵ on the Finance dataset with λ = λmax/20 (without warm start: β(0) = 0p).

ϵ 10 2 10 3 10 4 10 6

Celer 5 7 8 10 Blitz 25 26 27 30 scikit-learn 470 1,350 2,390 -

λ = λmax/20. The duality gap stopping criterion ϵ varies between 10 2 and 10 6. Celer is orders of magnitude faster than scikit-learn, which uses vanilla CD. The working set approach of Blitz is also outperformed, especially for low ϵ values.

6.3.2 Logistic regression

In this section, we evaluate separately the ﬁrst order solvers (Gap Safe, Gap Safe with extrapolation, Celer with coordinate descent as inner solver), and the Prox-Newton solvers: Blitz, Newton-Celer with working set but without using dual extrapolation (PN WS), and Newton-Celer.

GSR GGSR + extr. Celer w/o extr. Celer

10 4 & 85 10 6 & 245 10 8 & 356 ϵ & time GSR (s)

time / time GSR

10 4 & 72 10 6 & 404 10 8 & 1246 ϵ & time GSR (s)

time / time GSR

Figure 7: Time to compute a Logistic regression path from λmax to λmax/100 on the news20 dataset (left: coarse grid of 10 values, right: ﬁne grid of 100 values). λmax/100 gives 5319 non-zero coeﬃcients.

Figure 7 shows that when cyclic coordinate descent is used, extrapolation improves the performance of screening rules, and that using a dual-based working set policy further reduces the computational burden. Figure 8 shows the limitation of dual extrapolation when second order information is taken into account with a Prox-Newton: because the Prox-Newton iterations do not create a VAR sequence, it is necessary to perform some passes of coordinate descent to create θacc, as detailed in Section 5.2. This particular experiment reveals that this additional time unfortunately mitigates the gains observed in better working sets and stopping criterion.

Massias, Vaiter, Gramfort and Salmon

2 & 1.0 5 & 2.0 10 & 3.4 50 & 13.1 100 & 18.5 λmax/λ & Blitz time (s)

Percentage of Blitz time

Blitz PN WS Newton Celer extrapolation time

Figure 8: Time to solve a Logistic regression problem for diﬀerent values of λ, on the rcv1 dataset (ϵ = 10 6).

6.3.3 Multitask Lasso

The data for this experiment uses magnetoencephalography (MEG) recordings which are collected for neuroscience studies. Here we use data from the sample dataset of the MNE software (Gramfort et al., 2014). Data were obtained using auditory stimulation. There are n = 305 sensors, p = 7,498 source locations in the brain, and the measurements are time series of length q = 49. Using a Multitask Lasso formulation allows to reconstruct brain activitiy exhibiting a stable sparsity pattern across time (Gramfort et al., 2012). The inner solver for Celer is block coordinate descent, which is also used for the Gap Safe solver (Ndiaye et al., 2015).

GSR GSR + extr. Celer w/o extr. Celer

10 4 & 20 10 6 & 36 10 8 & 48 ϵ & time GSR (s)

time / time GSR

10 4 & 11 10 6 & 52 10 8 & 77 ϵ & time GSR (s)

time / time GSR

Figure 9: Time to compute a Multitask Lasso path from λmax to λmax/100 on MEG data (left: coarse grid of 10 values, right: ﬁne grid of 100 values). λmax/100 gives 254 non-zero rows for ˆB.

While Figure 2c showed that for the Multitask Lasso the dual extrapolation performance also gives an improved duality gap, here Figure 9 shows that the working set policy of Celer performs better than Gap Safes rules with strong active warm start. We could not include Blitz in the benchmark as there is no standard public implementation for this problem.

Dual Extrapolation for Sparse GLMs

In this work, we generalize the dual extrapolation procedure for the Lasso (Celer) to a wider class of ℓ1-penalized problems, in particular sparse Logistic regression. Theoretical guarantees based on sign identiﬁcation of coordinate descent are provided. Experiments show that dual extrapolation yields more eﬃcient Gap Safe screening rules and working sets solver. Finally, we adapt Celer to make it compatible with prox-Newton solvers, and empirically demonstrate its applicability to the Multi-task Lasso, for which we leave the proof to future work.

Acknowledgments

This work was funded by the ERC Starting Grant SLAB ERC-St G-676943, by the Chair Machine Learning for Big Data at T el ecom Paris Tech and by the ANR grant Gra Va ANR18-CE40-0005.

K. Arrow, L. Hurwicz, and H. Uzawa. Studies in Nonlinear Programming. Stanford University Press, 1958.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Convex optimization with sparsityinducing norms. Foundations and Trends in Machine Learning, 4(1):1 106, 2012.

H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. Springer, New York, 2011.

S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. S. Seljebotn, and K. Smith. Cython: The best of both worlds. Computing in Science Engineering, 13(2):31 39, 2011.

A. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791 806, 2011.

A. Boisbunon, R. Flamary, and A. Rakotomamonjy. Active set strategy for high-dimensional non-convex sparse optimization problems. In ICASSP, pages 1517 1521, 2014.

A. Bonnefoy, V. Emiya, L. Ralaivola, and R. Gribonval. A dynamic screening principle for the lasso. In EUSIPCO, 2014.

S. Boyd and L. Vandenberghe. Convex optimization. Cambridge University Press, 2004.

L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V. Niculae, P. Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. Vanderplas, A. Joly, B. Holt, and G. Varoquaux. API design for machine learning software: experiences from the scikit-learn project. ar Xiv e-prints, 2013.

E. Cand es and B. Recht. Simple bounds for recovering low-complexity models. Mathematical Programming, 141(1-2):577 589, 2013.

Massias, Vaiter, Gramfort and Salmon

A. Chambolle and T. Pock. A ﬁrst-order primal-dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis., 40(1):120 145, 2011.

S. S. Chen and D. L. Donoho. Atomic decomposition by basis pursuit. In SPIE, 1995.

S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. J. Mach. Learn. Res., 17(83):1 5, 2016.

C. D unner, S.Forte, M. Tak aˇc, and M. Jaggi. Primal-dual rates and certiﬁcates. In ICML, pages 783 792, 2016.

L. El Ghaoui, V. Viallon, and T. Rabbani. Safe feature elimination in sparse supervised learning. J. Paciﬁc Optim., 8(4):667 698, 2012.

R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A library for large linear classiﬁcation. J. Mach. Learn. Res., 9:1871 1874, 2008.

J. Fan and J. Lv. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol., 70(5):849 911, 2008.

O. Fercoq and P. Bianchi. A coordinate descent primal-dual algorithm with large step size and possibly non separable functions. ar Xiv preprint ar Xiv:1508.04625, 2015.

O. Fercoq and P. Richt arik. Accelerated, parallel and proximal coordinate descent. SIAM J. Optim., 25(3):1997 2013, 2015.

O. Fercoq, A. Gramfort, and J. Salmon. Mind the duality gap: safer rules for the lasso. In ICML, pages 333 342, 2015.

J. Friedman, T. J. Hastie, H. H oﬂing, and R. Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat., 1(2):302 332, 2007.

J. Friedman, T. J. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw., 33(1):1, 2010.

J.-J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE transactions on Information theory, 50(6):1341 1344, 2004.

A. Gramfort, M. Kowalski, and M. H am al ainen. Mixed-norm estimates for the M/EEG inverse problem using accelerated gradient methods. Phys. Med. Biol., 57(7):1937 1961, 2012.

A. Gramfort, M. Luessi, E. Larson, D. A. Engemann, D. Strohmeier, C. Brodbeck, L. Parkkonen, and M. S. H am al ainen. MNE software for processing MEG and EEG data. Neuro Image, 86:446 460, 2014.

E. Hale, W. Yin, and Y. Zhang. Fixed-point continuation for ℓ1-minimization: Methodology and convergence. SIAM J. Optim., 19(3):1107 1130, 2008.

W. L. Hare and A. S. Lewis. Identifying active manifolds. Algorithmic Operations Research, 2(2):75 75, 2007.

Dual Extrapolation for Sparse GLMs

J.-B. Hiriart-Urruty and C. Lemar echal. Convex analysis and minimization algorithms. II, volume 306. Springer-Verlag, Berlin, 1993.

C.-J Hsieh, M. Sustik, I. Dhillon, and P. Ravikumar. QUIC: Quadratic approximation for sparse inverse covariance estimation. J. Mach. Learn. Res., 15:2911 2947, 2014.

T. B. Johnson and C. Guestrin. Blitz: A principled meta-algorithm for scaling sparse optimization. In ICML, pages 1171 1179, 2015.

T. B. Johnson and C. Guestrin. Stingy CD: Safely avoiding wasteful updates in coordinate descent. In ICML, pages 1752 1760, 2017.

T. B. Johnson and C. Guestrin. A fast, principled working set algorithm for exploiting piecewise linear structure in convex problems. ar Xiv preprint ar Xiv:1807.08046, 2018.

P. Karimireddy, A. Koloskova, S. Stich, and M. Jaggi. Eﬃcient Greedy Coordinate Descent for Composite Problems. ar Xiv preprint ar Xiv:1810.06999, 2018.

K. Koh, S.-J. Kim, and S. Boyd. An interior-point method for large-scale l1-regularized logistic regression. J. Mach. Learn. Res., 8(8):1519 1555, 2007.

M. Kowalski, P. Weiss, A. Gramfort, and S. Anthoine. Accelerating ISTA with an active set strategy. In OPT 2011: 4th International Workshop on Optimization for Machine Learning, page 7, 2011.

S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVM-based Python JIT Compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pages 1 6. ACM, 2015.

J. Lee, Y. Sun, and M. Saunders. Proximal Newton-type methods for convex optimization. In NIPS, pages 827 835, 2012.

J. Mairal. Sparse coding for machine learning, image processing and computer vision. Ph D thesis, Ecole normale sup erieure de Cachan, 2010.

M. Massias, A. Gramfort, and J. Salmon. From safe screening rules to working sets for faster lasso-type solvers. In 10th NIPS Workshop on Optimization for Machine Learning, 2017.

M. Massias, A. Gramfort, and J. Salmon. Celer: a fast solver for the Lasso with dual extrapolation. In ICML, 2018.

P. Mc Cullagh and J.A. Nelder. Generalized Linear Models, Second Edition. Chapman and Hall/CRC Monographs on Statistics and Applied Probability Series. 1989.

D. Myers and W. Shih. A constraint selection technique for a class of linear programs. Operations Research Letters, 7(4):191 195, 1988.

E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparse multi-task and multi-class models. In NIPS, pages 811 819, 2015.

Massias, Vaiter, Gramfort and Salmon

E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. GAP safe screening rules for sparsegroup-lasso. In NIPS, 2016.

E. Ndiaye, O. Fercoq, A. Gramfort, and J. Salmon. Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res., 18(128):1 33, 2017.

G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classiﬁcation problems. Statistics and Computing, 20(2):231 252, 2010.

K. Ogawa, Y. Suzuki, and I. Takeuchi. Safe screening of non-support vectors in pathwise SVM computation. In ICML, pages 1382 1390, 2013.

F. Palacios-Gomez, L. Lasdon, and M. Engquist. Nonlinear optimization by successive linear programming. Management Science, 28(10):1106 1120, 1982.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res., 12:2825 2830, 2011.

D. Perekrestenko, V. Cevher, and M. Jaggi. Faster coordinate descent via adaptive importance sampling. In AISTATS, pages 869 877, 2017.

P. Richt arik and M. Tak aˇc. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. Mathematical Programming, 144(1-2):1 38, 2014.

V. Roth and B. Fischer. The group-lasso for generalized linear models: uniqueness of solutions and eﬃcient algorithms. In ICML, pages 848 855, 2008.

M. De Santis, S. Lucidi, and F. Rinaldi. A fast active set block coordinate descent algorithm for ℓ1-regularized least squares. SIAM J. Optim., 26(1):781 809, 2016.

K. Scheinberg and X. Tang. Complexity of inexact proximal Newton methods. ar Xiv preprint arxiv:1311.6547, 2013.

D. Scieur. Acceleration in Optimization. Ph D thesis, Ecole normale sup erieure, 2018.

D. Scieur, A. d Aspremont, and F. Bach. Regularized nonlinear acceleration. In NIPS, pages 712 720, 2016.

N. Simon, J. Friedman, T. J. Hastie, and R. Tibshirani. A sparse-group lasso. J. Comput. Graph. Statist., 22(2):231 245, 2013. ISSN 1061-8600.

G. Thompson, F. Tonge, and S. Zionts. Techniques for removing nonbinding constraints and extraneous variables from linear programming problems. Management Science, 12 (7):588 608, 1966.

R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol., 58(1):267 288, 1996.

Dual Extrapolation for Sparse GLMs

R. Tibshirani, J. Bien, J. Friedman, T. J. Hastie, N. Simon, J. Taylor, and R. J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B Stat. Methodol., 74(2):245 266, 2012.

R. J. Tibshirani. The lasso problem and uniqueness. Electron. J. Stat., 7:1456 1490, 2013.

R. J. Tibshirani. Dykstra s Algorithm, ADMM, and Coordinate Descent: Connections, Insights, and Extensions. In NIPS, pages 517 528, 2017.

P. Tseng. Convergence of a block coordinate descent method for nondiﬀerentiable minimization. J. Optim. Theory Appl., 109(3):475 494, 2001.

S. Vaiter, M. Golbabaee, J. Fadili, and G. Peyr e. Model selection with low complexity priors. Information and Inference: A Journal of the IMA, 4(3):230 287, 2015.

S. Vaiter, G. Peyr e, and J. M. Fadili. Model consistency of partly smooth regularizers. IEEE Trans. Inf. Theory, 64(3):1725 1737, 2018.

J. Wang, P. Wonka, and J. Ye. Lasso screening rules via dual polytope projection. ar Xiv preprint ar Xiv:1211.3966, 2012.

Z. J. Xiang, Y. Wang, and P. J. Ramadge. Screening tests for lasso problems. IEEE Trans. Pattern Anal. Mach. Intell., PP(99), 2016.

G Yuan, C.-H Ho, and C.-J Lin. An improved GLMNET for l1-regularized logistic regression. J. Mach. Learn. Res., 13:1999 2030, 2012.

M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(1):49 67, 2006.