# curvatureexploiting_acceleration_of_elastic_net_computations__cb794852.pdf

Curvature-Exploiting Acceleration of Elastic Net Computations

Vien V. Mai 1 Mikael Johansson 1

This paper introduces an efﬁcient second-order method for solving the elastic net problem. Its key innovation is a computationally efﬁcient technique for injecting curvature information in the optimization process which admits a strong theoretical performance guarantee. In particular, we show improved run time over popular ﬁrstorder methods and quantify the speed-up in terms of statistical measures of the data matrix. The improved time complexity is the result of an extensive exploitation of the problem structure and a careful combination of second-order information, variance reduction techniques, and momentum acceleration. Beside theoretical speed-up, experimental results demonstrate great practical performance beneﬁts of curvature information, especially for ill-conditioned data sets.

1. Introduction

Lasso, ridge and elastic net regression are fundamental problems in statistics and machine learning, with countless applications in science and engineering (Zou & Hastie, 2005). Elastic net regression amounts to solving the following convex optimization problem

minimize x Rd

2n Ax b 2 2 + γ2

2 x 2 2 + γ1 x 1

for given data matrices A Rn d and b Rn and regularization parameters γ1 and γ2. Setting γ1 = 0 results in ridge regression, γ2 = 0 yields lasso and letting γ1 = γ2 = 0 reduces the problem to the classical least-squares. Lasso promotes sparsity of the optimal solution, which sometimes helps to improve interpretability of the results. Adding the additional l2-regularizer helps to

1Department of Department of Automatic Control, School of Electrical Engineering and Computer Science, Royal Institute of Technology (KTH), Stockholm, Sweden. Correspondence to: V. V. Mai <maivv@kth.se>, M. Johansson <mikaelj@kth.se>.

Proceedings of the 36 th International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s).

improve the performance when features are highly correlated (Tibshirani et al., 2015; Zou & Hastie, 2005).

The convergence rates of iterative methods for solving (1) are typically governed by the condition number of the Hessian matrix of the ridge loss, C +γ2I, where C = 1

n A A is the sample correlation matrix. Real-world data sets often have few dominant features, while the other features are highly correlated with the stronger ones (Gonen et al., 2016; Tibshirani et al., 2015). This translates to a rapidly decaying spectrum of C. In this paper, we demonstrate how this property can be exploited to reduce the effect of ill-conditioning and to design faster algorithms for solving the elastic net regression problem (1).

1.1. Related work

Over the past few years, there has been a great attention to developing efﬁcient optimization algorithms for minimizing composite objective functions

min x Rd F (x) f (x) + h (x) , (2)

where f (x) = 1

n Pn i=1 fi (x) is a ﬁnite sum of smooth and convex component functions fi (x), and h (x) is a possibly non-smooth convex regularizer. In machine learning applications, the function f typically models the empirical data loss and the regularizer h is used to promote desired properties of a solution. For example, the elastic net objective can ﬁt to this form with fi(x) = 1

2 a i x bi 2 + γ2 x 2 2 , and h(x) = γ1 x 1 .

1.1.1. FIRST-ORDER METHODS

Standard deterministic ﬁrst-order methods for solving (2), such as proximal gradient descent, enjoy linear convergence for strongly convex objective functions and are able to ﬁnd an ϵ-approximate solution in time O dnκ log 1

ϵ , where κ is the condition number of f. This runtime can be improved to O dn κ log 1

ϵ if it is combined with Nesterov acceleration (Beck & Teboulle, 2009; Nesterov, 2013). However, the main drawback of these methods is that they need to access the whole data set in every iteration, which is too costly in many machine learning tasks.

For large-scale problems, methods based on stochastic gradients have become the standard choice for solv-

Curvature-Exploiting Acceleration of Elastic Net Computations

ing (2). Many linearly convergent proximal methods such as, SAGA (Defazio et al., 2014) and Prox-SVRG (Xiao & Zhang, 2014), have been introduced and shown to outperform standard ﬁrst-order methods under certain regularity assumptions. These methods improve the time complexity to O d (n + κ) log 1

ϵ , where κ is a condition number satisfying κ κ. When the component functions do not vary substantially in smoothness, κ κ, and this complexity is far better than those of deterministic methods above. By exploiting Nesterov momentum in different ways (see, e.g., (Allen-Zhu, 2016; Defazio, 2016; Frostig et al., 2015; Lin et al., 2015)), one can improve the complexity to O d n +

ϵ , which is also optimal for this class of problems (Woodworth & Srebro, 2016).

1.1.2. SECOND-ORDER METHODS

Second-order methods are known to have superior performance compared to their ﬁrst-order counterparts both in theory and practice, especially when the problem at hand is highly nonlinear and/or ill-conditioned. However, such methods often have very high computational cost per iteration. Recently, there has been an intense effort to develop algorithms which use second-order information with a more reasonable computational burden (see, e.g., (Agarwal et al., 2017; Byrd et al., 2016; Erdogdu & Montanari, 2015; Moritz et al., 2016; Roosta-Khorasani & Mahoney, 2016; Xiao & Zhang, 2014; Xu et al., 2016) and references therein). Those methods use techniques such as random sketching, matrix sampling, and iterative estimation to construct an approximate Hessian matrix. Local and global convergence guarantees have been derived under various assumptions. Although many experimental results have shown excellent performance of those methods on many machine learning tasks, current second-order methods for ﬁnite-sum optimization tend to have much higher time-complexities than their ﬁrst-order counterparts (see (Xu et al., 2016) for a detailed comparison).

Apart from having high time complexities, none of the methods cited above have any guarantees in the composite setting since their analyses hinge on differentiability of the objective function. Instead, one has to rely on methods that build on proximal Newton updates (see, e.g., (Ghanbari & Scheinberg, 2016; Lee et al., 2012; Liu et al., 2017; Rodomanov & Kropotov, 2016)). However, these methods still inherit the high update and storage costs of conventional second-order methods or require elaborate tuning of several parameters and stopping criteria depending on a phase transition which occurs in the algorithm.

1.1.3. RIDGE REGRESSION

For the smooth ridge regression problem, the authors in (Gonen et al., 2016) have developed a preconditioning

method based on linear sketching which, when coupled with SVRG, yields a signiﬁcant speed-up over stochastic ﬁrst-order methods. This is a rare second-order method that has a comparable or even better time complexity than stochastic ﬁrst-order methods. More precisely, it has a guaranteed running time of O(d(n + κH) log 1

ϵ ), where κH is a new condition number that can be dramatically smaller than κ, especially when the spectrum of C decays rapidly. When d n, the authors in (Wang & Zhang, 2017) combine sub-sampled Newton methods with the mini-batch SVRG to obtain some further improvements.

1.2. Contributions

Recently, the work (Arjevani & Shamir, 2017) shows that under some mild algorithmic assumptions, and if the dimension is sufﬁciently large, the iteration complexity of second-order methods for smooth ﬁnite-sum problems composed of quadratics is no better than ﬁrst-order methods. Therefore, it is natural to ask whether one can develop a second-order method for solving the elastic net problem which has improved practical performance but still enjoys a strong worst-case time complexity like the stochastic ﬁrstorder methods do? It should be emphasized that due to the non-smooth objective, achieving this goal is much more challenging than for ridge regression. The preconditioning approach in (Gonen et al., 2016) is not applicable, and the current theoretical results for second-order methods are not likely to offer the desired running time.

In this paper, we provide a positive answer to this question. Our main contribution is the design and analysis of a simple second-order method for the elastic net problem which has a strong theoretical time complexity and superior practical performance. The convergence bound adapts to the problem structure and is governed by the spectrum and a statistical measure of the data matrix. These quantities often yield signiﬁcantly stronger time complexity guarantees for practical datasets than those of stochastic ﬁrst-order methods (see Table 1). To achieve this, we ﬁrst leverage recent advances in randomized low-rank approximation to generate a simple, one-shot approximation of the Hessian matrix. We then exploit the composite and ﬁnite-sum structure of the problem to develop a variant of the Prox SVRG method that builds upon Nesterov s momentum acceleration and inexact computations of scaled proximal operators, which may be of independent interest. We provide a simple convergence proof based on an explicit Lyapunov function, thus avoiding the use of sophisticated stochastic estimate sequences.

2. Preliminaries and Notation

Vectors are indicated by bold lower-case letters, and matrices are denoted by bold upper-case letters. We denote

Curvature-Exploiting Acceleration of Elastic Net Computations

Table 1. Summary of different algorithms solving the elastic net problem. Here, κ and κ are conventional condition numbers satisfying κ κ, while κH is a new condition number deﬁned w.r.t the H-norm. When H is an approximate Hessian of the ridge loss, κH is often much smaller than κ, especially on practical data sets.

Algorithm Time complexity 2nd-order

PGD O(dnκ log 1

FISTA O dn κ log 1

Prox SVRG O d (n + κ) log 1

Katyusha O d(n +

Ours O d (n + κH) log 1

the dot product between x and y by x, y = x y, and the Euclidean norm of x by x 2 = p

x, x . For a symmetric positive deﬁnite matrix H, x, y H = x Hy is the H-inner product of two vectors x and y and x H = p

x, x H is the Mahalanobis norm of x. We denote by λi (A) the ith largest eigenvalue of A. Finally, λi denotes the ith largest eigenvalue of the correlation matrix C.

In the paper, we shall frequently use the notions of strong convexity and smoothness in the H-norm, introduced in the next two assumptions.

Assumption 1. The function h (x) is lower-semicontinous and convex and dom h := {x Rd | h (x) < }, is closed. Each function fi is Li-smooth w.r.t the H-norm, i.e, there exists a positive constant Li such that

fi (x) fi (y) H 1 Li x y H , x, y Rd.

Assumption 1 implies that f is L-Lipschitz:

f (x) f (y) H 1 L x y H

for some L Lavg = 1

n Pn i=1 Li. As a consequence, we have the following bound:

f (y) f (x) + f (x) , y x + Lavg

2 y x 2 H .

Assumption 2. The function f (x) is µ-strongly convex w.r.t the H-norm, i.e, there exists a positive constant µ such that

f (λx + (1 λ) y) λf (x) + (1 + λ) f (y)

holds for all x, y Rn and λ [0, 1].

Assumption 2 is equivalent to the requirement that

f (y) f (x) + f (x) , y x + µ

2 y x 2 H ,

holds for all x, y Rn. We will use both of these deﬁnitions of strong convexity in our proofs.

At the core of our method is the concept of scaled proximal mappings, deﬁned as follows:

Deﬁnition 1 (Scaled Proximal Mapping). For a convex function h and a symmetric positive deﬁnite matrix H, the scaled proximal mapping of h at x is

prox H h (y) = argmin x Rd

2 x y 2 H . (3)

The scaled proximal mappings generalize the conventional ones:

proxh (y) = argmin x Rd

2 x y 2 2 . (4)

However, while many conventional prox-mappings admit analytical solutions, this is almost never the case for scaled proximal mappings. This makes it hard to extend efﬁcient ﬁrst-order proximal methods to second-order ones. Fortunately, scaled proximal mappings do share some key properties with the conventional ones. We collect a few of them in the following result:

Property 1 ((Lee et al., 2012)). The following properties hold:

1. prox H h (x) exists and is unique for x dom h.

2. Let h (x) be the subdifferential of h at x, then

H x prox H h (x) h prox H h (x) .

3. prox H h ( ) is non-expansive in the H-norm: prox H h (x) prox H h (y) H x y H x, y dom h.

Finally, in our algorithm, it will be enough to solve (3) approximately in the following sense:

Deﬁnition 2 (Inexact subproblem solutions). We say that x+ Rd is an ϵ-optimal solution to (3) if

min x Rd h (x) + 1

2η x y 2 H + ϵ. (5)

3. Building Block 1: Randomized Low-Rank Approximation

The computational cost of many Newton-type methods is dominated by the time required to compute the update direction d = H 1g for some vector g Rd and approximate Hessian H. A naive implementation using SVD would take O nd2 ﬂops, which is prohibitive for largescale data sets. A natural way to reduce this cost is to use

Curvature-Exploiting Acceleration of Elastic Net Computations

truncated SVD. However, standard deterministic methods such as the power method and the Lanzcos method have run times that scale inversely with the gap between the eigenvalues of the input matrix. This gap can be arbitrarily small for practical data sets, thereby preventing us from obtaining the desired time complexity. In contrast, randomized sketching schemes usually admit gap-free run times (Halko et al., 2011). However, unlike other methods, the block Lanczos method, detailed in Algorithm 1, admits both fast run times and strong guarantees on the errors between the true and the computed approximate singular vectors. This property turns out to be critical for deriving bounds on the condition number of the elastic net.

Proposition 1 ((Musco & Musco, 2015)). Assume that Ur, Σr, and Vr are matrices generated by Algorithm 1. Let Ar = UrΣr V r = Pr i=1 σiuiv i and let A = Pd i=1 σi ui v i be the SVD of A. Then, the following bounds hold with probability at least 9/10:

A Ar 2 (1 + ϵ ) σk u i AA ui u i AA ui ϵ σ2 r+1, i {1, . . . , r}.

The total running time is O ndr log d(ϵ ) 1/2 .

Note that we only run Algorithm 1 once and ϵ = 1/2 is sufﬁcient in our work. Thus, the computational cost of this step is negligible, in theory and in practice.

3.1. Aproximating the Hessian

In this work, we consider the following approximate Hessian matrix of the ridge loss:

H = Vr Σ2 r + γ2I V r + σ2 r + γ2 I Vr V r . (6)

Here, the ﬁrst term is a natural rank r approximation of the true Hessian, while the second term is used to capture information in the subspace orthogonal to the column space of Vr. The inverse of H in (6) admits the explicit expression

H 1 = Vr Σ2 r + γ2I 1 V r + 1 σ2r + γ2

so the evaluation of H 1x has time complexity O (rd).

3.2. Bounding the Condition Number

We now turn our attention to studying how the approximate Hessian affects the relevant condition number of the elastic net problem. We ﬁrst introduce a condition number that usually determines the iteration complexity of stochastic ﬁrst-order methods under non-uniform sampling.

Deﬁnition 3. The average condition number of (1.1) is

1 n Pn i=1 Li µ .

Algorithm 1 Randomized Block Lanczos Method (Musco & Musco, 2015)

Input: Data matrix A Rn d, target rank r, target precision ϵ (0, 1)

1: Let q = O(log d/

ϵ ), and draw Π Nd r (0, I)

2: Compute K = AΠ AA AΠ . . . AA q AΠ

3: Orthonormalize columns of K to obtain Q

4: Compute truncated r-SVD of Q A as WrΣr V r 5: Compute Ur = QWr Output: Ur, Σr, Vr

For the elastic net problem (1), the smooth part of the objective is the ridge loss

1 2 a i x bi 2 + γ2 x 2 2 | {z } fi(x)

Since we deﬁne smoothness and strong convexity of fi(x) in the H-norm, the relevant constants are

Li = H 1 aia i + γ2I 2

µ = λd H 1/2 (C + γ2I) H 1/2 .

For comparison, we also deﬁne the conventional condition number κ, which characterizes the smoothness and strong convexity of fi(x) in the Euclidean norm. In this case κ = P

i Li/(nµ), where

Li = aia i + γ2I 2 2 and µ = λd (C + γ2I) .

It will become apparent that κH can be expressed in terms of a statistical measure of the ridge loss and that it may be signiﬁcantly smaller than κ. We start by introducing a statistical measure that has been widely used in the analysis of ridge regression (see, e.g., (Hsu et al., 2012) and the references therein). Deﬁnition 4 (Effective Dimension). For a positive constant λ, the effective dimension of C is deﬁned as

The effective dimension generalizes the ordinary dimension and satisﬁes dλ d with equality if and only if λ = 0. It is typical that when C has a rapidly decaying spectrum, most of the λi s are dominated by λ, and hence dλ can be much smaller than d.

The following lemma bounds the eigenvalues of the matrix H 1/2 (C + γ2I) H 1/2, which can be seen as the effective Hessian matrix.

Curvature-Exploiting Acceleration of Elastic Net Computations

Lemma 1 ((Gonen et al., 2016)). Invoking Algorithm 1 with data matrix 1 n A, target rank r, and target precision ϵ = 1/2, it holds with probability at least 9/10 that

λ1 H 1/2 (C + γ2I) H 1/2 17

γ2 19 (λr + γ2) λd H 1/2 (C + γ2I) H 1/2 2.

Equipped with Lemma 1, we can now connect κH with dλ using the following result.

Theorem 1. With probability at least 9/10, the following bound holds up to a multiplicative constant:

γ2 , rλr + P

i>r λi γ2 + d .

Proof. See Appendix B.

Since κ = (P i λi + dγ2)/γ2, κH is reduced by a factor P

i>r λi rλr + P

compared to κ. If the spectrum of C decays rapidly, then the terms P

i>r λi are negligible and the ratio is approximately P

i r λi/(rλr). If the ﬁrst eigenvalues are much larger than λr, this ratio will be large. For example, for the australian data set (Chang & Lin, 2011), this ratio can be as large as 1.34 104 and 1.6 105 for r = 3 and r = 4, respectively. This indicates that it is possible to improve the iteration complexity of stochastic ﬁrst-order methods if one can capitalize on the notions of strong convexity and smoothness w.r.t the H-norm in the optimization algorithm. Of course, this is only meaningful if there is an efﬁcient way to inject curvature information into the optimization process without signiﬁcantly increasing the computational cost. In the smooth case, i.e., γ1 = 0, this task can be done by a preconditioning step (Gonen et al., 2016). However, this approach is not applicable for the elastic net, and we need to make use of another building block.

4. Building Block 2: Inexact Accelerated Scaled Proximal SVRG

In this section, we introduce an inexact scaled accelerated Prox SVRG algorithm for solving the generic ﬁnite-sum minimization problem in (2). We then characterize the convergence rate of the proposed algorithm.

4.1. Description of the Algorithm

To motivate our algorithm, we ﬁrst recall the Prox SVRG method from (Xiao & Zhang, 2014): For the sth outer iteration with the corresponding outer iterate xs, let x0 = xs

and for k = 0, 2, . . . , T 1 do

vk = ( fik (xk) fik ( xs)) / (npik) + f ( xs) (7)

xk+1 = proxηh (xk ηvk) , (8)

where ik is drawn randomly from {1, . . . , n} with probability pik = Lik/(n Lavg). Since we are provided with an approximate Hessian matrix H, it is natural to use the following update:

xk+1 = prox H ηh xk ηH 1vk , (9)

which can be seen as a proximal Newton step with the full gradient vector replaced by vk. Note that when h( ) is the ℓ1-penalty, Prox SVRG can evaluate (8) in time O(d), while evaluating (9) amounts to solving an optimization problem. It is thus is critical to keep the number of such evaluations small, which then translates into making a sufﬁcient progress at each iteration. A natural way to achieve this goal is to reduce the variance of the noisy gradient vk. This suggests to use large mini-batches, i.e., instead of using a single component function fik, we use multiple ones to form:

fik (xk) fik ( xs) /(npik)+ f ( xs) ,

where Bk {1, . . . , n} is a set of indices with cardinality |Bk| = b. It is easy to verify that vk is an unbiased estimate of f (xk). Notice that naively increasing the batch size makes the algorithm increasingly similar to its deterministic counterpart, hence inheriting a high-time complexity. This makes it hard to retain the runtime of Prox SVRG in the presence of 2nd-order information.

In the absence of second-order information and under the assumption that the proximal step is computed exactly, the work (Nitanda, 2014) introduced a method called Acc Prox SVRG that enjoys the same time complexity as Prox SVRG but allows for much larger mini-batch sizes. In fact, it can tolerate a mini-batch of size O

κ thanks to the use of Nesterov momentum. This indicates that an appropriate use of Nesterov momentum in our algorithm could allow for the larger mini-batches required to balance the computational cost of using scaled proximal mappings. The improved iteration complexity of the scaled proximal mappings will then give an overall acceleration in terms of wall-clock time. As discussed in (Allen-Zhu, 2016), the momentum mechanism in Acc Prox SVRG fails to accelerate Prox SVRG unless κ n2. In contrast, as we will see, our algorithm will be able to accelerate the convergence also in these scenarios. In summary, our algorithm is designed to run in an inner-outer fashion as Prox SVRG with large mini-batch sizes and Nesterov momentum to compensate for the increased computational cost of subproblems. The overall procedure is summarized in Algorithm 2.

Curvature-Exploiting Acceleration of Elastic Net Computations

Algorithm 2 Inexact Accelerated Scaled Proximal SVRG

Input: x0, H, {Bk}T k=0, η, τ

1: for s = 0, 1, . . . , S do 2: f ( xs) 1

n Pn 1 fi ( xs) 3: x0 z0 xs 4: for k = 0, 1, . . . , T 1 do 5: yk 1 1+τ xk + τ 1+τ zk 6: vk f Bk (yk) f Bk ( xs) + f ( xs) 7: xk+1 prox H ηh yk ηH 1vk

η (yk xk+1) 9: zk+1 zk + τ (yk zk) τ

µgk+1 10: end for 11: xs+1 x T 12: end for Output: x S

4.2. Convergence Argument

In this subsection, we will show that as long as the errors in evaluating the scaled proximal mappings are controlled in an appropriate way, the iterates generated by the outer loop of Algorithm 2 converge linearly in expectation to the optimal solution. Recall that in Step 7 of Algorithm 2, we want to ﬁnd an ϵk-optimal solution in the sense of (5) to the following problem:

minimize x Rd 1 2η

x yk + ηH 1vk 2 H + h (x) . (10)

The next lemma quantiﬁes the progress made by one inner iteration of the algorithm. Our proof builds on a Lyapunov argument using a Lyapunov function on the form:

Vk = F (xk) F (x ) + µ

2 zk x 2 H . (11)

Lemma 2. Let Assumptions 1 2 hold and let x = argminx F (x), η = 1/Lavg and τ = p

µ/2Lavg. If the mini-batch size is chosen such that b 60 p

Lavg/µ, then for any k {0, . . . , T 1}, there exists a vector ξk Rd

such that ξk H 1 2ηϵk and

E Vk+1 (1 τ) E Vk + τLavg E ξk, x zk + 5ϵk

5 E {F (xk) F (x ) + F ( xs) F (x )} . (12)

Proof. See Appendix D.

Remark 1. Our proof is direct and based on natural Lyapunov functions, thereby avoiding the use of stochastic estimate subsequences as in (Nitanda, 2014) which is already very complicated even when the subproblems are solved exactly and H = I. We stress that the result in Lemma 2 also holds for smaller mini-batch sizes, namely b 1, . . . , O( p

Lavg/µ) , provided that the step size η

is reduced accordingly. In favor of a simple proof, we only report the large mini-batch result here.

Equipped with Lemma 2, we can now characterize the progress made by one outer iteration of Algorithm 2. Theorem 2. Let Assumptions 1 2 hold. Suppose that the parameters η, b, and τ are chosen according to Lemma 2 and deﬁne ρ = 9τ/10. Then, if the errors in solving the subproblems satisfy

ϵk (1 ρ)k V0

for all k {0, . . . , T 1} and T (4 log c)/3ρ, where c is a universal constant, then for every s N+,

E {F ( xs) F (x )} 2

3 E {F ( xs 1) F (x )} .

Proof. See Appendix E.

Remark 2. The theorem indicates that if the errors in solving the subproblems are controlled appropriately, the outer iterates generated by Algorithm 2 converge linearly in expectation to the optimal solution. Since V0 depends on x , it is difﬁcult to provide a general closed-form expression for the target precisions ϵk. However, we will show below that with a certain policy for selecting the initial point, it is sufﬁcient to run the solver a constant number of iterations independently of x . We stress that the results in this section are valid for minimizing general convex composite functions (2) and not limited to the elastic net problem.

5. Warm-Start

The overall complexity of Algorithm 2 depends strongly on our ability to solve (10) in a reasonable computational time. If one naively starts the solver at a random point, it may take many iterations to meet the target precision. Thus, it is necessary to have a well-designed warm-start procedure for initializing the solver. Intuitively, the current iterate xk can be a reasonable starting point since the next iterate xk+1 should not be too far away from xk. However, in order to achieve a strong theoretical running time, we use a rather different scheme inspired by (Lin et al., 2015). Let us ﬁrst deﬁne the vector uk = yk ηH 1vk for k {0, 1, . . . , T 1} and the function

p (z, u) = h (z) + 1

2η z u 2 H .

Then, the kth subproblem seeks for xk+1 such that

p (xk+1, uk) p x k+1, uk ϵk, (13)

where x k+1 is the exact solution. We consider the initialization policy

z0 = proxγh xk γ

η H (xk uk 1) , (14)

Curvature-Exploiting Acceleration of Elastic Net Computations

which can be seen as one step of the proximal gradient method applied to p (z, uk 1) starting at the current xk.

The following proposition characterizes the difference in objective realized by z0 and x k+1. Proposition 2. Let z0 be deﬁned by (14) with γ = η/λ1 (H). Let κsub = λ1 (H) /λr (H) be the condition number of the subproblems. Assume that the errors in solving the subproblems satisfy ϵk (1 ρ)k V0 for all k {0, 1, . . . , T 1}. Then,

p (z0, uk) p x k+1, uk κsub

Proof. See Appendix F.

The proposition, together with (13), implies that it sufﬁces to ﬁnd xk+1 such that

p (xk+1, uk) p x k+1, uk

p (z0, uk) p x k+1, uk . (15)

This is signiﬁcant since one only needs to reduce the residual error by a constant factor independent of the target precision. Note also that the condition number κsub is much smaller than κ λ1(H)/(λd+γ2), and computing the gradient of the smooth part of p (z, u) only takes time O (rd) instead of O (nd) as in the original problem. Those properties imply that the subproblems can be solved efﬁciently by iterative methods, where only a small (and known) constant number of iterations is needed. The next section develops the ﬁnal details of convergence proof.

6. Global Time Complexity

We start with the time complexity of Algorithm 2. Let T (α) be the number of gradient evaluations that a subproblem solver takes to reduce the residual error by a factor α. Then, by Proposition 2, one can ﬁnd an ϵk-optimal solution to the kth subproblem by at most T (κsub/ (1 ρ)) gradient evaluations, where one gradient evaluation is equivalent to O (d) ﬂops. Consider the same setting of Theorem 2 and suppose that the subproblems are initialized by (14). Then, the time complexity of Algorithm 2 is given by:

O d n + κH + κH T κsub

log (1/ϵ) , (16)

where the ﬁrst summand is due to the full gradient evaluation at each outer loop; the second one comes from the fact that one needs O( κH) inner iterations, each of which uses a mini-batch of size O( κH); and the third one is the result of O( κH) inner iterations, each of which solves a subproblem that needs T (κsub/ (1 ρ)) gradient evaluations. We can now put things together and state our main result.

Proposition 3. Suppose that the approximate Hessian matrix H is given by (6) and that Algorithm 2 is invoked with f (x) = 1 2n Ax b 2 2 + γ2

2 x 2 2 and h (x) = x 1. Assume further that the subproblems are solved by the accelerated proximal gradient descent method (Beck & Teboulle, 2009; Nesterov, 2013). Our method can ﬁnd an ϵ-optimal solution in time

O d (n + κH) log (1/ϵ) .

Proof. The task reduces to evaluating the term T (κsub/ (1 ρ)) in (16). Recall that the iteration complexity of the accelerated proximal gradient descent method for minimizing the function F(x) = f(x) + h(x), where f is a smooth and strongly convex function and h is a possibly non-smooth convex regularizer, initiliazed at x0 is given by κ log F (x0) F (x )

ϵ , where κ is the condition number. By invoking the about result with F(x) = p (x, uk), x = x k+1, x0 = z0, κ = κsub, and ϵ is the right-hand side of (15), it follows that the number of iterations for each subproblem can be bounded by

O ( κsub log (κsub/(1 ρ))) . (17)

In addition, each iteration takes time O (rd) to compute the gradient implying the time complexity

O d n + κH + r κsub κH log κsub

Selecting r such that κH κsub completes the proof.

We can easily recognize that this time complexity has the same form as the stochastic ﬁrst-order methods discussed in Section 1.2.1 with the condition number κ replaced by κH. It has been shown in Theorem 1 that κH can be much smaller than κ, especially, when C has a rapidly decaying spectrum. Note also that κsub is available for free to us after having approximated the Hessian matrix.

7. Experimental Results

In this section, we perform numerical experiments to verify the efﬁcacy of the proposed method on real world data sets (Chang & Lin, 2011; Guyon et al., 2008). We compare our method with the coordinate descent algorithm (BCD) (Tibshirani et al., 2015) and several ﬁrstorder methods: FISTA (Beck & Teboulle, 2009) with optimal step-size; Prox-SVRG (Xiao & Zhang, 2014) with epoch length 2n/b as suggested by the authors; Katyusha1 (Allen-Zhu, 2016) with epoch length 2n/b, Katyusha momentum τ2 = 0.5/b as suggested by the author; and our method with epoch length 2n/b. Since Katyusha1 can use a mini-batch of size n without slowing down the convergence, we set b = n for all

Curvature-Exploiting Acceleration of Elastic Net Computations

0 20 40 60 80 100 100

gisette-scale

0 2 4 6 8 10 12 14

0 20 40 60 80 100 10 5

0 20 40 60 80 100 10 4

Figure 1. Spectrum of the matrix C for different data sets.

0 20 40 60 80 100 10 6

Suboptimality

gisette-scale

0 20 40 60 80 100 10 10

0 20 40 60 80 100 10 10

Suboptimality

0 20 40 60 80 100 10 15

FISTA Katyusha Prox SVRG BCD Ours

Figure 2. Suboptimality versus the number of epochs.

stochastic methods unless otherwise stated. Finally, to make a fair comparison, for each algorithm above, we tune only the step size, from the set η {10k, 2 10k, 5 10k|k {0, 1, 2}}, where η is the theoretical step size, and report the one having smallest objective value. Other hyper-parameters are set to their theory-predicted values. All methods are initialized at 0. For the subproblems in Algorithm 2, we just simply run FISTA with κsub log κsub iterations as discussed previously, without any further tunning steps. The value of r is chosen as a small fraction of d so that the preprocessing time of Algorithm 1 is negligible. Note that the available spectrum of C after running Algorithm 1 also provides an insightful way to choose r.

Figure 2 shows the suboptimality in objective versus the number of epochs for different algorithms solving the elastic net problem. We can see that our method systematically outperforms the others in all settings, and that there is a clear correspondence between the spectrum of C in Fig. 1 and the potential speed-up. Notably, for ill-conditioned data sets such as australian and cina0, all the ﬁrstorder methods make almost no progress in the ﬁrst 100

0 100 200 300 400 10 6

Suboptimality

gisette-scale, b = 500

0 0.2 0.4 0.6 0.8 1 10 15

0 5 10 15 20 10 10

Suboptimality

0 50 100 150 200 10 15

real-sim, b = 2000

Figure 3. Suboptimality versus runtime.

epochs, while our method can ﬁnd a high-accuracy solution within tens of epochs, demonstrating a great beneﬁt of second-order information. On the other hand, for a well-conditioned data set that does not exhibit high curvature such as real-sim, Prox SVRG is comparable to our method and even outperforms Katyusha1. This agrees with the time complexities summarized in Table 1.

For runtime comparisons, we only report the performance of BCD and our method since they outperform the others by a large margin. We observe that for problems with small or medium dimensions, it is useful to have BCD as subproblem solver in our method since BCD is very effective in solving LASSO-type problems. However, for high-dimensional problems, BCD needs to loop over d coordinates, which can be excessive since we only need to reduce the error of the subproblem by a small constant factor. For this reason, we use BCD as subproblem solver for australian and cina0, and FISTA for gisette and real-sim. The runtime plots in Fig 3 demonstrate that curvature information offers signiﬁcant acceleration over one of the state-ofthe-art methods for solving the elastic net problem.

8. Conclusions

We have proposed and analyzed a novel second-order method for solving the elastic net problem. By carefully exploiting the problem structure, we demonstrated that it is possible to deal with the non-smooth objective and to efﬁciently inject curvature information into the optimization process without (signiﬁcantly) increasing the computational cost per iteration. The combination of second-order information, fast iterative solvers, and a well-designed warm-start procedure results in a signiﬁcant improvement of the total runtime complexity over popular ﬁrst-order methods. An interesting direction for future research would be to go beyond the quadratic loss.

Curvature-Exploiting Acceleration of Elastic Net Computations

Acknowledgements

This work was supported in part by the Knut and Alice Wallenberg Foundation, the Swedish Research Council and the Swedish Foundation for Strategic Research. We thank the anonymous reviewers for their useful comments and suggestions.

Agarwal, N., Bullins, B., and Hazan, E. Second-order stochastic optimization for machine learning in linear time. Journal of Machine Learning Research, 18(1): 4148 4187, 2017.

Allen-Zhu, Z. Katyusha: The ﬁrst direct acceleration of stochastic gradient methods. ar Xiv preprint ar Xiv:1603.05953, 2016.

Arjevani, Y. and Shamir, O. Oracle complexity of secondorder methods for ﬁnite-sum problems. In International Conference on Machine Learning, pp. 205 213, 2017.

Beck, A. and Teboulle, M. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183 202, 2009.

Bertsekas, D. P., Nedi c, A., and Ozdaglar, A. E. Convex analysis and optimization. Athena Scientiﬁc, 2003.

Byrd, R. H., Hansen, S. L., Nocedal, J., and Singer, Y. A stochastic quasi-Newton method for large-scale optimization. SIAM Journal on Optimization, 26(2):1008 1031, 2016.

Chang, C.-C. and Lin, C.-J. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):27, 2011.

Defazio, A. A simple practical accelerated method for ﬁnite sums. In Advances in Neural Information Processing Systems, pp. 676 684, 2016.

Defazio, A., Bach, F., and Lacoste-Julien, S. SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646 1654, 2014.

Erdogdu, M. A. and Montanari, A. Convergence rates of sub-sampled Newton methods. In Advances in Neural Information Processing Systems, pp. 3052 3060. MIT Press, 2015.

Frostig, R., Ge, R., Kakade, S., and Sidford, A. Unregularizing: Approximate proximal point and faster stochastic algorithms for empirical risk minimization. In International Conference on Machine Learning, pp. 2540 2548, 2015.

Ghanbari, H. and Scheinberg, K. Proximal quasi-Newton methods for convex optimization. ar Xiv preprint ar Xiv:1607.03081, 2016.

Gonen, A., Orabona, F., and Shalev-Shwartz, S. Solving ridge regression using sketched preconditioned SVRG. In International Conference on Machine Learning, pp. 1397 1405, 2016.

Guyon, I., Aliferis, C., Cooper, G., Elisseeff, A., Pellet, J.- P., Spirtes, P., and Statnikov, A. Design and analysis of the causation and prediction challenge. In Causation and Prediction Challenge, pp. 1 33, 2008.

Halko, N., Martinsson, P.-G., and Tropp, J. A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review, 53(2):217 288, May 2011.

Hsu, D., Kakade, S. M., and Zhang, T. Random design analysis of ridge regression. In Conference on Learning Theory, pp. 9 1, 2012.

Hu, C., Kwok, J. T., and Pan, W. Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems, pp. 781 789, Vancouver, Canada, Dec 2009.

Lee, J. D., Sun, Y., and Saunders, M. Proximal Newtontype methods for convex optimization. In Advances in Neural Information Processing Systems, pp. 827 835, 2012.

Lin, H., Mairal, J., and Harchaoui, Z. A universal catalyst for ﬁrst-order optimization. In Advances in Neural Information Processing Systems, pp. 3384 3392, 2015.

Liu, X., Hsieh, C.-J., Lee, J. D., and Sun, Y. An inexact subsampled proximal Newton-type method for large-scale machine learning. ar Xiv preprint ar Xiv:1708.08552, 2017.

Moritz, P., Nishihara, R., and Jordan, M. A linearlyconvergent stochastic L-BFGS algorithm. In Artiﬁcial Intelligence and Statistics, pp. 249 258, 2016.

Musco, C. and Musco, C. Randomized block Krylov methods for stronger and faster approximate singular value decomposition. In Advances in Neural Information Processing Systems, pp. 1396 1404, 2015.

Nesterov, Y. Gradient methods for minimizing composite functions. Mathematical Programming, 140(1):125 161, 2013.

Nitanda, A. Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems, pp. 1574 1582, Montr eal, Canada, Dec 2014.

Curvature-Exploiting Acceleration of Elastic Net Computations

Rodomanov, A. and Kropotov, D. A superlinearlyconvergent proximal Newton-type method for the optimization of ﬁnite sums. In International Conference on Machine Learning, pp. 2597 2605, 2016.

Roosta-Khorasani, F. and Mahoney, M. W. Sub-sampled Newton methods I: Globally convergent algorithms. ar Xiv preprint ar Xiv:1601.04737, 2016.

Schmidt, M., Roux, N. L., and Bach, F. Convergence rates of inexact proximal-gradient methods for convex optimization. In Advances in Neural Information Processing Systems, pp. 1458 1466, Granada, Spain, Dec 2011.

Tibshirani, R., Wainwright, M., and Hastie, T. Statistical learning with sparsity: The lasso and generalizations. Chapman and Hall/CRC, 2015.

Wang, J. and Zhang, T. Improved optimization of ﬁnite sums with minibatch stochastic variance reduced proximal iterations. ar Xiv preprint ar Xiv:1706.07001, 2017.

Woodworth, B. E. and Srebro, N. Tight complexity bounds for optimizing composite objectives. In Advances in Neural Information Processing Systems, pp. 3639 3647, 2016.

Xiao, L. and Zhang, T. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057 2075, Dec 2014.

Xu, P., Yang, J., Roosta-Khorasani, F., R e, C., and Mahoney, M. W. Sub-sampled Newton methods with nonuniform sampling. In Advances in Neural Information Processing Systems, pp. 3000 3008, 2016.

Zou, H. and Hastie, T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301 320, 2005.