# approximate_steepest_coordinate_descent__05ea416c.pdf

Approximate Steepest Coordinate Descent

Sebastian U. Stich 1 Anant Raj 2 Martin Jaggi 1

We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efﬁciency of this novel scheme is provably better than the efﬁciency of uniformly random selection, and can reach the efﬁciency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to n, the number of coordinates. In many practical applications, our scheme can be implemented at no extra cost and computational efﬁciency very close to the faster uniform selection. Numerical experiments with Lasso and Ridge regression show promising improvements, in line with our theoretical guarantees.

1. Introduction

Coordinate descent (CD) methods have attracted a substantial interest the optimization community in the last few years (Nesterov, 2012; Richt arik & Tak aˇc, 2016). Due to their computational efﬁciency, scalability, as well as their ease of implementation, these methods are the state-of-theart for a wide selection of machine learning and signal processing applications (Fu, 1998; Hsieh et al., 2008; Wright, 2015). This is also theoretically well justiﬁed: The complexity estimates for CD methods are in general better than the estimates for methods that compute the full gradient in one batch pass (Nesterov, 2012; Nesterov & Stich, 2017).

In many CD methods, the active coordinate is picked at random, according to a probability distribution. For smooth functions it is theoretically well understood how the sampling procedure is related to the efﬁciency of the scheme and which distributions give the best complexity estimates (Nesterov, 2012; Zhao & Zhang, 2015; Allen Zhu et al., 2016; Qu & Richt arik, 2016; Nesterov & Stich, 2017). For nonsmooth and composite functions that appear in many machine learning applications the pic-

1EPFL 2Max Planck Institute for Intelligent Systems. Correspondence to: Sebastian U. Stich <sebastian.stich@epﬂ.ch>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

ture is less clear. For instance in (Shalev-Shwartz & Zhang, 2013; Friedman et al., 2007; 2010; Shalev-Shwartz & Tewari, 2011) uniform sampling (UCD) is used, whereas other papers propose adaptive sampling strategies that change over time (Papa et al., 2015; Csiba et al., 2015; Osokin et al., 2016; Perekrestenko et al., 2017).

A very simple deterministic strategy is to move along the direction corresponding to the component of the gradient with the maximal absolute value (steepest coordinate descent, SCD) (Boyd & Vandenberghe, 2004; Tseng & Yun, 2009). For smooth functions this strategy yields always better progress than UCD, and the speedup can reach a factor of the dimension (Nutini et al., 2015). However, SCD requires the computation of the whole gradient vector in each iteration which is prohibitive (except for special applications, cf. Dhillon et al. (2011); Shrivastava & Li (2014)).

In this paper we propose approximate steepest coordinate descent (ASCD), a novel scheme which combines the best parts of the aforementioned strategies: (i) ASCD maintains an approximation of the full gradient in each iteration and selects the active coordinate among the components of this vector that have large absolute values similar to SCD; and (ii) in many situations the gradient approximation can be updated cheaply at no extra cost similar to UCD. We show that regardless of the errors in the gradient approximation (even if they are inﬁnite), ASCD performs always better than UCD.

Similar to the methods proposed in (Tseng & Yun, 2009) we also present variants of ASCD for composite problems. We conﬁrm our theoretical ﬁndings by numerical experiments for Lasso and Ridge regression on a synthetic dataset as well as on the RCV1 (binary) dataset.

Structure of the Paper and Contributions. In Sec. 2 we review the existing theory for SCD and (i) extend it to the setting of smooth functions. We present (ii) a novel lower bound, showing that the complexity estimates for SCD and UCD can be equal in general. We (iii) introduce ASCD and the save selection rules for both smooth (Sec. 3) and to composite functions (Sec. 5). We prove that (iv) ASCD performs always better than UCD (Sec. 3) and (v) it can reach the performance of SCD (Sec. 6). In Sec. 4 we discuss important applications where the gradient estimate can efﬁciently be maintained. Our theory is supported by nu-

Approximate Steepest Coordinate Descent (ASCD)

merical evidence in Sec. 7, which reveals that (vi) ASCD performs extremely well on real data.

Notation. Deﬁne [x]i := x, ei with ei the standard unit vectors in Rn. We abbreviate if := [ f]i. A convex function f : Rn R with coordinate-wise Li Lipschitz continuous gradients1 for constants Li > 0, i [n] := {1, . . . , n}, satisﬁes by the standard reasoning

f(x + ηei) f(x) + η if(x) + Li

for all x Rn and η R. A function is coordinate-wise L-smooth if Li L for i = 1, . . . , n. For an optimization problem minx Rn f(x) deﬁne X := arg minx Rn f(x) and denote by x Rn an arbitrary element x X .

2. Steepest Coordinate Descent

In this section we present SCD and discuss its theoretical properties. The functions of interest are composite convex functions F : Rn R of the form

F(x) := f(x) + Ψ(x) (2)

where f is coordinate-wise L-smooth and Ψ convex and separable, that is that is Ψ(x) = Pn i=1 Ψi([x]i). In the ﬁrst part of this section we focus on smooth problems, i.e. we assume that Ψ 0.

Coordinate descent methods with constant step size generate a sequence {xt}t 0 of iterates that satisfy the relation

xt+1 = xt 1

L itf(x)eit . (3)

In UCD the active coordinate it is chosen uniformly at random from the set [n], it u.a.r. [n]. SCD chooses the coordinate according to the Gauss-Southwell (GS) rule:

it = arg max i [n] i |f(xt)| . (4)

2.1. Convergence analysis

With the quadratic upper bound (1) one can easily get a lower bound on the one step progress

E [f(xt) f(xt+1) | xt] Eit h 1 2L | itf(xt)|2i . (5)

For UCD and SCD the expression on the right hand side evaluates to

τUCD(xt) := 1 2n L f(xt) 2 2 τSCD(xt) := 1 2L f(xt) 2 (6)

With Cauchy-Schwarz we ﬁnd

1 nτSCD(xt) τUCD(xt) τSCD(xt) . (7)

1| if(x + ηei) if(x)| Li |η| , x Rn, η R.

Hence, the lower bound on the one step progress of SCD is always at least as large as the lower bound on the one step progress of UCD. Moreover, the one step progress could be even lager by a factor of n. However, it is very difﬁcult to formally prove that this linear speed-up holds for more than one iteration, as the expressions in (7) depend on the (a priori unknown) sequence of iterates {xt}t 0.

Strongly Convex Objectives. Nutini et al. (2015) present an elegant solution of this problem for µ2-strongly convex functions2. They propose to measure the strong convexity of the objective function in the 1-norm instead of the 2-norm. This gives rise to the lower bound

τSCD(xt) µ1

L (f(xt) f(x )) , (8)

where µ1 denotes the strong convexity parameter. By this, they get a uniform upper bound on the convergence that does not directly depend on local properties of the function, like for instance τSCD(xt), but just on µ1. It always holds µ1 µ2, and for functions where both quantities are equal, SCD enjoys a linear speedup over UCD.

Smooth Objectives. When the objective function f is just smooth (but not necessarily strongly convex), then the analysis mentioned above is not applicable. We here extend the analysis from (Nutini et al., 2015) to smooth functions. Theorem 2.1. Let f : Rn R be convex and coordinatewise L-smooth. Then for the sequence {xt}t 0 generated by SCD it holds:

f(xt) f(x ) 2LR2 1 t , (9)

for R1 := max x X

max x Rn [ x x 1 | f(x) f(x0)] .

Proof. In the proof we ﬁrst derive a lower bound on the one step progress (Lemma A.1), similar to the analysis in (Nesterov, 2012). The lower bound for the one step progress of SCD can in each iteration differ up to a factor of n from the analogous bound derived for UCD (similar as in (7)). All details are given in Section A.1 in the appendix.

Note that the R1 is essentially the diameter of the level set at f(x0) measured in the 1-norm. In the complexity estimate of UCD, R2 1 in (9) is replaced by n R2 2, where R2 is the diameter of the level at f(x0) measured in the 2-norm (cf. Nesterov (2012); Wright (2015)). As in (7) we observe with Cauchy-Schwarz

1 n R2 1 R2 2 R2 1 , (10)

i.e. SCD can accelerate up to a factor of n over to UCD.

2A function is µp-strongly convex in the p-norm, p 1, if f(y) f(x) + f(x), y x + µp

2 y x 2 p, x, y Rn.

Approximate Steepest Coordinate Descent (ASCD)

2.2. Lower bounds

In the previous section we provided complexity estimates for the methods SCD and UCD and showed that SCD can converge up to a factor of the dimension n faster than UCD. In this section we show that this analysis is tight. In Theorem 2.2 below we give a function q: Rn R, for which the one step progress τSCD(xt) τUCD(xt) up to a constant factor, for all iterates {xt}t 0 generated by SCD.

By a simple technique we can also construct functions for which the speedup is exactly equal to an arbitrary factor λ [1, n]. For instance we can consider functions with a (separable) low dimensional structure. Fix integers s, n such that n

s λ, deﬁne the function f : Rn R as

f(x) := q(πs(x)) (11)

where πs denotes the projection to Rs (being the ﬁrst s out of n coordinates) and q: Rs R is the function from Theorem 2.2. Then

τSCD(xt) λ τUCD(xt) , (12)

for all iterates {xt}t 0 generated by SCD.

Theorem 2.2. Consider the function q(x) = 1

2 Qx, x for Q := In 99 100n Jn, where Jn = 1n1T n, n > 2. Then there exists x0 Rn such that for the sequence {xt}t 0 generated by SCD it holds

n q(xt) 2 2 . (13)

Proof. In the appendix we discuss a family of functions deﬁned by matrices Q := (α 1) 1

n Jn + In and deﬁne corresponding parameters 0 < cα < 1 such that for x0 deﬁned as [x0]i = ci 1 α for i = 1, . . . , n, SCD cycles through the coordinates, that is, the sequence {xt}t 0 generated by SCD satisﬁes

[xt]1+(t 1 mod n) = cn α [xt 1]1+(t 1 mod n) . (14)

We verify that for this sequence property (13) holds.

2.3. Composite Functions

The generalization of the GS rule (4) to composite problems (2) with nontrival Ψ is not straight forward. The steepest direction is not always meaningful in this setting; consider for instance a constrained problem where this rule could yield no progress at all when stuck at the boundary.

Nutini et al. (2015) discuss several generalizations of the Gauss-Southwell rule for composite functions. The GSs rule is deﬁned to choose the coordinate with the most negative directional derivative (Wu & Lange, 2008). This rule is identical to (4) but requires the calculation of subgradients of Ψi. However, the length of a step could be

arbitrarily small. In contrast, the GS-r rule was deﬁned to pick the coordinate direction that yields the longest step (Tseng & Yun, 2009). The rule that enjoys the best theoretical properties (cf. Nutini et al. (2015)) is the GS-q rule, which is deﬁned as to maximize the progress assuming a quadratic upper bound on f (Tseng & Yun, 2009). Consider the coordinate-wise models

Vi(x, y, s) := sy + L

2 y2 + Ψi([x]i + y) , (15)

for i [n]. The GS-q rule is formally deﬁned as

i GS q = arg min i [n] min y R Vi(x, y, if(x)) . (16)

2.4. The Complexity of the GS rule

So far we only studied the iteration complexity of SCD, but we have disregarded the fact that the computation of the GS rule (4) can be as expensive as the computation of the whole gradient. The application of coordinate descent methods is only justiﬁed if the complexity to compute one directional derivative is approximately n times cheaper than the computation of the full gradient vector (cf. Nesterov (2012)). By Theorem 2.2 this reasoning also applies to SCD. A class of function with this property is given by functions F : Rn R

F(x) := f(Ax) +

i=1 Ψi([x]i) (17)

where A is a d n matrix, and where f : Rd R, and Ψi : Rn R are convex and simple, that is the time complexity T for computing their gradients is linear: T( yf(y), xΨ(x) = O(d + n). This class of functions includes least squares, logistic regression, Lasso, and SVMs (when solved in dual form).

Assuming the matrix is dense, the complexity to compute the full gradient of F is T( x F(x)) = O(dn). If the value w = Ax is already computed, one directional derivative can be computed in time T( i F(x)) = O(d). The recursive update of w after one step needs the addition of one column of matrix A with some factors and can be done in time O(d). However, we note that recursively updating the full gradient vector takes time O(dn) and consequently the computation of the GS rule cannot be done efﬁciently.

Nutini et al. (2015) consider sparse matrices, for which the computation of the Gauss-Southwell rule becomes traceable. In this paper, we propose an alternative approach. Instead of updating the exact gradient vector, we keep track of an approximation of the gradient vector and recursively update this approximation in time O(n log n). With these updates, the use of coordinate descent is still justiﬁed in case d = Ω(n).

Approximate Steepest Coordinate Descent (ASCD)

Algorithm 1 Approximate SCD (ASCD)

Input: f, x0, T, δ-gradient oracle g, method M Initialize [ g0]i = 0, [r0]i = for i [n]. for t = 0 to T do

For i [n] deﬁne compute u.-and l.-bounds [ut]i := max{|[ gt]i [rt]i| , |[ gt]i + [rt]i|} [ℓt]i := miny R{|y| | [ gt]i [rt]i y [ gt]i+[rt]i}

av(I) := 1 |I| P

i I[ℓt]2 i compute active set It := arg min I I [n] | [ut]2 i < av(I), i / I

Pick it u.a.r. arg maxi It{[ℓ]i} active coordinate (xt+1, [ gt+1]it, [rt+1]it) := M(xt, itf(xt))

γt := [xt+1]it [xt]it update f(xt+1) estimate Update [ gt+1]j := [ gt]j + γtgitj(xt), j = it Update [rt+1]j := [rt]j + γtδitj, j = it end for

3. Algorithm

Is it possible to get the signiﬁcantly improved convergence speed from SCD, when one is only willing to pay the computational cost of only the much simpler UCD? In this section, we give a formal deﬁnition of our proposed approximate SCD method which we denote ASCD.

The core idea of the algorithm is the following: While performing coordinate updates, ideally we would like to efﬁciently track the evolution of all elements of the gradient, not only the one coordinate which is updated in the current step. The formal deﬁnition of the method is given in Algorithm 1 for smooth objective functions. In each iteration, only one coordinate is modiﬁed according to some arbitrary update rule M. The coordinate update rule M provides two things: First the new iterate xt+1, and secondly also an estimate g of the it-th entry of the gradient at the new iterate3. Formally,

(xt+1, g, r) := M(xt, itf(xt)) (18)

such that the quality of the new gradient estimate g satisﬁes

| itf(xt+1) g| r . (19)

The non-active coordinates are updated with the help of gradient oracles with accuracy δ 0 (see next subsection for details). The scenario of exact updates of all gradient entries is obtained for accuracy parameters δ = r = 0 and in this case ASCD is identical to SCD.

3.1. Safe bounds for gradient evolution

ASCD maintains lower and upper bounds for the absolute values of each component of the gradient ([ℓ]i

3For instance, for updates by exact coordinate optimization (line-search), we have g = r = 0.

| if(x)| [u]i). These bounds allow to identify the coordinates on which the absolute values of the gradient are small (and hence cannot be the steepest one). More precisely, the algorithm maintains a set It of active coordinates (similar in spirit as in active set methods, see e.g. Kim & Park (2008); Wen et al. (2012)). A coordinate j is excluded from It if the estimated progress in this direction (cf. (5)) is lower than the average of the estimated progress along coordinate directions in It, [ut]2 j < 1 |It| P

i It[ℓt]2 i . The active set It can be computed in O(n log n) time by sorting. All other operations take linear O(n) time.

Gradient Oracle. The selection mechanism in ASCD crucially relies on the following deﬁnition of a δ-gradient oracle. While the update M delivers the estimated active entry of the new gradient, the additional gradient oracle is used to update all other coordinates j = it of the gradient; as in the last two lines of Algorithm 1. Deﬁnition 3.1 (δ-gradient oracle). For a function f : Rn R and indices i, j [n], a (i, j)-gradient oracle with error δij 0 is a function gij : Rn R satisfying x Rn, γ R:

| jf(x + γei) γgij(x)| |γ| δij . (20)

We denote by a δ-gradient oracle a family {gij}i,j [n] of δij-gradient oracles.

We discuss the availability of good gradient oracles for many problem classes in more detail in Section 4. For example for least squares problems and general linear models, a δ-gradient oracle is for instance given by a scalar product estimator as in (24) below. Note that ASCD can also handle very bad estimates, as long as the property (20) is satisﬁed (possibly even with accuracy δij = ).

Initialization. In ASCD the initial estimate g0 of the gradient is just arbitrarily set to 0, with uncertainty r0 = . Hence in the worst case it takes Θ(n log n) iterations until each coordinate gets picked at least once (cf. Dawkins (1991)) and until corresponding gradient estimates are set to a realistic value. If better estimates of the initial gradient are known, they can be used for the initialization as long as a strong error bound as in (19) is known as well. For instance the initialization can be done with f(x0) if one is willing to compute this vector in one batch pass.

Convergence Rate Guarantee. We present our ﬁrst main result showing that the performance of ASCD is provably between UCD and SCD. First observe that if in Algorithm 1 the gradient oracle is always exact, i.e. δij 0, and if g0 is initialized with f(x0), then in each iteration | itf(xt)| = f(xt) and ASCD identical to SCD. Lemma 3.1. Let imax := arg maxi [n] | if(xt)|. Then imax It, for It as in Algorithm 1.

Approximate Steepest Coordinate Descent (ASCD)

Proof. This is immediate from the deﬁnitions of It and the upper and lower bounds. Suppose imax / It, then there exists j = imax such that [ℓt]j > [ut]imax, and consequently | jf(xt)| > | imaxf(xt)|.

Theorem 3.2. Let f : Rn R be convex and coordinatewise L-smooth, let τUCD, τSCD, τASCD denote the expected one step progress (6) of UCD, SCD and ASCD, respectively, and suppose all methods use the same step-size rule M. Then

τUCD(x) τASCD(x) τSCD(x) x Rn . (21)

Proof. By (5) we get τASCD(x) = 1 2L|I| P

i I | if(x)|2, where I denotes the corresponding index set of ASCD when at iterate x. Note that for j / I it must hold that | jf(x)|2 [u]2 j < 1 |I| P i I[ℓ]2 i 1 |I| P i I | if(x)|2

by deﬁnition of I.

Observe that the above theorem holds for all gradient oracles and coordinate update variants, as long as they are used with corresponding quality parameters r (as in (19)) and δij (as in (20)) as part of the algorithm.

Heuristic variants. Below also propose three heuristic variants of ASCD. For all these variants the active set It can be computed O(n), but the statement of Theorem 3.2 does not apply. These variants only differ from ASCD in the choice of the active set in Algorithm 1:

u-ASCD: It := arg maxi [n][ut]i ℓ-ASCD: It := arg maxi [n][ℓt]i a-ASCD: It := i [n] | [ut]i maxi [n][ℓt]i

4. Approximate Gradient Update

In this section we argue that for a large class of objective functions of interest in machine learning, the change in the gradient along every coordinate direction can be estimated efﬁciently.

Lemma 4.1. Consider F : Rn R as in (17) with twice-differentiable f : Rd R. Then for two iterates xt, xt+1 Rn of a coordinate descent algorithm, i.e. xt+1 = xt + γteit, there exists a x Rn on the line segment between xt and xt+1, x [xt, xt+1] with

i F(xt+1) i F(xt) = γt ai, 2f(A x)ait i = it (22) where ai denotes the i-th column of the matrix A.

Proof. For coordinates i = it the gradient (or subgradient set) of Ψi([xt]i) does not change. Hence it sufﬁces to calculate the change f(xt+1) f(xt). This is detailed in the appendix.

Least-Squares with Arbitrary Regularizers. The least squares problem is deﬁned as problem (17) with f(Ax) = 1 2 Ax b 2 2 for a b Rd. This function is twice differentiable with 2f(Ax) = In. Hence (22) reduces to

i F(xt+1) i F(xt) = γt ai, ait i = it . (23)

This formulation gives rise to various gradient oracles (20) for the least square problems. For for i = it we easily verify that the condition (20) is satisﬁed:

1. g1 ij := ai, ait ; δij = 0, 2. g2 ij := max { ai aj , min {S(i, j), ai aj }}; δij = ϵ ai aj , where S : [n] [n] denotes a function with the property |S(i, j) ai, aj | ϵ ai aj , i, j [n] (24) 3. g3 ij := 0; δij = ai aj , 4. g4 ij u.a.r. [ ai aj , ai aj ]; δij = ai aj .

Oracle g1 can be used in the rare cases where the dot product matrix is accessible to the optimization algorithm without any extra cost. In this case the updates will all be exact. If this matrix is not available, then the computation of each scalar product takes time O(d). Hence, they cannot be recomputed on the ﬂy, as argued in Section 2.4. In contrast, the oracles g3 and g4 are extremely cheap to compute, but the error bounds are worse. In the numerical experiments in Section 7 we demonstrate that these oracles perform surprisingly well.

The oracle g2 can for instance be realized by lowdimensional embeddings, such as given by the Johnson Lindenstrauss lemma (cf. Achlioptas (2003); Matouˇsek (2008)). By embedding each vector in a lower-dimensional space of dimension O ϵ 2 log n and computing the scalar products of the embedding in time O(log n), relation (24) is satisﬁed.

Updating the gradient of the active coordinate. So far we only discussed the update of the passive coordinates. For the active coordinate the best strategy depends on the update rule M from (18). If exact line search is used, then 0 itf(xt+1). For other update rules we can update the gradient itf(xt+1) with the same gradient oracles as for the other coordinates, however we need also to take into account the change of the gradient of Ψi([xt]i). If Ψi is simple, like for instance in ridge or lasso, the subgradients at the new point can be computed efﬁciently.

Bounded variation. In many applications the Hessian 2f(A x) is not so simple as in the case of square loss. If we assume that the Hessian of f is bounded, i.e. 2f(Ax) M In for a constant M 0, x Rn, then it is easy to see that the following holds :

M ai aj ai, 2f(A x)ait M ai aj .

Approximate Steepest Coordinate Descent (ASCD)

Using this relation, we can deﬁne gradient oracles for more general functions, by taking the additional approximation factor M into account. The quality can be improved, if we have access to local bounds on 2f(Ax).

Heuristic variants. By design, ASCD is robust to high errors in the gradient estimations the steepest descent direction is always contained in the active set. However, instead of using only the very crude oracle g4 to approximate all scalar products, it might be advantageous to compute some scalar products with higher precision. We propose to use a caching technique to compute the scalar products with high precision for all vectors in the active set (and storing a matrix of size O(It n)). This presumably works well if the active set does not change much over time.

5. Extension to Composite Functions

The key ingredients of ASCD are the coordinate-wise upper and lower bounds on the gradient and the deﬁnition of the active set It which ensures that the steepest descent direction is always kept and that only provably bad directions are removed from the active set. These ideas can also be generalized to the setting of composite functions (2). We already discussed some popular GSupdate rules in the introduction in Section 2.3.

Implementing ASCD for the GS-s rule is straight forward, and we comment on the GS-r in the appendix in Sec. D.2. Here we exemplary detail the modiﬁcation for the GS-q rule (16), which turns out to be the most evolved (the same reasoning also applies to the GSL-q rule from (Nutini et al., 2015)). In Algo. 2 we show the construction based just on approximations of the gradient of the smooth part f of the active set I. For this we compute upper and lower bounds v, w on miny R V (x, y, if(x)), such that

[v]i min y R V (x, y, if(x) [w]i i [n] . (25)

The selection of the active coordinate is then based on these bounds. Similar as in Lemma 3.1 and Theorem 3.2 this set has the property i GS q I, and directions are only discarded in such a way that the efﬁciency of ASCD-q cannot drop below the efﬁciency of UCD. The proof can be found in the appendix in Section D.1.

6. Analysis of Competitive Ratio

In Section 3 we derived in Thm. 3.2 that the one step progress of ASCD is between the bounds on the onestep progress of UCD and SCD. However, we know that the efﬁciency of the latter two methods can differ much, up to a factor of n. In this section we will argue that in certain cases where SCD performs much better than UCD, ASCD will accelerate as well. To measure this effect, we could for

Algorithm 2 Adaptation of ASCD for GS-q rule

Input: Gradient estimate g, error bounds r. For i [n] deﬁne: compute u.-and l.-bounds [u]i := [ g]i + [r]i, [ℓ]i := [ g]i [r]i

[u ]i := arg miny R V (x, y, [u]i) minimize the model [ℓ ]i := arg miny R V (x, y, [ℓ]i)

compute u.-and l. bounds on miny R V (x, y, if(x)) [ωu]i := V (x, [u ]i, [u]i)+max{0, [u ]i([ℓ]i [u]i)} [ωℓ]i := V (x, [ℓ ]i, [ℓ]i) + max{0, [ℓ ]i([u]i [ℓ]i)} [v]i := min {V (x, [u ]i, [u]i), V (x, [ℓ ]i, [ℓ]i)} [w]i := min {[ωu]i, [ωℓ]i, Ψi([x]i)}

av(I) := 1 |I| P

i I[w]i compute active set It := arg min I |{I [n] | [v]i > av(I), i / I}|

instance consider the ratio:

i It | | if(xt)| 1

|It| , (26)

For general functions this expression is a bit cumbersome to study, therefore we restrict our discussion to the class of objective functions (11) as introduced in Sec. 2.2. Of course not all real-world objective functions will fall into this class, however this problem class is still very interesting in our study, as we will see in the following, because it will highlight the ability (or disability) of the algorithms to eventually identify the right set of active coordinates.

For the functions with the structure (11) (and q as in Thm. 2.2), the active set falls into the ﬁrst s coordinates. Hence it is reasonable to approximate ϱt by the competitive ratio

ρt := |It [s]|

|It| . (27)

It is also reasonable to assume that in the limit, (t ), a constant fraction of the [s] will be contained in the active set It (it might not hold [s] It t, as for instance with exact line search the directional derivative vanishes just after the update). In the following theorem we calculate ρt for (t ), the proof is given in the appendix. Theorem 6.1. Let f : Rn R be of the form (11). For indices i / [s] deﬁne Ki := {t | i / It, i It 1}. For j Ki deﬁne T i j := min {t j | i Ij+t}, i.e. the number of iterations outside the active set, T i := limt Ej Ki T i j | j > k , and the average T := Ei/ [s] T i . If there exists a constant c > 0 such that limt |[s] It| = cs, then (with the notation ρ := limt E [ρt]),

ρ 2cs cs + n s T +

where θ θ := n2 + (c 1)2s2 + 2n((c 1)s T ) + 2(1 + c)s T + T 2 . Especially, ρ 1 n s

Approximate Steepest Coordinate Descent (ASCD)

0 50 100 epochs

measured exact lower bound

0 50 100 epochs

0 50 100 epochs

Figure 1. Competitive ratio ρt (blue) in comparison with ρ (28) (red) and the lower bound ρ 1 n s

T (black). Simulation for parameters n = 100, s = 10, c = 1 and T {50, 100, 400}.

In Figure 1 we compare the lower bound (28) of the competitive ratio in the limit (t ) with actual measurements of ρt for simulated example with parameters n = 100, s = 10, c = 1 and various T {50, 100, 400}. We initialized the active set I0 = [s], but we see that the equilibrium is reached quickly.

6.1. Estimates of the competitive ratio

Based on this Thm. 6.1, we can now estimate the competitive ratio in various scenarios. On the class (11) it holds c 1 as we argued before. Hence the competitive ratio (28) just depends on T . This quantity measures how many iterations a coordinate j / [s] is in average outside of the active set It. From the lower bound we see that the competitive ratio ρt approaches a constant for (t ) if T = Θ (n), for instance ρ 0.8 if T 5n.

As an approximation to T , we estimate the quantities T j t0 deﬁned in Thm. 6.1. T j t0 denotes the number of iterations it takes until coordinate j enters the active set again, assuming it left the active set at iteration t0 1. We estimate T j t0 ˆT, where ˆT denotes maximum number of iterations such that

t=t0 γtδiij 1

kf xt0+ ˆT j / [s]. (29)

For smooth functions, the steps γt = Θ (| itf(xt)|) and if we additionally assume that the errors of the gradient oracle are uniformly bounded δij δ, the sum in (29) simpliﬁes

to δ Pt0+ ˆT t=t0 | itf(xt)|.

For smooth, but not strongly convex function q, the norms of the gradient changes very slowly, with a rate independent of s or n, and we get ˆT = Θ 1

δ . Hence, the competitive ratio is constant for δ = Θ 1

For strongly convex function q, the norm of the gradient decreases linearly, say f(xt) 2 2 eκt for κ 1

s. I.e. it decreases by half after each Θ (s) iterations. Therefore to guarantee ˆT = Θ (n) it needs to hold δ = e Θ( n

s ). This result seems to indicate that the use of ACDM is only

justiﬁed if s is large, for instance s 1

4n. Otherwise the convergence on q is too fast, and the gradient approximations are too weak. However, notice that we assumed δ to be an uniform bound on all errors. If the errors have large discrepancy the estimates become much better (this holds for instance on datasets where the norm data vectors differs much, or when caching techniques as mentioned in Sec. 4 are employed).

7. Empirical Observations

In this section we evaluate the empirical performance of ASCD on synthetic and real datasets. We consider the following regularized general linear models:

min x Rn 1 2 Ax b 2 2 + λ

2 x 2 2 , (30)

min x Rn 1 2 Ax b 2 2 + λ x 1 , (31)

that is, l2-regularized least squares (30) as well as l1regularized linear regression (Lasso) in (31), respectively.

Datasets. The datasets A Rd n in problems (30) and (31) were chosen as follows for our experiments. For the synthetic data, we follow the same generation procedure as described in (Nutini et al., 2015), which generates very sparse data matrices. For completeness, full details of the data generation process are also provided in the appendix in Sec. E. For the synthetic data we choose n = 5000 for problem (31) and n = 1000 for problem (30). Dimension d = 1000 is ﬁxed for both cases. For real datasets, we perform the experimental evaluation on RCV1 (binary,training), which consists of 20, 242 samples, each of dimension 47, 236 (Lewis et al., 2004). We use the un-normalized version with all non-zeros values set to 1 (bag-of-words features).

Gradient oracles and implementation details. On the RCV1 dataset, we approximate the scalar products with the oracle g4 that was introduced in Sec. 4. This oracle is extremely cheap to compute, as the norms ai of the columns of A only need to be computed once. On the synthetic data, we simulate the oracle g2 for various precisions values ϵ. For this, we sample a value uniformly at random from the allowed error interval (24). Figs. 2d and 3d show the convergence for different accuracies. For the l1-regularized problems, we used ASCD with the GS-s rule (the experiments in (Nutini et al., 2015) revealed almost identical performance of the different GSrules). We compare the performance of UCD, SCD and ASCD. We also implement the heuristic version a-ASCD that was introduced in Sec. 3. All algorithm variants use the same step size rule (i.e. the method M in Algorithm 1). We use exact line search for the experiment in Fig. 3c, for all others we used a ﬁxed step size rule (the convergence is slower

Approximate Steepest Coordinate Descent (ASCD)

(a) Convergence for l2 (b) Convergence for l1 (c) True vs No Initialization for l2 (d) Error Variation (ASCD)

Figure 2. Experimental results on synthetically generated datasets

(a) Convergence for l2 (b) Convergence for l1

(c) Line search for l1 (d) Error Variation (ASCD)

Figure 3. Experimental results on the RCV1-binary dataset

for all algorithms, but the different effects of the selection of the active coordinate is more distinctly visible). ASCD is either initialized with the true gradient (Figs. 2a, 2b, 2d, 3c, 3d) or arbitrarely (with error bounds δ = ) in Figs. 3a and 3b (Fig. 2c compares both initializations). Fig. 2 shows results on the synthetic data, Fig. 3 on the RCV1 dataset. All plots show also the size of the active set It. The plots 3c and 3d are generated on a subspace of RCV1, with 10000 and 5000 randomly chosen columns, respectively.

Here are the highlights of our experimental study:

1. No initialization needed. We observe (see e.g. Figs. 2c,3a, 3b) that initialization with the true gradient values is not needed at beginning of the optimization process (the cost of the initialization being as expensive as one epoch of ASCD). Instead, the algorithm performs strong in terms of learning the active set on its own, and the set converges very fast after just one epoch.

2. High errors toleration. The gradient oracle g4 gives very crude approximations, however the convergence of ASCD is excellent on RCV1 (Fig. 3). Here the size of the true active set is very small (in the order of 0.1% on RCV1) and ASCD is able to identify this set. Fig. 3d shows that almost nothing can be gained from more precise (and more expensive) oracles.

3. Heuristic a-ASCD performs well. The convergence behavior of ASCD follows theory. For the heuristic version a-ASCD (which computes the active set slightly

faster, but Thm. 3.2 does not hold) performs identical to ASCD in practice (cf. Figs. 2, 3), and sometimes slightly better. This is explained by the active set used in ASCD typically being larger than the active set of a ASCD (Figs. 2a,2b, 3a, 3b).

8. Concluding Remarks

We proposed ASCD, a novel selection mechanism for the active coordinate in CD methods. Our scheme enjoys three favorable properties: (i) its performance can reach the performance steepest CD both in theory and practice, (ii) the performance is never worse than uniform CD, (iii) in many important applications, the scheme it can be implemented at no extra cost per iteration.

ASCD calculates the active set in a safe manner, and picks the active coordinate uniformly at random from this smaller set. It seems possible that an adaptive sampling strategy on the active set could boost the performance even further. Here we only study CD methods where a single coordinate gets updated in each iteration. ASCD can immediately also be generalized to block-coordinate descent methods. However, the exact implementation in a distributed setting can be challenging.

Finally, it is an interesting direction to extend ASCD also to the stochastic gradient descent setting (not only heuristically, but with the same strong guarantees as derived in this paper).

Approximate Steepest Coordinate Descent (ASCD)

Achlioptas, Dimitris. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671 687, 2003.

Allen-Zhu, Z, Qu, Z, Richtarik, P, and Yuan, Y. Even faster accelerated coordinate descent using non-uniform sampling. 2016.

Boyd, Stephen P and Vandenberghe, Lieven. Convex optimization. Cambridge University Press, 2004.

Csiba, Dominik, Qu, Zheng, and Richt arik, Peter. Stochastic Dual Coordinate Ascent with Adaptive Probabilities. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning, 2015.

Dawkins, Brian. Siobhan s problem: The coupon collector revisited. The American Statistician, 45(1):76 82, 1991.

Dhillon, Inderjit S, Ravikumar, Pradeep, and Tewari, Ambuj. Nearest Neighbor based Greedy Coordinate Descent. In NIPS 2014 - Advances in Neural Information Processing Systems 27, 2011.

Friedman, Jerome, Hastie, Trevor, H oﬂing, Holger, and Tibshirani, Robert. Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2):302 332, 2007.

Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1):1 22, 2010.

Fu, Wenjiang J. Penalized regressions: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7 (3):397 416, 1998.

Hsieh, Cho-Jui, Chang, Kai-Wei, Lin, Chih-Jen, Keerthi, S Sathiya, and Sundararajan, S. A Dual Coordinate Descent Method for Large-scale Linear SVM. In the 25th International Conference on Machine Learning, pp. 408 415, New York, USA, 2008.

Kim, Hyunsoo and Park, Haesun. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713 730, 2008.

Lee, Daniel D and Seung, H Sebastian. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755): 788 791, 1999.

Lewis, David D., Yang, Yiming, Rose, Tony G., and Li, Fan. Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res., 5:361 397, 2004.

Matouˇsek, Jiˇr ı. On variants of the johnsonlindenstrauss lemma. Random Structures & Algorithms, 33(2):142 156, 2008.

Nesterov, Yu. Efﬁciency of coordinate descent methods on hugescale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012.

Nesterov, Yurii and Stich, Sebastian U. Efﬁciency of the accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110 123, 2017.

Nutini, Julie, Schmidt, Mark W, Laradji, Issam H, Friedlander, Michael P, and Koepke, Hoyt A. Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection. In ICML, pp. 1632 1641, 2015.

Osokin, Anton, Alayrac, Jean-Baptiste, Lukasewitz, Isabella, Dokania, Puneet K., and Lacoste-Julien, Simon. Minding the gaps for block frank-wolfe optimization of structured svms. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML 16, pp. 593 602. PMLR, 2016.

Papa, Guillaume, Bianchi, Pascal, and Cl emenc on, St ephan. Adaptive Sampling for Incremental Optimization Using Stochastic Gradient Descent. ALT 2015 - 26th International Conference on Algorithmic Learning Theory, pp. 317 331, 2015.

Perekrestenko, Dmytro, Cevher, Volkan, and Jaggi, Martin. Faster Coordinate Descent via Adaptive Importance Sampling. In Proceedings of the 20th International Conference on Artiﬁcial Intelligence and Statistics, volume 54 of Proceedings of Machine Learning Research, pp. 869 877, Fort Lauderdale, FL, USA, 20 22 Apr 2017. PMLR.

Qu, Zheng and Richt arik, Peter. Coordinate descent with arbitrary sampling i: algorithms and complexity. Optimization Methods and Software, 31(5):829 857, 2016.

Richt arik, Peter and Tak aˇc, Martin. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1):433 484, 2016.

Shalev-Shwartz, Shai and Tewari, Ambuj. Stochastic Methods for l1-regularized Loss Minimization. JMLR, 12:1865 1892, 2011.

Shalev-Shwartz, Shai and Zhang, Tong. Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization. JMLR, 14:567 599, 2013.

Shrivastava, Anshumali and Li, Ping. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). In NIPS 2014 - Advances in Neural Information Processing Systems 27, pp. 2321 2329, 2014.

Tseng, Paul and Yun, Sangwoon. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117(1):387 423, 2009.

Wen, Zaiwen, Yin, Wotao, Zhang, Hongchao, and Goldfarb, Donald. On the convergence of an active-set method for 1 minimization. Optimization Methods and Software, 27(6):1127 1146, 2012.

Wright, Stephen J. Coordinate descent algorithms. Mathematical Programming, 151(1):3 34, 2015.

Wu, Tong Tong and Lange, Kenneth. Coordinate descent algorithms for lasso penalized regression. Ann. Appl. Stat., 2(1): 224 244, 2008.

Zhao, Peilin and Zhang, Tong. Stochastic optimization with importance sampling for regularized loss minimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of PMLR, pp. 1 9, Lille, France, 2015. PMLR.