# stochastic_spectral_and_conjugate_descent_methods__4cce4d05.pdf

Stochastic Spectral and Conjugate Descent Methods

Dmitry Kovalev1,2 Eduard Gorbunov1 Elnur Gasanov1,2 Peter Richtárik2,3,1

1Moscow Institute of Physics and Technology, Dolgoprudny, Russia 2King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 3University of Edinburgh, Edinburgh, United Kingdom

The state-of-the-art methods for solving optimization problems in big dimensions are variants of randomized coordinate descent (RCD). In this paper we introduce a fundamentally new type of acceleration strategy for RCD based on the augmentation of the set of coordinate directions by a few spectral or conjugate directions. As we increase the number of extra directions to be sampled from, the rate of the method improves, and interpolates between the linear rate of RCD and a linear rate independent of the condition number. We develop and analyze also inexact variants of these methods where the spectral and conjugate directions are allowed to be approximate only. We motivate the above development by proving several negative results which highlight the limitations of RCD with importance sampling.

1 Introduction

An increasing array of learning and training tasks reduce to optimization problem in very large dimensions. The state-of-the-art algorithms in this regime are based on randomized coordinate descent (RCD). Various acceleration strategies were proposed for RCD in the literature in recent years, based on techniques such as Nesterov s momentum [12, 9, 5, 1, 14], heavy ball momentum [16, 11], importance sampling [13, 19], adaptive sampling [4], random permutations [8], greedy rules [15], mini-batching [20], and locality breaking [21]. These techniques enable faster rates in theory and practice.

In this paper we introduce a fundamentally new type of acceleration strategy for RCD which relies on the idea of enriching the set of (unit) coordinate directions {e1, e2, . . . , en} in Rn, which are used in RCD as directions of descent, via the addition of a few spectral or conjugate directions. The algorithms we develop and analyze in this paper randomize over this enriched larger set of directions.

For expositional simplicity1, we focus on quadratic minimization

min x Rn f(x) := 1

2x Ax b x, (1)

where A is an n n symmetric and positive deﬁnite matrix. The optimal solution is unique, and equal to x = A 1b.

1.1 Randomized coordinate descent

Applied to (1), RCD performs the iteration

xt+1 = xt A :i xt bi

Aii ei, (2)

1Many of our results can be extended to convex functions of the form f(x) = φ(Ax) b x, where φ is a smooth and strongly convex function. However, due to space limitations, and the fact that we already have a lot to say in the special case φ(y) = 1

2 y 2, we leave these more general developments to a follow-up paper.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

Method Name Algorithm Rate Reference stochastic descent (SD) (4), Alg 1 (5), Lem 1 Gower and Richtárik [6] stochastic spectral descent (SSD) Alg 2 (6), Thm 2 NEW stochastic conjugate descent (Scon D) Sec 2.2 Thm 2 NEW randomized coordinate descent (RCD) (2), Alg 3 (3), (11) Gower and Richtárik [6] stochastic spectral coord. descent (SSCD) Alg 4 (7), Thm 8 NEW mini-batch SD (m SD) Alg 5 Lem 9 Richtárik and Takáˇc [18] mini-batch SSCD (m SSCD) Alg 6 Thm 10 NEW inexact Scon D (i Scon D) Alg 7 Thm 15 NEW inexact SSD (i SSD) Alg 8 Sec F.2 NEW

Table 1: Algorithms described in this paper.

where at each iteration, i is chosen with probability pi > 0. It was shown by Leventhal and Lewis [10] that if the probabilities are proportional to the diagonal elements of A (i.e., pi Aii), then the random iterates of RCD satisfy E xt x 2 A (1 ρ)t x0 x 2 A, where ρ = λmin(A)

Tr(A) and λmin(A) is the minimal eigenvalue of A. That is, as long as the number of iterations t is at least

λmin(A) log 1

we have E[ xt x 2 A] ϵ. Note that Tr(A) λmin(A) n, and that this can be arbitrarily larger than n.

1.2 Stochastic descent

Recently, Gower and Richtárik [6] developed an iterative sketch and project framework for solving linear systems and quadratic optimization; see [7] for extensions. In the context of problem (1), their method reads as

xt+1 = xt s t (Axt b)

s t Ast st, (4)

where st Rn is a random vector sampled from some ﬁxed distribution D. In this paper we will refer to this method by the name stochastic descent (SD).

Note that xt+1 is obtained from xt by minimizing f(xt +hst) for h R and setting xt+1 = xt +hst. Further, note that RCD arises as a special case with D being a discrete probability distribution over the set {e1, . . . , en}. However, SD converges for virtually any distribution D, including discrete and continuous distributions. In particular, Gower and Richtárik [6] show that as long as Es D[H] is invertible, where H := ss

s As, then SD converges as

O 1 λmin(W) log 1

where W := Es D[A1/2HA1/2] (see Lemma 1 for a more reﬁned result due to Richtárik and Takáˇc [18]). Rate of RCD in (3) can be obtained as a special case of (5).

1.3 Stochastic spectral and conjugate descent

The starting point of this paper is the new observation that stochastic descent obtains the rate

in the special case when D is chosen to be the uniform distribution over the eigenvectors of A (see Theorem 2). For obvious reasons, we refer to this new method as stochastic spectral descent (SSD).

To the best of our knowledge, SSD was not explicitly considered in the literature before. We should note that SSD is fundamentally different from spectral gradient descent [3, 2], which refers to a family of gradient methods with a special choice of stepsize depending on the spectrum of 2f.

The rate (6) does not merely provide an improvement on the rate of RCD given in (3); what is remarkable is that this rate is completely independent of the properties (such as conditioning) of A.

Result Thm Uniform probabilities are optimal for n = 2 3 Uniform probabilities are optimal for any n 2 as long as A is diagonal 4 Importance sampling pi Aii can lead to an arbitrarily worse rate than pi = 1/n 5 Importance sampling pi Ai: 2 can lead to an arbitrarily worse rate than pi = 1/n 5 For every n 2 and T > 0, A : rate of RCD with opt. probabilities is O(T log 1

ϵ ) 6 For every n 2 and T > 0, A : rate of RCD with opt. probabilities is Ω(T log 1

Table 2: Summary of results on importance and optimal sampling in RCD.

Moreover, we show that this method is optimal among the class of stochastic descent methods (4) parameterized by the choice of the distribution D (see Theorem 8). Despite the attractiveness of its rate, SSD is not a practical method. This is because once we have the eigenvectors of A available, the optimal solution x can be assembled directly without the need for an iterative method.

We extend all results discussed above for SSD, including the rate (6), to the more general class of methods we call stochastic conjugate descent (Scon D), for which D is the uniform distribution over vectors v1, . . . , vn which are mutually A conjugate: v i Avj = 0 for i = j and v i Avi = 1.

1.4 Optimizing probabilities in RCD

The idea of speeding up RCD via the use of non-uniform probabilities was pioneered by Nesterov [13] in the context of smooth convex minimization, and later built on by many authors [19, 17, 1]. In the case of non-accelerated RCD, and in the context of smooth convex optimization, the most popular choice of probabilities is to set pi Li, where Li is the Lipschitz constant of the gradient of the objective corresponding to coordinate i [13, 19]. For problem (1), we have Li = Aii. Gower and Richtárik [6] showed that the optimal probabilities for (1) can in principle be computed through semideﬁnite programming (SDP); however, no theoretical properties of the optimal solution of the SDP were given.

As a warm-up, we ﬁrst ask: how important is importance sampling? More precisely, we investigate RCD with probabilities pi Aii, and RCD with probabilities pi Ai: 2, considered as RCD with importance sampling , and compare these with the baseline RCD with uniform probabilities. Our result (see Theorem 5) contradicts conventional wisdom . In particular, we show that for every n there is a matrix A such that diagonal probabilities lead to the best rate. Moreover, the rate of RCD with importance can be arbitrarily worse than the rate of RCD with uniform probabilities. The same result applies to probabilities proportional to the square of the norm of the ith row of A.

We then switch gears, and motivated by the nature of SSD, we ask the following question: in order to obtain a condition-number-independent rate such as (6), do we have to consider new (and hard to compute) descent directions, such as eigenvectors of A, or can a similar effect be obtained using RCD with a better selection of probabilities? We give two negative results to this question (see Theorems 6 and 7). First, we show that for any n 2 and any T > 0, there is a matrix A such that the rate of RCD with any probabilities (including the optimal probabilities) is O(T log 1

ϵ ). Second, we give a similar but much stronger statement where we reach the same conclusion, but for the lower bound as opposed to the upper bound. That is, O is replaced by Ω.

As a by-product of our investigations into importance sampling, we establish that for n = 2, uniform probabilities are optimal for all matrices A (see Thm 3). For a summary of these results, see Table 2.

1.5 Interpolating between RCD and SSD

RCD and SSD lie on opposite ends of a continuum of stochastic descent methods for solving (1). RCD minimizes the work per iteration without any regard for the number of iterations, while SSD minimizes the number of iterations without any regard for the cost per iteration (or pre-processing cost). Indeed, one step of RCD costs O( Ai: 0) (the number of nonzero entries in the ith row of A), and hence RCD can be implemented very efﬁciently for sparse A. If uniform probabilities are used, no pre-processing (for computing probabilities) is needed. These advantages are paid for by the rate (3), which can be arbitrarily high. On the other hand, the rate of SSD does not depend on A. This advantage is paid for by a high pre-processing cost: the computation of the eigenvectors. This pre-processing cost makes the method utterly impractical.

general spectrum

n k largest eigvls are γ-clustered: c λi γc for k + 1 i n

α-exponentially decaying eigenvalues

RCD (pi Aii)

1 αn 1 SSCD

(k+1)λk+1+Pn i=k+2 λi λk+1 γn 1 αn k 1 SSD n n n

Table 3: Comparison of complexities of RCD, SSCD (with parameter 0 k n 1) and SSD under various regimes on the spectrum of A. In all terms we suppress a factor of log 1

One of the main contributions of this paper is the development of a new parametric family of algorithms that in some sense interpolate between RCD and SSD. In particular, we consider the stochastic descent algorithm (4) with D being a discrete distribution over the search directions {e1, . . . , en} {u1, . . . , uk}, where ui is the eigenvectors of A corresponding to the ith smallest eigenvalue of A. We call this new method stochastic spectral coordinate descent (SSCD).

We compute the optimal probabilities of this distribution, which turn out to be unique, and show that for k 1 they depend on the k + 1 smallest eigenvalues of A: 0 < λ1 λ2 λk+1. In particular, we prove (see Theorem 8) that the rate of SSCD with optimal probabilities is

O (k + 1)λk+1 + Pn i=k+2 λi λk+1 log 1

For k = 0, SSCD reduces to RCD with pi Aii, and the rate (7) reduces to (3). For k = n 1, SSCD does not reduce to SSD. However, the rates match. Indeed, in this case the rate (7) reduces to (6). Moreover, the rate improves monotonically as k increases, from O( Tr(A)

λmin(A) log 1

ϵ ) (for k = 0) to O(n log 1

ϵ ) (for k = n 1).

SSCD removes the effect of the k smallest eigenvalues. Note that the rate (7) does not depend on the k smallest eigenvalues of A. That is, by adding the eigenvectors u1, . . . , uk corresponding to the k smallest eigenvalues to the set of descent directions, we have removed the effect of these eigenvalues.

Clustered eigenvalues. Assume that the n k largest eigenvalues are clustered: c λi γc for some c > 0 and γ > 1, for all k + 1 i n. In this case, the rate (7) can be estimated as a function of the clustering tightness parameter γ: O γn log 1

ϵ . See Table 3. This can be arbitrarily better than the rate of RCD, even for k = 1. In other words, there are situations where by enriching the set of directions used by RCD by a single eigenvector only, the resulting method accelerates dramatically. To give a concrete and simpliﬁed example to illustrate this, assume that λ1 = δ > 0, while λ2 = = λn = 1. In this case, RCD has the rate O((1 + n 1

ϵ ), while SSCD with k = 1 has the rate O(n log 1

ϵ ). So, SSCD is 1

δ times better than RCD, and the difference grows to inﬁnity as δ approaches zero even for ﬁxed dimension n.

Exponentially decaying eigenvalues. If the eigenvalues of A follow an exponential decay with factor 0 < α < 1, then the rate of RCD is O 1 αn 1 log 1

ϵ , while the rate of SSCD is O 1 αn k 1 log 1

ϵ . This is an improvement by the factor 1 αk , which can be very large even for small k if α is small. See Table 3. For an experimental conﬁrmation of this prediction, see Figure 5.

Adding a few largest eigenvectors does not help. We show that in contrast with the situation above, adding a few of the largest eigenvectors to the coordinate directions of RCD does not help. This is captured formally in the supplementary material as Theorem 12.

Mini-batching. We extend SSCD to a mini-batch setting; we call the new method m SSCD. We show that the rate of m SSCD interpolates between the rate of mini-batch RCD and rate of SSD. Moreover, we show that m SSCD is optimal among a certain parametric family of methods, and that its rate improves as k increases. See Theorem 10.

1.6 Inexact Directions

Finally, we relax the need to compute exact eigenvectors or A conjugate vectors, and analyze the behavior of our methods for inexact directions. Moreover, we propose and analyze an inexact variant of SSD which does not arise as a special case of SD. See Sections E and F.

2 Stochastic Descent

The stochastic descent method was described in (4). We now formalize it as Algorithm 1, and equip it with a stepsize, which will be useful in Section A.1, where we study mini-batch version of SD.

Algorithm 1 Stochastic Descent (SD)

Parameters: Distribution D; Stepsize parameter ω > 0 Initialize: Choose x0 Rn for t = 0, 1, 2, . . . do

Sample search direction st D and set xt+1 = xt ω s t (Axt b)

In order to guarantee convergence of SD, we restrict our attention to the class of proper distributions. Assumption 1. Distribution D is proper with respect to A. That is, Es D[H] is invertible, where

Next we present the main convergence result for SD. Lemma 1 (Convergence of stochastic descent [6, 18]). Let D be proper with respect to A, and let 0 < ω < 2. Stochastic descent (Algorithm 1) converges linearly in expectation, E xt x 2 A (1 ω(2 ω)λmin(W))t x0 x 2 A,

and we also have the lower bound (1 ω(2 ω)λmax(W))t x0 x 2 A E xt x 2 A , where

W := Es D h A1/2HA1/2i . (9)

Finally, the statement remains true if we replace xt x 2 A by f(xt) f(x ) for all t.

It is easy to observe that the stepsize choice ω = 1 is optimal. This is why we have decided to present the SD method (4) with this choice of stepsize. Moreover, notice that due to linearity of expectation,

Tr(W) (9)= E h Tr(A1/2HA1/2) i (8)= E Tr zz

where z = A1/2s. Therefore, 0 < λmin(W) 1

n λmax(W) 1.

2.1 Stochastic Spectral Descent

Let A = Pn i=1 λiuiu i be the eigenvalue decomposition of A. That is, 0 < λ1 λ2 . . . λn are the eigenvalues of A and u1, . . . , un are the corresponding orthonormal eigenvectors. Consider now the SD method with D being the uniform distribution over the set {u1, . . . , un}, and ω = 1. This gives rise to a new variant of SD which we call stochastic spectral descent (SSD).

Algorithm 2 Stochastic Spectral Descent (SSD)

Initialize: x0 Rn; (u1, λ1), . . . (un, λn): eigenvectors and eigenvalues of A for t = 0, 1, 2, . . . do

Choose i [n] uniformly at random and set xt+1 = xt u i xt u i b λi

For SSD we can establish an unusually strong convergence result, both in terms of speed and tightness. Theorem 2 (Convergence of stochastic spectral descent). Let {xk} be the sequence of random iterates produced by stochastic spectral descent (Algorithm 2). Then

E[ xt x 2 A] = 1 1

t x0 x 2 A. (10)

This theorem implies the rate (6) mentioned in the introduction. Up to a log factor, SSD only needs n iterations to converge. Notice that (10) is an identity, and hence the rate is not improvable.

2.2 Stochastic Conjugate Descent

The same rate as in Theorem 2 holds for the stochastic conjugate descent (Scon D) method, which arises as a special case of stochastic descent for ω = 1 and D being a uniform distribution over a set of A-orthogonal (i.e., conjugate) vectors. The proof follows by combining Lemmas 1 and 13.

2.3 Randomized Coordinate Descent

RCD (Algorithm 3) arises as a special case of SD with unit stepsize (ω = 1) and distribution D given by st = ei with probability pi > 0.

Algorithm 3 Randomized Coordinate Descent (RCD)

Parameters: probabilities p1, . . . , pn > 0 Initialize: x0 Rn for t = 0, 1, 2, . . . do

Choose i [n] with probability pi > 0 and set xt+1 = xt Ai:xt bi

Aii ei end for

The rate of RCD (Algorithm 3) can therefore be deduced from Lemma 1. Notice that in view of (8), we have E[H] = P

i pi eie i Aii = Diag( p1

A11 , . . . , pn Ann ). So, as long as all probabilities are positive, Assumption 1 is satisﬁed. Therefore, Lemma 1 applies and RCD enjoys the rate

2.3.1 Uniform probabilities can be optimal

We ﬁrst prove that uniform probabilities are optimal in 2D. Theorem 3. Let n = 2 and consider RCD (Algorithm 3) with probabilities p1 > 0 and p2 > 0, p1 + p2 = 1. Then the choice p1 = p2 = 1

2 optimizes the rate of RCD in (11).

Next we claim that uniform probabilities are optimal in any dimension n as long as A is diagonal. Theorem 4. Let n 2 and let A be diagonal. Then uniform probabilities (pi = 1

n for all i) optimize the rate of RCD in (11).

2.3.2 Importance sampling can be unimportant

In our next result we contradict conventional wisdom about typical choices of importance sampling probabilities. We claim that diagonal and row-squared-norm probabilities can lead to an arbitrarily worse performance than uniform probabilities. Theorem 5. For every n 2 and T > 0, there exists A such that: (i) The rate of RCD with pi Aii is T times worse than the rate of RCD with uniform probabilities. (ii) The rate of RCD with pi Ai: 2 is T times worse than the rate of RCD with uniform probabilities.

2.3.3 Optimal probabilities can be bad

Finally, we show that there is no hope for adjustment of probabilities in RCD to lead to a rate independent of the data A, as is the case for SSD. Our ﬁrst result states that such a result can t be obtained from the generic rate (11). Theorem 6. For every n 2 and T > 0, there exists A such that the number of iterations (as expressed by formula (11)) of RCD with any choice of probabilities p1, . . . , pn > 0 is O(T log(1/ϵ)).

However, that does not mean, by itself, that such a result can t be possibly obtained via a different analysis. Our next result shatters these hopes as we establish a lower bound which can be arbitrarily larger than the dimension n. Theorem 7. For every n 2 and T > 0, there exists an n n positive deﬁnite matrix A and starting point x0, such that the number of iterations of RCD with any choice probabilities p1, . . . , pn > 0 is Ω(T log(1/ϵ)).

3 Interpolating Between RCD and SSD

Assume now that we have some partial spectral information available. In particular, ﬁx k {0, 1, . . . , n 1} and assume we know eigenvectors ui and eigenvalues λi for i = 1, . . . , k. We now deﬁne a parametric distribution D(α, β1, . . . , βk) with parameters α > 0 and β1, . . . , βk 0 as follows. Sample s D(α, β1, . . . , βk) arises through the process

( ei with probability pi = αAii

Ck , i [n], ui with probability pn+i = βi

Ck , i [k],

where Ck := αTr(A) + Pk i=1 βi is a normalizing factor ensuring that the probabilities sum up to 1.

3.1 The method and its convergence rate

Applying the SD method with the distribution D = D(α, β1, . . . , βk) gives rise to a new speciﬁc method which we call stochastic spectral coordinate descent (SSCD).

Algorithm 4 Stochastic Spectral Coordinate Descent (SSCD)

Parameters: Distribution D(α, β1, . . . , βk) Initialize: x0 Rn for t = 0, 1, 2, . . . do

Sample st D(α, β1, . . . , βk) and set xt+1 = xt s t (Axt b)

Theorem 8. Consider Stochastic Spectral Coordinate Descent (Algorithm 4) for ﬁxed k {0, 1, . . . , n 1}. The method converges linearly for all positive α > 0 and nonnegative βi. The best rate is obtained for parameters α = 1 and βi = λk+1 λi; and this is the unique choice of parameters leading to the best rate. In this case,

E xt x 2 A 1 λk+1

t x0 x 2 A,

where Ck := (k + 1)λk+1 + Pn i=k+2 λi. Moreover, the rate improves as k grows, and we have

λ1 Tr(A) = λ1

Ck λn Cn 1 = 1

If k = 0, SSCD reduces to RCD (with diagonal probabilities). Since λ1

C0 = λ1 Tr(A), we recover the rate of RCD of Leventhal and Lewis [10]. With the choice k = n 1 our method does not reduce to SSD. However, the rates match. Indeed, λn Cn 1 = λn nλn = 1

n (compare with Theorem 2).

3.2 Largest eigenvectors do not help

It is natural to ask whether there is any beneﬁt in considering a few largest eigenvectors instead. Unfortunately, for the same parametric family as in Theorem 8, the answer is negative. The optimal parameters suggest that RCD has better rate without these directions. See Thm 12 in suppl. material.

4 Experiments

4.1 Stochastic spectral coordinate descent (SSCD)

In our ﬁrst experiment we study how the practical behavior of SSCD (Algorithm 4) depends on the choice of k. What we study here does not depend on the dimensionality of the problem (n), and hence it sufﬁces to perform the experiments on small dimensional problems (n = 30). In this experiment we consider the regime of clustered eigenvalues described in the introduction and summarized in Table 3. In particular, we construct a synthetic matrix A R30 30 with the smallest 15 eigenvalues clustered in the interval (5, 5 + ) and the largest 15 eigenvalues clustered in the interval (θ, θ + ).

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 100, = 10

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 100, = 25

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 100, = 50

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 200, = 10

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 200, = 25

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 200, = 50

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 400, = 10

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 400, = 25

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 50 100 150 200 250 Number of iterations

Expected precision

= 400, = 50

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

Figure 1: Expected precision E xt x 2 A/ x0 x 2 A versus # iterations of SSCD for symmetric positive deﬁnite matrices A of size 30 30 with different structures of spectra. The spectrum of A consists of 2 equally sized clusters of eigenvalues; one in the interval (5, 5 + ), and the other in the interval (θ, θ + ).

0 10 20 30 40 50 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 10 20 30 40 50 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 10 20 30 40 50 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

0 10 20 30 40 50 Number of iterations

Expected precision

k = 0 k = 6 k = 12 k = 18 k = 24 k = 29

Figure 2: Expected precision versus # iterations of mini-batch SSCD for A R30 30 and several choices of mini-batch size τ. The spectrum of A was chosen as a uniform discretization of the interval [1, 60].

We vary the tightness parameter and the separation parameter θ, and study the performance of SSCD for various choices of k. See Figure 3.

Our ﬁrst ﬁnding is a conﬁrmation of the phase transition phenomenon predicted by our theory. Recall that the rate of SSCD (see Theorem 8) is O (k+1)λk+1+Pn i=k+2 λi λk+1

. If k < 15, we know

λi (5, 5 + ) for i = 1, 2, . . . , k + 1, and λi (θ, θ + ) for i = k + 2, . . . , n. Therefore, the rate can be estimated as rsmall := O (k + 1 + (n k 1)(θ + )/5) . On the other hand, if k 15, we know that λi (θ, θ + ) for i = k + 1, . . . , n, and hence the rate can be estimated as rlarge := O (k + 1 + (n k 1)(θ + )/θ) . Note that if the separation θ between the two clusters is large, the rate rlarge is much better than the rate rsmall. Indeed, in this regime, the rate rlarge becomes O(n), while rsmall can be arbitrarily large.

Going back to Figure 3, notice that this can be observed in the experiments. There is a clear phase transition at k = 15, as predicted be the above analysis. Methods using k {0, 6, 12} are relatively slow (although still enjoying a linear rate), and tend to have similar behaviour, especially when

is small. On the other hand, methods using k {18, 24, 29} are much faster, with a behaviour nearly independent of θ and . Moreover, as θ increases, the difference in the rates between the slow methods using k {0, 6, 12} and the fast methods using k {18, 24, 29} grows. We have performed more experiments with three clusters; see Fig 4 in the supplementary material.

4.2 Mini-batch SSCD

In Figure 2 we report on the behavior of m SSCD, the mini-batch version of SSCD, for four choices of the mini-batch parameter τ, and several choices of k. Mini-batch of size τ is processed in parallel on τ processors, and the cost of a single iteration of m SSCD is (roughly) the same for all τ. For τ = 1, the method reduces to SSCD, considered in previous experiment (but on a different dataset). Since the number of iterations is small, there are no noticeable differences across using different values of k. As τ grows, however, all methods become faster. Mini-batching seems to be more useful as k is larger. Moreover, we can observe that acceleration through mini-batching starts more aggressively for small values op k, and its added beneﬁt for increasing values of k is getting smaller and smaller. This means that even for relatively small values of k, mini-batching can be expected to lead to substantial speed-ups.

4.3 Matrix with 10 billion entries

In Figure 3 we report on an experiment using a synthetic problem with data matrix A of dimension n = 105 (i.e., potentially with 1010 entries). As all experiments were done on a laptop, we worked with sparse matrices with 106 nonzeros only. In the ﬁrst row of Figure 3 we consider matrix A with all eigenvalues distributed uniformly on the interval [1, 100]. We observe that SSCD with k = 104 (just 10% of n) requires about an order of magnitude less iterations than SSCD with k = 0 (=RCD). In the second row we consider a scenario where l eigenvalues are small, contained in [1, 2], with the rest of the eigenvalues contained in [100, 200]. We consider l = 10 and l = 1000 and study the behaviour of SSCD with k = l. We see that for l = 10, SSCD performs dramatically better than RCD: it is able to achieve machine precision while RCD struggles to reduce the initial error by a factor larger than 106. For l = 1000, SSCD achieves error 10 9 while RCD struggles to push the error below 10 4. These tests show that in terms of # iterations, SSCD has the capacity to accelerate on RCD by many orders of magnitude.

0.0 0.5 1.0 1.5 2.0 Number of iterations 1e6

Expected precision

k = 0 k = 10000

0.0 0.5 1.0 1.5 2.0 2.5 3.0 Number of iterations 1e6

Expected precision

k = 0 k = 10

0.0 0.5 1.0 1.5 2.0 Number of iterations 1e6

Expected precision

k = 0 k = 1000

Figure 3: Expected precision E xt x 2 A/ x0 x 2 A versus # iterations of SSCD for a matrix A

R105 105. Top row: spectrum of A is uniformly distributed on [1, 100]. Bottom row: spectrum contained in two clusters: [1, 2] and [100, 200].

5 Extensions

Our algorithms and convergence results can be extended to eigenvectors and conjugate directions which are only computed approximately. Some of this development can be found in the suppl. material (see Section E). Finally, as mentioned in the introduction, our results can be extended to the more general problem of minimizing f(x) = φ(Ax), where φ is smooth and strongly convex.

[1] Zeyuan Allen-Zhu, Zheng Qu, Peter Richtárik, and Yang Yuan. Even faster accelerated coordinate descent using non-uniform sampling. In ICML, pages 1110 1119, 2016.

[2] Jonathan Barzilai and Borwein Jonathan M. Two point step size gradient methods. IMA Journal of Numerical Analysis, 8:141 148, 1988.

[3] Ernesto G. Birgin, José Mario Martínez, and Marcos Raydan. Spectral projected gradient methods: Review and perspectives. Journal of Statistical Software, 60(3):1 21, 2014.

[4] Dominik Csiba, Zheng Qu, and Peter Richtárik. Stochastic dual coordinate ascent with adaptive probabilities. In ICML, pages 674 683, 2015.

[5] Olivier Fercoq and Peter Richtárik. Accelerated, parallel, and proximal coordinate descent. SIAM Journal on Optimization, 25(4):1997 2023, 2015.

[6] Robert M Gower and Peter Richtárik. Randomized iterative methods for linear systems. SIAM Journal on Matrix Analysis and Applications, 36(4):1660 1690, 2015.

[7] Robert Mansel Gower and Peter Richtárik. Stochastic dual ascent for solving linear systems. ar Xiv preprint ar Xiv:1512.06890, 2015.

[8] Ching-Pei Lee and Stephen J. Wright. Random permutations ﬁx a worst case for cyclic coordinate descent. ar Xiv:1607.08320, 2016.

[9] Yin Tat Lee and Aaron Sidford. Efﬁcient accelerated coordinate descent methods and faster algorithms for solving linear systems. In FOCS, 2013.

[10] Dennis Leventhal and Adrian Lewis. Randomized methods for linear constraints: convergence rates and conditioning. Mathematics of Operations Research, 35:641 654, 2010.

[11] Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. ar Xiv preprint ar Xiv:1712.09677, 2017.

[12] Yurii Nesterov. A method of solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983.

[13] Yurii Nesterov. Efﬁciency of coordinate descent methods on huge-scale optimization problems. SIAM Journal on Optimization, 22(2):341 362, 2012. doi: 10.1137/100802001. URL https: //doi.org/10.1137/100802001. First appeared in 2010 as CORE discussion paper 2010/2.

[14] Yurii Nesterov and Sebastian Stich. Efﬁciency of accelerated coordinate descent method on structured optimization problems. SIAM Journal on Optimization, 27(1):110 123, 2017.

[15] Julie Nutini, Mark Schmidt, Issam H. Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate descent converges faster with the Gauss-Southwell rule than random selection. In ICML, 2015.

[16] Boris Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1 17, 1964.

[17] Zheng Qu and Peter Richtárik. Coordinate descent with arbitrary sampling I: algorithms and complexity. Optimization Methods and Software, 31(5):829 857, 2016.

[18] Peter Richtárik and Martin Takáˇc. Stochastic reformulations of linear systems: Algorithms and convergence theory. ar Xiv preprint ar Xiv:1706.01108, 2017.

[19] Peter Richtárik and Martin Takáˇc. On optimal probabilities in stochastic coordinate descent methods. Optimization Letters, 10(6):1233 1243, 2016.

[20] Peter Richtárik and Peter Takáˇc. Parallel coordinate descent methods for big data optimization. Mathematical Programming, 156(1):433 484, 2016.

[21] Stephen Tu, Shivaram Venkataraman, Ashia C. Wilson, Alex Gittens, Michael I. Jordan, and Benjamin Recht. Breaking locality accelerates block Gauss-Seidel. In ICML, 2017.