# breaking_locality_accelerates_block_gaussseidel__16d4a338.pdf

Breaking Locality Accelerates Block Gauss-Seidel

Stephen Tu 1 Shivaram Venkataraman 1 Ashia C. Wilson 1 Alex Gittens 2 Michael I. Jordan 1 Benjamin Recht 1

Recent work by Nesterov and Stich (2016) showed that momentum can be used to accelerate the rate of convergence for block Gauss Seidel in the setting where a ﬁxed partitioning of the coordinates is chosen ahead of time. We show that this setting is too restrictive, constructing instances where breaking locality by running non-accelerated Gauss-Seidel with randomly sampled coordinates substantially outperforms accelerated Gauss-Seidel with any ﬁxed partitioning. Motivated by this ﬁnding, we analyze the accelerated block Gauss-Seidel algorithm in the random coordinate sampling setting. Our analysis captures the beneﬁt of acceleration with a new data-dependent parameter which is well behaved when the matrix subblocks are well-conditioned. Empirically, we show that accelerated Gauss-Seidel with random coordinate sampling provides speedups for large scale machine learning tasks when compared to non-accelerated Gauss-Seidel and the classical conjugate-gradient algorithm.

1. Introduction

The randomized Gauss-Seidel method is a commonly used iterative algorithm to compute the solution of an n n linear system Ax = b by updating a single coordinate at a time in a randomized order. While this approach is known to converge linearly to the true solution when A is positive deﬁnite (see e.g. (Leventhal & Lewis, 2010)), in practice it is often more efﬁcient to update a small block of coordinates at a time due to the effects of cache locality.

In extending randomized Gauss-Seidel to the block setting, a natural question that arises is how one should sample the next block. At one extreme a ﬁxed partition of the coordi-

1UC Berkeley, Berkeley, California, USA 2Rensselaer Polytechnic Institute, Troy, New York, USA. Correspondence to: Stephen Tu <stephent@berkeley.edu>.

Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 by the author(s).

nates is chosen ahead of time. The algorithm is restricted to randomly selecting blocks from this ﬁxed partitioning, thus favoring data locality. At the other extreme we break locality by sampling a new set of random coordinates to form a block at every iteration.

Theoretically, the ﬁxed partition case is well understood both for Gauss-Seidel (Qu et al., 2015; Gower & Richt arik, 2015) and its Nesterov accelerated variant (Nesterov & Stich, 2016). More speciﬁcally, at most O(µ 1 part log(1/ε)) iterations of Gauss-Seidel are sufﬁcient to reach a solution with at most ε error, where µpart is a quantity which measures how well the A matrix is preconditioned by the block diagonal matrix containing the sub-blocks corresponding to the ﬁxed partitioning. When acceleration is used, Nesterov and Stich (2016) show that the rate improves to

n p µ 1 part log(1/ε) , where p is the partition size.

For the random coordinate selection model, the existing literature is less complete. While it is known (Qu et al., 2015; Gower & Richt arik, 2015) that the iteration complexity with random coordinate section is O(µ 1 rand log(1/ε)) for an ε error solution, µrand is another instance dependent quantity which is not directly comparable to µpart. Hence it is not obvious how much better, if at all, one expects random coordinate selection to perform compared to ﬁxed partitioning.

Our ﬁrst contribution in this paper is to show that, when compared to the random coordinate selection model, the ﬁxed partition model can perform very poorly in terms of iteration complexity to reach a pre-speciﬁed error. Speciﬁcally, we present a family of instances (similar to the matrices recently studied by Lee and Wright (2016)) where nonaccelerated Gauss-Seidel with random coordinate selection performs arbitrarily faster than both non-accelerated and even accelerated Gauss-Seidel, using any ﬁxed partition. Our result thus shows the importance of the sampling strategy and that acceleration cannot make up for a poor choice of sampling distribution.

This ﬁnding motivates us to further study the beneﬁts of acceleration under the random coordinate selection model. Interestingly, the beneﬁts are more nuanced under this model. We show that acceleration improves the rate from

O(µ 1 rand log(1/ε)) to O q

νµ 1 rand log(1/ε) , where ν is

Breaking Locality Accelerates Block Gauss-Seidel

a new instance dependent quantity that satisﬁes ν µ 1 rand. We derive a bound on ν which suggests that if the subblocks of A are all well conditioned, then acceleration can provide substantial speedups. We note that this is merely a sufﬁcient condition, and our experiments suggest that our bound is conservative.

In the process of deriving our results, we also develop a general proof framework for randomized accelerated methods based on Wilson et al. (2016) which avoids the use of estimate sequences in favor of an explicit Lyapunov function. Using our proof framework we are able to recover recent results (Nesterov & Stich, 2016; Allen-Zhu et al., 2016) on accelerated coordinate descent. Furthermore, our proof framework allows us to immediately transfer our results on Gauss-Seidel over to the randomized accelerated Kaczmarz algorithm, extending a recent result by Liu and Wright (2016) on updating a single constraint at a time to the block case.

Finally, we empirically demonstrate that despite its theoretical nuances, accelerated Gauss-Seidel using random coordinate selection can provide signiﬁcant speedups in practical applications over Gauss-Seidel with ﬁxed partition sampling, as well as the classical conjugate-gradient (CG) algorithm. As an example, for a kernel ridge regression (KRR) task in machine learning on the augmented CIFAR10 dataset (n = 250, 000), acceleration with random coordinate sampling performs up to 1.5 faster than acceleration with a ﬁxed partitioning to reach an error tolerance of 10 2, with the gap substantially widening for smaller error tolerances. Furthermore, it performs over 3.5 faster than conjugate-gradient on the same task.

2. Background

We assume that we are given an n n matrix A which is positive deﬁnite, and an n dimensional response vector b. We also ﬁx an integer p which denotes a block size. Under the assumption of A being positive deﬁnite, the function f(x) = 1

2x TAx x Tb is strongly convex and smooth. Recent analysis of Gauss-Seidel (Gower & Richt arik, 2015) proceeds by noting the connection between Gauss-Seidel and (block) coordinate descent on f. This is the point of view we will take in this paper.

2.1. Existing rates for randomized block Gauss-Seidel

We ﬁrst describe the sketching framework of (Qu et al., 2015; Gower & Richt arik, 2015) and show how it yields rates on Gauss-Seidel when blocks are chosen via a ﬁxed partition or randomly at every iteration. While we will only focus on the special case when the sketch matrix represents column sampling, the sketching framework allows us to provide a uniﬁed analysis of both cases.

To be more precise, let D be a distribution over Rn p, and let Sk D be drawn iid from D. If we perform block coordinate descent by minimizing f along the range of Sk, then the randomized block Gauss-Seidel update is given by

xk+1 = xk Sk(ST k ASk) ST k (Axk b) . (1)

Column sampling. Every index set J 2[n] with |J| = p induces a sketching matrix S(J) = (e J(1), ..., e J(p)) where ei denotes the i-th standard basis vector in Rn, and J(1), ..., J(p) is any ordering of the elements of J. By equipping different probability measures on 2[n], one can easily describe ﬁxed partition sampling as well as random coordinate sampling (and many other sampling schemes). The former puts uniform mass on the index sets J1, ..., Jn/p, whereas the latter puts uniform mass on all n p

index sets of size p. Furthermore, in the sketching framework there is no limitation to use a uniform distribution, nor is there any limitation to use a ﬁxed p for every iteration. For this paper, however, we will restrict our attention to these cases.

Existing rates. Under the assumptions stated above, (Qu et al., 2015; Gower & Richt arik, 2015) show that for every k 0, the sequence (1) satisﬁes

E[ xk x A] (1 µ)k/2 x0 x A , (2)

where µ = λmin(E[PA1/2S]). The expectation in (2) is taken with respect to the randomness of S0, S1, ..., and the expectation in the deﬁnition of µ is taken with respect to S D. Under both ﬁxed partitioning and random coordinate selection, µ > 0 is guaranteed (see e.g. (Gower & Richt arik, 2015), Lemma 4.3). Thus, (1) achieves a linear rate of convergence to the true solution, with the rate governed by the µ quantity shown above.

We now specialize (2) to ﬁxed partitioning and random coordinate sampling, and provide some intuition for why we expect the latter to outperform the former in terms of iteration complexity. We ﬁrst consider the case when the sampling distribution corresponds to ﬁxed partitioning. Assume for notational convenience that the ﬁxed partitioning corresponds to placing the ﬁrst p coordinates in the ﬁrst partition J1, the next p coordinates in the second partition J2, and so on. Here, µ = µpart corresponds to a measure of how close the product of A with the inverse of the block diagonal is to the identity matrix, deﬁned as

nλmin A blkdiag A 1 J1 , ..., A 1 Jn/p

Above, AJi denotes the p p matrix corresponding to the sub-matrix of A indexed by the i-th partition. A loose lower bound on µpart is

n λmin(A) max1 i n/p λmax(AJi) . (4)

Breaking Locality Accelerates Block Gauss-Seidel

On the other hand, in the random coordinate case, Qu et al. (2015) derive a lower bound on µ = µrand as

β + (1 β)max1 i n Aii

where β = (p 1)/(n 1). Using the lower bounds (4) and (5), we can upper bound the iteration complexity of ﬁxed partition Gauss-Seidel Npart by

O n p max1 i n/p λmax(AJi)

λmin(A) log(1/ε) and random coordi-

nate Gauss-Seidel Nrand as O n p max1 i n Aii

λmin(A) log(1/ε) . Comparing the bound on Npart to the bound on Nrand, it is not unreasonable to expect that random coordinate sampling has better iteration complexity than ﬁxed partition sampling in certain cases. In Section 3, we verify this by constructing instances A such that ﬁxed partition Gauss-Seidel takes arbitrarily more iterations to reach a pre-speciﬁed error tolerance compared with random coordinate Gauss-Seidel.

2.2. Accelerated rates for ﬁxed partition Gauss-Seidel

Based on the interpretation of Gauss-Seidel as block coordinate descent on the function f, we can use Theorem 1 of Nesterov and Stich (2016) to recover a procedure and a rate for accelerating (1) in the ﬁxed partition case; the speciﬁc details are discussed in the full version of the paper (Tu et al., 2017). We will refer to this procedure as ACDM.

The convergence guarantee of the ACDM procedure is that for all k 0,

E[ xk x A] O

Above, D0 = x0 x A and µpart is the same quantity deﬁned in (3). Comparing (6) to the non-accelerated Gauss-Seidel rate given in (2), we see that acceleration improves the iteration complexity to reach a solution with ε

error from O(µ 1 part log(1/ε)) to O q

n p µ 1 part log(1/ε) , as discussed in Section 1.

We now present the main results of the paper. Proofs can be found in the full version (Tu et al., 2017) of this paper.

3.1. Fixed partition vs random coordinate sampling

Our ﬁrst result is to construct instances where Gauss-Seidel with ﬁxed partition sampling runs arbitrarily slower than random coordinate sampling, even if acceleration is used.

Consider the family of n n positive deﬁnite matrices A given by A = {Aα,β : α > 0, α + β > 0} with Aα,β

deﬁned as Aα,β = αI + β

n1n1T n. The family A exhibits a crucial property that ΠTAα,βΠ = Aα,β for every n n permutation matrix Π. Lee and Wright (2016) recently exploited this invariance to illustrate the behavior of cyclic versus randomized permutations in coordinate descent.

We explore the behavior of Gauss-Seidel as the matrices Aα,β become ill-conditioned. To do this, we consider a particular parameterization which holds the minimum eigenvalue equal to one and sends the maximum eigenvalue to inﬁnity via the sub-family {A1,β}β>0. Our ﬁrst proposition characterizes the behavior of Gauss-Seidel with ﬁxed partitions on this sub-family.

Proposition 3.1. Fix β > 0 and positive integers n, p, k such that n = pk. Let {Ji}k i=1 be any partition of {1, ..., n} with |Ji| = p, and denote Si Rn p as the column selector for partition Ji. Suppose S Rn p takes on value Si with probability 1/k. For every A1,β A we have that

µpart = p n + βp . (7)

Next, we perform a similar calculation under the random column sampling model.

Proposition 3.2. Fix β > 0 and integers n, p such that 1 < p < n. Suppose each column of S Rn p is chosen uniformly at random from {e1, ..., en} without replacement. For every A1,β A we have that

µrand = p n + βp + (p 1)βp (n 1)(n + βp) . (8)

The differences between (7) and (8) are striking. Let us assume that β is much larger than n. Then, we have that µpart 1/β for (7), whereas µrand p 1

n 1 for (8). That is, µpart can be made arbitrarily smaller than µrand as β grows.

Our next proposition states that the rate of Gauss-Seidel from (2) is tight order-wise in that for any instance there always exists a starting point which saturates the bound.

Proposition 3.3. Let A be an n n positive deﬁnite matrix, and let S be a random matrix such that µ = λmin(E[PA1/2S]) > 0. Let x denote the solution to Ax = b. There exists a starting point x0 Rn such that the sequence (1) satisﬁes for all k 0,

E[ xk x A] (1 µ)k x0 x A . (9)

From (2) we see that Gauss-Seidel using random coordinates computes a solution xk satisfying E[ xk x A1,β] ε in at most k = O( n

p log(1/ε)) iterations. On the other hand, Proposition 3.3 states that for any ﬁxed partition, there exists an input x0 such that k = Ω(β log(1/ε)) iterations are required to reach the same ε error tolerance.

Breaking Locality Accelerates Block Gauss-Seidel

Furthermore, the situation does not improve even if use ACDM from (Nesterov & Stich, 2016). Proposition 3.6, which we describe later, implies that for any ﬁxed partition there exists an input x0 such that k = Ω q

n p β log(1/ε)

iterations are required for ACDM to reach ε error. Hence as β , the gap between random coordinate and ﬁxed partitioning can be made arbitrarily large. These ﬁndings are numerically veriﬁed in Section 5.1.

3.2. A Lyapunov analysis of accelerated Gauss-Seidel and Kaczmarz

Motivated by our ﬁndings, our goal is to understand the behavior of accelerated Gauss-Seidel under random coordinate sampling. In order to do this, we establish a general framework from which the behavior of accelerated Gauss Seidel with random coordinate sampling follows immediately, along with rates for accelerated randomized Kaczmarz (Liu & Wright, 2016) and the accelerated coordinate descent methods of (Nesterov & Stich, 2016) and (Allen Zhu et al., 2016).

For conciseness, we describe a simpler version of our framework which is still able to capture both the Gauss Seidel and Kaczmarz results, deferring the general version to the full version of the paper. Our general result requires a bit more notation, but follows the same line of reasoning.

Let H be a random n n positive semi-deﬁnite matrix. Put G = E[H], and suppose that G exists and is positive deﬁnite. Furthermore, let f : Rn R be strongly convex and smooth, and let µ denote the strong convexity constant of f w.r.t. the G 1 norm.

Consider the following sequence {(xk, yk, zk)}k 0 deﬁned by the recurrence

xk+1 = 1 1 + τ yk + τ 1 + τ zk , (10a)

yk+1 = xk+1 Hk f(xk+1) , (10b)

zk+1 = zk + τ(xk+1 zk) τ

µHk f(xk+1) , (10c)

where H0, H1, ... are independent realizations of H and τ is a parameter to be chosen. Following (Wilson et al., 2016), we construct a candidate Lyapunov function Vk for the sequence (10) deﬁned as

Vk = f(yk) f + µ

2 zk x 2 G 1 . (11)

The following theorem demonstrates that Vk is indeed a Lyapunov function for (xk, yk, zk). Theorem 3.4. Let f, G, H be as deﬁned above. Suppose further that f has 1-Lipschitz gradients w.r.t. the G 1 norm, and for every ﬁxed x Rn,

f(Φ(x; H)) f(x) 1

2 f(x) 2 H , (12)

holds for a.e. H, where Φ(x; H) = x H f(x). Set τ in (10) as τ = p

ν = λmax E h (G 1/2HG 1/2)2i .

Then for every k 0, we have

E[Vk] (1 τ)k V0 .

We now proceed to specialize Theorem 3.4 to both the Gauss-Seidel and Kaczmarz settings.

3.2.1. ACCELERATED GAUSS-SEIDEL

Let S Rn p denote a random sketching matrix. As suggested in Section 2, we set f(x) = 1

2x TAx x Tb and put H = S(STAS) ST. Note that G = E[S(STAS) ST] is positive deﬁnite iff λmin(E[PA1/2S]) > 0, and is hence satisﬁed for both ﬁxed partition and random coordinate sampling (c.f. Section 2). Next, the fact that f is 1-Lipschitz w.r.t. the G 1 norm and the condition (12) are standard calculations. All the hypotheses of Theorem 3.4 are thus satisﬁed, and the conclusion is Theorem 3.5, which characterizes the rate of convergence for accelerated Gauss-Seidel (Algorithm 1).

Algorithm 1 Accelerated randomized block Gauss-Seidel. Require: A Rn n, A 0, b Rn, sketching matrices {Sk}T 1 k=0 Rn p, x0 Rn, µ (0, 1), ν 1. 1: Set τ = p

µ/ν. 2: Set y0 = z0 = x0. 3: for k = 0, ..., T 1 do 4: xk+1 = 1 1+τ yk + τ 1+τ zk.

5: Hk = Sk(ST k ASk) ST k . 6: yk+1 = xk+1 Hk(Axk+1 b). 7: zk+1 = zk + τ(xk+1 zk) τ

µHk(Axk+1 b). 8: end for 9: Return y T .

Theorem 3.5. Let A be an n n positive deﬁnite matrix and b Rn. Let x Rn denote the unique vector satisfying Ax = b. Suppose each Sk, k = 0, 1, 2, ... is an independent copy of a random matrix S Rn p. Put µ = λmin(E[PA1/2S]), and suppose the distribution of S satisﬁes µ > 0. Invoke Algorithm 1 with µ and ν, where

ν = λmax E h (G 1/2HG 1/2)2i , (13)

with H = S(STAS) ST and G = E[H]. Then with τ = p

µ/ν, for all k 0,

2(1 τ)k/2 x0 x A . (14)

Note that in the setting of Theorem 3.5, by the deﬁnition of ν and µ, it is always the case that ν 1/µ. Therefore,

Breaking Locality Accelerates Block Gauss-Seidel

the iteration complexity of acceleration is at least as good as the iteration complexity without acceleration.

We conclude our discussion of Gauss-Seidel by describing the analogue of Proposition 3.3 for Algorithm 1, which shows that our analysis in Theorem 3.5 is tight order-wise. The following proposition applies to ACDM as well; we show in the full version of the paper how ACDM can be viewed as a special case of Algorithm 1.

Proposition 3.6. Under the setting of Theorem 3.5, there exists starting positions y0, z0 Rn such that the iterates {(yk, zk)}k 0 produced by Algorithm 1 satisfy

E[ yk x A + zk x A] (1 τ)k y0 x A .

3.2.2. ACCELERATED KACZMARZ

The argument for Theorem 3.5 can be slightly modiﬁed to yield a result for randomized accelerated Kaczmarz in the sketching framework, for the case of a consistent overdetermined linear system.

Speciﬁcally, suppose we are given an m n matrix A which has full column rank, and b R(A). Our goal is to recover the unique x satisfying Ax = b. To do this, we apply a similar line of reasoning as (Lee & Sidford, 2013). We set f(x) = 1

2 x x 2 2 and H = PATS, where S again is our random sketching matrix. At ﬁrst, it appears our choice of f is problematic since we do not have access to f and f, but a quick calculation shows that H f(x) = (STA) ST(Ax b). Hence, with rk = Axk b, the sequence (10) simpliﬁes to

xk+1 = 1 1 + τ yk + τ 1 + τ zk , (15a)

yk+1 = xk+1 (ST k A) ST k rk+1 , (15b)

zk+1 = zk + τ(xk+1 zk) τ

µ(ST k A) ST k rk+1 . (15c)

The remainder of the argument proceeds nearly identically, and leads to the following theorem.

Theorem 3.7. Let A be an m n matrix with full column rank, and b = Ax . Suppose each Sk, k = 0, 1, 2, ... is an independent copy of a random sketching matrix S Rm p. Put H = PATS and G = E[H]. The sequence (15) with µ = λmin(E[PATS]), ν = λmax E (G 1/2HG 1/2)2 , and τ = p

µ/ν satisﬁes for all k 0,

2(1 τ)k/2 x0 x 2 . (16)

Specialized to the setting of (Liu & Wright, 2016) where each row of A has unit norm and is sampled uniformly at every iteration, it can be shown (Section A.5.1) that

ν m and µ = 1 mλmin(ATA). Hence, the above theorem states that the iteration complexity to reach ε er-

λmin(ATA) log(1/ε) , which matches Theo-

rem 5.1 of (Liu & Wright, 2016) order-wise. However, Theorem 3.7 applies in general for any sketching matrix.

3.3. Specializing accelerated Gauss-Seidel to random coordinate sampling

We now instantiate Theorem 3.5 to random coordinate sampling. The µ quantity which appears in Theorem 3.5 is identical to the quantity appearing in the rate (2) of nonaccelerated Gauss-Seidel. That is, the iteration complex-

ity to reach tolerance ε is O q

νµ 1 rand log(1/ε) , and the

only new term here is ν. In order to provide a more intuitive interpretation of the ν quantity, we present an upper bound on ν in terms of an effective block condition number deﬁned as follows. Given an index set J 2[n], deﬁne the effective block condition number of a matrix A as κeﬀ,J(A) = maxi J Aii

λmin(AJ) . Note that κeﬀ,J(A) κ(AJ) always. The following lemma gives upper and lower bounds on the ν quantity.

Lemma 3.8. Let A be an n n positive deﬁnite matrix and let p satisfy 1 < p < n. We have that

n 1 + 1 p 1

where κeﬀ,p(A) = max J 2[n]:|J|=p κeﬀ,J(A), ν is deﬁned in (13), and the distribution of S corresponds to uniformly selecting p coordinates without replacement.

Lemma 3.8 states that if the p p sub-blocks of A are well-conditioned as deﬁned by the effective block condition number κeﬀ,J(A), then the speed-up of accelerated Gauss-Seidel with random coordinate selection over its non-accelerate counterpart parallels the case of ﬁxed partitioning sampling (i.e. the rate described in (6) versus the rate in (2)). This is a reasonable condition, since very illconditioned sub-blocks will lead to numerical instabilities in solving the sub-problems when implementing Gauss Seidel. On the other hand, we note that Lemma 3.8 provides merely a sufﬁcient condition for speed-ups from acceleration, and is conservative. Our numerically experiments in Section A.7.2 suggest that in many cases the ν parameter behaves closer to the lower bound n/p than Lemma 3.8 suggests.

We can now combine Theorem 3.5 with (5) to derive the following upper bound on the iteration complexity of accelerated Gauss-Seidel with random coordinates as

Nrand,acc O

max1 i n Aii

λmin(A) κeﬀ,p(A) log(1/ε)

Breaking Locality Accelerates Block Gauss-Seidel

Illustrative example. We conclude our results by illustrating our bounds on a simple example. Consider the subfamily {Aδ}δ>0 A , with

Aδ = An+δ, n , δ > 0 . (17)

A simple calculation yields that κeﬀ,p(Aδ) = n 1+δ n p+δ,

and hence Lemma 3.8 states that ν(Aδ) n

n 1 . Furthermore, by a similar calculation to Proposition 3.2, µrand = pδ n(n p+δ). Assuming for simplicity that p = o(n) and δ (0, 1), Theorem 3.5 states that at most O( n3/2

δ log(1/ε)) iterations are sufﬁcient for an ε-accurate solution. On the other hand, without acceleration (2) states that O( n2

pδ log(1/ε)) iterations are sufﬁcient and Proposition 3.3 shows there exists a starting position for which it is necessary. Hence, as either n grows large or δ tends to zero, the beneﬁts of acceleration become more pronounced.

4. Related Work

We split the related work into two broad categories of interest: (a) work related to coordinate descent (CD) methods on convex functions and (b) randomized solvers designed for solving consistent linear systems.

When A is positive deﬁnite, Gauss-Seidel can be interpreted as an instance of coordinate descent on a strongly convex quadratic function. We therefore review related work on both non-accelerated and accelerated coordinate descent, focusing on the randomized setting instead of the more classical cyclic order or Gauss-Southwell rule for selecting the next coordinate. See (Tseng & Yun, 2009) for a discussion on non-random selection rules, (Nutini et al., 2015) for a comparison of random selection versus Gauss Southwell, and (Nutini et al., 2016) for efﬁcient implementations of Gauss-Southwell.

Nesterov s original paper in (2012) ﬁrst considered randomized CD on convex functions, assuming a partitioning of coordinates ﬁxed ahead of time. The analysis included both non-accelerated and accelerated variants for convex functions. This work sparked a resurgence of interest in CD methods for large problems. Most relevant to our paper are extensions to the block setting (Richt arik & Tak a c, 2014), handling arbitrary sampling distributions (Qu & Richt arik, 2014a;b; Fountoulakis & Tappenden, 2016), and second order updates for quadratic functions (Qu et al., 2016).

For accelerated CD, Lee and Sidford (2013) generalize the analysis of Nesterov (2012). While the analysis of (Lee & Sidford, 2013) was limited to selecting a single coordinate at a time, several follow on works (Qu & Richt arik, 2014a; Lin et al., 2014; Lu & Xiao, 2015; Fercoq & Richt arik, 2015) generalize to block and non-smooth settings. More recently, both Allen-Zhu et al. (2016) and Nesterov and

Stich (2016) independently improve the results of (Lee & Sidford, 2013) by using a different non-uniform sampling distribution. One of the most notable aspects of the analysis in (Allen-Zhu et al., 2016) is a departure from the (probabilistic) estimate sequence framework of Nesterov. Instead, the authors construct a valid Lyapunov function for coordinate descent, although they do not explicitly mention this. In our work, we make this Lyapunov point of view explicit. The constants in our acceleration updates arise from a particular discretization and Lyapunov function outlined from Wilson et al. (2016). Using this framework makes our proof particularly transparent, and allows us to recover results for strongly convex functions from (Allen-Zhu et al., 2016) and (Nesterov & Stich, 2016) as a special case.

From the numerical analysis side both the Gauss-Seidel and Kaczmarz algorithm are classical methods. Strohmer and Vershynin (2009) were the ﬁrst to prove a linear rate of convergence for randomized Kaczmarz, and Leventhal and Lewis (2010) provide a similar kind of analysis for randomized Gauss-Seidel. Both of these were in the single constraint/coordinate setting. The block setting was later analyzed by Needell and Tropp (2014). More recently, Gower and Richt arik (2015) provide a uniﬁed analysis for both randomized block Gauss-Seidel and Kaczmarz in the sketching framework. We adopt this framework in this paper. Finally, Liu and Wright (2016) provide an accelerated analysis of randomized Kaczmarz once again in the single constraint setting and we extend this to the block setting.

5. Experiments

In this section we experimentally validate our theoretical results on how our accelerated algorithms can improve convergence rates. Our experiments use a combination of synthetic matrices and matrices from large scale machine learning tasks.

Setup. We run all our experiments on a 4 socket Intel Xeon CPU E7-8870 machine with 18 cores per socket and 1TB of DRAM. We implement all our algorithms in Python using numpy, and use the Intel MKL library with 72 Open MP threads for numerical operations. We report errors as relative errors, i.e. xk x 2 A/ x 2 A. Finally, we use the best values of µ and ν found by tuning each experiment.

We implement ﬁxed partitioning by creating random blocks of coordinates at the beginning of the experiment and cache the corresponding matrix blocks to improve performance. For random coordinate sampling, we select a new block of coordinates at each iteration.

For our ﬁxed partition experiments, we restrict our attention to uniform sampling. While Gower and Richt arik (2015) propose a non-uniform scheme based on Tr(STAS), for translation-invariant kernels this reduces to

Breaking Locality Accelerates Block Gauss-Seidel

0 100 200 300 400 500 Iteration

10 30 10 27 10 24 10 21 10 18 10 15 10 12

10 9 10 6 10 3

xk x 2 A/ x 2 A

Id+Rank One, n=5000, p=500

GS Fixed Partition GS-Acc Fixed Partition GS Random Coordinates GS-Acc Random Coordinates

Figure 1. Experiments comparing ﬁxed partitions versus random coordinate sampling for the example from Section 3.1 with n = 5000 coordinates, block size p = 500.

uniform sampling. Furthermore, as the kernel block Lipschitz constants were also roughly the same, other nonuniform schemes (Allen-Zhu et al., 2016) also reduce to nearly uniform sampling.

5.1. Fixed partitioning vs random coordinate sampling

Our ﬁrst set of experiments numerically verify the separation between ﬁxed partitioning sampling versus random coordinate sampling.

Figure 1 shows the progress per iteration on solving A1,βx = b, with the A1,β deﬁned in Section 3.1. Here we set n = 5000, p = 500, β = 1000, and b N(0, I). Figure 1 veriﬁes our analytical ﬁndings in Section 3.1, that the ﬁxed partition scheme is substantially worse than uniform sampling on this instance. It also shows that in this case, acceleration provides little beneﬁt in the case of random coordinate sampling. This is because both µ and 1/ν are order-wise p/n, and hence the rate for accelerated and non-accelerated coordinate descent coincide. However we note that this only applies for matrices where µ is as large as it can be (i.e. p/n), that is instances for which Gauss Seidel is already converging at the optimal rate (see (Gower & Richt arik, 2015), Lemma 4.2).

5.2. Kernel ridge regression

We next evaluate how ﬁxed partitioning and random coordinate sampling affects the performance of Gauss-Seidel on large scale machine learning tasks. We use the popular image classiﬁcation dataset CIFAR-10 and evaluate a kernel ridge regression (KRR) task with a Gaussian kernel. Speciﬁcally, given a labeled dataset {(xi, yi)}n i=1, we solve the linear system (K + λI)α = Y with Kij = exp( γ xi xj 2 2), where λ, γ > 0 are tunable parameters. The key property of KRR is that the kernel matrix K is positive semi-deﬁnite, and hence Algorithm 1 applies.

0 100 200 300 400 500 Iteration

xk x 2 A/ x 2 A

CIFAR-10 KRR, n=250k, p=10k

GS Fixed Partition GS-Acc Fixed Partition GS Random Coordinates GS-Acc Random Coordinates

Figure 2. Experiments comparing ﬁxed partitions versus uniform random sampling for CIFAR-10 augmented matrix while running kernel ridge regression. The matrix has n = 250000 coordinates and we set block size to p = 10000.

0 1000 2000 3000 4000 5000 6000 7000 Time (s)

xk x 2 A/ x 2 A

CIFAR-10 KRR, n=250k, p=10k

GS Fixed Partition GS-Acc Fixed Partition GS Random Coordinates GS-Acc Random Coordinates Conjugate Gradient

Figure 3. Comparing conjugate gradient with accelerated and unaccelerated Gauss-Seidel methods for CIFAR-10 augmented matrix while running kernel ridge regression. The matrix has n = 250000 coordinates and we set block size to p = 10000.

For the CIFAR-10 dataset, we augment the dataset1 to include ﬁve reﬂections, translations per-image and then apply standard pre-processing steps used in image classiﬁcation (Coates & Ng, 2012; Sparks et al., 2017). We ﬁnally apply a Gaussian kernel on our pre-processed images and the resulting kernel matrix has n = 250000 coordinates. We also include experiments on a smaller MNIST kernel matrix (n = 60000) in Section A.7.

Results from running 500 iterations of random coordinate sampling and ﬁxed partitioning algorithms are shown in Figure 2. Comparing convergence across iterations, similar to previous section, we see that un-accelerated Gauss Seidel with random coordinate sampling is better than accelerated Gauss-Seidel with ﬁxed partitioning. However we also see that using acceleration with random sampling can further improve the convergence rates, especially to achieve errors of 10 3 or lower.

1Similar to https://github.com/akrizhevsky/cuda-convnet2.

Breaking Locality Accelerates Block Gauss-Seidel

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Time (s) to 10 5 error

MNIST Random Features, n=5000

Figure 4. The effect of block size on the accelerated Gauss-Seidel method. For the MNIST dataset (pre-processed using random features) we see that block size of p = 500 works best.

We also compare the convergence with respect to running time in Figure 3. Fixed partitioning has better performance in practice random access is expensive in multi-core systems. However, we see that this speedup in implementation comes at a substantial cost in terms of convergence rate. For example in the case of CIFAR-10, using ﬁxed partitions leads to an error of 1.2 10 2 after around 7000 seconds. In comparison we see that random coordinate sampling achieves a similar error in around 4500 seconds and is thus 1.5 faster. We also note that this speedup increases for lower error tolerances.

5.3. Comparing Gauss-Seidel to Conjugate-Gradient

We also compared Gauss-Seidel with random coordinate sampling to the classical conjugate-gradient (CG) algorithm. CG is an important baseline to compare with, as it is the de-facto standard iterative algorithm for solving linear systems in the numerical analysis community. While we report the results of CG without preconditioning, we remark that the performance using a standard banded preconditioner was not any better. However, for KRR specifically, there have been recent efforts (Avron et al., 2017; Rudi et al., 2017) to develop better preconditioners, and we leave a more thorough comparison for future work. The results of our experiment are shown in Figure 3. We note that Gauss-Seidel both with and without acceleration outperform CG. As an example, we note that to reach error 10 1 on CIFAR-10, CG takes roughly 7000 seconds, compared to less than 2000 seconds for accelerated Gauss Seidel, which is a 3.5 improvement.

To understand this performance difference, we recall that our matrices A are fully dense, and hence each iteration of CG takes O(n2). On the other hand, each iteration of both non-accelerated and accelerated Gauss-Seidel takes O(np2 + p3). Hence, as long as p = O(n2/3), the time per iteration of Gauss-Seidel is order-wise no worse than

CG. In terms of iteration complexity, standard results state that CG takes at most O( κ log(1/ε)) iterations to reach an ε error solution, where κ denotes the condition number of A. On the other hand, Gauss-Seidel takes at most O( n

p κeﬀlog(1/ε)), where κeﬀ= max1 i n Aii

λmin(A) . In the case of any (normalized) kernel matrix associated with a translation-invariant kernel such as the Gaussian kernel, we have max1 i n Aii = 1, and hence generally speaking κeﬀis much lower than κ.

5.4. Effect of block size

We next analyze the importance of the block size p for the accelerated Gauss-Seidel method. As the values of µ and ν change for each setting of p, we use a smaller MNIST matrix for this experiment. We apply a random feature transformation (Rahimi & Recht, 2007) to generate an n d matrix F with d = 5000 features. We then use A = F TF and b = F TY as inputs to the algorithm. Figure 4 shows the wall clock time to converge to 10 5 error as we vary the block size from p = 50 to p = 1000.

Increasing the block-size improves the amount of progress that is made per iteration but the time taken per iteration increases as O(p3) (Line 5, Algorithm 1). However, using efﬁcient BLAS-3 primitives usually affords a speedup from systems techniques like cache blocking. We see the effects of this in Figure 4 where using p = 500 performs better than using p = 50. We also see that these beneﬁts reduce for much larger block sizes and thus p = 1000 is slower.

6. Conclusion

In this paper, we extended the accelerated block Gauss Seidel algorithm beyond ﬁxed partition sampling. Our analysis introduced a new data-dependent parameter ν which governs the speed-up of acceleration. Specializing our theory to random coordinate sampling, we derived an upper bound on ν which shows that well conditioned blocks are a sufﬁcient condition to ensure speedup. Experimentally, we showed that random coordinate sampling is readily accelerated beyond what our bound suggests.

The most obvious question remains to derive a sharper bound on the ν constant from Theorem 3.5. Another interesting question is whether or not the iteration complexity of random coordinate sampling is always bounded above by the iteration complexity with ﬁxed coordinate sampling.

We also plan to study an implementation of accelerated Gauss-Seidel in a distributed setting (Tu et al., 2016). The main challenges here are in determining how to sample coordinates without signiﬁcant communication overheads, and to efﬁciently estimate µ and ν. To do this, we wish to explore other sampling schemes such as shufﬂing the coordinates at the end of every epoch (Recht & R e, 2013).

Breaking Locality Accelerates Block Gauss-Seidel

Acknowledgements

We thank Ross Boczar for assisting us with Mathematica support for non-commutative algebras, Orianna De Masi for providing useful feedback on earlier drafts of this manuscript, and the anonymous reviewers for their helpful feedback. ACW is supported by an NSF Graduate Research Fellowship. BR is generously supported by ONR awards N00014-11-1-0723 and N00014-13-1-0129, NSF award CCF-1359814, the DARPA Fundamental Limits of Learning (Fun Lo L) Program, a Sloan Research Fellowship, and a Google Research Award. This research is supported in part by DHS Award HSHQDC-16-3-00083, NSF CISE Expeditions Award CCF-1139158, DOE Award SN10040 DE-SC0012463, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, IBM, SAP, The Thomas and Stacey Siebel Foundation, Apple Inc., Arimo, Blue Goji, Bosch, Cisco, Cray, Cloudera, Ericsson, Facebook, Fujitsu, HP, Huawei, Intel, Microsoft, Mitre, Pivotal, Samsung, Schlumberger, Splunk, State Farm and VMware.

Allen-Zhu, Zeyuan, Richt arik, Peter, Qu, Zheng, and Yuan, Yang. Even Faster Accelerated Coordinate Descent Using Non-Uniform Sampling. In ICML, 2016.

Avron, Haim, Clarkson, Kenneth L., and Woodruff, David P. Faster Kernel Ridge Regression Using Sketching and Preconditioning. ar Xiv, 1611.03220, 2017.

Coates, Adam and Ng, Andrew Y. Learning Feature Representations with K-Means. In Neural Networks: Tricks of the Trade. Springer, 2012.

Fercoq, Olivier and Richt arik, Peter. Accelerated, Parallel, and Proximal Coordinate Descent. SIAM J. Optim., 25 (4), 2015.

Fountoulakis, Kimon and Tappenden, Rachael. A Flexible Coordinate Descent Method. ar Xiv, 1507.03713, 2016.

Gower, Robert M. and Richt arik, Peter. Randomized Iterative Methods for Linear Systems. SIAM Journal on Matrix Analysis and Applications, 36, 2015.

Lee, Ching-Pei and Wright, Stephen J. Random Permutations Fix a Worst Case for Cyclic Coordinate Descent. ar Xiv, 1607.08320, 2016.

Lee, Yin Tat and Sidford, Aaron. Efﬁcient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems. In FOCS, 2013.

Leventhal, Dennis and Lewis, Adrian S. Randomized Methods for Linear Constraints: Convergence Rates and

Conditioning. Mathematics of Operations Research, 35 (3), 2010.

Lin, Qihang, Lu, Zhaosong, and Xiao, Lin. An Accelerated Proximal Coordinate Gradient Method. In NIPS, 2014.

Liu, Ji and Wright, Stephen J. An Accelerated Randomized Kaczmarz Algorithm. Mathematics of Computation, 85 (297), 2016.

Lu, Zhaosong and Xiao, Lin. On the Complexity Analysis of Randomized Block-Coordinate Descent Methods. Mathematical Programming, 152(1 2), 2015.

Needell, Deanna and Tropp, Joel A. Paved with Good Intentions: Analysis of a Randomized Block Kaczmarz Method. Linear Algebra and its Applications, 441, 2014.

Nesterov, Yurii. Efﬁciency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM J. Optim., 22(2), 2012.

Nesterov, Yurii and Stich, Sebastian. Efﬁciency of Accelerated Coordinate Descent Method on Structured Optimization Problems. Technical report, Universit e catholique de Louvain, CORE Discussion Papers, 2016.

Nutini, Julie, Schmidt, Mark, Laradji, Issam H., Friedlander, Michael, and Koepke, Hoyt. Coordinate Descent Converges Faster with the Gauss-Southwell Rule Than Random Selection. In ICML, 2015.

Nutini, Julie, Sepehry, Behrooz, Laradji, Issam, Schmidt, Mark, Koepke, Hoyt, and Virani, Alim. Convergence Rates for Greedy Kaczmarz Algorithms, and Faster Randomized Kaczmarz Rules Using the Orthogonality Graph. In UAI, 2016.

Qu, Zheng and Richt arik, Peter. Coordinate Descent with Arbitrary Sampling I: Algorithms and Complexity. ar Xiv, 1412.8060, 2014a.

Qu, Zheng and Richt arik, Peter. Coordinate Descent with Arbitrary Sampling II: Expected Separable Overapproximation. ar Xiv, 1412.8063, 2014b.

Qu, Zheng, Richt arik, Peter, and Zhang, Tong. Randomized Dual Coordinate Ascent with Arbitrary Sampling. In NIPS, 2015.

Qu, Zheng, Richt arik, Peter, Tak a c, Martin, and Fercoq, Olivier. SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization. In ICML, 2016.

Rahimi, Ali and Recht, Benjamin. Random Features for Large-Scale Kernel Machines. In NIPS, 2007.

Breaking Locality Accelerates Block Gauss-Seidel

Recht, Benjamin and R e, Christopher. Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Mathematical Programming Computation, 5(2): 201 226, 2013.

Richt arik, Peter and Tak a c, Martin. Iteration Complexity of Randomized Block-Coordinate Descent Methods for Minimizing a Composite Function. Mathematical Programming, 114, 2014.

Rudi, Alessandro, Carratino, Luigi, and Rosasco, Lorenzo. FALKON: An Optimal Large Scale Kernel Method. ar Xiv, 1705.10958, 2017.

Sparks, Evan R., Venkataraman, Shivaram, Kaftan, Tomer, Franklin, Michael, and Recht, Benjamin. Keystone ML: Optimizing Pipelines for Large-Scale Advanced Analytics. In ICDE, 2017.

Strohmer, Thomas and Vershynin, Roman. A Randomized Kaczmarz Algorithm with Exponential Convergence. Journal of Fourier Analysis and Applications, 15 (1), 2009.

Tseng, Paul and Yun, Sangwoon. A Coordinate Gradient Descent Method for Nonsmooth Separable Minimization. Mathematical Programming, 117(1), 2009.

Tu, Stephen, Roelofs, Rebecca, Venkataraman, Shivaram, and Recht, Benjamin. Large Scale Kernel Learning using Block Coordinate Descent. ar Xiv, 1602.05310, 2016.

Tu, Stephen, Venkataraman, Shivaram, Wilson, Ashia C., Gittens, Alex, Jordan, Michael I., and Recht, Benjamin. Breaking Locality Accelerates Block Gauss Seidel. ar Xiv, 1701.03863, 2017.

Wilson, Ashia C., Recht, Benjamin, and Jordan, Michael I. A Lyapunov Analysis of Momentum Methods in Optimization. ar Xiv, 1611.02635, 2016.