# coded_distributed_computing_for_inverse_problems__f312d96f.pdf

Coded Distributed Computing for Inverse Problems

Yaoqing Yang, Pulkit Grover and Soummya Kar Carnegie Mellon University {yyaoqing, pgrover, soummyak}@andrew.cmu.edu

Computationally intensive distributed and parallel computing is often bottlenecked by a small set of slow workers known as stragglers. In this paper, we utilize the emerging idea of coded computation to design a novel error-correcting-code inspired technique for solving linear inverse problems under speciﬁc iterative methods in a parallelized implementation affected by stragglers. Example machinelearning applications include inverse problems such as personalized Page Rank and sampling on graphs. We provably show that our coded-computation technique can reduce the mean-squared error under a computational deadline constraint. In fact, the ratio of mean-squared error of replication-based and coded techniques diverges to inﬁnity as the deadline increases. Our experiments for personalized Page Rank performed on real systems and real social networks show that this ratio can be as large as 104. Further, unlike coded-computation techniques proposed thus far, our strategy combines outputs of all workers, including the stragglers, to produce more accurate estimates at the computational deadline. This also ensures that the accuracy degrades gracefully in the event that the number of stragglers is large.

1 Introduction

The speed of distributed computing is often affected by a few slow workers known as the stragglers [1 4]. This issue is often addressed by replicating tasks across workers and using this redundancy to ignore some of the stragglers. Recently, methods from error-correcting codes (ECC) have been used for speeding up distributed computing [5 15], which build on classical works on algorithm-based fault-tolerance [16]. The key idea is to treat stragglers as erasures and use ECC to retrieve the result after a subset of fast workers have ﬁnished. In some cases, (e.g. [6, 8] for matrix multiplications), techniques that utilize ECC achieve scaling-sense speedups in average computation time compared to replication. In this work, we propose a novel coding-inspired technique to deal with stragglers in distributed computing of linear inverse problems using iterative solvers [17].

Existing techniques that use coding to deal with stragglers treat straggling workers as erasures , that is, they ignore computation results of the stragglers. In contrast, when using iterative methods for linear inverse problems, even if the computation result at a straggler has not converged, the proposed algorithm does not ignore the result, but instead combines it (with appropriate weights) with results from other workers. This is in part because the results of iterative methods often converge gradually to the true solutions. We use a small example shown in Fig. 1 to illustrate this idea. Suppose we want to solve two linear inverse problems with solutions x 1 and x 2. We encode the computation by adding an extra linear inverse problem with solution x 1 + x 2 (see Section 3), and distribute these three problems to three workers. Using this method, the solutions x 1 and x 2 can be obtained from the results of any combination of two fast workers that ﬁrst return their solutions.

But what if we have a computational deadline, Tdl, by which only one worker converges? The natural extension of existing strategies (e.g., [6]) will declare a failure because it needs at least two workers to respond. However, our strategy does not require convergence: even intermediate results can be utilized to estimate solutions. In other words, our strategy degrades gracefully as the number of stragglers increases, or as the deadline is pulled earlier. Indeed, we show that it is suboptimal to ignore stragglers as erasures, and design strategies that treat the difference from the optimal solution

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

x1+e1 x2+e2 x1+x2+e3

r1 r2 r1+r2

General coded computation Proposed coded method

(slow) ignore!

wait Efor Etwo E fast Eworkers Deadline ETdl D e c o d e

(slow) (slow)

D e c o d e

fail weighted E combination

E n c o d e

r1 r2 r1+r2

E n c o d e

Figure 1: A comparison between the existing scheme in [6] and the proposed algorithm.

as soft additive noise (see Section 3). We use an algorithm that is similar to weighted least-squares for decoding, giving each worker a weight based on its proximity to convergence. In this way, we can expect to fully utilize the computation results from all workers and obtain better speedup.

Theoretically, we show that for a speciﬁed deadline time Tdl, under certain conditions on worker speed distributions, the coded linear inverse solver using structured codes has smaller mean squared error than the replication-based linear solver (Theorem 4.4). In fact, under more relaxed conditions on worker speed distributions, when the computation time Tdl increases, the ratio of the mean-squared error (MSE) of replication-based and coded linear solvers can get arbitrarily large (Theorem 4.5)! For validation of our theory, we performed experiments to compare coded and replication-based computation for a graph mining problem, namely personalized Page Rank [18] using the classical power-iteration method [19]. We conduct experiments on the Twitter and Google Plus social networks under a deadline on computation time using a given number of workers on a real computation cluster (Section 6). We observe that the MSE of coded Page Rank is smaller than that of replication by a factor of 104 at Tdl = 2 seconds. From an intuitive perspective, the advantage of coding over replication is that coding utilizes the diversity of all heterogeneous workers, whereas replication cannot (see section 7 for details). To compare with existing coded technique in [6], we adapt it to inverse problems by inverting only the partial results from the fast workers. However, from our experiments, if only the results from the fast workers are used, the error ampliﬁes due to inverting an ill-conditioned submatrix during decoding (Section 6). This ill-conditioning issue of real-number erasure codes has also been recognized in a recent communication problem [20]. In contrast, our novel way of combining all the partial results including those from the stragglers helps bypass the difﬁculty of inverting an ill-conditioned matrix.

The focus of this work is on utilizing computations to deliver the minimal MSE in solving linear inverse problems. Our algorithm does not reduce the communication cost. However, because each worker performs sophisticated iterative computations in our problem, such as the power-iteration computations, the time required for computation dominates that of communication (Section 5.2). This is unlike some recent works (e.g.[21 24]) where communication costs are observed to dominate because the per-processor computation is smaller.

Finally, we summarize our main contributions in this paper:

We propose a coded computing algorithm for multiple instances of a linear inverse problem; We theoretically analyze the mean-squared error of coded, uncoded and replication-based iterative linear solvers under a deadline constraint, and show scaling sense advantage of coded solvers in theory and orders of magnitude smaller error in data experiments. This is the ﬁrst work that treats stragglers as soft errors instead of erasures, which leads to graceful degradation in the event that the number of stragglers is large.

2 System Model and Problem Formulation

2.1 Preliminaries on Solving Linear Systems using Iterative Methods

Consider the problem of solving k inverse problems with the same linear transform matrix M and different inputs ri: Mxi = ri, i = 1, 2, . . . k. When M is a square matrix, the closed-form solution is xi = M 1ri. When M is a non-square matrix, the regularized least-square solution is xi = (M M + λI) 1M ri, i = 1, 2, . . . k, with an appropriate regularization parameter λ. Since matrix inversion is hard, iterative methods are often used. We now look at two ordinary iterative methods, namely the Jacobian method [17] and the gradient descent method. For a square matrix M = D + L, where D is diagonal, the Jacobian iteration is written as x(l+1) i = D 1(ri Lx(l) i ). Under certain conditions of D and L ([17, p.115]), the computation result converges to the true solution. One example is the Page Rank algorithm discussed in Section 2.2. For the ℓ2-minimization problem with a non-square M, the gradient descent method has the form x(l+1) i = ((1 λ)I ϵM M)x(l) i +ϵM ri,

where ϵ is an appropriate step-size. We can see that both the Jacobian iteration and the gradient descent iteration mentioned above have the form x(l+1) i = Bx(l) i + Kri, i = 1, 2, . . . k, (1) for two appropriate matrices B and K, which solves the following equation with true solution x i : x i = Bx i + Kri, i = 1, 2, . . . k. (2)

Therefore, subtracting (2) from (1), we have that the computation error e(l) i = x(l) i x i satisﬁes

e(l+1) i = Be(l) i . (3) For the iterative method to converge, we always assume the spectral radius ρ(B) < 1 (see [17, p.115]). We will study iterative methods that have the form (1) throughout this paper.

2.2 Motivating Applications of Linear Inverse Problems

Our coded computation technique requires solving multiple inverse problems with the same linear transform matrix M. One such problem is personalized Page Rank. For a directed graph, the Page Rank algorithm [19] aims to measure the nodes importance by solving the linear problem x = d N 1N + (1 d)Ax, where d = 0.15 is called the teleport probability, N is the number of nodes and A is the column-normalized adjacency matrix. The personalized Page Rank problem [18] considers a more general equation x = dr + (1 d)Ax, for any possible vector r RN that satisﬁes 1 r = 1. Compared to Page Rank [19], personalized Page Rank [18] incorporates r as the preference of different users or topics. A classical method to solve Page Rank is power-iteration, which iterates the computation x(l+1) = dr + (1 d)Ax(l) until convergence. This iterative method is the same as (1), which is essentially the Jacobian method mentioned above. Another example application is the sampling and recovery problem in the emerging ﬁeld of graph signal processing [25, 26] as a non-square system, which is discussed in Supplementary section 8.1.

2.3 Problem Formulation: Distributed Computing and the Straggler Effect

Consider solving k linear inverse problems Mxi = ri, i = 1, 2, . . . k in n > k workers using the iterative method (1), where each worker solves one inverse problem. Due to the straggler effect, the computation at different workers can have different speeds. The goal is to obtain minimal MSE in solving linear inverse problems before a deadline time Tdl. Suppose after Tdl, the i-th worker has completed li iterations in (1). Then, from (3), the residual error at the i-th worker is

e(li) i = Blie(0) i . (4)

For our theoretical results, we sometimes need the following assumption. Assumption 1. We assume that the optimal solutions x i , i = 1, 2, . . . k, are i.i.d.

Denote by µE and CE respectively the mean and the covariance of each x i . Note that Assumption 1 is equivalent to the assumption that the inputs ri, i = 1, 2, . . . k are i.i.d., because ri and x i are related by the linear equation (2). For the personalized Page Rank problem discussed above, this assumption is reasonable because queries from different users or topics are unrelated. Assume we have estimated the mean µE beforehand and we start with the initial estimate x(0) i = µE. Then, e(0) i = x(0) i x i has mean 0N and covariance CE. We also try to extend our results for the case when x i s (or equivalently, ri s) are correlated. Since the extension is rather long and may hinder the understanding of the main paper, we provide it in supplementary section 8.2 and section 8.5.

2.4 Preliminaries on Error Correcting Codes

We will use encode and decode to denote preprocessing and post-processing before and after parallel computation. In this paper, the encoder multiplies the inputs to the parallel workers with a generator matrix G and the decoder multiplies the outputs of the workers with a decoding matrix L (see Algorithm 1). We call a code an (n, k) code if the generator matrix has size k n. We often use generator matrices G with orthonormal rows, which means Gk n G n k = Ik. An example of such a matrix is the submatrix formed by any k rows of an n n orthonormal matrix (e.g., a Fourier matrix). Under this assumption, Gk n can be augmented to form an n n orthonormal matrix using

another matrix H(n k) n, i.e. the square matrix Fn n = Gk n H(n k) n

satisﬁes F F = In.

3 Coded Distributed Computing of Linear Inverse Problems

The proposed coded linear inverse algorithm (Algorithm 1) has three stages: (1) preprocessing (encoding) at the central controller, (2) parallel computing at n > k parallel workers, and (3) postprocessing (decoding) at the central controller. As we show later in the analysis of computing error, the entries trace(C(li)) in the diagonal matrix Λ are the expected MSE at each worker prior to decoding. The decoding matrix Lk n in the decoding step (7) is chosen to be (GΛ 1G ) 1GΛ 1 to reduce the mean-squared error of the estimates of linear inverse solutions by assigning different weights to different workers based on the estimated accuracy of their computation (which is what Λ provides). This particular choice of Λ is inspired from the weighted least-square solution.

Algorithm 1 Coded Distributed Linear Inverse

Input: Input vectors [r1, r2, . . . , rk], generator matrix Gk n, the linear system matrices B and K deﬁned in (1). Initialize (Encoding): Encode the input vectors and the initial estimates by multiplying G:

[s1, s2, . . . , sn] = [r1, r2, . . . , rk] G. (5)

[y(0) 1 , y(0) 2 , . . . , y(0) n ] = [x(0) 1 , x(0) 2 , . . . , x(0) k ] G. (6) Parallel Computing: for i = 1 to n (in parallel) do

Send si and y(0) i to the i-th worker. Execute the iterative method (1) with initial estimate y(0) i and input si at each worker. end for After a deadline time Tdl, collect all linear inverse results y(li) i from these n workers. The superscript li in y(li) i represents that the i-th worker ﬁnished li iterations. Denote by Y(Tdl) the collection of all results Y(Tdl) N n = [y(l1) 1 , y(l2) 2 , . . . , y(ln) n ]. Post Processing (decoding at the central controller): Compute an estimate of the linear inverse solutions using the following matrix multiplication:

ˆX = L (Y(Tdl)) := (GΛ 1G ) 1GΛ 1(Y(Tdl)) , (7)

where the estimate ˆXN k = [ˆx1, ˆx2, . . . , ˆxk], the matrix Λ is

Λ = diag [trace(C(l1)), . . . , trace(C(ln))] , (8)

where the matrices C(li), i = 1, . . . , n are deﬁned as

C(li) = Bli CE(B )li. (9)

In computation of Λ, if trace(C(li)) are not available, one can use precomputed estimates of this trace as discussed in Supplementary Section 8.9 with negligible computational complexity and theoretically guaranteed accuracy.

3.1 Bounds on Performance of the Coded Linear Inverse Algorithm

Deﬁne l = [l1, l2, . . . ln] as the vector of the number of iterations at all workers. E[ |l] denotes the conditional expectation taken with respect to the randomness of the optimal solution x i (see Assumption 1) conditioned on ﬁxed iteration number li at each worker, i.e., E[X|l] = E[X|l1, l2, . . . ln]. Deﬁne X N k = [x 1, x 2, . . . x k] as the matrix composed of all the true solutions.

Theorem 3.1. Deﬁne E = ˆX X , i.e., the error of the decoding result (7). Assuming that the solutions for each linear inverse problem are chosen i.i.d. (across all problems) according to a distribution with covariance CE. Then, the error covariance of E satisﬁes

E[ E 2 |l] σmax(G G)trace (GΛ 1G ) 1 , (10)

where the norm is the Frobenius norm, σmax(G G) is the maximum eigenvalue of G G and the matrix Λ is deﬁned in (8). Further, when G has orthonormal rows,

E[ E 2 |l] trace (GΛ 1G ) 1 , (11)

Proof overview. See supplementary Section 8.3 for the complete proof. Here we provide the main intuition by analyzing a scalar version of the linear inverse problem, in which case the matrix B is equal to a scalar a. For B = a, the inputs and the initial estimates in (5) and (6) are vectors instead of matrices. As we show in Supplementary Section 8.3, if we encode both the inputs and the initial estimates using (5) and (6), we also encode the error

[ϵ(0) 1 , ϵ(0) 2 , . . . , ϵ(0) n ] = [e(0) 1 , e(0) 2 , . . . , e(0) k ] G =: E0G, (12)

where ϵ(0) i = y(0) i y i is the initial error at the i-th worker, e(0) i = x(0) i x i is the initial error of the i-th linear inverse problem, and E0 := [e(0) 1 , e(0) 2 , . . . e(0) k ]. Suppose var[e(0) i ] = ce, which is a scalar version of CE after Assumption 1. From (4), the error satisﬁes:

ϵ(li) i = aliϵ(0) i , i = 1, 2, . . . n. (13)

Denote by D = diag{al1, al2, . . . aln}. Therefore, from (12) and (13), the error before the decoding step (7) can be written as

[ϵ(l1) 1 , ϵ(l2) 2 , . . . ϵ(ln) n ] =[ϵ(0) 1 , ϵ(0) 2 , . . . ϵ(0) n ] D = E0GD. (14)

We can show (see Supplementary Section 8.3 for details) that after the decoding step (7), the error vector is also multiplied by the decoding matrix L = (GΛ 1G ) 1GΛ 1:

E = L h ϵ(l1) 1 , ϵ(l2) 2 , . . . ϵ(ln) n i = LD G E 0 . (15)

E[ E 2 |l] =E[trace[E E]|l] = trace[LD G E[E 0 E0|l]GDL ]

(a) =trace[LD G ce Ik GDL ] = cetrace[LD G GDL ]

(b) ceσmax(G G)trace[LD DL ] = σmax(G G)trace[L(ce D D)L ]

(c) =σmax(G G)trace[LΛL ] (d) = σmax(G G)trace[(GΛ 1G ) 1],

where (a) holds because E0 := [e(0) 1 , e(0) 2 , . . . e(0) k ] and var[e(0) i ] = ce, (b) holds because G G σmax(G G)In, (c) holds because ce D D = Λ, which is from the fact that for a scalar linear system matrix B = a, the entries in the Λ matrix in (8) satisfy

trace(C(li)) = alice(a )li = cea2li, (17)

which is the same as the entries in the diagonal matrix ce D D. Finally, (d) is obtained by directly plugging in L := (GΛ 1G ) 1GΛ 1. Finally, inequality 11 holds because when G has orthonormal rows, σ(G G) = 1.

Additionally, we note that in (10), the term trace (GΛ 1G ) 1 resembles the MSE of ordinary weighted least-square solution, and the term σmax(G G) represents the inaccuracy due to using the weighted least-square solution as the decoding result, because the inputs to different workers become correlated by multiplying the i.i.d. inputs with matrix G (see (5)).

4 Comparison with Uncoded Schemes and Replication-based Schemes

Here, we often assume (we will state explicitly in the theorem) that the number of iterations li at different workers are i.i.d.. We use Ef[ ] to denote expectation on randomness of both the linear inverse solutions x i and the number of iterations li (this is different from the notation E[ |l]). Assumption 2. Within time Tdl, the number of iterations of linear inverse computations (see (1)) at each worker follows an i.i.d. distribution li f(l).

4.1 Comparison between the coded and uncoded linear inverse before a deadline

First, we compare the coded linear inverse scheme with an uncoded scheme, in which case we use the ﬁrst k workers to solve k linear inverse problems in (2) without coding. The following theorem quantiﬁes the overall mean-squared error of the uncoded scheme given l1, l2, . . . , lk. The proof is in Supplementary Section 8.6.

Theorem 4.1. In the uncoded scheme, the error E h Euncoded 2 |l i = E [e(l1) 1 . . . , e(lk) k ] 2 l = Pk i=1 trace (C(li)). Further, when the i.i.d. Assumption 2 holds,

Ef h Euncoded 2i = k Ef[trace(C(l1))]. (18)

Then, we compare the overall mean-squared error of coded and uncoded linear inverse algorithms. Note that this comparison is not fair because the coded algorithm uses more workers than uncoded. However, we still include Theorem 4.2 because we need it for the fair comparison between coded and replication-based linear inverse. The proof is in Supplementary section 8.4. Theorem 4.2. (Coded linear inverse beats uncoded) Suppose the i.i.d. Assumptions 1 and 2 hold and

suppose G is a k n submatrix of an n n Fourier transform matrix F, i.e., Fn n = Gk n H(n k) n

Then, expected error of the coded linear inverse is strictly less than that of uncoded:

Ef h Euncoded 2i Ef h Ecoded 2i Ef[trace(J2J 1 4 J 2 )], (19)

where J2 and J4 are the submatrices of FΛF := J1 J2 J 2 J4

n n and the matrix Λ is deﬁned in

(8). That is, (J1)k k is GΛG , (J2)k (n k) is GΛH , and (J4)(n k) (n k) is HΛH .

4.2 Comparison between the replication-based and coded linear inverse before a deadline

Consider an alternative way of doing linear inverse using n > k workers. In this paper, we only consider the case when n k < k, i.e., the number of extra workers is only slightly bigger than the number of problems (both in theory and in experiments). Since we have n k extra workers, a natural way is to pick any (n k) linear inverse problems and replicate them using these extra (n k) workers. After we obtain two computation results for the same equation, we use two natural decoding strategies for this replication-based linear inverse: (i) choose the worker with higher number of iterations; (ii) compute the weighted average using weights w1 w1+w2 and w2 w1+w2 , where w1 = 1/ p

trace(C(l1)) and w2 = 1/ p

trace(C(l2)), and l1 and l2 are the number of iterations completed at the two workers (recall that trace(C(li)) represents the residual MSE at the i-th worker). Theorem 4.3. The replication-based schemes satisfy the following lower bound on the MSE:

Ef h Erep 2i >Ef h Euncoded 2i (n k)Ef[trace(C(l1))]. (20)

Proof overview. Here the goal is to obtain a lower bound on the MSE of replication-based linear inverse and compare it with an upper bound on the MSE of coded linear inverse.

Note that if an extra worker is used to replicate the computation at the i-th worker, i.e., the linear inverse problem with input ri is solved on two workers, the expected error of the result of the i-th problem could at best reduced from Ef[trace(C(l1))] (see Thm. 4.1) to zero1. Therefore, (n k) extra workers make the error decrease by at most (and strictly smaller than) (n k)Ef[trace(C(l1))].

Using this lower bound, we can provably show that coded linear inverse beats replication-based linear inverse when certain conditions are satisﬁed. One crucial condition is that the distribution of the random variable trace(C(l)) (i.e., the expected MSE at each worker) satisﬁes a variance heavy-tail property deﬁned as follows. Deﬁnition 1. The random variable trace(C(l)) is said to have a ρ-variance heavy-tail property if

varf[trace(C(l))] > ρE2 f[trace(C(l))], (21)

for some constant ρ > 1. Notice that the term trace(C(l)) is essentially the remaining MSE after l iterations at a single machine. Therefore, this property simply means the remaining error at a single machine has large variance. For the coded linear inverse, we will use a Fourier code , the generator matrix G of which is a submatrix of a Fourier matrix. This particular choice of code is only for ease of analysis in comparing coded linear inverse and replication-based linear inverse. In practice, the code that minimizes mean-squared error should be chosen.

1Although this is clearly a loose bound, it makes for convenient comparison with coded linear inverse.

Theorem 4.4. (Coded linear inverse beats replication) Suppose the i.i.d. Assumptions 1 and 2 hold and G is a k n submatrix of k rows of an n n Fourier matrix F. Further, suppose (n k) = o( n). Then, the expected error of the coded linear inverse satisﬁes

lim n 1 n k

h Ef h Euncoded 2i Ef h Ecoded 2ii varf[trace(C(l1))]

Ef[trace(C(l1))] . (22)

Moreover, if the random variable trace(C(l)) satisﬁes the ρ-variance heavy-tail property for ρ > 1, coded linear inverse outperforms replication-based linear inverse in the following sense,

lim n 1 (n k) Ef Euncoded 2 Ef Erep 2 < 1

ρ lim n 1 (n k) Ef Euncoded 2 Ef Ecoded 2 .

Proof overview. See Supplementary Section 8.7 for a complete and rigorous proof. Here we only provide the main intuition behind the proof. From Theorem 4.2, we have Ef h Euncoded 2i

Ef h Ecoded 2i Ef[trace(J2J 1 4 J 2 )]. Therefore, to prove (22), the main technical difﬁculty is

to simplify the term trace(J2J 1 4 J 2 ). For a Fourier matrix F, we are able to show that the matrix

FΛF = J1 J2 J 2 J4

(see Theorem 4.2) is a Toeplitz matrix, which provides a good structure for

us to study its behavior. Then, we use the Gershgorin circle theorem [27] (with some algebraic manipulations) to show that the maximum eigenvalue of J4 satisﬁes σmax(J4) Ef[trace(C(l1))], and separately using some algebraic manipulations, we show

trace(J2J 2 ) (n k)varf[trace(C(l1))], (24)

for large matrix size n. Since trace(J2J 1 4 J 2 ) trace(J2(σmax(J4)) 1J 2 ) = 1 σmax(J4)trace(J2J 2 ),

trace(J2J 1 4 J 2 ) (n k)varf[trace(C(l1))]

Ef[trace(C(l1))] , (25)

for large n. Then, (22) can be proved by plugging (25) into (19). After that, we can combine (22), (20) and the variance heavy-tail property to prove (23).

4.3 Asymptotic Comparison between Coded, Uncoded and Replication-based linear inverse as the Deadline Tdl

Assumption 3. We assume the computation time of one power iteration is ﬁxed at each worker for each linear inverse computation, i.e., there exist n independent (not necessarily identically distributed) random variables v1, v2, . . . vn such that li = Tdl

vi , i = 1, 2, . . . n.

The above assumption is validated in experiments in Supplementary Section 8.13.

The k-th order statistic of a sample is equal to its k-th smallest value. Suppose the order statistics of the sequence v1, v2, . . . vn are vi1 < vi2 < . . . vin, where {i1, i2, . . . in} is a permutation of {1, 2, . . . n}. Denote by [k] the set {1, 2, . . . k} and [n] the set {1, 2, . . . n}. Theorem 4.5. (Error exponent comparison when Tdl ) Suppose the i.i.d. Assumption 1 and Assumption 3 hold. Suppose n k < k. Then, the error exponents of the coded and uncoded computation schemes satisfy

Tdl log E[ Ecoded 2 |l] 2

vik log 1 1 d, (26)

Tdl log E[ Euncoded 2 |l] = lim Tdl 1

Tdl log E[ Erep 2 |l] = 2 maxi [k] vi log 1 1 d. (27)

The error exponents satisfy coded>replication=uncoded. Here the expectation E[ |l] is only taken with respect to the randomness of the linear inverse sequence xi, i = 1, 2, . . . k.

Proof overview. See Supplementary Section 8.8 for a detailed proof. The main intuition behind this result is the following: when Tdl approaches inﬁnity, the error of uncoded computation is dominated

by the slowest worker among the ﬁrst k workers, which has per-iteration time maxi [k] vi. For the replication-based scheme, since the number of extra workers n k < k, there is a non-zero probability (which does not change with Tdl) that the n k extra workers do not replicate the computation in the slowest one among the ﬁrst worker. Therefore, replication when n k < k does not improve the error exponent, because the error is dominated by this slowest worker. For coded computation, we show in Supplementary Section 8.8 that the slowest n k workers among the overall n workers do not affect the error exponent, which means that the error is dominated by the k-th fastest worker, which has per-iteration time vik. Since the k-th fastest worker among all n workers can not be slower than the slowest one among the ﬁrst (unordered) k workers, the error exponent of coded linear inverse is larger than that of the uncoded and the replication-based linear inverse.

5 Analyzing the Computational Complexity

5.1 Encoding and decoding complexity

We ﬁrst show that the encoding and decoding complexity of Algorithm 1 are in scaling-sense smaller than that of the computation at each worker. This ensures that straggling comes from the parallel workers, not the encoder or decoder. The proof of Theorem 5.1 is in Supplementary Section 8.10. In our experiment on the Google Plus graph (See Section 6) for computing Page Rank, the computation time at each worker is 30 seconds and the encoding and decoding time at the central controller is about 1 second. Theorem 5.1. The computational complexity for the encoding and decoding is Θ(nk N), where N is the number of rows in the matrix B and k, n depend on the number of available workers assuming that each worker performs a single linear inverse computation. For a general dense matrix B, the computational complexity of computing linear inverse at each worker is Θ(N 2l), where l is the number of iterations in the speciﬁed iterative algorithm. The complexity of encoding and decoding is smaller than that of the computation at each user for large B matrices (large N).

5.2 Analysis on the cost of communication versus computation

In this work, we focus on optimizing the computation cost. However, what if the computation cost is small compared to the overall cost, including the communication cost? If this is true, optimizing the computation cost is not very useful. In Theorem 5.2 (proof appears in Supplementary Section 8.11), we show that the computation cost is larger than the communication cost in the scaling-sense. Theorem 5.2. The ratio between the number of operations (computation) and the number of bits transmitted (communication) at the i-th worker is COSTcomputation/COSTcommunication = Θ(li d) operations per integer, where li is the number of iterations at the i-th worker, and d is the average number of non-zeros in each row of the B matrix.

6 Experiments on Real Systems

We test the performance of the coded linear inverse algorithm for the Page Rank problem on the Twitter graph and the Google Plus graph from the SNAP datasets [28]. The Twitter graph has 81,306 nodes and 1,768,149 edges, and the Google Plus graph has 107,614 nodes and 13,673,453 edges. We use the HT-condor framework in a cluster to conduct the experiments. The task is to solve k = 100 personalized Page Rank problems in parallel using n = 120 workers. The uncoded algorithm picks the ﬁrst k workers and uses one worker for each Page Rank problem. The two replication-based schemes replicate the computation of the ﬁrst n k Page Rank problems in the extra n k workers (see Section 4.2). The coded Page Rank uses n workers to solve these k = 100 equations using Algorithm 1. We use a (120, 100) code where the generator matrix is the submatrix composed of the ﬁrst 100 rows in a 120 120 DFT matrix. The computation results are shown in the left two ﬁgures in Fig. 2. Note that the two graphs are of different sizes so the computation in the two experiments take different time. From Fig. 2, we can see that the mean-squared error of uncoded and replication-based schemes is larger than that of coded computation by a factor of 104 for large deadlines.

We also compare Algorithm 1 with the coded computing algorithm proposed in [6]. As we discussed in the Figure 1, the original coded technique in [6] ignores partial results and is suboptimal even in the toy example of three workers. However, it has a natural extension to iterative methods, which will be

0 10 20 30 Deadline Tdl (sec)

0 1 2 Deadline Tdl (sec)

0 10 20 30 Deadline Tdl (sec)

Average mean-squared error

Google Plus graph

0.5 1 1.5 2 Deadline Tdl (sec)

100 Twitter graph Google Plus graph

Twitter graph

Repetition-1

Repetition-2

Repetition-1

Repetition-2

Extension of coded Method in [Lee et.al.]

Algorithm 1

Original Coded Method in [Lee et.al.]

Figure 2: From left to right: (1,2) Experimentally computed overall MSE of uncoded, replicationbased and coded personalized Page Rank on the Twitter and Google Plus graph on a cluster with 120 workers. The ratio of MSE for repetition-based schemes and coded linear inverse increase as Tdl increases. (3) Comparison between an extended version of the algorithm in [6] and Algorithm 1 on the Google Plus graph. The ﬁgure shows that naively extending the general coded method using matrix inverse introduces error ampliﬁcation. (4) Comparison of different codes. In this experiment the DFT-code out-performs the other candidates in MSE.

discussed in details later. The third ﬁgure in Fig. 2 shows the comparison between the performance of Algorithm 1 and this extension of the algorithm from [6]. This extension uses the (unﬁnished) partial results from the k fastest workers to retrieve the required Page Rank solutions. More concretely, suppose S [n] is the index set of the k fastest workers. Then, this extension retrieves the solutions to the original k Page Rank problems by solving the equation YS = [x 1, x 2, . . . , x k] GS, where YS is composed of the (partial) computation results obtained from the fastest k workers and GS is the k k submatrix composed of the columns in the generator matrix G with indexes in S. However, since there is some remaining error at each worker (i.e., the computation results YS have not converged yet), when conducting the matrix-inverse-based decoding from [6], the error is magniﬁed due to the large condition number of GS. This is why the algorithm in [6] should not be naively extended in the coded linear inverse problem.

One question remains: what is the best code design for the coded linear inverse algorithm? Although we do not have a concrete answer to this question, we have tested different codes (with different generator matrices G) in the Twitter graph experiment, all using Algorithm 1. The results are shown in the fourth ﬁgure in Fig. 2. The generator matrix used for the binary curve has i.i.d. binary entries in { 1, 1}. The generator matrix used for the sparse curve has random binary sparse entries. The generator matrix for the Gaussian curve has i.i.d. standard Gaussian entries. In this experiment, the DFT-code performs the best. However, ﬁnding the best code in general is a meaningful future work.

7 Conclusions

By studying coding for iterative algorithms designed for distributed inverse problems, we aim to introduce new applications and analytical tools to the problem of coded computing with stragglers. Since these iterative algorithms designed for inverse problems commonly have decreasing error with time, the partial computation results at stragglers can provide useful information for the ﬁnal outputs. Note that this is unlike recent works on coding for multi-stage computing problems [29, 30], where the computation error can accumulate with time and coding has to be applied repeatedly to suppress this error accumulation. An important connection worth discussing is the diversity gain in this coded computing problem. The distributed computing setting in this work resembles random fading channels, which means coding can be used to exploit straggling diversity just as coding is used in communication channels to turn diverse channel fading into an advantage. What makes coding even more suitable in our setting is that the amount of diversity gain achieved here through replication is actually smaller than that can be achieved by replication in fading channels. This is because for two computers that solve the same equation Mxi = ri, the remaining error at the slow worker is a deterministic multiple of the remaining error at the fast worker (see equation (3)). Therefore, taking a weighted average of the two computation results through replication does not reduce error as in independent fading channels. How diversity gain can be achieved here optimally is worth deep investigation. Our next goals are two-fold: (1) extend the current method to solving a single large-scale inverse problem, such as graph mining with graphs that exceed the memory of a single machine; (2) carry out experiments on faster distributed systems such as Amazon EC2.

[1] J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74 80, 2013.

[2] G. Joshi, Y. Liu, and E. Soljanin. On the delay-storage trade-off in content download from coded distributed storage systems. IEEE Journal on Selected Areas in Communications, 32(5): 989 997, 2014.

[3] D. Wang, G. Joshi, and G. Wornell. Efﬁcient task replication for fast response times in parallel computation. In ACM SIGMETRICS Performance Evaluation Review, volume 42, pages 599 600. ACM, 2014.

[4] D. Wang, G. Joshi, and G. Wornell. Using straggler replication to reduce latency in large-scale parallel computing. ACM SIGMETRICS Performance Evaluation Review, 43(3):7 11, 2015.

[5] L. Huang, S. Pawar, H. Zhang, and K. Ramchandran. Codes can reduce queueing delay in data centers. In IEEE International Symposium on Information Theory Proceedings (ISIT), pages 2766 2770. IEEE, 2012.

[6] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. Speeding up distributed machine learning using codes. In IEEE International Symposium on Information Theory (ISIT), pages 1143 1147. IEEE, 2016.

[7] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis. Gradient coding. 2016.

[8] S. Dutta, V. Cadambe, and P. Grover. Short-dot: Computing large linear transforms distributedly using coded short dot products. In Advances In Neural Information Processing Systems, pages 2092 2100, 2016.

[9] N. S. Ferdinand and S. C. Draper. Anytime coding for distributed computation. In 54th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 954 960. IEEE, 2016.

[10] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. A uniﬁed coding framework for distributed computing with straggling servers. In IEEE Globecom Workshops (GC Wkshps), pages 1 6. IEEE, 2016.

[11] A. Reisizadehmobarakeh, S. Prakash, R. Pedarsani, and S. Avestimehr. Coded computation over heterogeneous clusters. In IEEE International Symposium on Information Theory (ISIT), pages 2408 2412. IEEE, 2017.

[12] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. Coding for distributed fog computing. IEEE Communications Magazine, 55(4):34 40, 2017.

[13] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr. Polynomial codes: an optimal design for high-dimensional coded matrix multiplication. In Advances In Neural Information Processing Systems, 2017.

[14] K. Lee, C. Suh, and K. Ramchandran. High-dimensional coded matrix multiplication. In IEEE International Symposium on Information Theory (ISIT), pages 2418 2422. IEEE, 2017.

[15] K. Lee, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran. Coded computation for multicore setups. In IEEE International Symposium on Information Theory (ISIT), pages 2413 2417. IEEE, 2017.

[16] K.-H. Huang et al. Algorithm-based fault tolerance for matrix operations. IEEE transactions on computers, 100(6):518 528, 1984.

[17] Y. Saad. Iterative methods for sparse linear systems. SIAM, 2003.

[18] T. H. Haveliwala. Topic-sensitive pagerank. In Proceedings of the 11th international conference on World Wide Web, pages 517 526. ACM, 2002.

[19] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Info Lab, 1999.

[20] M. Haikin and R. Zamir. Analog coding of a source with erasures. In IEEE International Symposium on Information Theory, pages 2074 2078. IEEE, 2016.

[21] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. IEEE Transactions on Information Theory, 56(9):4539 4551, 2010.

[22] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring elephants: Novel erasure codes for big data. In Proceedings of the VLDB Endowment, volume 6, pages 325 336. VLDB Endowment, 2013.

[23] M. A. Maddah-Ali and U. Niesen. Decentralized coded caching attains order-optimal memoryrate tradeoff. IEEE/ACM Transactions on Networking, 23(4):1029 1040, 2015.

[24] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr. Coded mapreduce. In Communication, Control, and Computing (Allerton), 2015 53rd Annual Allerton Conference on, pages 964 971. IEEE, 2015.

[25] D. I. Shuman, S. K. Narang, P. Frossard, A. Ortega, and P. Vandergheynst. The emerging ﬁeld of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83 98, 2013.

[26] A. Sandryhaila and J. M. F. Moura. Discrete signal processing on graphs. IEEE transactions on signal processing, 61(7):1644 1656, 2013.

[27] G. H. Golub and C. F. van Loan. Matrix computations, volume 3. JHU Press, 2012.

[28] J. Leskovec and J. J. Mcauley. Learning to discover social circles in ego networks. In Advances in neural information processing systems, pages 539 547, 2012.

[29] Y. Yang, P. Grover, and S. Kar. Computing linear transformations with unreliable components. IEEE Transactions on Information Theory, 2017.

[30] Y. Yang, P. Grover, and S. Kar. Rate distortion for lossy in-network linear function computation and consensus: Distortion accumulation and sequential reverse water-ﬁlling. IEEE Transactions on Information Theory, 2017.

[31] X. Wang, P. Liu, and Y. Gu. Local-set-based graph signal reconstruction. IEEE Transactions on Signal Processing, 63(9):2432 2444, 2015.

[32] S. K. Narang, A. Gadde, E. Sanou, and A. Ortega. Localized iterative methods for interpolation in graph structured data. In 2013 IEEE Global Conference on Signal and Information Processing (Global SIP), pages 491 494. IEEE, 2013.

[33] S. Chen, R. Varma, A. Sandryhaila, and J. Kovaˇcevi c. Discrete signal processing on graphs: Sampling theory. IEEE Transactions on Signal Processing, 63(24):6510 6523, 2015.

[34] S. Chen, Y. Yang, C. Faloutsos, and J. Kovacevic. Monitoring manhattan s trafﬁc at 5 intersections? In IEEE 2016 Global SIP Conference on Signal and Information Processing (Global SIP), 2016.

[35] A. M. Mood, F. A. Graybill, and D. C. Boes. Introduction to the theory of statistics, 3rd edition. 1974.

[36] H. Zhang and F. Ding. On the kronecker products and their applications. Journal of Applied Mathematics, 2013, 2013.