# iterative_methods_via_locally_evolving_set_process__ef790796.pdf

Iterative Methods via Locally Evolving Set Process

Baojian Zhou1,2 Yifan Sun3 Reza Babanezhad Harikandeh4 Xingzhi Guo3

Deqing Yang1,2 Yanghua Xiao2

1 the School of Data Science, Fudan University, 2 Shanghai Key Laboratory of Data Science, School of Computer Science, Fudan University 3 Department of Computer Science, Stony Brook University, 4 Samsung SAIT AI Lab.

Given the damping factor α and precision tolerance ϵ, Andersen et al. [2] introduced Approximate Personalized Page Rank (APPR), the de facto local method for approximating the PPR vector, with runtime bounded by Θ(1/(αϵ)) independent of the graph size. Recently, Fountoulakis & Yang [12] asked whether faster local algorithms could be developed using O(1/( αϵ)) operations. By noticing that APPR is a local variant of Gauss-Seidel, this paper explores the question of whether standard iterative solvers can be effectively localized. We propose to use the locally evolving set process, a novel framework to characterize the algorithm locality, and demonstrate that many standard solvers can be effectively localized. Let vol(St) and γt be the running average of volume and the residual ratio of active nodes St during the process. We show vol(St)/γt 1/ϵ and prove APPR admits a new runtime bound O(vol(St)/(αγt)) mirroring the actual performance. Furthermore, when the geometric mean of residual reduction is Θ( α), then there exists c (0, 2) such that the local Chebyshev method has runtime O(vol(St)/( α(2 c))) without the monotonicity assumption. Numerical results confirm the efficiency of this novel framework and show up to a hundredfold speedup over corresponding standard solvers on real-world graphs.

1 Introduction

Personalized Page Rank (PPR) vectors are key tools for graph problems such as clustering [2, 3, 30, 36, 54, 57], diffusion [10, 14, 15, 29], random walks [25, 32, 44], neural net training [7, 27, 20, 21], and many others [17, 48]. The Approximate PPR (APPR) [2] and its many variants [6, 9, 13, 37] efficiently approximate PPR vectors by exploring the neighbors of a specific node at each time, only requiring access to a tiny part of the graph hence the number of operations needed is independent of graph size. These local solvers are well-suited for large-scale graphs in modern graph data analysis. Specifically, let A and D be the adjacency and degree matrices of a graph G, respectively. Given a source node s and the damping factor α (0, 1), this paper studies local solvers for the linear system I (1 α) I + AD 1 /2 π = αes, (1)

where es is the standard basis of s and π is the PPR vector [2, 12, 37]. Given the error tolerance ϵ, a local solver needs to find ˆπ such that D 1(ˆπ π) ϵ without accessing the entire graph G.2

Corresponding to: Baojian Zhou, bjzhou@fudan.edu.cn 2Local methods for solving Equ. (1) can be naturally extended to other linear systems defined on G.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Andersen et al. [2] proposed the local APPR algorithm, which pushes large residuals to neighboring nodes until all residuals are small. Its runtime is upper bounded by Θ(1/(αϵ)) independent of graph size. Based on a variational characterization of Equ. (1), Fountoulakis et al. [13] reformulated the problem as optimizing a quadratic objective plus ℓ1-regularization and later asked [12] whether there exists a local solver with runtime O(1/( αϵ)). This corresponds to an accelerated rate since α is the strongly convex parameter. Recently, Martínez-Rubio et al. [37] provided a method based on a nested subspace pursuit strategy, and the corresponding iteration complexity is bounded by O(|S |/ α) where S is the support of the optimal solution. This bound deteriorates to O(n/ α) when the solution is dense, with n representing the number of nodes in G, which could be less favorable than that of standard solvers under similar conditions. Moreover, the nested computational structure provides a constant factor overhead, which could be significant in practice.

The bound analysis of the above local methods critically depends on the monotonicity properties of the designed algorithms. These requirements may hinder the development of simpler and faster local linear solvers that lack such monotonicity properties. Specifically, the runtime analysis of APPR relies on the non-negativity and decreasing monotonicity of residuals. Conversely, the runtime bounds developed in Fountoulakis et al. [13] and Martínez-Rubio et al. [37] depend on the monotonicity of variable updates, ensuring that the sparsity of intermediate variables increases monotonically.

Our contributions. Based on a refined analysis of APPR, our starting point is to demonstrate that APPR is a local variant of Gauss-Seidel Successive Overrelaxation (GS-SOR) that can be treated as an evolving set process.3 This insight leads us to explore whether standard solvers can be effectively localized to solve Equ. (1). To develop faster local methods with improved local bounds, we propose a novel locally evolving set process framework inspired by its stochastic counterpart [38]. This framework enables the development of faster local methods and circumvents the monotonicity requirement barrier in runtime complexity analysis in the existing literature. For example, our analysis of the local Chebyshev method does not depend on the monotonicity of residual or the active node sets processed. Specifically,

As a core tool, we propose an algorithm framework based on the locally evolving set process. We show that APPR is a local variant of GS-SOR using this process. This framework is powerful enough to facilitate the development of new local solvers. Specifically, standard gradient descent (GD) can be effectively localized for solving this problem and admits Θ(1/(αϵ)) runtime bound. This local evolving set process provides a novel way to characterize the algorithm locality; hence, new runtime bounds can be derived. Let vol(St) and γt be the running average of volume and the residual ratio of active nodes St during the process; we prove the ratio vol(St)/γt serving as a lower bound of 1/ϵ. We further show both APPR and local GD have O(vol(St)/(αγt)) runtime bound mirroring the actual performance of these two methods. Using our framework, we show there exists c (0, 2) such that both the localized Chebyshev and Heavy-Ball methods admit runtime bound O(vol(St)/( α(2 c))) with the assumption that the geometric mean of active ratio factors is Θ( α). Importantly, our analysis does not require any monotonicity property. The technical novelty is that we effectively characterize residuals of these two methods by using second-order difference equations with parameterized coefficients. We demonstrate, over 17 large graphs, that these localized methods can significantly accelerate their standard counterparts by a large margin. Furthermore, our proposed LOCSOR, LOCCH, and LOCHB are significantly faster than APPR and ℓ1-based solvers on two huge-scale graphs.

Paper structure. We begin by clarifying notations and reviewing APPR in Sec. 2. Sec. 3 introduces the locally evolving set process. Sec. 4 presents localized Chebyshev and Heavy-Ball methods along with our novel techniques. We discuss open questions in Sec. 5. Experiments and conclusions are covered in Sec. 6 and 7, respectively. Detailed related works and all missing proofs are included in the Appendix. Our code is available at https: // github. com/ baojian/ Local CH .

2 Notations and Preliminaries

Notations. We consider an undirected simple graph G(V, E) where V = {1, 2, . . . , n} and E V V with |E| = m are the node and edge sets, respectively. The set of neighbors of v is denoted as

3The local variant of GS-SOR is defined in Appendix B.1

N(v) V. The adjacency matrix A of G assigns unit weight au,v = 1 if (u, v) E and 0 otherwise. The v-th entry of the degree matrix D is dv = |N(v)|. Given S V, we define the volume of S as vol(S) P

v S dv. The support of x Rn is the set of nonzero indices supp(x) {v : xv = 0, v V}. The eigendecomposition of D 1/2AD 1/2 = V ΛV where each column of V is an eigenvector and Λ = diag(λ1, λ2, . . . , λn) with 1 = λ1 λ2 λn 1.

2.1 Revisiting Anderson s APPR and its local runtime bound

We use (p, z) for solving Equ. (1) while use (x, r) for solving Equ. (3) or Equ. (4). With the initial setting p 0, z es, APPR obtains a local estimate of π denoted as p by using a sequence of PUSH operations defined as

APPR(α, ϵ, s, G) : Repeat (p, z) PUSH(u, α, p, z) Until v, zv < ϵdv; Return p. (2)

Algo. 1 PUSH(u, α, p, z)

1: ν = zu 2: pu pu + α ν 3: zu (1 α) ν/2 4: for v N(u) do

5: zv zv + (1 α)ν

2du 6: Return (p, z)

At each repeat step (p, z) PUSH(u, α, p, z), it synchronously updates both p and residual z whenever there exists an active node u V (a node with a large residual, i.e., zu ϵdu). Specifically, for each active u, it updates pu, zu, and zv for v N(u) by using a PUSH operator illustrated on the left. It stops when no active nodes are left. APPR can be implemented locally so that the runtime is independent of G. In particular, vol(supp(p)) is locally bounded, demonstrating the sparsity effect. We restate the existing main results as follows. Lemma 2.1 (Runtime bound of APPR [2]). Given α (0, 1) and the precision ϵ 1/ds for node s V with p 0, z es at the initial, APPR(α, ϵ, s, G) defined in (2) returns an estimate p of π. There exists a real implementation of (2) (e.g., Algo. 2) such that the runtime TAPPR satisfies TAPPR Θ (1/(αϵ)) . Furthermore, the estimate ˆπ := p satisfies D 1(ˆπ π) ϵ and vol(supp(ˆπ)) 2/((1 α)ϵ).

The main argument for proving Lemma 2.1 is critically based on: 1) z 0 and z 1 decreases during the updates; 2) for each active u, zu ϵdu, implying that z 1 is decreased by at least αϵdu, consecutively leading to P

u du 1/(αϵ).

2.2 Problem reformulation

To approximate the PPR vector π, the original linear system in Equ. (1) can be reformulated as an equivalent symmetric version defined as

Qx = b, with Q I 1 α

1+αD 1/2AD 1/2 and b 2α (1+α)D 1/2es, (3)

where again es is the standard basis of s, and Q is a symmetric positive-definite M-matrix with all eigenvalues in [ 2α

1+α, 2 1+α]. To solve Equ. (3) is equivalent to solving a quadratic problem

x = arg min x Rn

2x Qx x b o , (4)

where f is strongly convex with condition number 1/α. Indeed, Equ. (3) is a symmetrized version of Equ. (1) and has a unique solution x = Q 1b. The PPR vector π can be recovered from x by π = D1/2x . It is convenient to denote estimate of π as π(t) D1/2x(t). Given x(t), we define the residual r(t) b Qx(t). If x(t) is returned by a local solver for solving either Equ. (3) or Equ. (4), we then equivalently require D 1/2(x(t) x ) ϵ. Hence, it is enough to have a stop condition D 1/2r(t) 2αϵ/(1 + α) for local solvers of Equ. (3) and Equ. (4).4

Fountoulakis et al. [13] demonstrated that APPR is equivalent to a coordinate descent solver for minimizing f in Equ. (4) and introduced an ISTA-style solver by minimizing f(x) + ϵα D1/2x 1, which provides a method with runtime bound O(1/(ϵα)) for achieving the same estimation guarantee of APPR. On one hand, one may note that the runtime bound Θ(1/(αϵ)) provided in Lemma 2.1 becomes less valuable when ϵ 1/m; on the other hand, all previous local variants [6, 9, 13, 37]

4See a justification in Appendix B.

of APPR are critically based on some monotonicity property. This limitation could impede the development of faster local methods that might violate the monotonicity assumption. The following two sections present the techniques and tools to address these challenges.

3 Local Methods via Evolving Set Process Our investigation begins with the locally evolving set process, as inspired by the stochastic counterpart [38]. The process reveals that APPR is essentially a local variant of GS-SOR. We then show how to use this process to build faster local solvers based on GS-SOR. We further develop a local parallelizable gradient descent with runtime Θ(1/(αϵ)).

3.1 Locally evolving set process

Given α, ϵ, s, and G, a local solver for Equ. (3) keeps track of an active set St V at each iteration t. That is, only nodes in St are used to update x or r. The next set St+1 is determined by current St and an associated local solver A. We define this process as the following local evolving set system.

Definition 3.1 (Locally evolving set process). Given a parameter configuration θ (α, ϵ, s, G), and a local iterative method A, the locally evolving set process generates a sequence of (St, x(t), r(t)) representing as the following dynamic system St+1, x(t+1), r(t+1) = Φθ St, x(t), r(t), A , t 0, (5)

where St+1 St ( u St N(u)) and we denote the active set St = {u1, u2, . . . , u|St|}. The set St is maintained via a queue data structure. We say this process converges when the last set ST = if there exists such T; the generated sequence of active nodes are

(S0, x(0), r(0)) (S1, x(1), r(1)) (S2, x(2), r(2)) . . . (ST = , x(T ), r(T )).

The runtime of the local solver, A for this whole local process, is then defined as 5

t=0 vol(St).

The framework of this set process provides a new way to design local methods. Furthermore, it helps to analyze the convergence and runtime bound of local solvers by characterizing the sequences {vol(St)}, and { r(t) } generated by Φθ. To analyze a new runtime bound, for T 1, we define the average of the volume of active node sets {vol(St)} and active ratio sequence {γt} as

t=0 vol(St), γT 1

γt P|St| i=1 | p

duir(t+ i) ui | D1/2r(t) 1

where i is a smaller time magnitude. We define i = (i 1)/|St| for the analysis of APPR and LOCSOR while i = 0 for LOCGD in our later analysis. In the rest, we denote IT = supp(r(T )).

vol(St) γt ln 2 ϵ(1 α)|It|

Figure 1: Runtime of APPR in the locally evolving set process on the com-dblp graph with s = 0, α = 0.1, and ϵ = 1/m. The red region of the left figure is TAPPR. The right two figures show active ratios and vol(ST )/γT 1/ϵ.

These two metrics vol(ST ) and γT characterize the locality of local methods. To demonstrate this local process, Fig. 1 shows vol(St) of APPR peaks at the early stage, and the active ratio decreases as the active volume diminishes. The quantity vol(ST )/γT is strictly smaller than 1/ϵ, indicating that it could serve as a better factor in the runtime analysis.

5In practice, TA := PT 1 t=0 (vol(St) + |St|) where we ignore |St| for simplicity as vol(St) dominates |St|.

3.2 APPR via locally evolving set process

We first demonstrate how this locally evolving set process can represent APPR. For solving Equ. (1), the set S0 = {s} and the queue-based of APPR (see Algo. 2 in Appendix A) naturally forms a sequence of active sets from S0 = {s} to ST = , hence converging. Active nodes u in queue satisfy zu ϵdu. To delineate successive iterations St and St+1, one can insert at the beginning of St. After processing St, it serves as an indicator for the next iteration. The star is reinserted into the queue iteratively until the queue is empty. We use a slightly different notation for presenting tuple (St, p(t), z(t)) to consistent with Sec. 2.1 and write out such evolving process as follows

Φθ St, p(t), z(t), A = APPR

: for ui in St := {u1, u2, . . . , u|St|} do

p(t+ i+1) p(t+ i) + αz(t+ i) ui eui, i := (i 1)/|St|

z(t+ i+1) z(t+ i) (1+α)

2 z(t+ i) ui eui + (1 α)

2 z(t+ i) ui AD 1eui

The following lemma establishes the equivalence between APPR and the local variant of GS-SOR method (see Appendix B.1) and provides a new evolving-based bound.

Lemma 3.2 (New local evolving-based bound for APPR). Let M = α 1 I 1 α

and s = es. The linear system Mπ = s is equivalent to Equ. (1). Given p(0) = 0, z(0) = es with ω (0, 2), the local variant of GS-SOR (15) for Mπ = s can be formulated as

p(t+ i+1) p(t+ i) + ωz(t+ i) ui Muiui eui, z(t+ i+1) z(t+ i) ωz(t+ i) ui Muiui Meui,

where ui is an active node in St satisfying zui ϵdui and i = (i 1)/|St|. Furthermore, when ω = 1+α

2 , this method reduces to APPR given in (7), and there exists a real implementation (Aglo. 2) of APPR such that the runtime TAPPR is bounded, that is

TAPPR vol(ST )

ϵ , where vol(ST )

ϵ , CT = 2 (1 α)|IT |, ˆγT 1

(P|St| i=1 |z(t+ i) ui | z(t) 1

3.3 Faster local variant of GS-SOR

Lemma 3.2 points to the sub-optimality of APPR, as GS-SOR allows for a larger ω. For solving Equ. (3), since APPR essentially serves as a local variant of GS-SOR, we can develop a faster local variant based SOR. To extend this method to solve Equ. (3), we propose a local GS-SOR based on an evolving set process, namely LOCSOR, as the following

Φθ St, x(t), r(t), A = LOCSOR

: for ui in St := {u1, . . . , u|St|} and do x(t+ i+1) x(t+ i) + ωr(t+ i) ui eui, i = (i 1)/|St|

r(t+ i+1) r(t+ i) ωr(t+ i) ui eui + (1 α)ω

1+α r(t+ i) ui D 1/2AD 1/2eui

When ω (0, 1], the residual r is still nonnegative and monotonically decreasing, we establish the convergence of LOCSOR stated in the following theorem. Theorem 3.3 (Runtime bound of LOCSOR (ω = 1)). Given the configuration θ = (α, ϵ, s, G) with α (0, 1) and ϵ 1/ds and let r(T ) and x(T ) be returned by LOCSOR defined in (8) for solving Equ. (3). There exists a real implementation of (8) such that the runtime TLOCSOR is bounded by

1 D1/2r(T ) 1

TLOCSOR 1 + α

αϵ, vol(ST )

where vol(ST ) and γT are defined in (6) and C = 1+α (1 α)|IT | with IT = supp(r(T )). Furthermore,

vol(ST )/γT 1/ϵ and the local estimate ˆπ := D1/2x(T ) satisfies D 1(ˆπ π) ϵ.

0.0 0.5 1.0 1.5 # Operations 106

ln π(T) π 1

Loc SOR (ω = 1)

Loc SOR (ω = ω ) APPR

12 10 8 6 4

log(#Operations)

Anderson s Upper Bound Our Upper Bound Actual Operations Our Lower Bound

Figure 2: Comparison of runtime between APPR and LOCSOR (left) and runtime bounds (right) as a function of ϵ. We used the same setting as in Fig. 1.

Our new evolving bound O(vol(ST )/(αγT )) mirroring the actual performance of APPR and empirically much smaller than Θ(1/(αϵ)) as illustrated in Fig. 2. Our lower bounds are quite effective when ϵ is relatively large, while our upper bound is better than Anderson s when ϵ is small. When ϵ Θ(1/m), this new bound is superior to both O(1/(αϵ)) and O(1/( αϵ)). This superiority is evident when compared to algorithms like ISTA or FISTA [5] to minimize the ℓ1-regularization of f for obtaining an approximate solution of Equ. (3). Additionally, when ω (1, 2) and recalling that Q is an M-matrix, the standard analysis of SOR shows that the spectral norm of the iteration matrix must be larger than |ω 1|. Hence, 0 < ω < 2 if and only if global SOR converges [55]. When ω is optimal (the point that the spectral radius of the iteration matrix is minimized), we have the following result.

Corollary 3.4. Let ω = ω 2/(1 + p

1 (1 α)2/(1 + α)2) and St = V, t 0 during the updates, the global version of LOCSOR has the following convergence bound

r(t) 2 2 (1 + α) ds

where ϵt are small positive numbers with limt ϵt = 0.

Asymptotically, when ϵt = o( α), then the runtime of global LOCSOR is O(m/ α) where O hides log 1/ϵ. The main difficulty of analyzing the optimal local LOCSOR is that the nonnegativity and monotonicity of r(t) do not hold. Instead, by using a parameterized second-order difference equation, we develop new techniques based on the Chebyshev method detailed in Sec. 4.

3.4 Parallelizable local gradient descent

One disadvantage of LOCSOR is its limited potential for parallelization. The standard GD x(t+1) = x(t) f(x(t)) (step size = 1), in contrast, is easy to parallelize across the coordinates of the update. Instead of updating r and x synchronously per-coordinate, we propose the following

Φθ St, x(t), r(t), A = LOCGD

: x(t+1) x(t) + r(t) St , r(t+1) r(t) Qr(t) St (9)

Every coordinate in St is updated in parallel at iteration t. Interestingly, LOCGD exhibits nonnegativity and monotonicity properties, and its runtime complexity is similar to that of LOCSOR, as stated in the following theorem (To remind, i = 0 for γT of LOCGD in Equ. (6) ). Theorem 3.5 (Runtime bound of LOCGD). Given the configuration θ = (α, ϵ, s, G) with α (0, 1) and ϵ 1/ds and let r(T ) and x(T ) be returned by LOCGD defined in (9) for solving Equ. (4). There exists a real implementation of (9) such that the runtime TLOCGD is bounded by

1 D1/2r(T ) 1

TLOCGD 1 + α

αϵ, vol(ST )

where C = (1 + α)/((1 α)|IT |), IT = supp(r(T )). Furthermore, vol(ST )/γT 1/ϵ and the estimate ˆπ := D1/2x(T ) satisfies D 1(ˆπ π) ϵ.

Note that γT of LOCGD is empirically smaller than that of LOCSOR. Hence, LOCGD is empirically slower than LOCSOR by only a small constant factor (e.g., twice as slow), a finding consistent with observations of their standard counterparts [19]. Nonetheless, LOCGD is much simpler and more amenable to parallelization on platforms such as GPUs compared to APPR.

4 Accelerated Local Iterative Methods

This section presents our key contributions where we propose faster local methods based on the Chebyshev method for solving Equ. (3) and the Heavy-Ball (HB) method for Equ. (4).

4.1 Local Chebyshev method

Compared with GS and GD, the standard Chebyshev method offers optimal acceleration in solving Equ. (3). Following existing techniques (e.g., see d Aspremont et al. [11]), we show there exists an upper runtime bound O(m/ α) to meeting the stopping condition where O hides log 1/ϵ (we presented it in Theorem C.6). Hence, the Chebyshev method is one of the optimal first-order linear solvers for solving Equ. (3). However, localizing Chebyshev poses greater challenges due to the additional momentum vector involved in updating x(t). Our key observation is that if a substantial reduction in the magnitudes of r(t) is required within a subset of St, then the corresponding momentum coordinates are likely to possess significant acceleration energy. Intuitively, a viable strategy involves localizing both the residual and momentum vectors. For t 1, denote the momentum vector as (t) := x(t) x(t 1) and δt:t+1 = δtδt+1, we propose the localized Chebyshev as the following

Φθ St, x(t), r(t), A = LOCCH

ˆx(t) (1 + δt:t+1)r(t) St + δt:t+1 (t) St , δt+1 = 2 1+α

x(t+1) x(t) + ˆx(t), r(t+1) r(t) ˆx(t) + 1 α

1+αW ˆx(t),

where t 1 with the initials x(0) = 0, x(1) = r(0), δ0 = 0, δ1 = (1 α)/(1 + α), and W = D 1/2AD 1/2 is normalized adjacency matrix. Our key strategy for analyzing (10) is to rewrite the updates of r(t) as a nonhomogeneous second-order difference equation (see details in Lemma C.8)

r(t+1) 2δt+1W r(t) + δt:t+1r(t 1) =

(1 + δj:j+1)

r=j+1 δr:r+1Qr(j) Sj,t

where we denote Sj,t = Sj St 1 St given t j 0 where St = V\St. In the rest, we define α = (1 α)/(1 + α) and recall the eigendecomposition of D 1/2AD 1/2 = V ΛV . Based on the above Equ. (11), we have the following key lemma.

Lemma 4.1. Given t 1, x(0) = 0, x(1) = r(0). The residual r(t) of LOCCH defined in (10) can be expressed as the following

V r(t) = δ1:t Zt V r(0) + δ1:ttu0,t + 2

k=1 δk+1:t(t k)uk,t,

where Zt is a diagonal matrix such that Zt 2 1 and

Pt 1 j=1 δ2:j

t Hj,t I 1 α

1+αΛ V r(0) S0,j if k = 0

Pt 1 j=k δk+1:j

(t k) Hj,t 1+α 1 αI Λ V r(k) Sk,j if k 1,

where Hk,t is a diagonal matrix such that Hk,t 2 t k.

This key lemma essentially captures the process of residual reduction r(t) of LOCCH. Specifically, given current iteration t, we define the running residual reduction rate for r(k) with k = 0, 1, 2, . . . , t 1 of step t as βk,t, that is,

βk,t uk,t 2

r(k) 2 , βk max t βk,t. (12)

βk,t 2 r(k)

(1 α) r(k) 2 | {z } O(ϵ)

4 αj k(t j) (1 α)(t k)

r(k) Sk,j 2 r(k) 2 | {z } 4 α (1 α)(1 α)

where whether the last term can be even smaller depends on r(k) Sk,j 2 for Sk,j = Sk Sj 1 Sj.

However, we notice that the running geometric mean βt (Qt 1 j=0(1 + βj))1/t is even smaller in practice. Based on these observations and the assumption on βt, we establish the following theorem.

Theorem 4.2 (Runtime bound of LOCCH). Given the configuration θ = (α, ϵ, s, G) with α (0, 1) and ϵ 1/ds and let r(T ) and x(T ) be returned by LOCCH defined in (10) for solving Equ. (3). For t 1, the residual magnitude r(t) 2 has the following convergence bound

r(t) 2 δ1:t

j=0 (1 + βj)yt,

where yt is a sequence of positive numbers solving yt+1 2yt + yt 1/((1 + βt 1)(1 + βt)) = 0 with y0 = y1 = r(0) 2. Suppose the geometric mean βt (Qt 1 j=0(1 + βj))1/t of βt be such that

βt = 1 + c α 1 α where c [0, 2). There exists a real implementation of (10) such that the runtime TLOCCH is bounded by

TLOCCH Θ (1 + α)vol(ST ) α(2 c) ln 2y T

0 200000 400000 600000 TA = X

ASPR (ˆϵ =1.00e-05)

ASPR (ˆϵ =1.00e-07) APPR

0 1 2 TA = X

t vol(St) 107

ASPR (ˆϵ =3.15e-07)

ASPR (ˆϵ =3.15e-09) APPR

Figure 3: Comparison of runtime between APPR and ASPR. The setting is the same as in Fig. 1. Left ϵ = 10 4 while 1

n for right.

Golub & Overton [18] considered the approximate Chebyshev method by assuming that the inexact residual is sufficiently smaller than ϵ r(t) 2, where ϵ must be small enough to ensure convergence. However, this assumption is overly stringent for our case. The novelty of our analysis lies in a more elegant treatment of a parameterized second-order difference equation, allowing us to circumvent this assumption. The nested APGD(ˆϵ), namely ASPR proposed in Martínez-Rubio et al. [37] has runtime complexity O(|S | f vol (S ) / α + |S | vol (S )) where S is the optimal support of arg minx f(x) + ϵα D1/2x 1 and f vol(S ) = nnz (QS ,S ). Although it is difficult to compare our bound to this, one limitation of ASPR is that it assumes to call APGD(ˆϵ) O(|S |) times to finish in the worst case. However, our iteration complexity is O(1/( α(2 c))). Asymptotically, c = o( α) (ϵ 0), our complexity is O(1/ α) could be better than O(|S |/ α). Fig. 3 presents a preliminary study on ASPR, indicating that it requires more operations than APPR.

We conclude our analysis by presenting a similar result for the local Heavy Ball (HB). Note the HB method is the one when δtδt+1 α2 where α = (1 α)/(1 + α). Hence, it has similar convergence analyses as to LOCCH shown in Theorem D.8. The LOCHB has the following updates

Φ (St; s, ϵ, α, G, A = LOCHB) :

ˆx(t) (1 + α2)r(t) St + α2 (t) St x(t+1) x(t) + ˆx(t), r(t+1) r(t) ˆx(t) + 1 α

1+αW ˆx(t).

5 Generalization and Open Problems

Our framework can be applied to various local methods for large-scale linear systems. Extensions of this framework to other linear systems are detailed in Tab. 2 of Appendix E. More broadly, we consider the feasibility of local methods for solving Qx = b, where b is a sparse vector (| supp(b)| n) and Q is a positive definite, graph-induced matrix with bounded eigenvalues. This leads us to question whether all standard iterative methods can be effectively localized, raising two key questions

1. Given a graph-induced matrix Q and its spectral radius ρ(Q) < 1, a standard solver A, and the corresponding local evolving process Φθ(St, x(t), r(t), LOCA), does a localized version of A (over St) converge and have local runtime bounds? 2. Based on current analysis, Theorem 4.2 relies on the geometric mean of residual reduction on r(k) 2 being small. How feasible is acceleration within locality constraints? Specifically, a stronger bound could be established for solving Equ. (3) via LOCHB and LOCCH, with a graph-independent bound of

TLOCA = Θ vol(ST ) αγT ln C

, where C a graph-independent constant.

Additionally, this work primarily focuses on using first-order neighbors at each iteration. An area for future exploration is generalizing to higher-order neighbors to determine if this leads to faster or more efficient methodologies, which remains an open question.

6 Experiments

We conduct experiments over 17 graphs to solve (3) and explore the local clustering task. We address the following questions: 1) Can iterative solvers be effectively localized? 2) How does the performance of accelerated local methods compare to non-accelerated ones? 3) Can our proposed methods reduce the number of operations required for local clustering? 6

Baselines. We consider four baselines: 1) Conjugate Gradient Method (CGM) as a benchmark to compare local and non-local methods; 2) ISTA, the local method proposed by Fountoulakis et al. [13]; 3) FISTA, the momentum-based local algorithm proposed by Hu [22]; and 4) APPR, the classic local method proposed by Andersen et al. [2]. All methods are implemented in Python 3.10 with the numba library [33].

Efficiency of localized algorithms. To compare local solvers to their standard counterparts, we set α = 0.1, randomly select 50 nodes from each graph to serve as es in (3), and run standard GD, SOR, HB, and CH solvers along with their local counterparts: LOCGD, LOCSOR, LOCHB, and LOCCH. We measure the efficiency by the speedup, defined as the ratio between the runtime of the standard and local solver. The range of ϵ is ϵ [ α 2(1+α)ds , 10 4/n]. The results, presented in Fig. 4, clearly indicate that our design demonstrates significant speedup, especially around ϵ = 1/n. Remarkably, they still show better performance even when ϵ 10 4/n (Fig. 5). These results suggest that local solvers are preferred over non-local ones when the precision requirement is in this range.

10 8 10 5 10 2 ϵ

10 8 10 5 10 2 ϵ

10 8 10 5 10 2 ϵ

10 8 10 5 10 2 ϵ

ogbn-products

10 8 10 5 10 2 ϵ

ogbn-proteins

Figure 4: The speedup of local solvers as a function of ϵ. The vertical line is ϵ = 1/n.

0.0 0.5 1.0 1.5 2.0 Operations 109

ln x(T) x 1

Loc SOR Loc HB Loc CH ISTA FISTA CGM APPR

0 1 2 3 Operations 109

ogbn-products

Figure 5: Estimation error as a function of operations required. (ϵ = 10 4/n)

Comparison with local baselines and CGM. We next compare our three accelerated methods with four baselines. Fig. 5 presents the ℓ1-estimation error in terms of the number of operations (quantified as t vol(St)) executed. It is evident that our three solvers use significantly fewer operations compared to CGM and the other three local methods. Again, due to maintaining a nondecreasing set of active nodes, ISTA and FISTA require more operations than the locally evolving set process. Ours are more efficient than APPR, where r(0) = es is used.

0 1 2 Operations 107

ln x(T) x 1

com-friendster

APPR Loc HB Loc CH Loc SOR

0 2 4 Operations 107

ogbn-papers100M

Figure 6: Performance on large-scale graphs.

Efficiency in terms of α and huge-graph tests. We demonstrate the performance of local solvers in terms of different α ranging from 0.005 to 0.25. Interestingly, in Fig. 13, LOCGD show faster convergence when α is small; this may be because of the advantages of monotonicity properties, which is not present in the accelerated methods. However, in other regions of α, accelerated methods are faster. We also tested local solvers on two large-scale graphs where papers100M has 111M nodes and 1.6B edges while com-friendster has 65M nodes with 1.8B edges. Results are shown in Fig. 6; compared with current default local methods, it is several times faster, especially on ogbn-papers100M.

6Additional experimental results, setups, and algorithm parameters are provided in Appendix F.

Case study on local clustering. Following the experimental setup in Fountoulakis et al. [13], we consider the task of local clustering on 15 graphs. As partially demonstrated in Tab. 1, compared with APPR and FISTA, LOCSOR uses the least operations and is the fastest, demonstrating the advantages of our proposed local solvers.

Table 1: Operations/runtime comparison on local clustering.

G Operations Run Time (Seconds)

APPR Loc SOR FISTA APPR Loc SOR FISTA

G1 6.9e+05 6.5e+04 5.7e+05 0.127 0.043 0.093 G2 6.7e+05 8.9e+04 4.4e+05 0.362 0.125 0.308 G3 4.3e+05 3.5e+04 2.9e+05 0.069 0.014 0.042 G4 5.7e+05 7.6e+04 4.4e+05 0.357 0.175 0.229 G5 5.4e+05 9.0e+04 5.0e+05 0.072 0.055 0.084

7 Limitations and Conclusion

Our proposed algorithms may have the following limitations: 1) When α is small, the acceleration effect partially disappears, as observed in Fig. 13. This may be due to the limitations of global counterparts, where the residual may not decrease early; 2) Our new accelerated bound for Loc CH depends on an empirically reasonable assumption of residual reduction but lacks theoretical justification.

We propose using a new locally evolving set process framework to characterize algorithm locality and demonstrate that several standard iterative solvers can be effectively localized, significantly speeding up current local solvers. Our local methods could be efficiently implemented into GPU architecture to accelerate the training of GNNs such as APPNP [27] and PPRGo [7]. We also offer open problems in developing faster local methods. It is worth exploring whether subsampling active nodes stochastically or using different queue strategies (priority rather than FIFO) could help speed up the framework further. It also remains interesting to see how to design local algorithms for conjugate direction-based methods such as CGM.

Acknowledgments and Disclosure of Funding

The authors would like to thank the anonymous reviewers for their helpful comments. The work of Baojian Zhou is sponsored by Shanghai Pujiang Program (No. 22PJ1401300) and the National Natural Science Foundation of China (No. KRH2305047). The work of Deqing Yang is supported by Chinese NSF Major Research Plan No.92270121. The computations in this research were performed using the CFFF platform of Fudan University.

[1] Alon, N., Rubinfeld, R., Vardi, S., and Xie, N. Space-efficient local computation algorithms. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms (SODA), pp. 1132 1139. SIAM, 2012.

[2] Andersen, R., Chung, F., and Lang, K. Local graph partitioning using Page Rank vectors. In 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2006.

[3] Andersen, R., Gharan, S. O., Peres, Y., and Trevisan, L. Almost optimal local graph clustering using evolving sets. Journal of the ACM (JACM), 63(2):1 31, 2016.

[4] Anikin, A., Gasnikov, A., Gornov, A., Kamzolov, D., Maximov, Y., and Nesterov, Y. Efficient numerical methods to solve sparse linear equations with application to Page Rank. Optimization Methods and Software, pp. 1 29, 2020.

[5] Beck, A. and Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2:183 202, 2009.

[6] Berkhin, P. Bookmark-coloring algorithm for personalized pagerank computing. Internet Mathematics, 3(1):41 62, 2006.

[7] Bojchevski, A., Klicpera, J., Perozzi, B., Kapoor, A., Blais, M., Rózemberczki, B., Lukasik, M., and Günnemann, S. Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), 2020.

[8] Boyd, S., Diaconis, P., and Xiao, L. Fastest mixing markov chain on a graph. SIAM review, 46 (4):667 689, 2004.

[9] Chen, Z., Guo, X., Zhou, B., Yang, D., and Skiena, S. Accelerating personalized Page Rank vector computation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 262 273, 2023.

[10] Chung, F. The heat kernel as the pagerank of a graph. Proceedings of the National Academy of Sciences (PNAS), 104(50):19735 19740, 2007.

[11] d Aspremont, A., Scieur, D., Taylor, A., et al. Acceleration methods. Foundations and Trends in Optimization, 5(1-2):1 245, 2021.

[12] Fountoulakis, K. and Yang, S. Open problem: Running time complexity of accelerated ℓ1regularized Page Rank. In Conference on Learning Theory (COLT), 2022.

[13] Fountoulakis, K., Roosta-Khorasani, F., Shun, J., Cheng, X., and Mahoney, M. W. Variational perspective on local graph clustering. Mathematical Programming, 174(1):553 573, 2019.

[14] Fountoulakis, K., Wang, D., and Yang, S. p-norm flow diffusion for local graph clustering. In ICML, 2020.

[15] Gasteiger, J., Weißenberger, S., and Günnemann, S. Diffusion improves graph learning. In Advances in neural information processing systems (Neur IPS), 2019.

[16] Gleich, D. and Mahoney, M. Anti-differentiating approximation algorithms: A case study with min-cuts, spectral, and flow. In International Conference on Machine Learning, pp. 1018 1025. PMLR, 2014.

[17] Gleich, D. F. Page Rank beyond the web. siam REVIEW, 57(3):321 363, 2015.

[18] Golub, G. H. and Overton, M. L. The convergence of inexact chebyshev and richardson iterative methods for solving linear systems. Numerische Mathematik, 53(5):571 593, 1988.

[19] Golub, G. H. and Van Loan, C. F. Matrix computations (4th Edition). JHU press, 2013.

[20] Guo, X., Zhou, B., and Skiena, S. Subset node representation learning over large dynamic graphs. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pp. 516 526, 2021.

[21] Guo, X., Zhou, B., and Skiena, S. Subset node anomaly tracking over large dynamic graphs. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 475 485, 2022.

[22] Hu, C. Local graph clustering using l1-regularized Page Rank algorithms. Master s thesis, University of Waterloo, 2020.

[23] Hu, W., Fey, M., Zitnik, M., Dong, Y., Ren, H., Liu, B., Catasta, M., and Leskovec, J. Open graph benchmark: Datasets for machine learning on graphs. Advances in neural information processing systems, 33:22118 22133, 2020.

[24] Johnson, R. and Zhang, T. Graph-based semi-supervised learning and spectral kernel design. IEEE Transactions on Information Theory, 54(1):275 288, 2008.

[25] Kapralov, M., Lattanzi, S., Nouri, N., and Tardos, J. Efficient and local parallel random walks. Advances in Neural Information Processing Systems (Neur IPS), 2021.

[26] Klamkin, M. and Newman, D. J. Extensions of the weierstrass product inequalities. Mathematics Magazine, 43(3):137 141, 1970.

[27] Klicpera, J., Bojchevski, A., and Günnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), 2019.

[28] Kloster, K. and Gleich, D. F. A nearly-sublinear method for approximating a column of the matrix exponential for matrices from large, sparse networks. In Algorithms and Models for the Web Graph: 10th International Workshop, WAW 2013, Cambridge, MA, USA, December 14-15, 2013, Proceedings 10, pp. 68 79. Springer, 2013.

[29] Kloster, K. and Gleich, D. F. Heat kernel based community detection. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2014.

[30] Kloumann, I. M. and Kleinberg, J. M. Community membership identification from small seed sets. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD), 2014.

[31] Koutis, I., Miller, G. L., and Peng, R. A nearly-m log n time solver for sdd linear systems. In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), pp. 590 598. IEEE, 2011.

[32] Ł acki, J., Mitrovi c, S., Onak, K., and Sankowski, P. Walking randomly, massively, and efficiently. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing (STOC), pp. 364 377, 2020.

[33] Lam, S. K., Pitrou, A., and Seibert, S. Numba: A llvm-based python jit compiler. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, pp. 1 6, 2015.

[34] Leskovec, J., Lang, K. J., Dasgupta, A., and Mahoney, M. W. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. Internet Mathematics, 6(1):29 123, 2009.

[35] Leventhal, D. and Lewis, A. S. Randomized methods for linear constraints: convergence rates and conditioning. Mathematics of Operations Research, 35(3):641 654, 2010.

[36] Mahoney, M. W., Orecchia, L., and Vishnoi, N. K. A local spectral method for graphs: With applications to improving graph partitions and exploring data graphs locally. The Journal of Machine Learning Research (JMLR), 13(1):2339 2365, 2012.

[37] Martínez-Rubio, D., Wirth, E., and Pokutta, S. Accelerated and sparse algorithms for approximate personalized Page Rank and beyond. In Proceedings of Thirty Sixth Conference on Learning Theory (COLT), volume 195 of Proceedings of Machine Learning Research, pp. 2852 2876. PMLR, 2023.

[38] Morris, B. and Peres, Y. Evolving sets and mixing. In Proceedings of the Thirty-Fifth Annual ACM Symposium on Theory of Computing (STOC), pp. 279 286, New York, NY, USA, 2003. Association for Computing Machinery.

[39] Nutini, J., Schmidt, M., Laradji, I., Friedlander, M., and Koepke, H. Coordinate descent converges faster with the gauss-southwell rule than random selection. In ICML, pp. 1632 1641. PMLR, 2015.

[40] Polyak, B. T. Introduction to optimization. New York, Optimization Software 1987.

[41] Rakhlin, A. and Sridharan, K. Efficient online multiclass prediction on graphs via surrogate losses. In Artificial Intelligence and Statistics, pp. 1403 1411. PMLR, 2017.

[42] Rubinfeld, R. and Shapira, A. Sublinear time algorithms. SIAM Journal on Discrete Mathematics, 25(4):1562 1588, 2011.

[43] Saad, Y. Iterative methods for sparse linear systems. SIAM, 2003.

[44] Schaub, M. T., Benson, A. R., Horn, P., Lippner, G., and Jadbabaie, A. Random walks on simplicial complexes and the normalized hodge 1-laplacian. SIAM Review, 62(2):353 391, 2020.

[45] Spielman, D. A. Algorithms, graph theory, and linear equations in laplacian matrices. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II IV: Invited Lectures, pp. 2698 2722. World Scientific, 2010.

[46] Spielman, D. A. and Teng, S.-H. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM Journal on Matrix Analysis and Applications, 35(3):835 885, 2014.

[47] Stevi c, S. Bounded solutions to nonhomogeneous linear second-order difference equations. Symmetry, 9(10):227, 2017.

[48] Teng, S.-H. et al. Scalable algorithms for data and network analysis. Foundations and Trends in Theoretical Computer Science, 12(1 2):1 274, 2016.

[49] Tseng, P. and Yun, S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming, 117:387 423, 2009.

[50] Tu, S., Venkataraman, S., Wilson, A. C., Gittens, A., Jordan, M. I., and Recht, B. Breaking locality accelerates block gauss-seidel. In ICML, pp. 3482 3491. PMLR, 2017.

[51] Vishnoi, N. K. et al. Lx= b. Foundations and Trends in Theoretical Computer Science, 8(1 2): 1 141, 2013.

[52] Wang, J.-K., Lin, C.-H., and Abernethy, J. D. A modular analysis of provable acceleration via Polyak s momentum: Training a wide Re LU network and a deep linear network. In ICML, pp. 10816 10827. PMLR, 2021.

[53] Wu, F., Souza, A., Zhang, T., Fifty, C., Yu, T., and Weinberger, K. Simplifying graph convolutional networks. In ICML, pp. 6861 6871. PMLR, 2019.

[54] Yin, H., Benson, A. R., Leskovec, J., and Gleich, D. F. Local higher-order graph clustering. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (KDD), 2017.

[55] Young, D. M. Iterative solution of large linear systems. Elsevier, 2014.

[56] Zeng, H., Zhang, M., Xia, Y., Srivastava, A., Malevich, A., Kannan, R., Prasanna, V., Jin, L., and Chen, R. Decoupling the depth and scope of graph neural networks. Advances in Neural Information Processing Systems, 34:19665 19679, 2021.

[57] Zhou, B., Sun, Y., and Harikandeh, R. B. Fast online node labeling for very large graphs. In ICML, pp. 42658 42697. PMLR, 2023.

Appendix / supplemental material

A Notations and Proof of Lemma 2.1 16

A.1 List of Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.2 Proof of Lemma 2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

B Local Iterative Methods via Evolving Set Process 18

B.1 Local Variant of GS-SOR and Proof of Lemma 3.2 . . . . . . . . . . . . . . . . . 18

B.2 LOCSOR and Proof of Theorem 3.3 . . . . . . . . . . . . . . . . . . . . . . . . . 20

B.3 Optimal GS-SOR and Proof of Corollary 3.4 . . . . . . . . . . . . . . . . . . . . . 22

B.4 LOCGD and Proof of Theorem 3.5 . . . . . . . . . . . . . . . . . . . . . . . . . 23

C Local Chebyshev Method - LOCCH 25

C.1 Nonhomogeneous of Second-order Difference Equation . . . . . . . . . . . . . . . 25

C.2 Properties on Ratio of Chebyshev Polynomials . . . . . . . . . . . . . . . . . . . 28

C.3 Standard Chebyshev (CH) Method and Proof of Theorem C.6 . . . . . . . . . . . . 29

C.4 Residual Updates of LOCCH and Proof of Lemma 4.1 . . . . . . . . . . . . . . . 30

C.5 Convergence of LOCCH and Proof of Theorem 4.2 . . . . . . . . . . . . . . . . . 34

C.6 Implementation of LOCCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

D Local Heavy-Ball Method - LOCHB 37

D.1 Standard HB and Proof Theorem D.2 . . . . . . . . . . . . . . . . . . . . . . . . . 37

D.2 Residual Updates of LOCHB and Proof of Theorem D.6 . . . . . . . . . . . . . . 40

D.3 Convergence of LOCHB of Proof of Theorem D.8 . . . . . . . . . . . . . . . . . . 43

D.4 Implementation of LOCHB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

E Instances of Sparse Linear Systems 45

E.1 Table of Popular Graph-induced Linear Systems . . . . . . . . . . . . . . . . . . 45

F Experimental Details and Missing Results 46

F.1 Datasets and Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

F.2 Problems Settings and Baseline Methods . . . . . . . . . . . . . . . . . . . . . . . 46

F.3 Full results of Fig. 15 4 5 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

F.4 Results on local clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

G Related work 53

A Notations and Proof of Lemma 2.1

A.1 List of Notations

In the rest of the appendix, we use the following notations:

Description

G An undirected connected simple graph with unit weights.

A The adjacency matrix of G.

D The diagonal degree matrix of G.

W The normalized Laplacian matrix W D 1/2AD 1/2.

V ΛV The eigendecomposition of W is given by W = D 1/2AD 1/2 = V ΛV , where Λ = diag(λ1, λ2, . . . , λn), with 1 = λ1 λ2 λn 1. When λn = 1, G is a bipartite graph.

Q The underlying matrix of Equ. (3) is Q = I 1 α

1+αD 1/2AD 1/2.

r(t) Given an estimate x(t), the residual is defined as r(t) b Qx(t).

r(t) The D1/2-shifted residual is defined as r(t) = D1/2r(t).

α The damping factor α, which lies in the interval (0,1).

α A pre-defined constant α = (1 α)/(1 + α).

TA The total runtime of a local algorithm A.

Tt(x) For t 1, Tt denotes the Chebyshev polynomial of the first kind, defined as Tt+1(x) = 2x Tt(x) Tt 1(x), with T0(x) = 1, and T1(x) = x.

δt The ratio of Tt 1 and Tt, i.e., δt = Tt 1( 1+α

1 α)/Tt( 1+α

1 α), and δt+1 = (2 1+α

1 α δt) 1, with δ1 = 1 α

δ1:t The product of all δt, i.e., δ1:t Qt i=1 δi. By default, we set δ0:1 = 0.

S1:t The intersection of all t-ordered sets, i.e., S1:t = S1 S2 St. By default, we set St:t 1 = V.

St The complement of St, i.e., St = V\St.

Sj,t Sj,t = Sj Sj+1 St 1 St = Sj:t 1 St. By default, we set St,t = St.

A.2 Proof of Lemma 2.1

Lemma 2.1 (Runtime bound of APPR [2]). Given α (0, 1) and the precision ϵ 1/ds for node s V with p 0, z es at the initial, APPR(G, ϵ, α, s) defined in (2) returns an estimate p of π. There exists a real implementation of (2) (e.g., Algo. 2) such that the runtime TAPPR satisfies

TAPPR Θ (1/(αϵ)) .

Furthermore, the estimate ˆπ := p satisfies D 1(ˆπ π) ϵ and vol(supp(ˆπ)) 2/((1 α)ϵ).

Proof. To find an upper bound of TAPPR, we add a time index for all active nodes u1, u2, . . . , ut processed in APPR of Algo. 2, a real implementation of (2). The parameter t is the number of active nodes processed, and ui is the node dequeued at time i. So, updates of p and z are from (0, es) = (p(0), z(0)) to (p(t), z(t)) as follows:

(p(0), z(0)) u1 (p(1), z(1)) u2 (p(2), z(2)) ut (z(t), p(t)).

For each active ui, the updates of (p, z) by definition can be represented as

p(i) = p(i 1) + αz(i 1) ui eui, z(i) = z(i 1) (1 + α)z(i 1) ui 2 eui + (1 α)z(i 1) ui 2 AD 1eui.

Since z(0) = es 0, then z(i) 0 and p(i) 0 for all i by induction. Note that AD 1eui 1 = 1, we have the following relation from the updates of z

z(i) 1 = z(i 1) 1 αz(i 1) ui z(i 1) ui = z(i 1) 1 z(i) 1

Note ϵdui z(i 1) ui for each active ui. Summing the above equation over i = 1, 2, . . . , t, we have

i=1 z(i 1) ui =

z(i 1) 1 z(i) 1

= z(0) 1 z(t) 1

where note z(0) 1 = es 1 = 1 by the initial condition. Since Pt i=1 dui exactly captures the number of operations needed, the runtime of Algo. 2 is then bounded as

To check the quality of estimate p, using the updates of p(i) and summing over all i, we have

i=1 z(i 1) ui eui = α (1 + α)

z(i 1) z(i) = π Πz(t),

where Π is the PPR matrix. The above gives us π p(t) = Π z(t). Since G is undirected, the Π matrix satisfies πv[u] = (du/dv)πu[v] where πv[u] is the u-th element of PPR vector of sourcing node v. Consider each u-th element of Π z(t)

(Π z(t))u = X

v V z(t) v πv[u] = X

v V z(t) v du

dv πu[v] ϵdu X

v V πu[v] = ϵdu,

where the last equality is due to P

v V πu[v] = 1. Hence, (π p(t))u = (Π z(t))u ϵdu, which indicates D 1(p(t) π) ϵ. To see the bound of vol(supp(p(t))), note for any u supp(p(t)), it was an active node and there was at least zu(1 α)/2 remain in u-th entry of z(t) where we denote zu as the residual before the last push operation of node u; hence

u supp(p) du X

u supp(p) z(t) u ϵ(1 α)/2 2 (1 α)ϵ.

Algo. 2 APPR(α, ϵ, s, G) via FIFO Queue

1: Initialize: p 0, z es, Q { , s}, t = 1 2: while true do 3: u Q.dequeue() 4: if u == * then 5: if Q = then 6: break 7: t t + 1 // Starting time of St 8: Q.enqueue(*) // Marker for next St+1 9: continue 10: z zu 11: pu pu + α z 12: zu z (1 α)/2 13: for v N(u) do

14: zv zv + (1 α)

2 z du 15: if zv ϵdv and v / Q then 16: Q.enqueue(v) 17: if zu ϵdu and u / Q then 18: Q.enqueue(u) 19: return p

Indeed, Lemma 2.1 is a special case of Theorem 1 in [2]. The proof outlined above adheres to the key strategy demonstrated in that theorem, which involves exploring the monotonicity and nonnegativity of z. The real implementation of APPR, as shown in Algo. 2, presents a typical queue-based method. It has monotonic properties during the updates of p and z (Lines 10-16 of Algo. 2). It also holds element-wise that p 0 and z 0. The operations of Q.enqueue(u), Q.dequeue(), and v / Q are all in O(1). Line 4 to Line 9 is to design the marker for distinguishing between St and St+1. If all active nodes are processed and no more active nodes are added into Q, then Q will be empty, and finally, the algorithm returns an estimate p of π.

B Local Iterative Methods via Evolving Set Process

Justification of an equivalent condition. We make the justification of an equivalent stop condition for solving (3). Note we require a local solver to return an estimate ˆπ satisfies

D 1 (ˆπ π) ϵ (14)

Since we define Qx = b as I 1 α

1+αD 1/2AD 1/2 x = 2α 1+αD 1/2es. With r(t) = b Qx(t)

and the stop condition

D 1/2r(t) 2αϵ 1 + α ensures the estimate ˆπ = D1/2x(t) satisfies (14). To see this, since π = D1/2x , we have

D 1 (ˆπ π) = D 1/2(x(t) x )

= D 1/2(Q 1b Q 1r(t) Q 1b)

= D 1/2Q 1D1/2D 1/2r(t)

D 1/2Q 1D1/2 D 1/2r(t)

D 1/2Q 1D1/2 2αϵ

where D 1/2Q 1D1/2 = (I 1 α

1+αD 1A) 1 = P i=0( 1 α

1+αD 1A)i. This leads to

1+αD 1A)i 2αϵ

B.1 Local Variant of GS-SOR and Proof of Lemma 3.2

The Gauss-Seidel Successive Over-Relaxation (GS-SOR) solver (see Section 11.2.7 of Golub & Van Loan [19]) for the linear system Mπ = s via the following forward substitution

for i in V := {1, 2, . . . , n} do :

p(t+1) i = ω

j=1 Mijp(t+1) j

j=i+1 Mijp(t) j

/Mii + (1 ω)p(t) i ,

where p is updated from p(t) to p(t+1). When the relaxation parameter ω = 1, GS-SOR reduces to the standard GS method. Equivalently, let i = (i 1)/n for i = 1, 2, . . . , n, then GS-SOR updates can be sequentially represented as

for i in V := {1, 2, . . . , n} do :

p(t+ i+1) p(t+ i) + ω Mii

j=1 Mijp(t+ i) j

j=i Mijp(t+ i) j

Therefore, it is natural to define the following local variant of GS-SOR. Definition B.1 (Local variant of GS-SOR). Consider the linear system Mπ = s. For t 0, we are given an active node set St = {u1, u2, . . . , u|St|} and let i = (i 1)/|St| for i = 1, 2, . . . , |St|, providing ω (0, 2), it is natural to define the local variant of GS-SOR as follows:

for ui in St := {u1, u2, . . . , u|St|} do :

p(t+ i+1) p(t+ i) + ω Muiui

j=1 Muiujp(t+ i) uj

j=i Muiujp(t+ i) uj

where St V. When ω = 1 and St = V = {1, 2, . . . , n}, it reduces to the standard GS.

We use the above definition to show APPR is a local variant of GS-SOR as the following. Lemma 3.2 (New local evolving-based bound for APPR). Let M = α 1 I 1 α

and s = es. The linear system Mπ = s is equivalent to Equ. (1). Given p(0) = 0, z(0) = es with ω (0, 2), the local variant of GS-SOR (15) for Mπ = s can be formulated as

p(t+ i+1) p(t+ i) + ωz(t+ i) ui Muiui eui, z(t+ i+1) z(t+ i) ωz(t+ i) ui Muiui Meui,

where ui is an active node in St satisfying zui ϵdui and i = (i 1)/|St|. Furthermore, when ω = 1+α

2 , this method reduces to APPR given in (7), and there exists a real implementation (Aglo. 2) of APPR such that the runtime TAPPR is bounded by

TAPPR vol(ST )

ϵ , where vol(ST )

ϵ , CT = 2 (1 α)|IT |, ˆγT 1

(P|St| i=1 |z(t+ i) ui | z(t) 1

Proof. Note that Mπ = s is equivalent to Equ. (1). We first rewrite p(t+ i+1) in terms of the residual z. The residual z at time t + i can be written as z(t+ i) = s Mp(t+ i). Note sui Pi 1 j=1 Muiujp(t+ i) uj Pn j=i Muiujp(t+ i) uj = (s Mp(t+ i))ui = z(t+ i) ui . Then, the updates of

local GS-SOR defined in (15) can be rewritten as p(t+ i+1) = p(t+ i) + ω Muiui z(t+ i) ui eui. Hence,

the updates of z(t+ i) can be written as

z(t+ i+1) = s Mp(t+ i+1) = s M

p(t+ i) + ωz(t+ i) ui Muiui eui

= z(t+ i) ωz(t+ i) ui Muiui Meui.

Note the diagonal element Muiui = (1 + α)/(2α). Hence, when ω = 1+α

2 , we have

p(t+ i+1) = p(t+ i) + αz(t+ i) ui eui

z(t+ i+1) = z(t+ i) αz(t+ i) ui Meui = z(t+ i) z(t+ i) ui (1 + α

2 AD 1)eui.

The above updates match APPR s evolving set process formulation in (7). The rest is to show a new runtime bound. Adding ℓ1-norm on both sides of the above equation, then note z(t+ i+1) 1 = z(t+ i) 1 αz(t+ i) ui for i = 1, 2, . . . , |St|. We have

1 α P|St| i=1 z(t+ i) ui z(t) 1

z(t) 1 = (1 αβt) z(t) 1 =

i=0 (1 αβt) z(0) 1,

where we define βt :=

P|St| i=1 |z (t+ i) ui | z(t) 1 . Let t = T 1, we have

t=0 ln (1 αβt)

t=0 αβt T 1 αˆγT ln z(0) 1

where the first inequality is due to ln(1 + x) x for x > 1. For each nonzero node u IT = {z(T ) u : z(T ) u = 0, u V}, consider the last time t that it was altered. Then, either the alteration came from u being an active node, with z(t ) u duϵ, and after the PUSH operation it became z(t ) u (1 α)du

2 ϵ; or the alteration came from a neighboring node vu N(u) pushing its mass onto

u, which ensures that z(t ) u (1 α)

2dvu z(t ) vu (1 α)

2 ϵ. These two cases provide a lower bound of 1 α

Hence, z(T ) 1 ϵ(1 α)|IT |

2 , which leads to the corresponding constant CT .

To see the lower bound of 1/ϵ, note ϵdui z(t+ i) ui for all i = 1, 2, . . . , |St|. Then we have

i=1 z(t+ i) ui

= βt z(t) 1 βt,

where we defined βt :=

P|St| i=1 |z (t+ i) ui | z(t) 1 and the last inequality is due to the monotonic decreasing of

z(t) 1, i.e., 1 z(0) 1 z(T ) 1. Applying the above inequality for all t = 0, 1, 2 . . . , T 1, it leads to ϵ vol(St) βt

t=0 vol(St)

where the last derivation is from the fact that ˆγT = 1

PT 1 t=0 βt :=

P|St| i=1 |z (t+ i) ui | z(t) 1

Remark B.2. The connection between APPR and the Gauss-Seidel is not new [28, 29, 16, 9]. Our work is the first work that has linked APPR and the Gauss-Seidel with a locally evolving set process.

B.2 LOCSOR and Proof of Theorem 3.3

In this subsection, recall we defined r(t) = D1/2r(t). Lemma B.3 (Local iteration complexity of LOCSOR (ω 1)). Denote St = {u1, u2, . . . , u|St|} as the active node set at the t-th iteration. When ω (0, 1], all vectors r(t) 0 are nonnegative and magnitudes are decreasing r(t+1) 1 < r(t) 1. Let T be the total number of iterations needed. Then, at iteration T, we have

r(0) 1 , ln r(0) 1

r(t+ i) ui r(t) 1

where γt = t 1 Pt 1 τ=0 γτ is the mean of active ratio factors defined in Equ. (6).

Proof. Recall ui St = {u1, . . . , u|St|} and i = i 1

|St|, LOCSOR in Algo. 3 updates

x(t+ i+1) = x(t+ i) + ωr(t+ i) ui eui

r(t+ i+1) = r(t+ i) ωr(t+ i) ui eui + (1 α)ω

1+α r(t+ i) ui D 1/2AD 1/2eui Note r 0 during updates when ω (0, 1] and recall r(t) = D1/2r(t), we have

r(t+ i+1) + ω r(t+ i) ui eu 1 = r(t+ i) + (1 α)ω

1+α r(t+ i) ui AD 1eui 1

r(t+ i+1) 1 + ω r(t+ i) ui = r(t+ i) 1 + (1 α)ω

1+α r(t+ i) ui .

Summing over the above equations over ui, we have

r(t+1) 1 = r(t) 1 2αω

i=1 r(t+ i) ui =

r(t+ i) ui r(t) 1 | {z } γt

r(t) 1. (17)

Given {xi}T 1 i=0 and xi (0, 1), the Weierstrass product inequality provides 1 PT 1 i=0 xi QT 1 i=0 (1 xi). By using this inequality, we continue to have a lower bound of T as the following

r(T ) 1 r(0) 1 (1 + α) 1 r(T ) 1/ r(0) 1

To get upper bound of γt, note each active residual r(t+ i) ui pushes at most (1 α)ω

(1+α) times magnitude to

rui+1, rui+2, and ru|St|; hence, P|St| j=i r(t+ j) uj will increase by at most r(t+ i) ui (1 α)ω

(1+α) r(t+ i) ui in total. Hence, overall ui, we have

r(t) St 1 =

i=1 r(t) ui

i=1 r(t+ i) ui 2 r(t) St 1.

We reach the following lower and upper bounds of γt, r(t) St 1 r(t) 1 γt := P|St| i=1 r (t+ i) ui r(t) 1 2 r(t) St 1 r(t) 1 . To

check the upper bound of T, from Equ. (17), r(T ) 1 = QT 1 t=0 1 2αωγt

1+α r(0) 1 and

t=0 ln 1 2αωγt

1 + α T (1 + α)

2αωγT ln r(0) 1

where the first inequality is due to ln(1 + x) x for x > 1.

Theorem 3.3 (Runtime bound of LOCSOR (ω = 1)). Given the configuration θ = (α, ϵ, s, G) with α (0, 1) and ϵ 1/ds and let r(T ) and x(T ) be returned by LOCSOR defined in (8) for solving Equ. (3). There exists a real implementation of (8) such that the runtime TLOCSOR is bounded by

1 D1/2r(T ) 1

TLOCSOR 1 + α

αϵ, vol(ST )

where vol(ST ) and γT are defined in (6) and C = 1+α (1 α)|IT | with IT = supp(r(T )). Furthermore,

vol(ST )/γT 1/ϵ and the local estimate ˆπ := D1/2x(T ) satisfies D 1(ˆπ π) ϵ.

Proof. After the last iteration T, for each nonzero residual r(T ) u = 0, u IT , there must be at least one update that happened at node u: Node u has a neighbor vu N(u), which was active. This

neighbor vu pushed some residual (1 α) r(t ) vu (1+α)dvu to u where t < T. Hence, for all u IT , we have

r(T ) 1 = X

u IT r(T ) u X

(1 α) r(t ) vu (1 + α)dvu X

(1 α)2αϵdvu/(1 + α)

(1 + α)dvu = ϵ|IT |2α(1 α)

where the second inequality is because r(t ) vu was active before the push operation. Applying the above lower bound of r(T ) 1 to Equ. (16) of Lemma B.3 and note r(0) 1 = 2α/(1 + α), we obtain

r(0) 1 r(T ) 1 r(0) 1 ϵ|IT | 2α(1 α)

(1+α)2 = 1 + α ϵ(1 α)|IT | := C1

The rest is to prove an upper bound 1/(αϵ) of TLOCSOR. Recall that for any active node u, we have residual updates from Algo. 3 as the following

D1/2r(t+1) = D1/2r(t) ωr(t) u D1/2eu + (1 α)ωr(t) u 1 + α AD 1D1/2eu.

Move ωr(t) u D1/2eu to the left and note AD 1D1/2eu 1 = du, we then obtain

D1/2r(t+1) 1 + ω p

dur(t) u = D1/2r(t) 1 + (1 α)ω

Hence, for each active u, we have 2αω dur(t) u 1+α = D1/2r(t) 1 D1/2r(t+1) 1. Summing them

over all active nodes u and noticing r(t) u 2αϵ du/(1 + α) by the active condition. Note ω = 1 and ||D1/2r(0)||1 = 2α 1+α, we have run time bounded by

TLOCSOR = X

2 P t( D1/2r(t) 1 D1/2r(t+1) 1)

Combining the above bound and the bound T shown in Lemma B.3, we prove the lower and upper bound of TLoc SOR. To check the lower bound of 1/ϵ,i.e., vol(ST )/γT 1/ϵ, note 2αϵdui

1+α r(t+ i) ui for all i = 1, 2, . . . , |St|. Then we have

2αϵ 1 + α vol(St)

i=1 r(t+ i) ui

= γt D1/2r(t) 1

γt D1/2r(0) 1 = 2αγt

where the last inequality is due to the monotonic decreasing of D1/2r(t) 1, i.e., 2α 1+α D1/2r(0) 1 r(T ) 1. Applying the above inequality over all t = 0, 1, 2 . . . , T 1, it leads to

ϵ vol(St) γt

t=0 vol(St)

Algo. 3 LOCSOR(α, ϵ, s, G, ω) via FIFO Queue

1: Initialize: r ces, x 0, c = 2α 1+α, t = 1 2: Q { , s} // As we assume ϵ 1/ds 3: while true do 4: u Q.dequeue() 5: if u == * then 6: if Q = then 7: break 8: t t + 1 // Starting time of St 9: Q.enqueue(*) // Marker for next St+1 10: continue 11: r ru 12: if |ru| < c ϵdu then 13: continue 14: xu xu + ω r 15: ru ru ω r 16: for v N(u) do

17: rv rv + (1 α)ω

(1+α) r du 18: if |rv| c ϵdv and v / Q then 19: Q.enqueue(v) 20: if |ru| c ϵdu and u / Q then 21: Q.enqueue(u) 22: return x

The real queue-based implementation of LOCSOR is presented in Algo. 3. It has monotonic and nonnegative properties during the updates of r 0 and x 0 when ω (0, 1]. Same as APPR, the operations of Q.enqueue(u), Q.dequeue(), and v / Q are all in O(1). During the updates, one should note that the real vector r presents D1/2r(t) while the vector x is D1/2x(t). In this case, the original active node condition is implicitly shifting from |ru| 2αϵ du

1+α to du|ru| 2αϵdu

1+α . We use this shifted active condition in Lines 11 and 13 and inactive condition in Line 5. When ω (1, 2), it is possible |ru| < c ϵdu and LOCSOR will ignore this inactive node u during the updates. This step makes sure St = {ui : |r (t+ i) ui | 2αϵ du

1+α } during the updates.

B.3 Optimal GS-SOR and Proof of Corollary 3.4

We introduce the following standard result. Lemma B.4 (Young [55], Section 12.2, Theorem 2.1 ). Given the GS-SOR method for solving Qx = b, if the underlying matrix Q is a Stieltjes matrix and set relaxation parameter ω as

1 ρ(B)2 = 1 +

where ρ(B) is the largest eigenvalue (in magnitude) of B = I diag(Q) 1Q, then

where Lω := (diag(Q) ωQL) 1(ωQU (ω 1)diag(Q)) with Q = diag(Q) QU QL.

Corollary 3.4. Let ω = ω 2/(1 + p

1 (1 α)2/(1 + α)2) and St = V, t 0, the global version of LOCSOR has the following convergence bound

r(t) 2 2 (1 + α) ds

where ϵt are small positive numbers with limt ϵt = 0.

Proof. Recall Q = I 1 α

1+αD 1/2AD 1/2 and we consider the underlying graph as simple which means A has 0 diagonal. Hence, diag(Q) = I and B is defined as

1 + αD 1/2AD 1/2, ρ(B) = 1 α

Since Q is a Stieltjes matrix, then Lemma B.4 gives a bound on the spectral radius of Lω as

= 2(1 + α) 1 + α + 2 α 1 1/2 = 1 α

Recall that Gelfand s formula states [52]: Given spectral radius ρ(Lω ) := maxi [n] |λi(Lω )|, where λi( ) is the i-th eigenvalue, there exists a sequence {ϵt} 0 such that Lt ω 2 = (ρ(Lω ) + ϵt)t

and limt ϵt = 0. The standard SOR method is defined as the following

x(t+1) = (diag(Q) ω QL) 1 ω b + (ω QU (ω 1)diag(Q)) x(t)

= Lω x(t) + ω (diag(Q) ω QL) 1b,

Note r(t) = Qe(t) = QLt ω Q 1Qe(0) = QLt ω Q 1r(0). We have

r(t) 2 = QLt ω Q 1r(0) 2 Q 2 Lt ω 2 Q 1 2 r(0) 2

2 1 + α Lt ω 2 1 + α

2α 2α (1 + α) ds = 2 Lt ω 2 (1 + α) ds .

To meet the stop condition, we require |r(t) u | 2αϵ du

1+α . It is enough to make sure r(t) 2 2αϵ (1+α) ds . This leads to find t such that

r(t) 2 2 Lt ω 2 (1 + α) ds 2αϵ (1 + α) ds 1 α

When ϵt = o( α), then the runtime of global LOCSOR is O(m/ α) where O hides log 1

B.4 LOCGD and Proof of Theorem 3.5

The local gradient descent, namely LOCGD is to use x(t+1) = x(t) +r(t) St and r(t+1) = r(t) Qr(t) St ,

where St = {ui : |r(t+ i) ui | 2αϵ du/(1 + α)} where i = 0. Algo. 4 presents our actual implementation of LOCGD via FIFO Queue.

Algo. 4 LOCGD(α, ϵ, s, G) via FIFO Queue

1: Initialize: r ces, x 0, c = 2α 1+α 2: Q {s} // Assume ϵ 1 ds 3: t = 0 4: while Q = do 5: St [] 6: while Q = do 7: u Q.dequeue() 8: St.append((u, ru)) 9: xu xu + ru 10: ru 0 11: for (u, r) St do 12: for v N(u) do

13: rv rv + (1 α) r (1+α)du 14: if |rv| c ϵdv and v / Q then 15: Q.enqueue(v) 16: t t + 1 17: return x, r

Algo. 4 presents LOCGD similar to the real queue-based implementation of LOCSOR. It has monotonic and nonnegative properties during the updates of r 0 and x 0. Again, the operations of Q.enqueue(u), Q.dequeue(), and v / Q are all in O(1). During the updates, one should note that r presents D1/2r(t) while x is D1/2x(t). All shifted conditions are the same as of LOCSOR. The key advantage of LOCGD is that it is highly parallelizable, while LOCSOR is truly an online update, so it is hard to parallelize.

Lemma B.5 (Iterations of LOCGD). With the initial x(0) = 0, r(0) = b, S0 = supp(r(0)), denote r(t) = D1/2r(t). LOCGD defined in (9) has the following properties: 1) x(t) 0, r(t) 0 and r(t) r(t+1) 1; 2) The residual and estimation error satisfies

r(t+1) 1 = 1 2αγt

r(t) 1, γt =

r(t+ i) ui r(t) 1 , where i = 0.

Proof. We first show x(t) 0, r(t) 0 are all nonnegative vectors during the updates when b 0. This can be seen from the induction. At the initial stage, x(0) 0 and r(0) = b 0. Now assume that for any t 0, x(t) 0 and r(t) 0. Then x(t+1) = x(t) + r(t) Sk 0, and

r(t+1) = r(t)

1+αD 1/2AD 1/2r(t) St 0. Therefore, x(t) 0 and r(t) 0 for all t. Note

r(t+1) = r(t)

1+αAD 1 r(t) St and since AD 1 r(t) St 1 = r(t) St 1, we will have

1 2α 1 + α r(t) St 1 r(t) 1

r(t) 1, where γt := r(t) St 1/ r(t) 1.

Then, we can bound the total residual as the following theorem.

Theorem 3.5 (Runtime bound of LOCGD). Given the configuration θ = (α, ϵ, s, G) with α (0, 1) and ϵ 1/ds and let r(T ) and x(T ) be returned by LOCGD defined in (9) for solving Equ. (4). There exists a real implementation of (9) such that the runtime TLOCGD is bounded by

TLOCGD 1 + α

αϵ, vol(ST )

where C = (1 + α)/((1 α)|IT |), IT = supp(r(T )). Furthermore, vol(ST )/γT 1/ϵ and the estimate ˆπ := D1/2x(T ) satisfies D 1(ˆπ π) ϵ.

Proof. We first show bound 1/(αϵ). We first rearrange r(t+1) = r(t) Qr(t) St into

D1/2r(t+1) + D1/2r(t) St = D1/2r(t) + 1 α

1 + αAD 1D1/2r(t) St .

Note r(t) 0 and AD 1D1/2r(t) St 1 = D1/2r(t) St 1. Hence, it leads to

D1/2r(t) St 1 = 1 + α

D1/2r(t) 1 D1/2r(t+1) 1 .

At each local iterative t, by the active node condition 2αϵ du/(1 + α) r(t) u , we have

ϵ vol(St) = X

(1 + α) dur(t) u 2α = 1 + α

D1/2r(t) St

Then the total run time of LOCGD presented in Algo. 4 is

t=0 vol(St) 1

2 D1/2r(0) 1 D1/2r(T ) 1 1 + α

Therefore, the total run time is at most TLOCGD := PT 1 t=0 vol(St) 1+α

2αϵ . For estimating the bounds of T, by the Weierstrass product inequality [26] and Lemma B.5, we use the similar argument made in Lemma B.3 and continue to have

2αγT ln r(0) 1

Note that each nonzero r(T ) u has at least part of the magnitude from the push operation of an active node, say vu at time t < T. This means each nonzero of D1/2r(T ) satisfies

r(T ) u (1 α) r(t ) vu (1 + α)dvu (1 α) 2αϵdvu/(1 + α)

(1 + α)dvu = 2α(1 α)ϵ

(1 + α)2 , for u IT .

Hence, we have r(T ) 1 2α(1 α)ϵ|IT |

(1+α)2 and T is further bounded as

2αγT ln r(0) 1 2α(1 α)ϵ|IT |

(1+α)2 := (1 + α)

ϵ , where CT = (1 + α) (1 α)|IT |.

The lower bound of 1/ϵ, i.e., vol(St)/γT 1/ϵ, directly follows a similar strategy of previous proof by noticing that D1/2r(t) 1 is monotonically decreasing.

Remark B.6. One may consider designing local methods based on Jacobi and Richardson s iterations. Indeed, these two methods have the same updates as standard GD. Recall the standard GD method to solve (4) is x(t+1) = x(t) + r(t), r(t+1) = r(t) Qr(t). The Richardson s iteration is x(t+1) = (I ωW )x(t) + ωb, i.e., x(t+1) = x(t) ω(W x(t) b). The optimal ω = 2/(λmin + λmax) where λmin = 2α/(1 + α) and λmax 2/(1 + α). Hence one can choose ω = 1 ω [19]. It leads to x(t+1) = x(t) + r(t). One can get the same result for the Jacobi method.

C Local Chebyshev Method - LOCCH

C.1 Nonhomogeneous of Second-order Difference Equation

We begin by providing the solutions of the second-order nonhomogeneous equation as the following

Lemma C.1 (Stevi c [47]). The solution of the second-order nonhomogeneous difference equation

xt+1 + pxt + qxt 1 = ft, t = 1, 2, . . . (21)

is characterized by the following two cases

ˆλt 1 ˆλ2x0 x1 Pt k=1 fk ˆλk 1

+ ˆλt 2 x1 ˆλ1x0 + Pt k=1 fk ˆλk 2

2)t x0 Pt k=1 kfk ( p/2)k+1 + t( p

2 x0 + Pt k=1 fk ( p/2)k p2 = 4q ,

where ˆλ1, ˆλ2 are two roots of λ2 + pλ + q = 0, and the summation follows convention P0 k=1 = 0.

Based on the above lemma, we have the following corollary

Corollary C.2 (Second-order nonhomogeneous equation). Given |a| 1, the second-order nonhomogeneous equation xt+1 2axt + xt 1 = ft has the following solution

x0 + t(x1 x0) + Pt k=1(t k)fk if a = 1

sin(θt)x1 sin(θ(t 1))x0

Pt k=1 sin(θ(t k))fk

sin θ if |a| < 1 where θ = arccos(a)

( 1)t(x0 t(x0 + x1)) + ( 1)t Pt k=1( 1) k 1(t k)fk if a = 1.

Proof. The first and last two cases can be directly followed from Lemma C.1. Since ˆλ1 and ˆλ2 are the two complex roots of λ2 2aλ + 1 = 0, we write ˆλ1 = reiθ = r(cos θ + i sin θ) and ˆλ2 = re iθ = r(cos θ i sin θ). It indicates

λ2 2aλ + 1 = λ reiθ λ re iθ = λ2 r(eiθ + e iθ)λ + r2 = 0.

Since 1 > a2, ˆλ1 = a i

1 a2, and Re(ˆλ1)2 + Im(ˆλ1)2 = a2 + (1 a2) = 1, then r = 1. Then θ = arccos(Re(ˆλ1)) = arccos(a), and sin(θ) = Im(ˆλ1) =

1 a2. Finally, ˆλt 1 = eitθ = cos(tθ) + i sin(tθ), and

ˆλ1 = cos θ + i sin θ, ˆλ2 = cos θ i sin θ ˆλ1ˆλ2 = eitθe itθ = 1 ˆλ2 ˆλ1 = 2i sin θ ˆλt 1 = cos(θt) + i sin(θt), ˆλt 2 = cos(θt) i sin(θt) ˆλt 2 ˆλt 1 = e itθ eitθ = 2i sin(θt).

ˆλt 1 1 ˆλt 1 2 = 2i sin(θ(t 1)) ˆλ1ˆλ2=1 = ˆλt 1ˆλ2 ˆλt 2ˆλ1

Based on these, we have

xt = 1 ˆλ2 ˆλ1

= ˆλt 1(ˆλ2x0 x1) + ˆλt 2(x1 ˆλ1x0) ˆλ2 ˆλ1 + 1 ˆλ2 ˆλ1

fk ˆλk 1 + ˆλt 2

= (2i sin(θ(t 1))x0 2i sin(θt)x1)

2i sin θ + 1 2i sin θ

fk ˆλk 1 + ˆλt 2

= sin(θt)x1 sin(θ(t 1))x0

sin θ + 1 2i sin θ

ˆλt k 2 ˆλt k 1 fk

= sin(θt)x1 sin(θ(t 1))x0

sin θ + 1 2i sin θ

k=1 2i sin(θ(t k))fk

= sin(θt)x1 sin(θ(t 1))x0

sin θ + Pt k=1 sin(θ(t k))fk

Lemma C.3. For |λi| 1, i = 1, 2, . . . , n, equations y(t+1) i 2λiy(t) i + y(t 1) i = 0 have the following solutions

y(0) i + t(y(1) i y(0) i ) if λi = 1

sin(θit)y(1) i sin(θi(t 1))y(0) i sin(θi) if |λi| < 1 where θi = arccos(λi)

( 1)t(zi,0 t(y(0) i + y(1) i )) if λi = 1.

Furthermore, when y(1) i = λiy(0) i for i = 1, 2, . . . , n, then solutions can be simplified as

y(0) i if λi = 1

y(0) i cos(θit) if |λi| < 1 where θi = arccos(λi)

zi,0( 1)t if λi = 1.

Proof. The first part is a consequence of Corollary C.2 by letting ft = 0. To see the second identity of this lemma, note that

(λi sin(θit) sin(θi(t 1)))

sin θi = cos(θit).

Indeed, by expanding sin(θi(t 1)), we have sin(θi(t 1)) = sin(θit θi) = sin(θit) cos( θi) + cos(θit) sin( θi) = sin(θit) cos(θi) cos(θit) sin(θi) = λi sin(θit) cos(θit) sin(θi)

Hence, when y(1) i = λiy(0) i , we have

sin(θit)y(1) i sin(θi(t 1))y(0) i sin(θi) = (λi sin(θit) sin(θi(t 1))) y(0) i sin θi = cos(θit)y(0) i .

Lemma C.4. Given t 1, |λi| 1 , the n second-order difference equations

y(t+1) i 2λiy(t) i + y(t 1) i = hi,t, i = 1, 2, . . . , n. have the following solutions

y(0) i + t(y(1) i y(0) i ) + Pt 1 k=1(t k)hi,k if λi = 1

sin(θit)y(1) i sin(θi(t 1))y(0) i sin(θi) + Pt 1 k=1 sin(θi(t k))

sin θi hi,k if |λi| < 1 where θi = arccos(λi)

( 1)t(zi,0 t(y(0) i + y(1) i )) + Pt 1 k=1( 1)t k 1(t k)hi,k if λi = 1. (25)

Furthermore, with initial conditions y(1) i = λiy(0) i , y(t) i can be simplified as

y(0) i + Pt 1 k=1(t k)hi,k if λi = 1

cos(θit)y(0) i + Pt 1 k=1 sin(θi(t k))

sin θi hi,k if |λi| < 1 where θi = arccos(λi)

( 1)ty(0) i + Pt 1 k=1( 1)t k 1(t k)hi,k if λi = 1.

Proof. The first part is a consequence of Corollary C.2 by letting ft = hi,t. We prove the second part by considering three cases of λi as

Case 1. When λi = 1, we have y(t+1) i 2y(t) i + y(t 1) i = hi,t. For t 1, the solution the above is

y(t) i = y(0) i + t(y(1) i y(0) i ) +

k=1 (t k)hi,k = y(0) i +

k=1 (t k)hi,k,

where the second equation is due to y(1) i = λ1y(0) i = y(0) i .

Case 2. When λn = 1 (G is a bipartite graph), we have y(t+1) i + 2y(t) i + y(t 1) i = hi,t. For t 1, the solution is

y(t) i = ( 1)ty(0) i + ( 1)t t 1 X

( 1)k+1 = ( 1)ty(0) i +

k=1 ( 1)t k 1(t k)hi,k.

Case 3. When |λi| < 1, and define θi = arccos(λi). We use a similar argument in Lemma

C.3 and continue to have

y(t) i = sin(θit)y(1) i sin(θi(t 1))y(0) i sin θi + Pt 1 k=1 sin(θi(t k))hi,k

= (λi sin(θit) sin(θi(t 1))) y(0) i sin θi + Pt 1 k=1 sin(θi(t k))hi,k

= cos(θit)y(0) i +

sin(θi(t k))

sin θi hi,k.

C.2 Properties on Ratio of Chebyshev Polynomials

The next lemma presents the properties of Chebyshev polynomials. Lemma C.5 (Chebyshev polynomial bound). For t 1, the Chebyshev polynomial of the first kind is defined recursively as

Tt+1(x) = 2x Tt(x) Tt 1(x) with T0(x) = 1, T1(x) = x.

For t 1, define δt = Tt 1( 1+α

1 α)/Tt( 1+α

1. Tt(x = 1+α

1 α) and δt defines the following sequence

δt+1 = 21 + α

1 , where δ1 = 1 α

2. The closed-form δ1:t can be upper bounded as

δ1:t = 1 Tt( 1+α

1 α) = 2 αt + α t 2 1 α

3. Note δ1 = T0/T1 = 1/x = 1 α

1+α, the sequence {δt} satisfies δt < 1, t 1 and

1 = 2δt+1x δtδt+1, t = 1, 2, . . . .

4. Denote δj:t = Qt i=j δi for t j 0 and set the default value δj:j 1 = 1 for j 0, then

δ1:t/(δ1:k) = δk+1:t 2 αt k, for t k 0.

Proof. For the first item, let x = 1+α

1 α, use the Chebyshev equation, we have

1 = 2 1 + α

Tt+1 = 2 1 + α

δt+1 δtδt+1 δ 1 t+1 = 2 1 + α

For the second item, for all t 0, if ξ = x+x 1

2 = 0, it is well known that the Tt can be rewritten as

ξ = x + x 1

For our problem, recall we defined α = 1 α

1+ α and using x = 1+α

1 α = 0, one can verify that

1 α = α + α 1

j=1 δj = T0

Tt = 2 αt + α t = 2 αt

α2t + 1 2 αt = 2 1 α

For the third item, it is sufficient to show that δt 1

x for all t 1, for x = 1+α

1 α. This can be done recursively, since δ1 = x and

δt+1 = 1 2x δt 1 2x 1

For the last item, note when t 1 and k 1, we have the following inequalities

j=1 δ 1 j = 2 αt + α t αk + α k

2 = αk + α k

αt + α t = αt k α2k + 1

α2t + 1 2 αt k,

where note α2k+1

α2t+1 [1, 2] for t k.

C.3 Standard Chebyshev (CH) Method and Proof of Theorem C.6

This subsection introduces the standard Chebyshev algorithm. Our following theorem is to prove the runtime complexity of the Chebyshev polynomial iteration for solving Equ. (3).

Theorem C.6 (Standard CH). For t 1, consider the Chebyshev polynomials to solve Equ. (3) as

x(t+1) = x(t) + (1 + δt:t+1)r(t) + δt:t+1 x(t) x(t 1) , r(t+1) = 2δt+1W r(t) δt:t+1r(t 1),

where x(0) = 0, x(1) = x(0) + r(0) and δt+1 = 2 1+α

1 α δt 1 with δ1 = 1 α

1+α. Assume ϵ < 1/ds, then the residual has the following convergence bound

r(t) 2 2 1 α

Let the estimate be ˆπ = D1/2x(t), the the runtime of CH for reaching D 1/2r(t) 2αϵ

1+α with D 1(ˆπ π) ϵ guarantee is at most

TCH Θ m 1 + α

Proof. Recall eigendecomposition of W = V ΛV where V = [v1, v2, . . . , vn] and each vi is the eigenvector. For t 1, the residual r(t) can be written as n second-order difference equations as

V r(t+1) 2δt+1ΛV r(t) + δt:t+1V r(t 1) = 0,

where each i-th element-wise equation of the above can be written as the following

v i r(t+1) 2δt+1λiv i r(t) + δt+1δtv i r(t 1) = 0, i = 1, 2, . . . , n.

Define V r(t) = δ1:ty(t). Each component v i r(t) is v i r(t) = δ1:ty(t) i where v i r(0) := y(0) i by default. The above can be rewritten

δ1:t+1y(t+1) i 2δt+1δ1:tλiy(t) i + δt+1δtδ1:t 1y(t 1) i = 0

y(t+1) i 2λiy(t) i + y(t 1) i = 0, (27)

where follows from δ1:t+1 = 0. Note V r(1) = 1 α

1+αV V ΛV r(0) = δ1ΛV r(0), we have

V r(0) = y(0) V r(1) = δ1y(1) = δ1ΛV r(0) = δ1Λy(0),

where it follows from δ1 = 0. As y(1) = Λy(0), follow Equ. (24) of Lemma C.3, y(t) i has the solution

( y(0) i cos(θit) |λi| < 1 y(0) i λt i |λi| = 1

( |y(0) i | |λi| < 1 |y(0) i | |λi| = 1 ,

where θi = arccos(λi). We can write down r(t) in terms of y(t)

V r(t) = δ1:ty(t) = δ1:t Zty(0),

where Zt has two possible forms

(diag (1, . . . , cos(θit), . . . , ( 1)t) for bipartite graphs

diag (1, . . . , cos(θit), . . . , cos(θnt)) for non-bipartite graphs.

Hence, Zt 2 1 and Zty(0) 2 y(0) 2. We have

r(t) 2 = V r(t) 2 δ1:t y(0) 2 2 1 α

where the last inequality follows Lemma C.5 and note z(0) = V r(0). To meet the stop condition of |r(t) u | < 2αϵ du/(1 + α), u V = , it is sufficient to choose a minimal integer t such that

t b 2 < 2αϵ (1 + α) ds .

To see this, if the above inequality is satisfied, then note for any node u, we have

|r(t) u | r(t) r(t) 2 2 1 α

t b 2 < 2αϵ (1 + α) ds 2αϵ du

where all nodes are inactive. So, it gives t ln 1 α 1+ α ln ϵ

2 by noticing b 2 = 2α/((1 + α) ds).

It indicates t ln 2

ln 1+ α 1 α . As 1 ln(1+x) 1+x

x for x > 0, it is sufficient to choose t

C.4 Residual Updates of LOCCH and Proof of Lemma 4.1

We propose the following local Chebyshev iteration procedure

x(t+1) = x(t) + (1 + δt:t+1)r(t) St + δt:t+1 x(t) x(t 1)

Our next Lemma is to expanding x(t) x(t 1)

Lemma C.7. For t 1 with initials x(0) = 0 and x(1) = x(0) + r(0) S0 , the local Chebyshev iterative is the following

x(t+1) = x(t) + (1 + δt:t+1)r(t) St + δt:t+1 x(t) x(t 1)

Denote (t) = x(t) x(t 1), then (t) = Pt 1 j=0 (1 + δj:j+1) Qt 1 r=j+1 δr:r+1r(j) Sj:t 1

, where δ0:1 = 0, S0:t = S0 S1 St and δj:j+1 = δjδj+1. We have the following

(1 + δt:t+1)r(t)

St + δt:t+1 x(t) x(t 1)

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj,t

where Sj,t Sj:t 1 St.

Proof. Recall we defined δ0 = 0 so that δ0:1 = 0. We prove this lemma by induction.

For t = 1, note δ0:1 = 0 and the support of r(0) is S0 = supp(r(0)), then

x(1) x(0) = (1 + δ0:1)r(0).

For t = 2, note S1:1 = S1 and S0:1 = S0 S1 by our notation, then

x(2) x(1) = (1 + δ1:2)r(1) S1:1 + δ1:2(x(1) x(0))S1

= (1 + δ1:2)r(1) S1:1 + (1 + δ0:1)δ1:2r(0) S0:1.

For t = 3, one can build x(3) x(2) based on x(2) x(1) and recall S2:2 = S2, S1:2 = S1 S2

x(3) x(2) = (1 + δ2:3)r(2) S2:2 + δ2:3(x(2) x(1))S2

= (1 + δ2:3)r(2) S2:2 + δ2:3((1 + δ1:2)r(1) S1:1 + (1 + δ0:1)δ1:2r(0) S0:1)S2

= (1 + δ2:3)r(2) S2:2 + (1 + δ1:2)δ2:3r(1) S1:2 + (1 + δ0:1)δ1:2δ2:3r(0) S0:2.

For t = 4, we continue to have

x(4) x(3) = (1 + δ3:4)r(3) S3:3 + δ3:4(x(3) x(2))S3:3

= (1 + δ3:4)r(3) S3:3 + δ3:4((1 + δ2:3)r(2) S2:2 + (1 + δ1:2)δ2:3r(1) S1:2 + (1 + δ0:1)δ1:2δ2:3r(0) S0:2)S3:3

= (1 + δ3:4)r(3) S3:3 + (1 + δ2:3)δ3:4r(2) S2:3 + (1 + δ1:2)δ3:4δ2:3r(1) S1:3 + (1 + δ0:1)δ1:2δ2:3δ3:4r(0) S0:3 (x(4) x(3))S4 = (1 + δ3:4)r(3) S3:3 + δ3:4(x(3) x(2))S3:3

By induction t 1,

x(t) x(t 1) =

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj:t 1

where the convention notation P0 i=1 = 0 and Qi j=i+1 = 1. To verify the inductive step, consider for t + 1, we have

x(t+1) x(t) = (1 + δt:t+1)r(t) St + δt:t+1(x(t) x(t 1))St

= (1 + δt:t+1)r(t) St + δt:t+1(x(t) x(t 1))St

= (1 + δt:t+1)r(t) St + δt:t+1

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj:t 1

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj:t

To see the second equation, note

(1 + δt:t+1)r(t)

St + δt:t+1 x(t) x(t 1)

= (1 + δt:t+1)r(t)

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj:t 1 St

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj,t

where recall we denote Sj,t Sj:t 1 St.

Lemma C.8 (Local Chebyshev updates). Given the updates of x(t+1) as defined by LOCCH in (10), we have the following local updates

x(t+1) = x(t) + (1 + δt:t+1)r(t) St + δt:t+1 x(t) x(t 1)

r(t+1) 2δt+1W r(t) + δt:t+1r(t 1) = Pt j=0 (1 + δj:j+1) Qt r=j+1 δr:r+1Qr(j) Sj,t

Proof. We only need to show the second equation of (28). The residual of LOCCH updates is

r(t+1) = b Qx(t+1)

r(t+1) = b Q x(t) + (1 + δt:t+1)r(t) St + δt:t+1(x(t) x(t 1))St

= r(t) (1 + δt:t+1)Qr(t) St δt:t+1Q(x(t) x(t 1))St r(t+1) + (1 + δt:t+1)Qr(t) + δt:t+1Q(x(t) x(t 1)) r(t) | {z } u

= (1 + δt:t+1)Qr(t)

St + δt:t+1Q x(t) x(t 1)

St | {z } small noisy part

Note Q(x(t) x(t 1)) = b Qx(t 1) (b Q(x(t)) = r(t 1) r(t) and then u becomes

u = r(t+1) + (1 + δt:t+1)Qr(t) + δt:t+1(r(t 1) r(t)) r(t)

= r(t+1) 2δt+1W r(t) + δt:t+1r(t 1),

where the last equality is due to (1 + δt:t+1)Qr(t) = (1 + δt:t+1)r(t) 2δt:t+1W r(t) by noticing (1 + δtδt+1) 1 α

1+α = 2δt+1 in Lemma C.5. Hence, we have the second equation. To see the noisy part, note by Lemma C.7

(1 + δt:t+1)r(t)

St + δt:t+1 x(t) x(t 1)

St = (1 + δt:t+1)r(t)

(1 + δj:j+1)

r=j+1 δr:r+1r(j) Sj,t

(1 + δt:t+1)Qr(t)

St + δt:t+1Q x(t) x(t 1)

(1 + δj:j+1)

r=j+1 δr:r+1Qr(j) Sj,t

Lemma 4.1 (Residual updates of LOCCH). Given t 1, x(0) = 0, x(1) = x0 + r(0) S0 . The residual r(t) of LOCCH defined in Equ. (28) satisfies

V r(t) = δ1:t Zt V r(0) + δ1:ttu0,t + 2

k=1 δk+1:t(t k)uk,t, (29)

Pt 1 j=1 δ2:j Hj,t I 1 α

1+αΛ V r(0) S0,j/t if k = 0

Pt 1 j=k δk+1:j Hj,t 1+α 1 αI Λ V r(k) Sk,j

/(t k) if k 1,

(diag (1, . . . , cos(θit), . . . , ( 1)t) for bipartite graphs

diag (1, . . . , cos(θit), . . . , cos(θnt)) for non-bipartite graphs,

diag t k, . . . , sin(θi(t k))

sin θi , . . . , ( 1)t k 1(t k) for bipartite graphs

diag t k, . . . , sin(θi(t k))

sin θi , . . . , sin(θn(t k))

for non-bipartite graphs.

Proof. We first decompose the residual equation in (28) as

V r(t+1) 2δt+1ΛV r(t) + δt:t+1V r(t 1)

(1 + δj:j+1)

r=j+1 (δr:r+1) I 1 α

1 + αΛ V r(j) Sj,t

| {z } f (t)

Define V r(t) = δ1:ty(t) and V r(0) = δ1:0y(0) = y(0) by default. Then we have

δ1:t+1y(t+1) 2δ1:t+1Λy(t) + δ1:t+1y(t 1) = f (t) y(t+1) 2Λy(t) + y(t 1) = f (t)

Note V r(1) = δ1y(1) = V δ1W r(0) = δ1Λy(0), which indicates y(1) = Λy(0). Then, by the Lemma C.4, each y(t) i has the solution

y(0) i + Pt 1 k=1(t k)f (k) i /(δ1:k+1) if λi = 1

cos(θit)y(0) i + Pt 1 k=1 sin(θi(t k))

sin θi f (k) i /(δ1:k+1) if |λi| < 1

( 1)ty(0) i + Pt 1 k=1( 1)t k 1(t k)f (k) i /(δ1:k+1) if λi = 1. Use Zt and Hk,t, we write the solution of the second-order difference equation as

y(t) = Zty(0) +

k=1 Hk,tf (k)/(δ1:k+1)

V r(t) = δ1:ty(t) = δ1:t Zt V r(0) + δ1:t

k=1 Hk,tf (k)/(δ1:k+1)

V r(t) = δ1:t Zt V r(0) +

k=1 δk+2:t Hk,t

(1 + δj:j+1)

r=j+1 (δr:r+1) I 1 α

1 + αΛ V r(j) Sj,k

Note 1 + δj:j+1 = 2δj+1 1+α

1 α, j 1, then (1 + δj:j+1) Qk r=j+1 (δr:r+1) = 2 1+α

1 αδj+1:kδj+1:k+1. Then, we have

V r(t) = δ1:t Zt V r(0) +

k=1 δk+2:t Hk,t

1 + αΛ V r(0) S0,k

k=1 δk+2:t Hk,t

δj+1:kδj+1:k+1

1 αI Λ V r(j) Sj,k

= δ1:t Zt V r(0) + δ1:t

k=1 δ2:k Hk,t

1 + αΛ V r(0) S0,k

k=1 δk+2:t Hk,t

δj+1:kδj+1:k+1

1 αI Λ V r(j) Sj,k

where the last term can be expanded as

k=1 δk+2:t Hk,t

δj+1:kδj+1:k+1

1 αI Λ V r(j) Sj,k | {z }

δ3:t H1,tδ2:1δ2:2w(1) S1,1+

δ4:t H2,tδ2:2δ2:3w(1) S1,2 + δ4:t H2,tδ3:2δ3:3u(2) S2,2+

δ5:t H3,tδ2:3δ2:4w(1) S1,3 + δ5:t H3,tδ3:3δ3:4w(2) S2,3 + δ5:t H3,tδ4:3δ4:4w(3) S3,3+

δ6:t H4,tδ2:4δ2:5w(1) S1,4 + δ6:t H4,tδ3:4δ3:5w(2) S2,4 + δ6:t H4,tδ4:4δ4:5w(3) S3,4 + δ6:t H4,tδ5:4δ5:5w(4) S4,4+

δ7:t H5,tδ2:5δ2:6w(1) S1,5 + δ7:t H5,tδ3:5δ3:6w(2) S2,5 + δ7:t H5,tδ4:5δ4:6w(3) S3,5 + δ7:t H5,tδ5:5δ5:6w(4) S4,5

+ δ7:t H5,tδ6:5δ6:6w(5) S5,5+

δk+1:j Hj,t

1 αI Λ V r(k) Sk,j

Here, we denote δk+1:k = 1. The final iterative update is

V r(t) = δ1:t Zt V r(0) + δ1:tt

j=1 δ2:j Hj,t

1 + αΛ V r(0) S0,j

| {z } u0,t

k=1 δk+1:t(t k)

δk+1:j Hj,t

1 αI Λ V r(k) Sk,j

| {z } uk,t

C.5 Convergence of LOCCH and Proof of Theorem 4.2

Corollary C.9. Let βk be lower bound of residual reduction satisfies uk,t 2 βk r(k) 2, then the upper bound of r(t) 2 can be characterized as

r(t) 2 δ1:t

j=0 (1 + βj)yt, where yt+1 2yt + yt 1 (1 + βt 1)(1 + βt) = 0, (30)

where y0 = y1 = r0 2.

Proof. Since uk,t 2 βk r(k) 2, the final iterative updates (29) can be bounded as

V r(t) = δ1:t Zt V r(0) + δ1:ttu0,t + 2

k=1 δk+1:t(t k)uk,t

r(t) 2 δ1:t r(0) 2 + δ1:ttβ0 r(0) 2 + 2

k=1 δk+1:t(t k)βk r(k) 2

k=1 δk+1:t(t k)βk r(k) 2 δ1:t(1 + tβ0) r(0) 2, (31)

where t = 0, 1, . . . , T. These T + 1 (including a trivial one where r(0) 2 r(0) 2) inequalities shown in Equ. (31) form a system of linear inequality matrix as

z31 z32 1 0

... ... ... ... ...

z T 1 z T 2 z T 3 1

| {z } I ZL

δ1:1(1 + 1β0) r(0) 2

δ1:2(1 + 2β0) r(0) 2

δ1:3(1 + 3β0) r(0) 2 ...

δ1:T (1 + Tβ0t) r(0) 2

where (ZL)tk = 2δk+1:t(t k)βk for t = 2, 3, . . . , T and k = 1, 2, . . . , t 1. Assume that N RT T is a strictly lower triangular matrix, then we know the established formula (I +N) 1 = I + PT 1 k=1 ( 1)k N k. Hence, we have the following

(I ZL) 1 = I +

Given that (I ZL) 1 0, then we obtain an upper-bound of

z (I ZL) 1 c.

It leads to the following new second-order difference equation

k=1 δk+1:t(t k)βkzk = δ1:t(1 + tβ0)z0, for t = 1, 2, . . . , T,

where the initial value of z0 = r(0) 2. Following the argument in Theorem 1 of Golub & Overton [18], we construct a second-order homogeneous equation for zt as

zt = δ1:t(1 + tβ0)z0 + 2

k=1 δk+1:t(t k)βkzk

zt+1 = δ1:t+1(1 + (t + 1)β0)z0 + 2

k=1 δk+1:t+1(t + 1 k)βkzk

δt+1zt = δ1:t+1(1 + tβ0)z0 + 2

k=1 δk+1:t+1(t k)βkzk

zt+1 δt+1zt = δ1:t+1β0z0 + 2

k=1 δk+1:t+1βkzk, (32)

where Equ. (32) is obtained by the difference between the second equation and the third equation. Similarly,

zt 1 = δ1:t 1(1 + (t 1)β0)z0 + 2

k=1 δk+1:t 1(t k 1)βkzk

δt:t+1zt 1 = δ1:t+1(1 + (t 1)β0)z0 + 2

k=1 δk+1:t+1(t k 1)βkzk

δt+1zt δt:t+1zt 1 = δ1:t+1β0z0 + 2

k=1 δk+1:t+1βkzk, (33)

where Equ. (33) is obtained by the difference of the first two. Hence, Equ. (32) (33) gives us

zt+1 2(1 + βt)δt+1zt + δt:t+1zt 1 = 0, for t = 1, 2, . . . , T,

where two initials are z0 = r(0) 0 and z1 = δ1(1 + β0) r(0) 2. Let zt = δ1:tˆzt, then

ˆzt+1 2(1 + βt)ˆzt + ˆzt 1 = 0,

where two initials are ˆz0 = z0 = r(0) 0 and ˆz1 = (1 + β0) r(0) 2. We finish the proof by setting Qt 1 j=0(1 + βj)yt = ˆzt.

Remark C.10. Key points of the above proof strategy largely follow from Golub & Overton [18]. However, different from the original technique, we generalize the strategy to a parameterized version. Lemma C.11. Given βj 0, the following second-order difference equation

xt+1 2(1 + βt)xt + xt 1 = 0.

has the following solution

j=0 (1 + βj)yt,

where yt+1 2yt + yt 1 (1+βt 1)(1+βt) = 0 with y0 = x0 and y1 = x1/(1 + β0).

Proof. Assume xt = 1

2 t Qt 1 j=0 ( 2(1 + βj)) yt. Then, following the equation, we have

j=0 ( 2(1 + βj)) yt+1

2(1 + βt) 1

j=0 ( 2(1 + βj)) yt + 1

j=0 ( 2(1 + βj)) yt 1 = 0.

Since βj 0, the term Qt j=0 ( 2(1 + βj)) = 0, we divide it on both sides to have

t+1 yt+1 + 1

t 1 1 4(1 + βt 1)(1 + βt)yt 1 = 0.

Hence, it is simplified into yt+1 2yt + yt 1 (1+βt 1)(1+βt) = 0. To make a simplification on xt, we prove the lemma.

Theorem 4.2 (Runtime bound of LOCCH). Given the configuration θ = (α, ϵ, s, G) with α (0, 1) and ϵ 1/ds and let r(T ) and x(T ) be returned by LOCCH defined in (10) for solving Equ. (3). For t 1, the residual magnitude r(t) 2 has the following convergence bound

r(t) 2 δ1:t

j=0 (1 + βj)yt,

where yt is a sequence of positive numbers solving yt+1 2yt + yt 1/((1 + βt 1)(1 + βt)) = 0 with y0 = y1 = r(0) 2. Suppose the geometric mean βt (Qt 1 j=0(1 + βj))1/t of βt be such that

βt = 1 + c α 1 α where c [0, 2). There exists a real implementation of (9) such that the runtime TLOCCH is bounded by

TLOCCH Θ (1 + α)vol(ST ) α(2 c) ln 2y T

Proof. The convergence bound of r(t) directly follows from Corollary C.9. Since we assume that

there exists c [0, 2) such that Qt 1 j=0(1 + βj) 1 + c α 1 α t . Then multiplying both sides by αt, we have

j=0 (1 + βj) 1 (2 c) α

Then we have

r(t) 2 δ1:t

j=0 (1 + βj)yt δ1:t 2 αt 2 αt βt tyt ϵ

j=0 (1 + βj) 1/t

Since βt = Qt 1 j=0(1 + βj) 1/t, and by using 1+x

x 1 ln(1+x) and letting x = 1+ α 1 α

then t can be lower bounded further by

t 1 + α 1 + α (1 α)βt ln 2yt

Since we assumed βt = (1 + c α 1 α), which means 1 βt = (1 + c α 1 α), so βt h 1, 1+ α

1 α i . Then, we find such an upper bound of t so that LOCCH converges.

t = 1 + α 1 + α (1 α)βt ln 2yt

(2 c) α ln 2yt

C.6 Implementation of LOCCH

We present the implementation of LOCCH as follows: Recall the sequence δt+1 = 2 1+α

δt 1, t = 1, 2, . . . with δ1 = 1 α 1+α. Denote x(t) x(t) x(t 1), (t) := (1 + δt:t+1)r(t) + δt:t+1 x(t), we have

x(t+1) = x(t) + (1 + δt:t+1)r(t) St + δt:t+1 x(t) St = x(t) + (t) St

r(t+1) = b Q x(t) + (1 + δt:t+1)r(t) St + δt:t+1 x(t) St

= r(t) Q (t) St

x(t+1) = x(t) + (t) St (t 1) St 1 .

When t = 0, we have x(0) = 0, r(0) = b, x(0) = 0, (0) = r(0).

When t = 1, we have x(1) = r(0) S0 , r(1) = 1 α 1+αW (0) S0 , x(1) = r(0) S0 , (1) = (1 + δ1:2)r(1) + δ1:2 x(1).

When t 1, we can recursively calculate the following vectors

x(t+1) = x(t) + (t) St

r(t+1) = r(t) (t) St + 1 α

1 + αW (t) St

x(t+1) = x(t) + (t) St (t 1) St 1 .

Therefore, at per-iteration, we only need to save sub-vectors St and St 1 and update x locally.

D Local Heavy-Ball Method - LOCHB

D.1 Standard HB and Proof Theorem D.2

Lemma D.1 (The standard HB updates). The updates x(t) and r(t) of the HB method for solving Equ. (4) can be written as

x(t+1) = x(t) + (1 + α2)r(t) + α2 x(t) x(t 1)

r(t+1) = 2 αW r(t) α2r(t 1).

The residual updates can be rewritten as a second-order homogeneous equation

y(t+1) 2Λy(t) + y(t 1) = 0, t = 1, 2, 3, . . .

where y(t) is such that r(t) = αt V y(t), t 0 with y(0) = V r(0) = V b.

Proof. We follow the standard Polyak s heavy-ball method [40] as

x(t+1) = x(t) ηα f(x(t)) + ηβ(x(t) x(t 1)),

where f(x(t)) = Qx(t) b and ηα = 4/( p

2/(1 + α)+ p

2α/(1 + α))2 = 2(1+α)/(1+ α)2 = 1 + α2 and ηβ = ( p

2/(1 + α) p

2α/(1 + α))2/( p

2/(1 + α) + p

2α/(1 + α))2 = α2. Hence, it leads to the following updates

x(t+1) = x(t) + (1 + α2)r(t) + α2 x(t) x(t 1) .

r(t) = Q(x x(t)) = b I 1 α

1 + αW x(t) = b I 2 α 1 + α2 W x(t)

then x(t+1) = 2 αW x(t) α2x(t 1) + (1 + α2)b and since

1 + αW W = W I 1 α

1 + αW = W Q

r(t+1) = Q(x(t+1) x )

= 2 αQW x(t) + α2Qx(t 1) + (I (1 + α2)Q)b

= 2 αQW x(t) + α2Qx(t 1) + α2 + 2 αW b

= 2 αW r(t) α2r(t 1)

Using r(t) = αt V y(t), t 0

αt+1V y(t+1) = 2 αW αt V y(t) α2 αt 1V y(t 1) V y(t+1) = 2W V y(t) V y(t 1).

As W = V ΛV and V = V 1 is orthogonal matrix, we continue to have

V V y(t+1) 2V V ΛV V y(t) + V V y(t 1) = 0 y(t+1) 2Λy(t) + y(t 1) = 0.

Theorem D.2 (Convergence analysis of Heavy-Ball (HB)). To solve the minimization problem in Equ. (4), we propose the following standard HB updates as

x(t+1) = x(t) + (1 + α2)r(t) + α2 x(t) x(t 1) , r(t+1) = 2 αW r(t) α2r(t 1),

where the initial condition is x(0) = 0, r(0) = b, x(1) = x(0) + Γr(0), r(1) = b Qx(1). Then there exists a constant τ such that the total iteration complexity to reach the stop condition {u : |ru| ϵdu, u V} = is

2 α ln Ct r(0) 2

where Ct = 1 if Γ = Q 1(I αW ) (ideal case); Ct = max 1+ α 1

1 λ2 2 , 1 + (1 + α 1)t if Γ = 0

(practical case); and Ct = 2

1 λ2 2 if Γ = (1 α)(1+α)

2 I (G is not bi-partite graph).

Proof. Recall W = V ΛV and then V r(t+1) = 2 αΛV r(t) α2V r(t 1). By Lemma D.1, we have y(t+1) 2Λy(t) + y(t 1) = 0,

where we obtained n second-order difference equations

y(t+1) i 2λiy(t) i + y(t 1) i = 0, i = 1, 2, . . . , n.

Follow the Lemma C.3, Equ. (23) has the solution

sin(θit)y(1) i sin(θi(t 1))y(0) i sin(θi) |λi| < 1 where θi = arccos(λi)

(y(0) i + (y(1) i λiy(0) i )t)λt i |λi| = 1, , (34)

where in the case of |λi| < 1. We consider the three cases of Γ

Ideal Case: We can eliminate t in (34), when y(1) = Λy(0), we get y(1) i = λiy(0) i , and then y(t) i can be simplified into

( (λi sin(θit) sin(θi(t 1)))y(0) i sin θi |λi| < 1 y(0) i λt i |λi| = 1 =

( y(0) i cos(θit) |λi| < 1 y(0) i λt i |λi| = 1

( |y(0) i | |λi| < 1 |y(0) i | |λi| = 1 .

In this case, Γ needs to be Γ = Q 1(I αW ). Therefore, we have

V r(t) 2 = r(t) 2 = αt y(t) 2 αt y(0) 2 = αt V r(0) 2 = αt r(0) 2.

Practical Case: Just letting x(1) = x(0) = 0, we have αy(1) = y(0), then

α 1 sin(θit) sin(θi(t 1))

sin(θi) y(0) i |λi| < 1

(1 + ( α 1 λi)t)y(0) i λt i |λi| = 1 max

( 1 + α 1 p

1 λ2 2 , 1 + (1 + α 1)t

where θi = arccos(λi).

Non-bipartite graph Case: When the graph is non-bipartite, we can eliminate t, as the following: We choose Γ = τI, we have

x(1) = x(0) + Γr(0) = Γr(0), r(1) = b Qx(1) = r(0) τQr(0)

V r(1) = (1 τ)V r(0) + (1 α)τ

1 + α ΛV r(0), v i r(1) = (1 τ + (1 α)τ

1 + α λi)v i r(0)

We have the following relations

v i r(0) = y(0) i

v i r(1) = αy(1) i = (1 τ + (1 α)τ

1 + α λi)v i r(0) = (1 τ + (1 α)τλi

1 + α )y(0) i ,

To make t disappear when λi = 1, we need y(0) i = y(1) i , or

1 τ + (1 α)τ

1 + α = 1 α

1 + α τ = 1 + α α + α.

In this case, we have

|y(t) i | 2 p

1 λ2 2 |y(0) i |

To make sure the algorithm stops when the stop condition is met, it is enough for

r(t) 2 = αt y(t) 2 αt Ct y(0) 2 = αt Ct V r(0) 2 = αt Ct r(0) 2 ϵ.

This means Ct r(0) 2 αt ϵ, which leads to the following

2 α ln Ct r(0) 2

Remark D.3. The constant that appears in the bound involves the second largest eigenvalue λ2 of AD 1. It is deeply related to the mixing time of random walk [8] where the second largest eigenvalue determines the mixing time of the walk. A smaller absolute value of the second largest eigenvalue indicates that a random walk on the graph will mix (i.e., approach its steady-state distribution) more quickly. Our proof is partially inspired by d Aspremont et al. [11] where we directly bound r(t)

instead of providing bound for e(t).

D.2 Residual Updates of LOCHB and Proof of Theorem D.6

Lemma D.4. Let the local heavy-ball method be defined as

x(t+1) = x(t) + (t) St , r(t+1) = r(t) Q (t) St , (t) = (1 + α2)r(t) + α2 x(t) x(t 1) ,

where x(0) = 0, x(1) = Γr(0) and Γ = diag(Γ1, Γ2, . . . , Γn) is initial step size matrix. We have the following expanding sequence

α2(x(t) x(t 1))St = (1 + α2)

i=1 α2(t i)r(i) Si:t 1 St + α2tΓr(0) S0:t 1 St, t 1

St = (1 + α2)

i=1 α2(t i)r(i) Si:t 1 St + α2tΓr(0) S0:t 1 St, t 1.

Furthermore, we have the following sequence

1 + αΛV ) x(t) x(t 1)

i=1 α2(t i)(V 1 α

1 + αΛV )r(i)

Si,t + Γ α2t(V 1 α

1 + αΛV )r(0)

where we denote Si,t Si:t 1 St.

Proof. We assume all nonzeros in b are active nodes at time t = 0 and t = 1, i.e., S0 = r(0) = supp(b). The local updates can be expressed as

x(t+1) = x(t) + (1 + α2)r(t) St + α2 x(t) x(t 1)

St r(t+1) = b Qx(t+1)

= r(t) (1 + α2)Qr(t) α2Q x(t) x(t 1)

| {z } original updates

+ (1 + α2)Qr(t)

St + α2Q x(t) x(t 1)

St | {z } noisy with small magnitudes

= 2 αW r(t) α2r(t 1) + (1 + α2)Qr(t)

St + α2Q x(t) x(t 1)

St | {z } noisy with small magnitudes

= 2 αW r(t) α2r(t 1) + (1 + α2) I 1 α

1 + αW r(t)

St + α2Q x(t) x(t 1)

r(t+1) 2 αW r(t) + α2r(t 1) = (1 + α2)r(t)

St 2 αW r(t)

St + α2Q x(t) x(t 1)

For t 1, we can expand x(t+1) x(t) as the following

x(t+1) = x(t) + (1 + α2)r(t) St + α2 x(t) x(t 1)

St (t) = x(t+1) x(t) = (1 + α2)r(t) St + α2 (1 + α2)r(t 1) St 1 + α2 x(t 1) x(t 2)

= (1 + α2)r(t) St + α2(1 + α2)r(t 1) St 1:t + α4 x(t 1) x(t 2)

= (1 + α2)r(t) St + α2(1 + α2)r(t 1) St 1:t + α4 (1 + α2)r(t 2) St 2 + α2 x(t 2) x(t 3)

= (1 + α2)r(t) St + α2(1 + α2)r(t 1) St 1:t + α4(1 + α2)r(t 2) St 2:t + α6 x(t 2) x(t 3)

i=t 2 α2(t i)r(i) Si:t + α6 x(t 2) x(t 3)

i=1 α2(t i)r(i) Si:t + α2t x(1) x(0)

i=1 α2(t i)r(i) Si:t + α2tΓr(0) S1:t.

Note S0 = supp(r(0)), then r(0) S1:t = r(0) S0:t, we continue to have

x(t) x(t 1) = (1 + α2)

i=1 α2(t i 1)r(i) Si:t 1 + Γ α2(t 1)r(0) S0:t 1, t 1

α2(x(t) x(t 1))St = (1 + α2)

i=1 α2(t i)r(i) Si:t 1 St + Γ α2tr(0) S0:t 1 St, t 1

The rest follows readily.

Lemma D.5 (The nonhomogeneous difference equation). Given y(1) = Λy(0), equations

y(t+1) 2Λy(t) + y(t 1) := f (t).

have the following solutions

y(t) = Zty(0) +

k=1 Hk,tf (k),

(diag (1, . . . , cos(θit), . . . , ( 1)t) for bipartite graphs

diag (1, . . . , cos(θit), . . . , cos(θnt)) for non-bipartite graphs,

diag t k, . . . , sin(θi(t k))

sin θi , . . . , ( 1)t k 1(t k) for bipartite graphs

diag t k, . . . , sin(θi(t k))

sin θi , . . . , sin(θn(t k))

for non-bipartite graphs.

Proof. This directly follows from Lemma C.4.

Theorem D.6 (Representation of r(t) for LOCHB). Given t 1, x(0) = 0 and x(1) = Γr(0) S0 . The residual of r(t) of LOCHB satisfies

V r(t) = αt Zt V r(0) + αtt

k=1 αk 1Hk,t V QΓr(0) S0,k

| {z } u0,t

k=1 αt k(t k)

j=k αj k Hj,t

1 α Λ V r(k) Sk,j

| {z } uk,t

Proof. Follow Lemma D.4, we have

r(t+1) Q (t) St = r(t) Q (t)

= r(t) (1 + α2)Qr(t) α2Q(x(t) x(t 1))

= α2r(t) + 2 αW r(t) α2Q(x(t) x(t 1))

= 2 αW r(t) α2r(t 1)

So we can write the updates of LOCHB as

r(t+1) 2 αW r(t) + α2r(t 1) = Q (t)

i=1 α2(t i)Qr(i) Si,t + α2t QΓr(0) S0,t.

Write V r(t) = αty(t), and note

r(t) = αt V y(t), r(t)

St = αt V y(t)

St = αt V V y(t)

V r(t+1) 2 αV W r(t) + α2V r(t 1)

i=1 α2(t i)V Qr(i) Si,t + α2t V QΓr(0) S0,t

αt+1y(t+1) 2 α αt W y(t) + α2 αt 1y(t 1)

i=1 α2(t i) αi QV (V y(i)) Si,t + α2t QΓV (V y(0)) S0,t

y(t+1) 2W y(t) + y(t 1)

f (t) z }| { t X

i=1 (1 + α2) αt 1 α i QV (V y(i))Si,t | {z }

+ αt 1QΓV (V y(0))S0,t | {z }

Then, from Lemma D.5

y(t) = Zty(0) +

k=1 Hk,tf (k) V r(t) = αt Zt V r(0) + αt t 1 X

k=1 Hk,tf (k)

so expanding the error term

αt Hk,tf (k) =

i=1 αt Hk,tf (k) i + αt Hk,tf (k) 0

= (1 + α2) αt+k 1 i k X

i=1 Hk,t QV (V y(i))Si,k

+ αt+k 1Hk,t QΓV (V y(0))S0,k

= (1 + α2) αt+k 1 2i k X

i=1 Hk,t QV r(i) Si,k + αt+k 1Hk,t QΓV r(0) S0,k

i=1 αt+k 2i Hk,t(1 + α

1 α)QV r(i) Si,k + αt+k 1Hk,t QΓV r(0) S0,k

k=1 αt Hk,tf (k) =

i=1 αt+k 2i Hk,t(1 + α

1 α)QV r(i) Si,k + αt+k 1Hk,t QΓV r(0) S0,k)

j=k αt+j 2k Hj,t(1 + α

1 α)QV r(k) Sk,j +

k=1 αt+k 1Hk,t QΓV r(0) S0,k

which when simplified, gives the relation proposed.

D.3 Convergence of LOCHB of Proof of Theorem D.8

Corollary D.7. Define

βk,t uk,t 2/ r(k) 2, βk max t βk,t.

Then the upper bound of r(t) 2 can be characterized as

r(t) 2 αt t 1 Y

j=0 (1 + βj)yt, (35)

where yt+1 2yt + yt 1/((1 + βt 1)(1 + βt)) = 0 where y0 = y1 = r(0) 2.

Proof. Since uk,t 2 βk r(k) 2, then given the final iterative updates (29)

V r(t) = αt Zt V r(0) + αttu0,t + 2

k=1 αt k(t k)uk,t

and since Zt 2 1 we can bound

r(t) 2 αt r(0) 2 + αttβ0 r(0) 2 + 2

k=1 αt k 1(t k)βk r(k) 2

k=1 αt k 1(t k)βk r(k) 2 αt(1 + tβ0) r(0) 2, (36)

where t = 0, 1, . . . , T. The rest just follows a similar strategy shown in Corollary C.9. We have the following inequalities

v31 v32 1 0

... ... ... ... ...

vn1 vn2 vn3 1

α1(1 + β0) r(0) 2

α2(1 + 2β0) r(0) 2

α3(1 + 3β0) r(0) 2 ...

αT (1 + Tβ0) r(0) 2

where (VL)tk = 2 αt k(t k)βk. Denote each upper bound as τt = r(t) 2, we will have

k=1 vt,k = αt(1 + tβ0)τ0 + 2

k=1 (t k) αt kβkτk

τt+1 = ct+1 +

k=1 vt+1,k = αt+1(1 + (t + 1)β0)τ0 + 2

k=1 (t + 1 k) αt k+1βkτk

= αt+1(1 + (t + 1)β0)τ0 + 2

k=1 (t k) αt k+1βkτk + 2

k=1 αt k+1βkτk

ατt = αt+1(1 + tβ0)τ0 + 2

k=1 (t k) αt k+1βkτk

τt+1 ατt = αt+1β0τ0 + 2

k=1 αt k+1βkτk

τt 1 = αt 1(1 + (t 1)β0)τ0 + 2

k=1 (t k 1) αt k 1βkτk

α2τt 1 = αt+1(1 + (t 1)β0)τ0 + 2

k=1 (t k) αt k+1βkτk 2

k=1 αt k+1βkτk

ατt α2τt 1 = αt+1β0τ0 + 2

k=1 αt k+1βkτk.

τt+1 2 ατt + α2τt 1 = 2 αβtτt The above analysis finally leads to τt+1 2(1 + βt) ατt + α2τt 1 = 0. So for

j=0 (1 + βj)yt = τt

we have yt+1 2yt + yt (1 + βt)(1 + βt 1) = 0.

Following the same strategy in Corollary C.9, we obtain the upper bound.

Theorem D.8 (Convergence of LOCHB). Let the geometric mean of βt be βt Qt 1 j=0(1 + βj)1/t. Then the upper bound of r(t) 2 can be characterized as

r(t) 2 αt t 1 Y

j=0 (1 + βj)yt, (37)

where yt+1 2yt + yt 1/((1 + βt 1)(1 + βt)) = 0 where y0 = y1 = r(0) 2. Assume that there exists a constant c [0, 2) such that βt 1 + c α 1 α. Then the total number of iterations can be bounded as

Then the total runtime T is bounded by

T Θ (1 + α)vol(ST )

(2 c) α ln yt

= e O vol(ST )

where e O hides ln(yt/ϵ).

Proof. The first part follows from Corollary of D.7. The rest follows the same strategy of LOCCH as in Theorem 4.2. (See also Theorem 4.2.)

Table 2: Examples of sparse linear systems

Original Linear system Our target Qx = b Q = Λ σD 1/2AD 1/2 [µ, L] Ref.

I σW x = αes 1. Λ = I, σ = 1 α b = αes [α, 2 α] [27] [53]

2 L x = αD 1/2es 2.

L = I D 1/2AD 1/2

Λ = I, σ = 1 α

1+α b = 2αD 1/2s/(1 + α)

h 2α 1+α, 2 1+α i [2] [13] [12] [37]

I (1 α)AD 1 y = αes 3.

Λ = I, σ = 1 α b = αD 1/2es, α (0, 1) x = D 1/2y

[α, 2 α] [7] [56]

n In + D A y = 2λes 4.

n D 1 + In, σ = 1 b = 2λD 1/2es, λ [1, n] x = D1/2y

h λ ndmax , λ+2n

n i [41] [57]

D.4 Implementation of LOCHB

We present the implementation of LOCHB as follows: Recall the updates of LOCHB is

x(t+1) = x(t) + (1 + α2)r(t) St + α2 x(t) x(t 1)

St, r(t+1) = 2 αW r(t) α2r(t 1).

The corresponding local updates are

(t) = (1 + α2)r(t) + α2 x(t)

x(t+1) = x(t) + (t) St

r(t+1) = r(t) (t) + 1 α

x(t+1) = x(t) + (t) (t 1),

where if we choose x(0) = x(1) = x(1) = 0, r(0) = r(1) = b and (0) = 0.

E Instances of Sparse Linear Systems

E.1 Table of Popular Graph-induced Linear Systems

This section presents most commonly used graph-induced linear system as the following

Λ σD 1/2AD 1/2 | {z } Q x = b,

where Q is the generalized version of the perturbed normalized graph Laplacian matrix with perturbation parameter σ > 0, and b is a sparse vector. A typical example of Q = I 1 α

1+αD 1/2AD 1/2

with Λ = I and b = 2αD 1/2es/(1 + α)

The detailed parameters are:

1. W = D 1/2 A D 1/2 and A = I + A is the adjacency matrix defined on G(V, E) by adding self-loops for all nodes, and D = I + D is defined as the augmented degree matrix by adding self-loops. α (0, 1) and usually α < 0.5, The I (1 α)W is the perturbed augmented normalized Laplacian with perturbed parameter α.

2. Q = D 1/2 D 1 α

2 (D + A) D 1/2 = αI + 1 α

2 L > 0 and Q = 1 α

1+αD 1/2AD 1/2. This is known as the lazy random-walk version of PPR vectors.

3. x = α I (1 α)AD 1 1 es This is the standard Personalized Page Rank vectors widely used for graph embeddings and graph neural network designing [7]. It is also used for decoupling for large-scale GNNs [56].

Dataset ID Dataset Name n m

G1 as-skitter 1,694,616 11,094,209 G2 cit-patent 3,764,117 16,511,740 G3 com-dblp 317,080 1,049,866 G4 com-lj 3,997,962 34,681,189 G5 com-orkut 3,072,441 117,185,083 G6 com-youtube 1,134,890 2,987,624 G7 ogbn-arxiv 169,343 1,157,799 G8 ogbn-mag 1,939,743 21,091,072 G9 ogbn-products 2,385,902 61,806,303 G10 ogbn-proteins 132,534 39,561,252 G11 soc-lj1 4,843,953 42,845,684 G12 soc-pokec 1,632,803 22,301,964 G13 wiki-talk 2,388,953 4,656,682 G14 ogbl-ppa 576,039 21,231,776 G15 wiki-en21 6,216,199 160,823,797 G16 com-friendster 65,608,366 1,806,067,135 G17 ogbn-papers100m 111,059,433 1,614,061,934 Table 3: Dataset Statistics

4. Graph kernel computation for online learning. Each computed vector serves as semi-supervised learning feature vectors [24] or as online node labeling learning vectors [41, 57]. Note the target linear system when σ = 1

n In + D A D 1/2D1/2y = 2λD 1/2es λ

n D 1 + In D 1/2AD 1/2 D1/2y = 2λD 1/2es.

Hence, we have Λ = λ

n D 1 + In, σ = 1, b = 2λD 1/2es.

F Experimental Details and Missing Results

F.1 Datasets and Preprocessing

Following Leskovec et al. [34], we treat all 17 graphs as undirected with unit weights. We remove selfloops and keep the largest connected component when the graph is disconnected. After preprocessing, the graphs range from 169, 343 nodes in ogbn-proteins to 111, 059, 433 in ogbn-papers100M, as presented in Table 3.

F.2 Problems Settings and Baseline Methods

For solving Equation (3), we randomly select 50 source nodes s from each graph. The damping factor is fixed at 0.1, i.e., α = 0.1 for all experiments, and it varies within the range {0.005, 0.01, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3} for others. The ϵ is chosen from the range 2α/((1 + α)ds), 10 4/n .

For solving the local clustering problem, we follow the greedy strategy from Andersen et al. [2], where a local cluster is identified by examining the top magnitudes in PPR vectors. Specifically, we denote the boundary of S as (S) = {(u, v) E : u S, v / S}. The conductance of S is defined as

Φ(S) | (S)| min(vol(S), 2m vol(V\S)).

0 cit-patent

com-friendster

com-youtube

ogb-mag240m

0 ogbn-arxiv

ogbn-papers100M

0 5 # operations 108

ogbn-products

0 1 # operations 108

ogbn-proteins

0.0 2.5 5.0 # operations 108

0 1 2 # operations 108

0 1 # operations 109

0 2 # operations 107

Figure 7: The LOCSOR method compared with CGM over 18 graphs.

com-youtube

ogbn-products

10 8 10 5 10 2 ϵ

ogbn-proteins

10 8 10 5 10 2 ϵ

10 8 10 5 10 2 ϵ

10 8 10 5 10 2 ϵ

10 8 10 5 10 2 ϵ

Figure 8: The speedup of local solvers compared with their standard counterparts.

The goal of local clustering is to obtain PPR vectors using these local methods and then apply clustering algorithms to find clusters with low conductance. For the sorting process, given the approximate PPR vector π, we sort D 1/2 π in decreasing order of magnitudes. Let the ordered nodes be v1, v2, . . . , vt; the local clustering algorithm iteratively checks the conductance reduction by v1, v2, , vk where k = 1, 2, . . . , t, and after completing all checks, it returns a subset v1, v2, . . . , vk that has the minimal conductance among all examined subsets.

Parameter settings of baselines. For the local ISTA method [13], the precision parameter is set to ˆϵ = 0.5 for all experiments. According to the algorithm s description of ISTA, the corresponding ρ value is given by ϵ/(1 + ˆϵ). For LOCSOR, the parameter ω is calculated as 2(1 + α)/(1 + α)2. For the local FISTA, as demonstrated in [22], we adopt the same settings as for ISTA and follow its implementation guidelines. We also include preliminary results on ASPR [37]. The algorithm incorporates a parameter, ˆϵ, to control the number of iterations in the nested Accelerated Projected Gradient Descent (APGD). We adjust ˆϵ from low precision, ˆϵ = 0.1/n, to high precision, ˆϵ = 10 4/n, to ensure the identification of a good approximation.

For our experiment, we used a server powered by an Intel(R) Xeon(R) Gold 5218R CPU, which features 40 cores (80 threads). The system is equipped with 256 GB of RAM.

F.3 Full results of Fig. 15 4 5 6

In all 15 graphs, we set α = 0.1 and ϵ = 0.1/n. For each of the testing graphs, we randomly select 50 nodes and run LOCSOR and CGM.

ln x(T) x 1

as-skitter (ϵ =1.12e-09)

Loc SOR Loc HB Loc CH

cit-patent (ϵ =1.12e-09)

com-dblp (ϵ =1.12e-09)

0.0 0.5 1.0 108

com-lj (ϵ =1.12e-09)

com-orkut (ϵ =1.12e-09)

ln x(T) x 1

com-youtube (ϵ =1.12e-09)

ogbl-ppa (ϵ =1.12e-09)

0.0 2.5 5.0

ogbn-arxiv (ϵ =1.12e-09)

0.0 0.5 1.0 108

ogbn-mag (ϵ =1.12e-09)

ogbn-products (ϵ =1.12e-09)

0 5 Operations 106

ln x(T) x 1

ogbn-proteins (ϵ =1.12e-09)

0.0 0.5 1.0 Operations 108

soc-lj1 (ϵ =1.12e-09)

0.0 0.5 1.0 Operations 108

2 soc-pokec (ϵ =1.12e-09)

0 1 2 Operations 108

wiki-en21 (ϵ =1.12e-09)

0 2 4 Operations 107

wiki-talk (ϵ =1.12e-09)

Figure 10: Comparison of three local solvers over 15 graphs.

Fig. 8 presents all speedup tests on 15 datasets. It is evident that these standard linear solvers can be localized effectively.

0 1 2 3 108

ln x(T) x 1

as-skitter (ϵ =1.58e-09)

Loc SOR Loc HB Loc CH ISTA FISTA CGM APPR

0.0 2.5 5.0 7.5

cit-patent (ϵ =7.13e-10)

com-dblp (ϵ =8.46e-09)

0.0 0.5 1.0

com-lj (ϵ =6.71e-10)

com-orkut (ϵ =8.73e-10)

ln x(T) x 1

com-youtube (ϵ =2.36e-09)

ogbl-ppa (ϵ =4.66e-09)

ogbn-arxiv (ϵ =1.58e-08)

0.0 0.5 1.0 109

ogbn-mag (ϵ =1.38e-09)

0.0 0.5 1.0

ogbn-products (ϵ =1.12e-09)

0 1 2 3 Operations 108

ln x(T) x 1

ogbn-proteins (ϵ =2.02e-08)

0.0 0.5 1.0 1.5 Operations 109

soc-lj1 (ϵ =5.54e-10)

0.0 0.5 1.0 Operations 109

soc-pokec (ϵ =1.64e-09)

0 2 4 Operations 109

wiki-en21 (ϵ =4.32e-10)

0 2 4 Operations 108

wiki-talk (ϵ =1.12e-09)

Figure 9: The estimation error reduction tests on 7 solvers including our LOCSOR, LOCHB, and LOCCH. The experiments were conducted on 15 datasets.

Fig. 9 presents the missing results on the estimation error reduction for 15 datasets. Compared with the global solver CGM, all local methods show significant speedup in the early stages. To compare our three local solvers, we zoom in on our results and present them in Fig. 10. Empirically, LOCSOR is the fastest algorithm when the parameter ω is chosen optimally.

ln x(T) x 1

(Node 0, ϵ = 1/n)

APPR Loc HB Loc CH Loc SOR

(Node 1, ϵ = 1/n)

(Node 2, ϵ = 1/n)

(Node 3, ϵ = 1/n)

(Node 4, ϵ = 1/n)

0 2 Operations 109

ln x(T) x 1

(Node 0, ϵ = 1/m)

0 1 2 Operations 109

(Node 1, ϵ = 1/m)

0 1 2 Operations 109

(Node 2, ϵ = 1/m)

0 2 Operations 109

(Node 3, ϵ = 1/m)

0 2 Operations 109

(Node 4, ϵ = 1/m)

Figure 11: Estimation error as a function of the number of operations on com-friendster. We randomly select 5 different nodes and use ϵ = 1./n and ϵ = 1/m.

ln x(T) x 1

(Node 0, ϵ = 1/n)

APPR Loc HB Loc CH Loc SOR

(Node 1, ϵ = 1/n)

2.4 (Node 2, ϵ = 1/n)

(Node 3, ϵ = 1/n)

0.0 2.5 5.0

(Node 4, ϵ = 1/n)

0.0 0.5 1.0 Operations 109

ln x(T) x 1

(Node 0, ϵ = 1/m)

0 1 2 Operations 109

2.0 (Node 1, ϵ = 1/m)

0 2 Operations 109

(Node 2, ϵ = 1/m)

0 2 Operations 109

(Node 3, ϵ = 1/m)

0.0 0.5 1.0 Operations 109

(Node 4, ϵ = 1/m)

Figure 12: Estimation error as a function of the number of operations on ogbn-papers100m. We randomly select 5 different nodes and use ϵ = 1./n and ϵ = 1/m.

- Run time (seconds) Number of operations Dataset LOCSOR LOCCH CGM LOCSOR LOCCH CGM

as-skitter 0.350 0.054 0.567 0.095 2.626 0.551 7.827e+05 1.026e+06 1.524e+08 cit-patent 0.966 0.120 1.609 0.189 14.660 2.249 1.804e+06 2.298e+06 2.873e+08 com-dblp 0.068 0.058 0.104 0.059 0.469 0.112 8.222e+04 1.166e+05 1.562e+07 com-friendster 15.29 1.89 26.54 3.52 508.50 99.12 7.027e+07 8.063e+07 1.442e+10 com-lj 0.802 0.148 1.410 0.234 7.593 2.361 2.604e+06 3.271e+06 4.122e+08 com-orkut 0.455 0.158 0.815 0.308 13.343 7.951 2.965e+06 3.221e+06 9.220e+08 com-youtube 0.290 0.070 0.501 0.099 1.314 0.257 7.617e+05 9.323e+05 3.561e+07 ogb-mag240m 85.14 16.05 108.86 7.994 549.51 367.2 4.541e+07 5.820e+07 9.426e+09 ogbl-ppa 0.116 0.019 0.202 0.040 5.624 1.108 6.117e+05 6.445e+05 1.682e+08 ogbn-arxiv 0.039 0.060 0.058 0.063 0.239 0.104 1.158e+05 1.354e+05 1.473e+07 ogbn-mag 0.520 0.079 0.921 0.128 6.136 0.974 1.804e+06 1.947e+06 2.050e+08 ogbn-papers100M 27.20 3.94 41.99 6.17 750.96 110.45 8.893e+07 1.095e+08 1.604e+10 ogbn-products 0.695 0.236 1.059 0.249 31.907 4.961 1.777e+06 2.385e+06 7.071e+08 ogbn-proteins 0.021 0.057 0.025 0.057 0.910 0.306 7.941e+04 6.610e+04 1.488e+08 soc-lj1 1.040 0.282 1.751 0.487 7.102 3.410 3.263e+06 4.045e+06 4.833e+08 soc-pokec 0.210 0.027 0.368 0.049 1.568 0.207 2.020e+06 2.225e+06 2.160e+08 wiki-en21 1.436 2.708 1.794 0.215 19.658 4.576 5.996e+06 6.659e+06 1.329e+09 wiki-talk 0.251 0.049 0.455 0.091 0.642 0.071 1.090e+06 1.290e+06 4.284e+07 Table 4: Summary of runtime and operations for 15 datasets. (ϵ = 10 6)

ln x(T) x 1

ogbl-ppa (α = 0.005)

0 2 4 6 8 106

ogbl-ppa (α = 0.1)

0.0 0.5 1.0 1.5

ogbl-ppa (α = 0.25)

Loc GD Loc SOR Loc HB Loc CH

0 1 2 Operations 107

ln x(T) x 1

ogbn-products (α = 0.005)

0.0 0.5 1.0 Operations 107

ogbn-products (α = 0.1)

0 2 4 Operations 106

ogbn-products (α = 0.25)

Figure 13: Estimation error as a function of operations needed. For α = 0.005, α = 0.1, and α = 0.25.

Conductance

APPR Loc CH Loc HB Loc SOR ISTA FISTA

0 500 1000 1500

0 100 200 300

Conductance

com-youtube

0 250 500 750

0 50 100 150

0 2000 4000

ogbn-products

0 500 1000 1500 Support Size

Conductance

ogbn-proteins

0 50 100 Support Size

0 200 400 Support Size

0 2500 5000 7500 Support Size

0 1000 2000 3000 Support Size

Figure 14: The graph conductance found by local graph clustering method using different local approximate methods. Experiments ran on 15 graphs. (ϵ = 10 6)

Dataset APPR LOCCH LOCHB LOCSOR ISTA FISTA

as-skitter 5.90e-04 5.90e-04 5.90e-04 5.90e-04 5.90e-04 5.90e-04 cit-patent 4.23e-02 4.23e-02 4.23e-02 4.23e-02 4.23e-02 4.23e-02 com-dblp 4.12e-03 4.12e-03 4.12e-03 4.12e-03 4.12e-03 4.12e-03 com-lj 2.94e-04 2.94e-04 2.94e-04 2.94e-04 2.94e-04 2.94e-04 com-orkut 1.75e-01 1.76e-01 1.76e-01 1.75e-01 1.76e-01 1.75e-01 com-youtube 5.46e-03 5.46e-03 5.46e-03 5.46e-03 5.46e-03 5.46e-03 ogbn-arxiv 2.11e-02 2.11e-02 2.11e-02 2.12e-02 2.14e-02 2.11e-02 ogbn-mag 3.18e-03 1.59e-03 3.18e-03 3.18e-03 4.74e-03 3.18e-03 ogbn-products 7.26e-02 1.02e-01 9.68e-02 9.65e-02 9.51e-02 7.24e-02 ogbn-proteins 5.92e-03 5.92e-03 5.92e-03 5.92e-03 5.92e-03 5.92e-03 soc-lj1 5.07e-02 1.18e-01 1.18e-01 1.18e-01 5.25e-02 5.07e-02 soc-pokec 7.80e-05 7.80e-05 7.80e-05 7.80e-05 7.80e-05 7.80e-05 wiki-talk 1.64e-02 1.64e-02 1.64e-02 1.64e-02 1.64e-02 1.64e-02 ogbl-ppa 5.13e-02 5.13e-02 5.13e-02 5.13e-02 5.13e-02 5.13e-02 wiki-en21 1.31e-01 1.31e-01 1.31e-01 1.31e-01 1.31e-01 1.31e-01

Table 5: The local conductance for six local solvers tested on 15 graphs datasets. (ϵ = 10 6)

Dataset APPR LOCCH LOCHB LOCSOR ISTA FISTA as-skitter 0.127 0.147 0.144 0.043 0.323 0.093 cit-patent 0.362 0.516 0.457 0.125 0.939 0.308 com-dblp 0.069 0.033 0.028 0.014 0.297 0.042 com-lj 0.357 0.440 0.664 0.175 0.493 0.229 com-orkut 0.072 0.141 0.139 0.055 0.108 0.084 com-youtube 0.128 0.176 0.131 0.040 0.682 0.102 ogbl-ppa 0.102 0.047 0.091 0.027 0.146 0.102 ogbn-arxiv 0.042 0.013 0.014 0.006 0.237 0.032 ogbn-mag 0.068 0.137 0.091 0.035 0.253 0.090 ogbn-products 0.072 0.121 0.117 0.045 0.135 0.090 ogbn-proteins 0.005 0.005 0.005 0.002 0.005 0.004 soc-lj1 0.239 0.376 0.350 0.103 0.512 0.194 soc-pokec 0.067 0.096 0.059 0.033 0.222 0.051 wiki-talk 0.197 0.314 0.274 0.109 0.508 0.176 wiki-en21 0.121 0.290 0.279 0.106 0.197 0.147 Table 6: Runtime (seconds) for six local solvers tested on 15 graphs datasets. (ϵ = 10 6)

F.4 Results on local clustering

Dataset APPR LOCCH LOCHB LOCSOR ISTA FISTA

as-skitter 6.9e+05 7.6e+04 8.1e+04 6.5e+04 2.9e+06 5.7e+05 cit-patent 6.7e+05 1.0e+05 1.1e+05 8.9e+04 2.3e+06 4.4e+05 com-dblp 4.3e+05 4.8e+04 5.0e+04 3.5e+04 1.9e+06 2.9e+05 com-lj 5.7e+05 8.9e+04 1.0e+05 7.6e+04 2.8e+06 4.4e+05 com-orkut 5.4e+05 8.9e+04 1.1e+05 9.0e+04 1.3e+06 5.0e+05 com-youtube 5.3e+05 6.5e+04 6.6e+04 5.7e+04 3.8e+06 4.4e+05 ogbl-ppa 6.7e+05 9.6e+04 1.1e+05 9.3e+04 1.5e+06 6.0e+05 ogbn-arxiv 6.8e+05 8.7e+04 9.7e+04 7.6e+04 2.7e+06 5.2e+05 ogbn-mag 5.2e+05 8.2e+04 9.4e+04 8.3e+04 2.4e+06 4.4e+05 ogbn-products 9.0e+05 1.1e+05 1.3e+05 9.8e+04 2.1e+06 7.6e+05 ogbn-proteins 5.3e+05 4.0e+04 5.5e+04 5.0e+04 6.8e+05 6.2e+05 soc-lj1 5.6e+05 8.9e+04 1.0e+05 7.7e+04 2.8e+06 4.4e+05 soc-pokec 5.9e+05 1.1e+05 1.3e+05 1.1e+05 2.5e+06 5.0e+05 wiki-talk 4.4e+05 5.5e+04 5.2e+04 4.6e+04 1.8e+06 3.4e+05 wiki-en21 5.2e+05 8.0e+04 9.7e+04 8.1e+04 1.8e+06 4.9e+05

Table 7: Operations Needed for six local solvers tested on 15 graphs datasets. (ϵ = 10 6)

G Related work

0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Number of operations needed 1010

Figure 15: Comparison of the error reduction between the proposed LOCCH and the standard CGM on the papers100M dataset [23], in terms of the number of operations required.

Many graph applications [2, 7, 29, 15, 27, 36, 30, 44, 25, 45, 51, 48] only require solving Equ. (1) approximately. The reasons could be either the most energies of π are among a small set of nodes forming small subgraphs, or one wants to study large graphs by checking them locally. Given a graph G with n nodes and m edges, there are two main types of iterative solvers for Equ. (1) as follows:

Standard iterative methods. Methods for solving linear systems have been well-established over the past decades (see textbooks of Saad [43], Golub & Van Loan [19], Young [55]). The fastest linear solver for solving the symmetrized version of Equ. (1) is the Conjugate Gradient Method (CGM) with runtime complexity O(m/ α) where m is the number of edges in the graph. It costs Θ(m) to access the entire graph at each iteration; hence, it is much slower than local solvers, as demonstrated in Fig. 15. The symmetric diagonally dominant (SDD) solvers advance CGM further to have complexity O (m logc n log(1/ϵ)) [31, 46]. Anikin et al. [4] considered the Page Rank problem and proposed an algorithm with runtime depending on O(n). This paper focuses on local algorithms where the goal is to avoid the dominant factor m or n by avoiding the full Qx operation.

Local algorithms. Local solvers, in contrast to standard counterparts, leverage the fact that the energy of π lives in a small portion of the graph and hence do not require O(m) or O(n) per iteration. They are advantageous for huge-scale graphs demonstrated in Fig. 15. Andersen et al. [2] used APPR to obtain an approximate of π for local clustering. Quite similar algorithms were developed in Berkhin [6] (bookmark-coloring algorithm) and Kloster & Gleich [28] (Gauss-Southwell procedure).

Under the same stopping condition as APPR, Fountoulakis et al. [13] demonstrated that APPR is equivalent to coordinate descent via variational characterization, with a runtime of O(1/(αϵ)) using ISTA where the monotonicity and conservation properties remain. Hence, it is nature to ask whether O(1/( αϵ)) could be achieved by FISTA [5] in Fountoulakis & Yang [12]. However, the difficulty is that FISTA violates the monotonicity property where the volume accessed of per-iteration cannot be bounded properly. To overcome this, Martínez-Rubio et al. [37] proposed a nested accelerated projected gradient descent (APGD) and gradually expanding solutions so that the monotonicity property still holds. However, nested APGD requires solving subproblems accurately, which in practice may be cumbersome if the precision requirement of the inner problem is too stringent. All current local methods rely on some monotonicity property of variables to guarantee locality, which does not exist in most accelerated frameworks; thus, developing an accelerated method that is guaranteed to preserve intermediate variable sparsity remains challenging.

It is worth mentioning that local methods are also closely related to sublinear time and local computational algorithms [42, 1]. From the optimization perspective, the equivalence between Gauss-Seidel and coordinate descent has been considered [49, 35, 39, 50] but does not focus on local analysis.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: The abstract and introduction clearly state the main claims made by the paper, including its key contributions and scope. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: The accelerated algorithms, such as LOCCH and LOCHB, may exhibit instability when α is small. This limitation arises due to the inherent constraints of global methods. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes]

Justification: All assumptions are clearly stated. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and cross-referenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes]

Justification: We have provided our code for the review process and will make it publicly available upon publication. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: All datasets used in this study are publicly available.

Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/public/ guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https:// nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: All parameter settings of our methods and baselines are included. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [Yes]

Justification: For most of our results, we report the standard error over 50 random sampling nodes. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).

If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: For our experiment, we used a server powered by an Intel(R) Xeon(R) Gold 5218R CPU, which features 40 cores (80 threads). The system is equipped with 256 GB of RAM. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in the paper conforms in every respect with the Neur IPS Code of Ethics. The authors have thoroughly reviewed and adhered to the guidelines, ensuring that all aspects of their work align with ethical standards. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: The advancements in accelerated algorithms like LOCCH and LOCHB can significantly enhance computational efficiency in various applications, contributing to faster and more effective solutions in fields such as data analysis, machine learning, and optimization. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.

The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: None.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [NA]

Justification: None.

Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators.

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [NA]

Justification: None.

Guidelines:

The answer NA means that the paper does not release new assets.

Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: No human subjects involved. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: No human subjects involved. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.