# accelerated_consensus_via_minsum_splitting__32997222.pdf

Accelerated consensus via Min-Sum Splitting

Patrick Rebeschini Department of Statistics University of Oxford patrick.rebeschini@stats.ox.ac.uk

Sekhar Tatikonda Department of Electrical Engineering Yale University sekhar.tatikonda@yale.edu

We apply the Min-Sum message-passing protocol to solve the consensus problem in distributed optimization. We show that while the ordinary Min-Sum algorithm does not converge, a modiﬁed version of it known as Splitting yields convergence to the problem solution. We prove that a proper choice of the tuning parameters allows Min-Sum Splitting to yield subdiffusive accelerated convergence rates, matching the rates obtained by shift-register methods. The acceleration scheme embodied by Min-Sum Splitting for the consensus problem bears similarities with lifted Markov chains techniques and with multi-step ﬁrst order methods in convex optimization.

1 Introduction

Min-Sum is a local message-passing algorithm designed to distributedly optimize an objective function that can be written as a sum of component functions, each of which depends on a subset of the decision variables. Due to its simplicity, Min-Sum has emerged as canonical protocol to address large scale problems in a variety of domains, including signal processing, statistics, and machine learning. For problems supported on tree graphs, the Min-Sum algorithm corresponds to dynamic programming and is guaranteed to converge to the problem solution. For arbitrary graphs, the ordinary Min-Sum algorithm may fail to converge, or it may converge to something different than the problem solution [28]. In the case of strictly convex objective functions, there are known sufﬁcient conditions to guarantee the convergence and correctness of the algorithm. The most general condition requires the Hessian of the objective function to be scaled diagonally dominant [28, 25]. While the Min-Sum scheme can be applied to optimization problems with constraints, by incorporating the constraints into the objective function as hard barriers, the known sufﬁcient conditions do not apply in this case.

In [34], a generalization of the traditional Min-Sum scheme has been proposed, based on a reparametrization of the original objective function. This algorithm is called Splitting, as it can be derived by creating equivalent graph representations for the objective function by splitting the nodes of the original graph. In the case of unconstrained problems with quadratic objective functions, where Min-Sum is also known as Gaussian Belief Propagation, the algorithm with splitting has been shown to yield convergence in settings where the ordinary Min-Sum does not converge [35]. To date, a theoretical investigation of the rates of convergence of Min-Sum Splitting has not been established.

In this paper we establish rates of convergence for the Min-Sum Splitting algorithm applied to solve the consensus problem, which can be formulated as an equality-constrained problem in optimization. The basic version of the consensus problem is the network averaging problem. In this setting, each node in a graph is assigned a real number, and the goal is to design a distributed protocol that allows the nodes to iteratively exchange information with their neighbors so to arrive at consensus on the average across the network. Early work include [42, 41]. The design of distributed algorithms to solve the averaging problem has received a lot of attention recently, as consensus represents a widely-used primitive to compute aggregate statistics in a variety of ﬁelds. Applications include, for instance, estimation problems in sensor networks, distributed tracking and localization, multi-agents coordination, and distributed inference [20, 21, 9, 19]. Consensus is typically combined with some

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

form of local optimization over a peer-to-peer network, as in the case of iterative subgradient methods [29, 40, 17, 10, 6, 16, 39]. In large-scale machine learning, consensus is used as a tool to distribute the minimization of a loss function over a large dataset into a network of processors that can exchange and aggregate information, and only have access to a subset of the data [31, 11, 26, 3].

Classical algorithms to solve the network averaging problem involve linear dynamical systems supported on the nodes of the graph. Even when the coefﬁcients that control the dynamics are optimized, these methods are known to suffer from a diffusive rate of convergence, which corresponds to the rate of convergence to stationarity exhibited by the diffusion random walk naturally associated to a graph [44, 2]. This rate is optimal for graphs with good expansion properties, such as complete graphs or expanders. In this case the convergence time, i.e., the number of iterations required to reach a prescribed level of error accuracy ε > 0 in the ℓ2 norm relative to the initial condition, scales independently of the dimension of the problem, as Θ(log 1/ε). For graphs with geometry this rate is suboptimal [7], and it does not yield a convergence time that matches the lower bound Ω(D log 1/ε), where D is the graph diameter [37, 36]. For example, in both cycle graphs and in grid-like topologies the number of iterations scale like Θ(D2 log 1/ε) (if n is the number of nodes, D n in a cycle and D n in a two-dimensional torus). Θ(D2 log 1/ε) is also the convergence time exhibited in random geometric graphs, which represent the relevant topologies for many applications in sensor networks [9]. In [7] it was established that for a class of graphs with geometry (polynomial growth or ﬁnite doubling dimension), the mixing time of any reversible Markov chain scales at least like D2, embodying the fact that symmetric walks on these graphs take D2 steps to travel distances of order D.

Min-Sum schemes to solve the consensus problem have been previously investigated in [27]. The authors show that the ordinary Min-Sum algorithm does not converge in graphs with cycles. They investigate a modiﬁed version of it that uses a soft barrier function to incorporate the equality constrains into the objective function. In the case of d-regular graphs, upon a proper choice of initial conditions, the authors show that the algorithm they propose reduces to a linear process supported on the directed edges of the graph, and they characterize the convergence time of the algorithm in terms of the Cesàro mixing time of a Markov chain deﬁned on the set of directed edges of the original graph. In the case of cycle graphs (i.e., d = 2), they prove that the mixing time scales like O(D), which yields the convergence time O(D/ε log 1/ε). See Theorem 4 and Theorem 5 in [27]. In the case of (d/2)-dimensional tori (D n2/d), they conjecture that the mixing time is Θ(D2(d 1)/d), but do not present bounds for the convergence time. See Conjecture 1 in [27]. For other graph topologies, they leave the mixing time (and convergence time) achieved by their method as an open question.

In this paper we show that the Min-Sum scheme based on splitting yields convergence to the consensus solution, and we analytically establish rates of convergence for any graph topology. First, we show that a certain parametrization of the Min-Sum protocol for consensus yields a linear message-passing update for any graph and for any choice of initial conditions. Second, we show that the introduction of the splitting parameters is not only fundamental to guarantee the convergence and correctness of the Min-Sum scheme in the consensus problem, but that proper tuning of these parameters yields accelerated (i.e., subdiffusive ) asymptotic rates of convergence. We establish a square-root improvement for the asymptotic convergence time over diffusive methods, which allows Min-Sum Splitting to scale like O(D log(D/ε)) for cycles and tori. Our results show that Min-Sum schemes are competitive and get close to the optimal rate O(D log(1/ε)) recently established for some algorithms based on Nesterov s acceleration [30, 36]. The main tool used for the analysis involves the construction of an auxiliary linear process supported on the nodes of the original graph to track the evolution of the Min-Sum Splitting algorithm, which is instead supported on the directed edges. This construction allows us to relate the convergence time of the Min-Sum scheme to the spectral gap of the matrix describing the dynamics of the auxiliary process, which is easier to analyze than the matrix describing the dynamics on the edges as in [27].

In the literature, overcoming the suboptimal convergence rate of classical algorithms for network averaging consensus has motivated the design of several accelerated methods. Two main lines of research have been developed, and seem to have evolved independently of each others: one involves lifted Markov chains techniques, see [37] for a review, the other involves accelerated ﬁrst order methods in convex optimization, see [13] for a review. Another contribution of this paper is to show that Min-Sum Splitting bears similarities with both types of accelerated methods. On the one hand, Min-Sum can be seen as a process on a lifted space, which is the space of directed edges in the original graph. Here, splitting is seen to introduce a directionality in the message exchange of the ordinary Min-Sum protocol that is analogous to the directionality introduced in non-reversible

random walks on lifted graphs to achieve faster convergence to stationarity. The advantage of the Min-Sum algorithm over lifted Markov chain methods is that no lifted graph needs to be constructed. On the other hand, the directionality induced on the edges by splitting translates into a memory term for the auxiliary algorithm running on the nodes. This memory term, which allows nodes to remember previous values and incorporate them into the next update, directly relates the Min-Sum Splitting algorithm to accelerated multi-step ﬁrst order methods in convex optimization. In particular, we show that a proper choice of the splitting parameters recovers the same matrix that support the evolution of shift-register methods used in numerical analysis for linear solvers, and, as a consequence, we recover the same accelerated rate of convergence for consensus [45, 4, 24].

To summarize, the main contributions of this paper are:

1. First connection of Min-Sum schemes with lifted Markov chains techniques and multi-step methods in convex optimization. 2. First proof of how the directionality embedded in Belief Propagation protocols can be tuned and exploited to accelerate the convergence rate towards the problem solution. 3. First analysis of convergence rates for Min-Sum Splitting. New proof technique based on the introduction of an auxiliary process to track the evolution of the algorithm on the nodes. 4. Design of a Min-Sum protocol for the consensus problem that achieves better convergence rates than the ones established (and conjectured) for the Min-Sum method in [27].

Our results motivate further studies to generalize the acceleration due to splittings to other problems.

The paper is organized as follows. In Section 2 we introduce the Min-Sum Splitting algorithm in its general form. In Section 3 we describe the consensus problem and review the classical diffusive algorithms. In Section 4 we review the main accelerated methods that have been proposed in the literature. In Section 5 we specialize the Min-Sum Splitting algorithm to the consensus problem, and show that a proper parametrization yields a linear exchange of messages supported on the directed edges of the graph. In Section 6 we derive the auxiliary message-passing algorithm that allows us to track the evolution of the Min-Sum Splitting algorithm via a linear process with memory supported on the nodes of the graph. In Section 7 we state Theorem 1, which shows that a proper choice of the tuning parameters recovers the rates of shift-registers. Proofs are given in the supplementary material.

2 The Min-Sum Splitting algorithm

The Min-Sum algorithm is a distributed routine to optimize a cost function that is the sum of components supported on a given graph structure. Given a simple graph G = (V, E) with n := |V | vertices and m := |E| edges, let us assume that we are given a set of functions φv : R R { }, for each v V , and φvw = φwv : R R R { }, for each {v, w} E, and that we want to solve the following problem over the decision variables x = (xv)v V RV :

v V φv(xv) + X

{v,w} E φvw(xv, xw). (1)

The Min-Sum algorithm describes an iterative exchange of messages which are functions of the decision variables associated to each directed edge in G. Let E := {(v, w) V V : {v, w} E} be the set of directed edges associated to the undirected edges in E (each edge in E corresponds to two edges in E). In this work we consider the synchronous implementation of the Min-Sum algorithm where at any given time step s, each directed edge (v, w) E supports two messages, ˆξs vw, ˆµs vw : R R { }. Messages are computed iteratively. Given an initial choice of messages ˆµ0 = (ˆµ0 vw)(v,w) E, the Min-Sum scheme that we investigate in this paper is given in Algorithm 1. Henceforth, for each v V , let N(v) := {w V : {v, w} E} denote the neighbors of node v.

The formulation of the Min-Sum scheme given in Algorithm 1, which we refer to as Min-Sum Splitting, was introduced in [34]. This formulation admits as tuning parameters the real number δ R and the symmetric matrix Γ = (Γvw)v,w V RV V . Without loss of generality, we assume that the sparsity of Γ respects the structure of the graph G, in the sense that if {v, w} E then Γvw = 0 (note that Algorithm 1 only involves summations with respect to nearest neighbors in the graph). The choice of δ = 1 and Γ = A, where A is the adjacency matrix deﬁned as Avw := 1 if {v, w} E and Avw := 0 otherwise, yields the ordinary Min-Sum algorithm. For

Algorithm 1: Min-Sum Splitting

Input: Messages ˆµ0 = (ˆµ0 vw)(v,w) E; parameters δ R and Γ RV V symmetric; time t 1. for s {1, . . . , t} do

ˆξs wv = φv/δ ˆµs 1 wv + P

z N(v) Γzv ˆµs 1 zv , (w, v) E; ˆµs wv = minz R{φvw( , z)/Γvw + (δ 1)ˆξs wv + δˆξs vw(z)}, (w, v) E; µt v = φv + δ P w N(v) Γwv ˆµt wv, v V ; Output: xt v = arg minz R µt v(z), v V .

an arbitrary choice of strictly positive integer parameters, Algorithm 1 can be seen to correspond to the ordinary Min-Sum algorithm applied to a new formulation of the original problem, where an equivalent objective function is obtained from the original one in (1) by splitting each term φvw into Γvw N \ {0} terms, and each term φv into δ N \ {0} terms. Namely, minimize P

v V Pδ k=1 φk v(xv) + P

{v,w} E PΓvw k=1 φk vw(xv, xw), with φk v := φv/δ and φk vw := φvw/Γvw.1 Hence the reason for the name splitting algorithm. Despite this interpretation, Algorithm 1 is deﬁned for any real choice of parameters δ and Γ.

In this paper we investigate the convergence behavior of the Min-Sum Splitting algorithm for some choices of δ and Γ, in the case of the consensus problem that we deﬁne in the next section.

3 The consensus problem and standard diffusive algorithms

Given a simple graph G = (V, E) with n := |V | nodes, for each v V let φv : R R { } be a given function. The consensus problem is deﬁned as follows:

v V φv(xv) subject to xv = xw, {v, w} E. (2)

We interpret G as a communication graph where each node represents an agent, and each edge represent a communication channel between neighbor agents. Each agent v is given the function φv, and agents collaborate by iteratively exchanging information with their neighbors in G with the goal to eventually arrive to the solution of problem (2). The consensus problem amounts to designing distributed algorithms to solve problem (2) that respect the communication constraints encoded by G.

A classical setting investigated in the literature is the least-square case yielding the network averaging problem, where for a given b RV we have2 φv(z) := 1

2z2 bvz and the solution of problem (2) is b := 1 n P

v V bv. In this setup, each agent v V is given a number bv, and agents want to exchange information with their neighbors according to a protocol that allows each of them to eventually reach consensus on the average b across the entire network. Classical algorithms to solve this problem involve a linear exchange of information of the form xt = Wxt 1 with x0 = b, for a given matrix W RV V that respects the topology of the graph G (i.e., Wvw = 0 only if {v, w} E or v = w), so that W t 11T /n for t , where 1 is the all ones vector. This linear iteration allows for a distributed exchange of information among agents, as at any iteration each agent v V only receives information from his/her neighbors N(v) via the update: xt v = Wvvxt 1 v + P

w N(v) Wvwxt 1 w . The original literature on this problem investigates the case where the matrix W has non-negative coefﬁcients and represents the transition matrix of a random walk on the nodes of the graph G, so that Wvw is interpreted as the probability that a random walk at node v visits node w in the next time step. A popular choice is given by the Metropolis-Hastings method [37], which involved the doubly-stochastic matrix W MH deﬁned as W MH vw := 1/(2dmax) if {v, w} E, W MH vw := 1 dv/(2dmax) if w = v, and W MH vw := 0 otherwise, where dv := |N(v)| is the degree of node v, and dmax := maxv V dv is the maximum degree of the graph G.

1As mentioned in [34], one can also consider a more general formulation of the splitting algorithm with δ (δv)v V R (possibly also with time-varying parameters). The current choice of the algorithm is motivated by the fact that in the present case the output of the algorithm can be tracked by analyzing a linear system on the nodes of the graph, as we will show in Section 5. 2In the literature, the classical choice is φv(z) := 1

v V (z bv)2, which yields the same results as the quadratic function that we deﬁne in the main text, as constant terms in the objective function do not alter the optimal point of the problem but only the optimal value of the objective function.

In [44], necessary and sufﬁcient conditions are given for a generic matrix W to satisfy W t 11T /n, namely, 1T W = 1T , W1 = 1, and ρ(W 11T /n) < 1, where ρ(M) denotes the spectral radius of a given matrix M. The authors show that the problem of choosing the optimal symmetric matrix W that minimizes ρ(W 11T /n) = W 11T /n where M denotes the spectral norm of a matrix M that coincides with ρ(M) if M is symmetric is a convex problem and it can be cast as a semi-deﬁnite program. Typically, the optimal matrix involves negative coefﬁcients, hence departing from the random walk interpretation. However, even the optimal choice of symmetric matrix is shown to yield a diffusive rate of convergence, which is already attained by the matrix W MH [7]. This rate corresponds to the speed of convergence to stationarity achieved by the diffusion random walk, deﬁned as the Markov chain with transition matrix diag(d) 1A, where diag(d) RV V is the degree matrix, i.e., diagonal with diag(d)vv := dv, and A RV V is the adjacency matrix, i.e., symmetric with Avw := 1 if {v, w} E, and Avw := 0 otherwise. For instance, the condition W 11T /n t ε, where is the ℓ2 norm, yields a convergence time that scales like t Θ(D2 log(1/ε)) in cycle graphs and tori [33], where D is the graph diameter. The authors in [7] established that for a class of graphs with geometry (polynomial growth or ﬁnite doubling dimension) the mixing time of any reversible Markov chain scales at least like D2, and it is achieved by Metropolis-Hastings [37].

4 Accelerated algorithms

To overcome the diffusive behavior typical of classical consensus algorithms, two main types of approaches have been investigated in the literature, which seem to have been developed independently.

The ﬁrst approach involves the construction of a lifted graph b G = (b V , b E) and of a linear system supported on the nodes of it, of the form ˆxt = c W ˆxt 1, where c W Rb V b V is the transition matrix of a non-reversible Markov chain on the nodes of b G. This approach has its origins in the work of [8] and [5], where it was observed for the ﬁrst time that certain non-reversible Markov chains on properly-constructed lifted graphs yield better mixing times than reversible chains on the original graphs. For some simple graph topologies, such as cycle graphs and two-dimensional grids, the construction of the optimal lifted graphs is well-understood already from the works in [8, 5]. A general theory of lifting in the context of Gossip algorithms has been investigated in [18, 37]. However, this construction incurs additional overhead, which yield non-optimal computational complexity, even for cycle graphs and two-dimensional grids. Typically, lifted random walks on arbitrary graph topologies are constructed on a one-by-one case, exploiting the speciﬁcs of the graph at hand. This is the case, for instance, for random geometric graphs [22, 23]. The key property that allows non-reversible lifted Markov chains to achieve subdiffusive rates is the introduction of a directionality in the process to break the diffusive nature of reversible chains. The strength of the directionality depends on global properties of the original graph, such as the number of nodes [8, 5] or the diameter [37]. See Figure 1.

(d) Figure 1: (a) Symmetric Markov chain W on the nodes of the ring graph G. (b) Non-reversible Markov chain c W on the nodes of the lifted graph b G [8]. (c) Ordinary Min-Sum algorithm on the directed edges E associated to G (i.e., b K(δ, Γ), Algorithm 2, with δ = 1 and Γ = A, where A is the adjacency matrix of G). (d) Min-Sum Splitting b K(δ, Γ), Algorithm 2, with δ = 1, Γ = γW, γ = 2/(1 + p

1 ρ2 W ) as in Theorem 1. Here, ρW is Θ(1 1/n2) and γ 2(1 1/n) for n large. The matrix b K(δ, Γ) has negative entries, departing from the Markov chain interpretation. This is also the case for the optimal tuning in classical consensus schemes [44] and for the ADMM lifting in [12].

The second approach involves designing linear updates that are supported on the original graph G and keep track of a longer history of previous iterates. This approach relies on the fact that the original consensus update xt = Wxt 1 can be interpreted as a primal-dual gradient ascent method to solve problem (2) with a quadratic objective function [32]. This allows the implementation of accelerated

gradient methods. To the best of our knowledge, this idea was ﬁrst introduced in [14], and since then it has been investigated in many other papers. We refer to [13, 24], and references in there, for a review and comparison of multi-step accelerated methods for consensus. The simplest multi-step extension of gradient methods is Polyak s heavy ball, which involves adding a momentum term to the standard update and yields a primal iterate of the form xt = Wxt 1 + γ(xt 1 xt 2). Another popular multi-step method involves Nesterov s acceleration, and yields xt = (1 + γ)Wxt 1 γWxt 2. Aligned with the idea of adding a momentum term is the idea of adding a shift register term, which yields xt = (1 + γ)Wxt 1 γxt 2. For our purposes, we note that these methods can be written as

for a certain matrix K R2n 2n. As in the case of lifted Markov chains techniques, also multi-step methods are able to achieve accelerated rates by exploiting some form of global information: the choice of the parameter γ that yields subdiffusive rates depends on the eigenvalues of W. Remark 1. Beyond lifted Markov chains techniques and accelerated ﬁrst order methods, many other algorithms have been proposed to solve the consensus problem. The literature is vast. As we focus on Min-Sum schemes, an exhaustive literature review on consensus is beyond the scope of our work. Of particular interest for our results is the distributed ADMM approach [3, 43, 38]. Recently in [12], for a class of unconstrained problems with quadratic objective functions, it has been shown that message-passing ADMM schemes can be interpreted as lifting of gradient descent techniques. This prompts for further investigation to connect Min-Sum, ADMM, and accelerated ﬁrst order methods.

In the next two sections we show that Min-Sum Splitting bears similarities with both types of accelerated methods described above. On the one hand, in Section 5 we show that the estimates xt v s of Algorithm 1 applied to the network averaging problem can be interpreted as the result of a linear process supported on a lifted space, i.e., the space E of directed edges associated to the undirected edges of G. On the other hand, in Section 6 we show that the estimates xt v s can be seen as the result of a linear multi-step process supported on the nodes of G, which can be written as in (3). Later on, in Section 7 and Section 8, we will see that the similarities just described go beyond the structure of the processes, and they extend to the acceleration mechanism itself. In particular, the choice of splitting parameters that yields subdiffusive convergence rates, matching the asymptotic rates of shift register methods, is also shown to depend on global information about G.

5 Min-Sum Splitting for consensus

We apply Min-Sum Splitting to solve network averaging. We show that in this case the messagepassing protocol is a linear exchange of parameters associated to the directed edges in E.

Given δ R and Γ RV V symmetric, let ˆh(δ) RE be the vector deﬁned as ˆh(δ)wv := bw + (1 1/δ)bv, and let b K(δ, Γ) RE E be matrix deﬁned as

b K(δ, Γ)wv,zu :=

δΓzw if u = w, z N(w) \ {v}, δ(Γvw 1) if u = w, z = v, (δ 1)Γzv if u = v, z N(v) \ {w}, (δ 1)(Γwv 1) if u = v, z = w, 0 otherwise.

Consider Algorithm 2 with initial conditions ˆR0 = ( ˆR0 vw)(v,w) E RE, ˆr0 = (ˆr0 vw)(v,w) E RE.

Algorithm 2: Min-Sum Splitting, consensus problem, quadratic case

Input: ˆR0, ˆr0 RE; δ R, Γ RV V symmetric; b K(δ, Γ) deﬁned in (5); t 1. for s {1, . . . , t} do

ˆRs = (2 1/δ)1 + b K(δ, Γ) ˆRs 1; ˆrs = ˆh(δ) + b K(δ, Γ)ˆrs 1;

Output: xt v := bv+δ P

w N (v) Γwv ˆrt wv 1+δ P

w N (v) Γwv ˆ Rtwv , v V .

Proposition 1. Let δ R and Γ RV V symmetric be given. Consider Algorithm 1 applied to problem (2) with φv(z) := 1

2z2 bvz and with quadratic initial messages: ˆµ0 vw(z) = 1

2 ˆR0 vwz2 ˆr0 vwz, for some ˆR0 vw > 0 and ˆr0 vw R. Then, the messages will remain quadratic, i.e., ˆµs vw(z) = 1

2 ˆRs vwz2 ˆrs vwz for any s 1, and the parameters evolve as in Algorithm 2. If 1 + δ P

w N(v) Γwv ˆRt wv > 0 for any v V and t 1, then the output of Algorithm 2 coincides with the output of Algorithm 1.

6 Auxiliary message-passing scheme

We show that the output of Algorithm 2 can be tracked by a new message-passing scheme that corresponds to a multi-step linear exchange of parameters associated to the nodes of G. This auxiliary algorithm represents the main tool to establish convergence rates for the Min-Sum Splitting protocol, i.e., Theorem 1 below. The intuition behind the auxiliary process is that while Algorithm 1 (hence, Algorithm 2) involves an exchange of messages supported on the directed edges E, the computation of the estimates xt v s only involve the belief functions µt v s, which are supported on the nodes of G. Due to the simple nature of the pairwise equality constraints in the consensus problem, in the present case a reparametrization allows to track the output of Min-Sum via an algorithm that directly updates the belief functions on the nodes of the graph, which yields Algorithm 3.

Given δ R and Γ Rn n symmetric, deﬁne the matrix K(δ, Γ) R2n 2n as

K(δ, Γ) := (1 δ)I (1 δ)diag(Γ1) + δΓ δI δI δdiag(Γ1) + (1 δ)Γ (1 δ)I

where I RV V is the identity matrix and diag(Γ1) RV V is diagonal with (diag(Γ1))vv = (Γ1)v = P

w N(v) Γvw. Consider Algorithm 3 with initial conditions R0, r0, Q0, q0 RV .

Algorithm 3: Auxiliary message-passing

Input: R0, r0, Q0, q0 RV ; δ R, Γ RV V symmetric; K(δ, Γ) deﬁned in (5); t 1. for s {1, . . . , t} do rs qs

= K(δ, Γ) rs 1

= K(δ, Γ) Rs 1

Output: xt v := rt v/Rt v, v V .

Proposition 2. Let δ R and Γ RV V symmetric be given. The output of Algorithm 2 with initial conditions ˆR0, ˆr0 RE is the output of Algorithm 3 with R0 v := 1 + δ P w N(v) Γwv ˆR0 wv, Q0 v := 1 δ P w N(v) Γwv ˆR0 wv, r0 v := bv + δ P w N(v) Γwvˆr0 wv, and q0 v := bv δ P w N(v) Γvwˆr0 vw.

Proposition 2 shows that upon proper initialization, the outputs of Algorithm 2 and Algorithm 3 are equivalent. Hence, Algorithm 3 represents a tool to investigate the convergence behavior of the Min-Sum Splitting algorithm. Analytically, the advantage of the formulation given in Algorithm 3 over the one given in Algorithm 2 is that the former involves two coupled systems of n equations whose convergence behavior can explicitly be linked to the spectral properties of the n n matrix Γ, as we will see in Theorem 1 below. On the contrary, the linear system of 2m equations in Algorithm 2 does not seem to exhibit an immediate link to the spectral properties of Γ. In this respect, we note that the previous paper that investigated Min-Sum schemes for consensus, i.e., [27], characterized the convergence rate of the algorithm under consideration albeit only in the case of d-regular graphs, and upon initializing the quadratic terms to the ﬁx point in terms of the spectral gap of a matrix that controls a linear system of 2m equations. However, the authors only list results on the behavior of this spectral gap in the case of cycle graphs, i.e., d = 2, and present a conjecture for 2d-tori.

7 Accelerated convergence rates for Min-Sum Splitting

We investigate the convergence behavior of the Min-Sum Splitting algorithm to solve problem (2) with quadratic objective functions. Henceforth, without loss of generality, let b RV be given with 0 < bv < 1 for each v V , and let φv(z) := 1

2z2 bvz. Deﬁne b := P

Recall from [27] that the ordinary Min-Sum algorithm (i.e., Algorithm 2 with δ = 1 and Γ = A, where A is the adjacency matrix of the graph G) does not converge if the graph G has a cycle.

We now show that a proper choice of the tuning parameters allows Min-Sum Splitting to converge to the problem solution in a subdiffusive way. The proof of this result, which is contained in the supplementary material, relies on the use of the auxiliary method deﬁned in Algorithm 3 to track the evolution of the Min-Sum Splitting scheme. Here, recall that x denotes the ℓ2 norm of a given vector x, M denotes the ℓ2 matrix norm of the given matrix M, and ρ(M) its spectral radius. Theorem 1. Let W RV V be a symmetric matrix with W1 = 1 and ρW := ρ(W 11T /n) < 1. Let δ = 1 and Γ = γW, with γ = 2/(1 + p

1 ρ2 W ). Let xt be the output at time t of Algorithm 2 with initial conditions ˆR0 = ˆr0 = 0. Deﬁne

K := γW I (1 γ)I 0

, K := 1 (2 γ)n

(1 γ)11T (1 γ)11T

Then, for any v V we have limt xt v = b and xt b1 4

2n 2 γ (K K )t . The asymptotic rate of convergence is given by

ρK := ρ(K K ) = limt (K K )t 1/t = q

1 ρ2 W )/(1+ p

1 ρ2 W ) < ρW < 1,

which satisﬁes 1

1/(1 ρW ) 1/(1 ρK) p

Theorem 1 shows that the choice of splitting parameters δ = 1 and Γ = γW, where γ and W are deﬁned as in the statement of the theorem, allows the Min-Sum Splitting scheme to achieve the asymptotic rate of convergence that is given by the second largest eigenvalue in magnitude of the matrix K deﬁned in (6), i.e., the quantity ρK. The matrix K is the same matrix that describes shift-register methods for consensus [45, 4, 24]. In fact, the proof of Theorem 1 relies on the spectral analysis previously established for shift-registers, which can be traced back to [15]. See also [13, 24].

Following [27], let us consider the absolute measure of error given by xt b1 / n (recall that we assume 0 < bv < 1 so that b n). From Theorem 1 it follows that, asymptotically, we have xt b1 / n 4

2ρt K/(2 γ). If we deﬁne the asymptotic convergence time as the minimum time t so that, asymptotically, xt b1 / n ε, then the Min-Sum Splitting scheme investigated in Theorem 1 has an asymptotic convergence time that is O(1/(1 ρK) log{[1/(1 ρK)]/ε}). Given the last bound in Theorem 1, this result achieves (modulo logarithmic terms) a square-root improvement over the convergence time of diffusive methods, which scale like Θ(1/(1 ρW ) log 1/ε). For cycle graphs and, more generally, for higher-dimensional tori where 1/(1 ρW ) is Θ(D2) so that 1/(1 ρK) is Θ(D) [33, 1] the convergence time is O(D log D/ε), where D is the graph diameter.

As prescribed by Theorem 1, the choice of γ that makes the Min-Sum scheme achieve a subdiffusive rate depends on global properties of the graph G. Namely, γ depends on the quantity ρW , the second largest eigenvalue in magnitude of the matrix W. This fact connects the acceleration mechanism induced by splitting in the Min-Sum scheme to the acceleration mechanism of lifted Markov chains techniques (see Figure 1) and multi-step ﬁrst order methods, as described in Section 4.

It remains to be investigated how choices of splitting parameters different than the ones investigated in Theorem 1 affect the convergence behavior of the Min-Sum Splitting algorithm.

8 Conclusions

The Min-Sum Splitting algorithm has been previously observed to yield convergence in settings where the ordinary Min-Sum protocol does not converge [35]. In this paper we proved that the introduction of splitting parameters is not only fundamental to guarantee the convergence of the Min-Sum scheme applied to the consensus problem, but that proper tuning of these parameters yields accelerated convergence rates. As prescribed by Theorem 1, the choice of splitting parameters that yields subdiffusive rates involves global type of information, via the spectral gap of a matrix associated to the original graph (see the choice of γ in Theorem 1). The acceleration mechanism exploited by Min-Sum Splitting is analogous to the acceleration mechanism exploited by lifted Markov chain techniques where the transition matrix of the lifted random walks is typically chosen to depend on the total number of nodes in the graph [8, 5] or on its diameter [37] (global pieces of information) and to the acceleration mechanism exploited by multi-step gradient methods where the momentum/shift-register term is chosen as a function of the eigenvalues of a matrix supported on the original graph [13] (again, a global information). Prior to our results, this connection seems to have not been established in the literature. Our ﬁndings motivate further studies to generalize the acceleration due to splittings to other problem instances, beyond consensus.

Acknowledgements

This work was partially supported by the NSF under Grant EECS-1609484.

[1] David Aldous and James Allen Fill. Reversible markov chains and random walks on graphs, 2002. Unﬁnished monograph, recompiled 2014, available at http://www.stat.berkeley. edu/~aldous/RWG/book.html.

[2] S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE Transactions on Information Theory, 52(6):2508 2530, 2006.

[3] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, 2011.

[4] Ming Cao, Daniel A. Spielman, and Edmund M. Yeh. Accelerated gossip algorithms for distributed computation. Proc. 44th Ann. Allerton Conf. Commun., Contr., Computat, pages 952 959, 2006.

[5] Fang Chen, László Lovász, and Igor Pak. Lifting markov chains to speed up mixing. In Proceedings of the Thirty-ﬁrst Annual ACM Symposium on Theory of Computing, pages 275 281, 1999.

[6] J. Chen and A. H. Sayed. Diffusion adaptation strategies for distributed optimization and learning over networks. IEEE Transactions on Signal Processing, 60(8):4289 4305, 2012.

[7] P. Diaconis and L. Saloff-Coste. Moderate growth and random walk on ﬁnite groups. Geometric & Functional Analysis GAFA, 4(1):1 36, 1994.

[8] Persi Diaconis, Susan Holmes, and Radford M. Neal. Analysis of a nonreversible markov chain sampler. The Annals of Applied Probability, 10(3):726 752, 2000.

[9] A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione. Gossip algorithms for distributed signal processing. Proceedings of the IEEE, 98(11):1847 1864, 2010.

[10] John C. Duchi, Alekh Agarwal, and Martin J. Wainwright. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Trans. Automat. Contr., 57(3):592 606, 2012.

[11] Pedro A. Forero, Alfonso Cano, and Georgios B. Giannakis. Consensus-based distributed support vector machines. J. Mach. Learn. Res., 11:1663 1707, 2010.

[12] G. França and J. Bento. Markov chain lifting and distributed admm. IEEE Signal Processing Letters, 24(3):294 298, 2017.

[13] E. Ghadimi, I. Shames, and M. Johansson. Multi-step gradient methods for networked optimization. IEEE Transactions on Signal Processing, 61(21):5417 5429, 2013.

[14] Bhaskar Ghosh, S. Muthukrishnan, and Martin H. Schultz. First and second order diffusive methods for rapid, coarse, distributed load balancing (extended abstract). In Proceedings of the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, pages 72 81, 1996.

[15] Gene H. Golub and Richard S. Varga. Chebyshev semi-iterative methods, successive overrelaxation iterative methods, and second order richardson iterative methods. Numer. Math., 3(1):147 156, 1961.

[16] D. Jakoveti c, J. Xavier, and J. M. F. Moura. Fast distributed gradient methods. IEEE Transactions on Automatic Control, 59(5):1131 1146, May 2014.

[17] Björn Johansson, Maben Rabi, and Mikael Johansson. A randomized incremental subgradient method for distributed optimization in networked systems. SIAM Journal on Optimization, 20(3):1157 1170, 2010.

[18] K. Jung, D. Shah, and J. Shin. Distributed averaging via lifted markov chains. IEEE Transactions on Information Theory, 56(1):634 647, 2010.

[19] S. Kar, S. Aldosari, and J. M. F. Moura. Topology for distributed inference on graphs. IEEE Transactions on Signal Processing, 56(6):2609 2613, 2008.

[20] V. Lesser, C. Ortiz, and M. Tambe, editors. Distributed Sensor Networks: A Multiagent Perspective (Edited book), volume 9. Kluwer Academic Publishers, 2003.

[21] Dan Li, K. D. Wong, Yu Hen Hu, and A. M. Sayeed. Detection, classiﬁcation, and tracking of targets. IEEE Signal Processing Magazine, 19(2):17 29, 2002.

[22] W. Li and H. Dai. Accelerating distributed consensus via lifting markov chains. In 2007 IEEE International Symposium on Information Theory, pages 2881 2885, 2007.

[23] W. Li, H. Dai, and Y. Zhang. Location-aided fast distributed consensus in wireless networks. IEEE Transactions on Information Theory, 56(12):6208 6227, 2010.

[24] Ji Liu, Brian D.O. Anderson, Ming Cao, and A. Stephen Morse. Analysis of accelerated gossip algorithms. Automatica, 49(4):873 883, 2013.

[25] Dmitry M. Malioutov, Jason K. Johnson, and Alan S. Willsky. Walk-sums and belief propagation in gaussian graphical models. J. Mach. Learn. Res., 7:2031 2064, 2006.

[26] G. Mateos, J. A. Bazerque, and G. B. Giannakis. Distributed sparse linear regression. IEEE Transactions on Signal Processing, 58(10):5262 5276, 2010.

[27] C. C. Moallemi and B. Van Roy. Consensus propagation. IEEE Transactions on Information Theory, 52(11):4753 4766, 2006.

[28] Ciamac C. Moallemi and Benjamin Van Roy. Convergence of min-sum message-passing for convex optimization. Information Theory, IEEE Transactions on, 56(4):2041 2050, 2010.

[29] A. Nedic and A. Ozdaglar. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009.

[30] A. Olshevsky. Linear Time Average Consensus on Fixed Graphs and Implications for Decentralized Optimization and Multi-Agent Control. Ar Xiv e-prints (1411.4186), 2014.

[31] J. B. Predd, S. R. Kulkarni, and H. V. Poor. A collaborative training algorithm for distributed learning. IEEE Transactions on Information Theory, 55(4):1856 1871, 2009.

[32] M. G. Rabbat, R. D. Nowak, and J. A. Bucklew. Generalized consensus computation in networked systems with erasure links. In IEEE 6th Workshop on Signal Processing Advances in Wireless Communications, 2005., pages 1088 1092, 2005.

[33] Sébastien Roch. Bounding fastest mixing. Electron. Commun. Probab., 10:282 296, 2005.

[34] N. Ruozzi and S. Tatikonda. Message-passing algorithms: Reparameterizations and splittings. IEEE Transactions on Information Theory, 59(9):5860 5881, 2013.

[35] Nicholas Ruozzi and Sekhar Tatikonda. Message-passing algorithms for quadratic minimization. Journal of Machine Learning Research, 14:2287 2314, 2013.

[36] Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 3027 3036, 2017.

[37] Devavrat Shah. Gossip algorithms. Foundations and Trends R in Networking, 3(1):1 125, 2009.

[38] W. Shi, Q. Ling, K. Yuan, G. Wu, and W. Yin. On the linear convergence of the ADMM in decentralized consensus optimization. IEEE Transactions on Signal Processing, 62(7):1750 1761, 2014.

[39] Wei Shi, Qing Ling, Gang Wu, and Wotao Yin. Extra: An exact ﬁrst-order algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015.

[40] S. Sundhar Ram, A. Nedi c, and V. V. Veeravalli. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of Optimization Theory and Applications, 147(3):516 545, 2010.

[41] J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Control, 31(9):803 812, 1986.

[42] John N. Tsitsiklis. Problems in Decentralized Decision Making and Computation. Ph D thesis, Department of EECS, MIT, 1984.

[43] E. Wei and A. Ozdaglar. Distributed alternating direction method of multipliers. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 5445 5450, 2012.

[44] Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1):65 78, 2004.

[45] David M Young. Second-degree iterative methods for the solution of large linear systems. Journal of Approximation Theory, 5(2):137 148, 1972.