# distributed_stochastic_gradient_descent_with_eventtriggered_communication__e556261e.pdf

The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence (AAAI-20)

Distributed Stochastic Gradient Descent with Event-Triggered Communication

Jemin George,1 Prudhvi Gurram2,1

1CCDC Army Research Laboratory Adelphi, MD 20783 jemin.george.civ@mail.mil 2Booz Allen Hamilton Mc Lean, VA 22102 gurram prudhvi@bah.com

We develop a Distributed Event-Triggered Stochastic GRAdient Descent (DETSGRAD) algorithm for solving non-convex optimization problems typically encountered in distributed deep learning. We propose a novel communication triggering mechanism that would allow the networked agents to update their model parameters aperiodically and provide sufﬁcient conditions on the algorithm step-sizes that guarantee the asymptotic mean-square convergence. The algorithm is applied to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural networks to perform image classiﬁcation, while aperiodically sharing the model parameters with their onehop neighbors. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural network, while the event-triggered communication provides signiﬁcant reduction in inter-agent communication. Results also show that the proposed algorithm allows the individual agents to classify the images even though the training data corresponding to all the classes are not locally available to each agent.

Introduction With the advent of smart devices, there has been an exponential growth in the amount of data collected and stored locally on individual devices. Applying machine learning to extract value from such massive data to provide data-driven insights, decisions, and predictions has been a popular research topic as well as the focus of numerous businesses. However, porting these vast amounts of data to a data center to conduct traditional machine learning has raised two main issues: (i) the communication challenge associated with transferring vast amounts of data from a large number of devices to a central location and (ii) the privacy issues associated with sharing raw data. Distributed machine learning techniques based on the server-client architecture (Li et al. 2014a; 2014b; Zhang, Alqahtani, and Demirbas 2017) have been proposed as solutions to this problem. On one extreme end of this architecture, we have the parameter server approach, where a server or group of servers initiate distributed learning by pushing the current model to a set of

Copyright c 2020, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

client nodes that host the data. The client nodes compute the local gradients or parameter updates and communicate them to the server nodes. Server nodes aggregate these values and update the current model (Zhang et al. 2018; Li et al. 2014b). On the other extreme, we have federated learning, where each client node obtains a local solution to the learning problem and the server node computes a global model by averaging the local models (Kone cn u et al. 2016; Mc Mahan et al. 2017). Besides the server-client architecture, a shared-memory (multicore/multi GPU) architecture, where different processors independently compute the gradients and update the global model parameter using a shared memory, has also been proposed as a solution to the distributed machine learning problem (Recht et al. 2011; De Sa et al. 2015; Chaturapruek, Duchi, and R e 2015; Feyzmahdavian, Aytekin, and Johansson 2016). However, none of the above-mentioned learning techniques are truly distributed since they follow a master-slave architecture and do not involve any peer-to-peer communication. Furthermore, these techniques are not always robust and they are rendered useless if the master/server node or the shared-memory fails. Therefore, we aim to develop a fully distributed machine learning architecture enabled by client-to-client interaction. For large-scale machine learning, stochastic gradient descent (SGD) methods are often preferred over batch gradient methods (Bottou, Curtis, and Nocedal 2018) because (i) in many large-scale problems, there is a good deal of redundancy in data and therefore it is inefﬁcient to use all the data in every optimization iteration, (ii) the computational cost involved in computing the batch gradient is much higher than that of the stochastic gradient, and (iii) stochastic methods are more suitable for online learning where data are arriving sequentially. Since most machine learning problems are non-convex, there is a need for distributed stochastic gradient methods for non-convex problems. Therefore, here we present a communication efﬁcient, distributed stochastic gradient algorithm for non-convex problems and demonstrate its utility for distributed machine learning.

Related work

Distributed Non-Convex Optimization: A few early examples of (non-stochastic or deterministic) distributed non-

convex optimization algorithms include the Distributed Approximate Dual Subgradient (DADS) Algorithm (Zhu and Mart ınez 2013), Nonconv Ex primal-dual Spli TTing (NESTT) algorithm (Hajinezhad et al. 2016), and the Proximal Primal Dual Algorithm (Prox-PDA) (Hong, Hajinezhad, and Zhao 2017). More recently, a non-convex version of the accelerated distributed augmented Lagrangians (ADAL) algorithm is presented in Chatzipanagiotis and Zavlanos(2017) and successive convex approximation (SCA)-based algorithms such as i Nner c On Vex Approximation (NOVA) and in-Network succ Essive conve X approxima Tion algorithm (NEXT) are given in Scutari, Facchinei, and Lampariello(2017) and Lorenzo and Scutari(2016), respectively. References (Hong 2018; Guo, Hug, and Tonguz 2017; Hong, Luo, and Razaviyayn 2016) provide several distributed alternating direction method of multipliers (ADMM) based non-convex optimization algorithms. Non-convex versions of Decentralized Gradient Descent (DGD) and Proximal Decentralized Gradient Descent (Prox-DGD) are given in Zeng and Yin(2018). Finally, Zeroth-Order Nonconv Ex (ZONE) optimization algorithms for mesh network (ZONE-M) and star network (ZONE-S) are presented in Hajinezhad, Hong, and Garcia(2019). However, almost all aforementioned consensus optimization algorithms focus on non-stochastic problems and are extremely communication heavy because they require constant communication among the agents. Distributed Convex SGD: Within the consensus optimization literature, there exist several works on distributed stochastic gradient methods, but mainly for strongly convex optimization problems. These include the stochastic subgradient-push method for distributed optimization over time-varying directed graphs given in Nedi c and Olshevsky(2016), distributed stochastic optimization over random networks given in Jakovetic et al.(2018), the Stochastic Unbiased Curvature-aided Gradient (SUCAG) method given in Wai et al.(2018), and distributed stochastic gradient tracking methods Pu and Nedi c(2018). There are very few works on distributed stochastic gradient methods for non-convex optimization (Tatarenko and Touri 2017; Bianchi and Jakubowicz 2013); however, the push-sum algorithm given in Tatarenko and Touri(2017) assumes there are no saddle-points and it often requires up to 3 times as many internal variables as the proposed algorithm. Compared to Bianchi and Jakubowicz(2013) and Tatarenko and Touri(2017), the proposed algorithm provides an explicit consensus rate and allows the parallel execution of the consensus communication and gradient computation steps. Parallel SGD: There exist numerous asynchronous SGD algorithms aimed at parallelizing the data-intensive machine learning tasks. The two popular asynchronous parallel implementations of SGD are the computer network implementation originally proposed in Agarwal and Duchi(2011) and the shared memory implementation introduced in Recht et al.(2011). Computer network implementation follows the master-slave architecture and Agarwal and Duchi(2011) showed that for smooth convex problems, the delays due to asynchrony are asymptotically negligible. Feyzmahdavian, Aytekin, and Johansson(2016) extend the results in Agarwal and Duchi(2011) for regularized SGD. Extensions of the computer network

implementation of asynchronous SGD with variance reduction and polynomially growing delays are given in Huo and Huang(2016) and Zhou et al.(2018), respectively. Recht et al.(2011) proposed a lock-free asynchronous parallel implementation of SGD on a shared memory system and proved a sublinear convergence rate for strongly convex smooth objectives. The lock-free algorithm, HOGWILD!, proposed in Recht et al.(2011) has been applied to Page Rank approximation (Mitliagkas et al. 2015), deep learning (Noel and Osindero 2014), and recommender systems (Yu et al. 2012). In Duchi, Jordan, and Mc Mahan(2013), authors extended the HOGWILD! algorithm to a dual averaging algorithm that works for non-smooth, non-strongly convex problems with sparse gradients. An extension of HOGWILD! called BUCKWILD! is introduced in De Sa et al.(2015) to account for quantization errors introduced by ﬁxed-point arithmetic. In Chaturapruek, Duchi, and R e(2015), the authors show that because of the noise inherent to the sampling process within SGD, the errors introduced by asynchrony in the sharedmemory implementation are asymptotically negligible. Recently, several parallel SGD works focus on adjusting the worker-server interaction period or frequency as a way to decrease the communication overhead. For example, (Yu, Jin, and Yang 2019) and (Yu, Yang, and Zhu 2019) used a ﬁxed period, while (Yu and Jin 2019) and (Lin, Stich, and Jaggi 2018) propose an increasing period as a way to reduce communication. A detailed comparison of both computer network and shared memory implementation is given in Lian et al.(2015). Again, the aforementioned asynchronous algorithms are not distributed since they rely on a shared-memory or central coordinator. Decentralized SGD: Recently, numerous decentralized SGD algorithms for non-convex optimization have been proposed as a solution to the communication bottleneck often encountered in the server-client architecture (Lian et al. 2017; Jiang et al. 2017; Tang et al. 2018; Lian et al. 2018; Wang and Joshi 2018; Haddadpour et al. 2019; Assran et al. 2019; Wang et al. 2019). However almost all these works primarily focus on the performance of the algorithm during a ﬁxed time interval, and the constant algorithm step-size, which often depends on the ﬁnal time, is selected to speed-up the convergence rate. These SGD algorithms with constant step-size can only guarantee convergence to some ϵ-ball of the stationary point. Furthermore, most of the aforementioned decentralized SGD algorithms provide convergence rates in terms of the average of all local estimates of the global minimizer without ever proving a similar or faster consensus rate. In fact, most decentralized SGD algorithms can only provide bounded consensus and they require a centralized averaging step after running the algorithm until the ﬁnal-time (Lian et al. 2017; Tang et al. 2018; Lian et al. 2018; Haddadpour et al. 2019; Wang et al. 2019). Finally, most application of decentralized SGD focus on distributed learning scenarios where the data is distributed identically across all agents. Contribution: Currently, there exists no distributed SGD algorithm for the non-convex problems that doesn t require constant or periodic communication among the agents. In fact, algorithms in (Lian et al. 2017; Jiang et al. 2017; Tang et al. 2018; Lian et al. 2018; Wang and Joshi 2018;

Haddadpour et al. 2019; Assran et al. 2019; Wang et al. 2019) all rely on periodic communication despite the local model has not changed from previously communicated model. This is a waste of resources, especially in wireless setting and therefore we propose an approach that would allow the nodes to transmit only if the local model has signiﬁcantly changed from previously communicated model. The contributions of this paper are three-fold: (i) we propose a fully distributed machine learning architecture, (ii) we present a distributed SGD algorithm built on a novel communication triggering mechanism, and provide sufﬁcient conditions on step-sizes such that the algorithm is mean-square convergent, and (iii) we demonstrate the efﬁcacy of the proposed event-triggered SGD algorithm for distributed supervised learning with i.i.d. and more importantly, non-i.i.d. data. Notation: Let Rn m denote the set of n m real matrices. For a vector φ, φi is the i th entry of φ. An n n identity matrix is denoted as In and 1n denotes an n-dimensional vector of all ones. For p [1, ], the p-norm of a vector x is denoted as x p. For matrices A Rm n and B Rp q, A B Rmp nq denotes their Kronecker product. For a graph G (V, E) of order n, V {1, . . . , n} represents the agents or nodes and the communication links between the agents are represented as E {e1, . . . , eℓ} V V. Let Ni {j V : (i, j) E} denote the set of neighbors of node i. Let A = [aij] Rn n be the adjacency matrix with entries of aij = 1 if (i, j) E and zero otherwise. Deﬁne Δ = diag (A1n) as the in-degree matrix and L = Δ A as the graph Laplacian.

Distributed machine learning

Our problem formulation closely follows the centralized machine learning problem discussed in Bottou, Curtis, and Nocedal(2018). Consider a networked set of n agents, each with a set of mi, i = 1, . . . , n, independently drawn input-output samples {xj i, yj i }j=mi j=1 , where xj i Rdx and yj i Rdy are the j-th input and output data, respectively, associated with the i-th agent. For example, the input data could be images and the outputs could be labels. Let h ( ; ) : Rdx Rdw Rdy denote the prediction function, fully parameterized by the vector w Rdw. Each agent aims to ﬁnd the parameter vector that minimizes the losses, ℓ( ; ) : Rdy Rdy R, incurred from inaccurate predictions. Thus, the loss function ℓ(h (xi; w) , yi) yields the loss incurred by the i-th agent, where h (xi; w) and yi are the predicted and true outputs, respectively, for the i-th node. Assuming the input output space Rdx Rdy associated with the i-th agent is endowed with a probability measure Pi : Rdx Rdy [0, 1], the objective function an agent wishes to minimize is

Rdx Rdy ℓ(h (xi; w) , yi) d Pi (xi, yi)

= EPi [ℓ(h (xi; w) , yi)] . (1)

Here Ri(w) denotes the expected risk given a parameter vector w with respect to the probability distribution Pi. The

total expected risk across all networked agents is given as

i=1 Ri(w) =

i=1 EPi [ℓ(h (xi; w) , yi)] . (2)

Minimizing the expected risk is desirable but often unattainable since the distributions Pi are unknown. Thus, in practice, each agent chooses to minimize the empirical risk Ri(w) deﬁned as

j=1 ℓ h xj i; w , yj i . (3)

Here, the assumption is that mi is large enough so that Ri(w) Ri(w). The total empirical risk across all networked agents is

i=1 Ri(w) =

j=1 ℓ h(xj i; w), yj i (4)

To simplify the notation, let us represent a sample inputoutput pair (xi, yi) by a random seed ξi and let ξj i denote the j-th sample associated with the i-th agent. Deﬁne the loss incurred for a given w, ξj i as ℓ w, ξj i . Now, the distributed learning problem can be posed as an optimization involving sum of local empirical risks, i.e.,

min w f(w) = min w

i=1 fi (w) , (5)

where fi (w) = 1 mi mi j=1 ℓ w, ξj i .

Distributed event-triggered SGD Here we propose a distributed event-triggered stochastic gradient method to solve (5). Let wi(k) Rdw denote agent i s estimate of the optimizer at time instant k. Thus, for an arbitrary initial condition wi(0), the update rule at node i is as follows:

wi(k + 1) = wi(k) βk

j=1 aij ( ˆwi(k) ˆwj(k))

αk gi (wi(k), ξi(k)) ,

where αk and βk are hyper parameters to be speciﬁed, aij are the entries of the adjacency matrix and gi (wi(k), ξi(k)) represents either a simple stochastic gradient, mini-batch stochastic gradient or a stochastic quasi-Newton direction, i.e.,

gi (wi(k), ξi(k)) =

ℓ wi(k), ξk i , or

s=1 ℓ wi(k), ξk,s i , or

Hi(k) 1 ni(k)

s=1 ℓ wi(k), ξk,s i ,

where ni(k) denotes the mini-batch size, Hi(k) is a positive deﬁnite scaling matrix, ξk i represents the single random inputoutput pair sampled at time instant k, and (ξk,s i ) denotes the

s-th input-output pair out of the ni(k) random input-output pairs sampled at time instant k. For i = 1, . . . , n, the piecewise constant signal ˆwi(k) deﬁned as

ˆwi(k) = wi(ti q), k ti q, ti q + 1, . . . , ti q+1 1 , (7)

denote agent i s last broadcasted estimate of the optimizer. Here ti q, q = 0, 1, . . . with ti 0 = 0 denotes triggering instants, i.e., the time instants when agent i broadcasts wi to its neighbors. Deﬁne w(k) w 1 (k) . . . w n (k)

Rndw and ˆw(k) ˆw 1 (k) . . . ˆw n (k) Rndw. Now (6) can be written as

w(k + 1) = w(k) βk (L Idw) ˆw(k) αk g(w(k), ξ(k)), (8)

where L is the network Laplacian and

g(w(k), ξ(k))

g1 (w1(k), ξ1(k)) ... gn (wn(k), ξn(k))

Let ei(k) = wi(k) ˆwi(k) and e(k) = w(k) ˆw(k). Now (8) can be written as

w(k + 1) = (Wk Idw) w(k) + βk (L Idw) e(k) αk g(w(k), ξ(k)), (9)

where Wk = (In βk L). The event instants are deﬁned as

ti q+1 = inf k > ti q | ei(k) 1 υ0 αk , (10)

where υ0 is a positive constant to be deﬁned. Pseudo-code of the proposed distributed event-triggered SGD is given in Algorithm 1 (see supplementary material). Now we state the following assumption on the individual objective functions: Assumption 1. Objective functions fi( ) and its gradients fi( ) : Rdw Rdw are Lipschitz continuous with Lipschitz constants L0 i > 0 and Li > 0, respectively, i.e., wa, wb Rdw, i = 1, . . . , n, we have

fi( wa ) fi( wb ) 2 L0 i wa wb 2 and fi( wa ) fi( wb ) 2 Li wa wb 2.

Now we introduce F( ) : Rndw R, an aggregate objective function of local variables

i=1 fi (wi(k)) . (11)

Following Assumption 1, the function F( ) is Lipschitz continuous with Lipschitz continuous gradient F( ), i.e., wa, wb Rndw, we have F( wa ) F( wb ) 2 L wa wb 2, with constant L = max i {Li} and F( w ) f1( w1 ) . . . fn( wn ) Rndw.

Assumption 2. The function F( ) is lower bounded by Finf, i.e., Finf F(w), w Rndw. Without loss of generality, we assume that Finf 0. Now we make the following assumption regarding {αk} and {βk}:

Assumption 3. Sequences {αk} and {βk} are selected as

αk = a (k + 1)δ2 and βk = b (k + 1)δ1 , (12)

where a > 0, b > 0, 0 < 3δ1 < δ2 1, δ1/2 + δ2 > 1, and δ2 > 1/2. For sequences {αk} and {βk} that satisfy Assumption 3, we have k=1 αk = , k=1 βk = , k=1 α2 k < and k=1 αkβ1/2 k < . Thus αk and βk are not summable sequences. However, αk is square-summable and αk βk is summable. Assumption 4. The interaction topology of n networked agents is given as a connected undirected graph G (V, E). Assumption 5. Parameter b in sequence {βk} is selected such that

W0 = (In b L) (13)

has a single eigenvalue at 1 corresponding to the right eigenvector 1n and the remaining n 1 eigenvalues of W0 are strictly inside the unit circle. In other words, b is selected such that b < 1/σmax(L), where σmax( ) denotes the largest singular value. Thus, bσmax(L) < 1. Let Eξ[ ] denote the expected value taken with respect to the distribution of the random variable ξk given the ﬁltration Fk generated by the sequence {w0, . . . , wk}, i.e.,

Eξ[ wk+1 ] = E[ wk+1 |Fk] = (Wk Idw) wk αk E[ g(wk, ξk) |Fk] a.s.,

where a.s. (almost surely) denote events that occur with probability one. Now we make the following assumptions regarding the stochastic gradient term g(w(k), ξ(k)). Assumption 6. Stochastic gradients are unbiased such that

Eξ [ g(wk, ξk) ] = F(wk), a.s. (14)

That is to say Eξ [ g(wk, ξk) ] =

Eξ1 [ g1 (w1(k), ξ1(k)) ] ... Eξn [ gn (wn(k), ξn(k)) ]

f1( w1(k) ) ... fn( wn(k) )

Assumption 7. Stochastic gradients have conditionally bounded second moment, i.e., there exist scalars μv1 0 and μv2 0 such that

Eξ g(wk, ξk) 2 2 μv1 + μv2 F(wk) 2 2 , a.s. (15)

Assumption 7 is the bounded variance assumption typically made in all SGD literature. Pseudo-code of the proposed distributed event-triggered SGD is given in Algorithm 1.

Convergence analysis Deﬁne the average-consensus error as wk = (M Idw) wk, where M = In 1 n1n1 n . Note that ML = L and (L Idw) wk = (L Idw) wk. Thus from (9) we have

wk+1 = (Wk Idw) wk + βk (L Idw) e(k) αk (M Idw) g(wk, ξk). (16)

Algorithm 1 DETSGRAD algorithm

Input : a, b, υ0, δ1 and δ2 Initialization : w(0) = w 1 (0) . . . w n (0)

for Agent i = 1 to n do Sample ξi(0) & compute gi (wi(0), ξi(0)) Send wi(0) & let ˆw(i) i = wi(0) Receive wj(0) & let ˆw(i) j = wj(0), j Ni Update wi(1) = wi(0) α0 gi (wi(0), ξi(0))

j Ni aij ˆw(i) i ˆw(i) j

end for for Iteration k 1 do for Agent i = 1 to n do Sample ξi(k) & compute gi (wi(k), ξi(k)) Compute ei(k) = wi(k) ˆw(i) i if ei(k) 1 υ0 αk then Send wi(k) & let ˆw(i) i = wi(k) end if if any wj(k) received then Let ˆw(i) j = wj(k) end if Update wi(k+1) = wi(k) αk gi (wi(k), ξi(k))

j Ni aij ˆw(i) i ˆw(i) j

end for end for

Our strategy for proving the convergence of the proposed distributed event-triggered SGD algorithm to a critical point is as follows. First we show that the consensus error among the agents are diminishing at the rate of O 1 (k+1)δ2 (see Theorem 1). Asymptotic convergence of the algorithm is then proved in Theorem 3. Theorem 4 then establishes that the weighted expected average gradient norm is a summable sequence. Convergence rate of the algorithm in the typical weak sense is given in Theorem 5. Finally, Theorem 6 proves the asymptotic mean-square convergence of the algorithm to a critical point. Theorem 1. Consider the event-triggered SGD algorithm (6) under Assumptions 1-7. Then, there holds:

E wk 2 2 = O 1 (k + 1)δ2

Proof of Theorem 1 is given in George and Gurram(2019). Deﬁne

βk = a/b (k + 1)δ2 δ1 . (18)

Now deﬁne a non-negative function V (γk, wk) as

V (γk, wk) = F(wk) + 1 2γk w k (L Idw) wk. (19)

Taking the gradient with respect to wk yields

V (γk, wk) = F(wk) + 1

γk (L Idw) wk. (20)

Theorem 2. Consider the distributed event-triggered SGD algorithm (6) under Assumptions 1-7. Then, for the gradient V (γk, wk) given in (20), there holds:

k=0 αk E V (γk, wk) 2 2 < . (21)

Proof. See Theorem 2 in George and Gurram(2019).

Theorem 3. For the distributed event-triggered SGD algorithm (6) under Assumptions 1-7, we have

k=0 E wk+1 wk 2 2 < and (22)

lim k E wk+1 wk 2 2 = 0. (23)

See the supplementary material section of George and Gurram(2019) for the proof of Theorem 3. Deﬁne wk = 1 n 1n1 n Idw wk and F(wk) = 1 n 1n1 n Idw F(wk). Note that F(wk) 2 2 = 1 n 1 n Idw F(wk) 2 2 = 1

n n i=1 fi(wi(k)) 2 2. Theorem 4. For the distributed event-triggered SGD algorithm (6) under Assumptions 1-7, we have

k=0 αk E F(wk) 2 2

Proof. See Theorem 4 in George and Gurram(2019).

Theorem 4 establishes results about the weighted sum of expected average gradient norm and the key takeaway from this result is that, for the distributed SGD in (8) or (6) with appropriate step-sizes, the expected average gradient norms cannot stay bounded away from zero (See Theorem 9 of (Bottou, Curtis, and Nocedal 2018)), i.e., lim infk E F(wk) 2 2

= 0 or equivalently

lim infk E n i=1 fi(wi(k)) 2 2 = 0. The rate of such weak convergence results can be obtained as shown in Theorem 5. Theorem 5. Let {wk}K k=0 be generated according to the distributed event-triggered SGD given in (6) under Assumptions 1-7. Then for δ2 = 1 we have

i=1 fi( z K i )

= O 1 log(K + 1)

and for δ2 (0.5, 1) we have

i=1 fi( z K i )

= O 1 (K + 1)1 δ2

Here z K (z K 1 ) . . . (z K n ) is a random sample from {wk}K k=0 with probability P z K = wk = αk K j=0 αj .

Proof. See Theorem 5 in George and Gurram(2019).

(a) Centralized SGD (b) Distributed SGD-r (c) Distributed SGD-s

(d) DETSGRAD-r (e) DETSGRAD-s

Figure 1: Empirical risk for all ﬁve experiments on MNIST dataset.

(a) DETSGRAD-r (b) DETSGRAD-s

Figure 2: Fraction of event-triggered broadcast events for the 10 agents compared to continuous broadcasting case for MNIST dataset.

Agent 1 2 3 4 5 6 7 8 9 10

Dist. SGD-r 98.97 98.97 98.97 98.97 98.97 98.97 98.97 98.97 98.97 98.97

Dist. SGD-s 98.86 98.86 98.86 98.87 98.86 98.86 98.86 98.87 98.85 98.87

DETSGRAD-r 98.34 98.35 98.32 98.27 98.31 98.31 98.38 98.29 98.23 98.33

DETSGRAD-s 98.46 98.49 98.49 98.51 98.5 98.45 98.13 98.49 98.42 98.51

Table 1: MNIST - Final classiﬁcation accuracies (%) of the 10 agents after 40 epochs (240000 iterations for the random sampling/i.i.d. case and 216840 iterations for the single class/non-i.i.d case) using different algorithms. The ﬁnal accuracy of a single agent using centralized SGD after 10 epochs (600000 iterations) is 98.63%.

Agent 1 2 3 4 5 6 7 8 9 10

DETSGRAD-r 61759 61455 61504 61636 61738 61822 61746 61712 61850 61795

DETSGRAD-s 71756 71718 71762 71983 71976 71773 71762 72159 72233 72208

Table 2: MNIST - Total number of event-triggered broadcast events for the 10 agents after 40 epochs. The total number of continuous broadcast events for each agent after 40 epochs is 240000 in the random sampling case, and 216840 in the single class case.

Finally, we present the following result to illustrate that stronger convergence results follows from the continuity assumption on the Hessian, which has not been utilized in our analysis so far. Assumption 8. The Hessians 2fi( ) : Rdw Rdw dw are Lipschitz continuous with Lipschitz constants LHi, i.e., wa, wb Rdw, i = 1, . . . , n, we have

2fi( wa ) 2fi( wb ) 2 LHi wa wb 2. (27)

It follows from Assumption 8 that the Hessian 2F( ) is Lipschitz continuous, i.e., wa, wb Rndw,

2F( wa ) 2F( wb ) 2 LH wa wb 2, (28)

with constant LH = max i {LHi}.

Theorem 6. For the distributed SGD algorithm (6) under Assumptions 1-8 we have

lim k E F(wk) 2 2

= 0 and (29)

i=1 fi(wi(k))

Proof. See Theorem 6 in George and Gurram(2019).

Similar to the centralized SGD (Bottou, Curtis, and Nocedal 2018), the analysis given here shows the meansquare convergence of the distributed algorithm to a critical point, which include the saddle points. Though SGD has shown to escape saddle points efﬁciently (Lee et al. 2017; Fang, Lin, and Zhang 2019; Jin et al. 2019), extension of such results for distributed SGD is currently nonexistent and is a topic for future research.

Application to distributed supervised learning We apply the proposed algorithm to distributedly train neural network agents for image classiﬁcation task. We present extensive results on two different datasets - MNIST1 and CIFAR-102.

MNIST MNIST data set is a handwritten digit recognition data set containing 60000 grayscale images of 10 digits (0-9) for training and 10000 images are used for testing. We distributedly train 10 agents that are connected in an undirected unweighted ring topology. The 10-node ring was selected only

1http://yann.lecun.com/exdb/mnist/ 2https://www.cs.toronto.edu/ kriz/cifar.html

since it is one of the least connected network (besides the path) and MNIST contains 10 classes. Proposed algorithm would work for any undirected graph as along as it is connected. Each agent aims to train its own neural network, which is a randomly initialized Le Net-5 (Le Cun et al. 1998). During training, each agent broadcasts its weights to its neighbors at every iteration or aperiodically as described in the proposed algorithm. Here we conduct the following ﬁve experiments: (i) Centralized SGD, where a centralized version of the SGD is implemented by a central node having access to all 60000 training images from all classes; (ii) Distributed SGDr, where all the agents broadcast their respective weights at every iteration, and each agent has access to 6000 training images, randomly sampled from the entire training set, which forms the i.i.d. case; (iii) Distributed SGD-s, where all the agents broadcast their weights at every iteration, and each agent has access to the images corresponding to a single class, which forms the non-i.i.d. case; (iv) DETSGRAD-r, where the agents aperiodically broadcast their weights using the triggering mechanism in (10), and each agent has access to 6000 training images, randomly sampled from the entire training set, i.e., i.i.d. case; (v) DETSGRAD-s, where the agents aperiodically broadcast their weights using the triggering mechanism in (10), and each agent has access to the images corresponding to a single class, i.e., non-i.i.d. case. In the single class case, for ease of programming, we set the number of training images available for each agent to 5421 (the minimum number of training images available in a single class, which is digit 5 in MNIST data set). Here we select αk = 0.1 (εk+1) and βk = 0.2525 (εk+1)1/10 , where ε = 10 5 for Distributed SGD and DETSGRAD. We select αk = 0.001 (εk+1) for centralized SGD. Note that using a scale factor ε does not affect the theoretical results provided in the previous sections. For the DETSGRAD experiments, we select the broadcast event trigger threshold υ0 = 0.2 Nparameters, where Nparameters is the total number of parameters in each neural network. The plots of the empirical risk vs. the iterations (parameter update steps), illustrated in Figure 1, show the convergence of the proposed algorithm. The ﬁnal test accuracies of the 10 agents after 40 training epochs using different algorithms and different training settings are shown in Table 1. Results obtained here indicate that regardless of how the data are distributed (random or single class), the agents are able to train their network and the distributedly trained networks are able to yield similar performance as that of a centrally trained network. More importantly, in the single class case, agents were

Agent 1 2 3 4 5 6 7 8

Dist. SGD-r 85.02 85.04 84.88 84.93 85.2 85.7 84.47 84.92

DETSGRAD-r 84.96 84.84 84.42 84.37 84.46 84.86 84.84 84.6

Table 3: CIFAR-10 - Final classiﬁcation accuracies (%) of the 8 agents after 200 epochs of training. The ﬁnal accuracy of a single agent using centralized SGD after 150 epochs is 85.12%.

(a) Centralized SGD (b) Distributed SGD-r (c) DETSGRAD-r

Figure 3: Empirical risk vs. epoch for Res Net20 trained on CIFAR-10 dataset.

Agent 1 2 3 4 5 6 7 8

DETSGRAD-r 5484 5479 5471 5483 5483 5485 5493 5481

Table 4: CIFAR-10 - Total number of event-triggered broadcast events for the 8 agents after 200 epochs. The total number of continuous broadcast events for each agent after 200 epochs is 9800.

able to recognize images from all 10 classes even though they had access to data corresponding only to a single class during the training phase. This result has numerous implications for the machine learning community, speciﬁcally for federated multi-task learning under information ﬂow constraints.

The total number of event-triggered parameter broadcast events for the 10 agents using the DETSGRAD algorithm are shown in Table 2. In the random sampling case, by employing broadcast event-triggering mechanism, we are able to reduce the inter-agent communications from 240000 to an average of 61702 over 40 epochs leading to a reduction of 74.2% in network communications. In the single class case, the agents broadcast the parameters continuously for the ﬁrst 4 epochs, after which the event-trigger mechanism is started. Here, we are able to reduce the parameter broadcasts for each agent from 216840 to an average of 71933 over 40 epochs leading to a reduction of 66.8% in network communications. Yet, as can be seen in Table 1, DETSGRAD gives similar classiﬁcation performance as distributed SGD with continuous parameter sharing with signiﬁcant reduction in network communications. The fractions of the broadcast events for the 10 agents over 40 epochs are presented in Figure 2. As expected, the number of broadcast events reduces with the increase in epoch number as the agents converge to the critical point of the empirical risk function.

CIFAR-10 CIFAR-10 data set is an image classiﬁcation data set containing 50000 color images of 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck) for training and 10000 images are used for testing. We distributedly train 8 agents that are connected in an undirected unweighted ring topology. Each agent trains its own neural network, which is a randomly initialized Res Net-203 (He et al. 2016). We conducted the following three experiments: (i) Centralized SGD, where a centralized version of the SGD is implemented by a central node having access to all 50000 training images from all classes; (ii) Distributed SGD-r, where all the agents broadcast their respective weights at every iteration, and each agent has access to 6250 training images, randomly sampled from the entire training set; (iii) DETSGRAD-r, where the agents aperiodically broadcast their weights using the triggering mechanism in (10), and each agent has access to 6250 training images, randomly sampled from the entire training set. Here we select αk = 0.1 (εk+1) and βk = 0.2525 (εk+1)1/10 , where ε = 0.00025 for Distributed SGD and DETSGRAD. We select αk = 0.1 (εk+1), where ε = 10 5 for centralized SGD. For the DETSGRAD experiment, we select the broadcast event trigger threshold υ0 = 0.01 Nparameters, where Nparameters is the total number of parameters in each neural

3https://github.com/akamaster/pytorch resnet cifar10

network, with the agents broadcasting the parameters continuously for the ﬁrst 25 epochs. The plots of the empirical risk vs. epochs, illustrated in Figure 3, show the convergence of the proposed algorithm. The ﬁnal test accuracies of the 8 agents after 200 training epochs using two different algorithms are shown in Table 3. Similar to the previous case, results obtained here indicate that the distributedly trained networks are able to yield similar performance as that of a centrally trained network. The total number of event-triggered parameter broadcast events for the 8 agents using the DETSGRAD algorithm are shown in Table 4. By employing broadcast event-triggering mechanism, we are able to reduce the inter-agent communications from 9800 to an average of 5482 over 200 epochs leading to a reduction of 44.1% in network communications. Yet, as can be seen in Table 3, DETSGRAD gives similar classiﬁcation performance as distributed SGD with continuous parameter sharing with signiﬁcant reduction in network communications.

Conclusion This paper presented the development of a distributed stochastic gradient descent algorithm with event-triggered communication mechanism for solving non-convex optimization problems. We presented a novel communication triggering mechanism, which allowed the agents to decidedly reduce the communication overhead by communicating only when the local model has signiﬁcantly changed from previously communicated model. We presented the sufﬁcient conditions on algorithm step-sizes to guarantee asymptotic mean-square convergence of the proposed algorithm to a critical point and provided the convergence rate of the proposed algorithm. We applied the developed algorithm to distributed supervisedlearning problem, in which a set of networked agents collaboratively train their individual neural nets to perform image classiﬁcation. Results indicate that the distributedly trained networks are able to yield similar performance to that of a centrally trained network. Numerical results also show that the proposed event-triggered communication mechanism signiﬁcantly reduced the inter-agent communication while yielding similar performance to that of a distributedly trained network with constant communication.

References Agarwal, A., and Duchi, J. C. 2011. Distributed delayed stochastic optimization. In NIPS. 873 881. Assran, M.; Loizou, N.; Ballas, N.; and Rabbat, M. 2019. Stochastic Gradient Push for Distributed Deep Learning. In ICML, 344 353. Bianchi, P., and Jakubowicz, J. 2013. Convergence of a multiagent projected stochastic gradient algorithm for non-convex optimization. IEEE TAC 58(2):391 405. Bottou, L.; Curtis, F.; and Nocedal, J. 2018. Optimization methods for large-scale machine learning. SIAM Review 60(2):223 311. Chaturapruek, S.; Duchi, J. C.; and R e, C. 2015. Asynchronous stochastic convex optimization: the noise is in the noise and sgd don t care. In NIPS. 1531 1539.

Chatzipanagiotis, N., and Zavlanos, M. M. 2017. On the convergence of a distributed augmented lagrangian method for nonconvex optimization. IEEE TAC 62(9):4405 4420. De Sa, C. M.; Zhang, C.; Olukotun, K.; R e, C.; and R e, C. 2015. Taming the wild: A uniﬁed analysis of hogwild-style algorithms. In NIPS. 2674 2682.

Duchi, J.; Jordan, M. I.; and Mc Mahan, B. 2013. Estimation, optimization, and parallelism when data is sparse. In NIPS. 2832 2840.

Fang, C.; Lin, Z.; and Zhang, T. 2019. Sharp analysis for nonconvex sgd escaping from saddle points. ar Xiv:1902.00247. Feyzmahdavian, H. R.; Aytekin, A.; and Johansson, M. 2016. An asynchronous mini-batch algorithm for regularized stochastic optimization. IEEE TAC 61(12):3740 3754.

George, J., and Gurram, P. 2019. Distributed Deep Learning with Event-Triggered Communication. ar Xiv 1909.05020. Guo, J.; Hug, G.; and Tonguz, O. K. 2017. A case for nonconvex distributed optimization in large-scale power systems. IEEE TPS 32(5):3842 3851.

Haddadpour, F.; Kamani, M. M.; Mahdavi, M.; and Cadambe, V. 2019. Trading redundancy for communication: Speeding up distributed SGD for non-convex optimization. In ICML, 2545 2554. Hajinezhad, D.; Hong, M.; Zhao, T.; and Wang, Z. 2016. Nestt: A nonconvex primal-dual splitting method for distributed and stochastic optimization. In NIPS. 3215 3223.

Hajinezhad, D.; Hong, M.; and Garcia, A. 2019. Zone: Zeroth order nonconvex multi-agent optimization over networks. IEEE TAC.

He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In IEEE CVPR, 770 778. Hong, M.; Hajinezhad, D.; and Zhao, M.-M. 2017. Prox PDA: The proximal primal-dual algorithm for fast distributed nonconvex optimization and learning over networks. In ICML, 1529 1538.

Hong, M.; Luo, Z.; and Razaviyayn, M. 2016. Convergence analysis of alternating direction method of multipliers for a family of nonconvex problems. SIAM JO 26(1):337 364. Hong, M. 2018. A distributed, asynchronous, and incremental algorithm for nonconvex optimization: An admm approach. IEEE TCNS 5(3):935 945. Huo, Z., and Huang, H. 2016. Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization. ar Xiv:1604.03584. Jakovetic, D.; Bajovic, D.; Sahu, A. K.; and Kar, S. 2018. Convergence rates for distributed stochastic optimization over random networks. In IEEE CDC, 4238 4245. Jiang, Z.; Balu, A.; Hegde, C.; and Sarkar, S. 2017. Collaborative deep learning in ﬁxed topology networks. In NIPS. 5904 5914. Jin, C.; Netrapalli, P.; Ge, R.; Kakade, S. M.; and Jordan, M. I. 2019. Stochastic gradient descent escapes saddle points efﬁciently. ar Xiv:1902.04811.

Kone cn u, J.; Mc Mahan, H. B.; Yu, F. X.; Richtarik, P.; Suresh, A. T.; and Bacon, D. 2016. Federated learning: Strategies for improving communication efﬁciency. In NIPSW. Le Cun, Y.; Bottou, L.; Bengio, Y.; Haffner, P.; et al. 1998. Gradient-based learning applied to document recognition. Proc. of the IEEE 86(11):2278 2324. Lee, J. D.; Panageas, I.; Piliouras, G.; Simchowitz, M.; Jordan, M. I.; and Recht, B. 2017. First-order methods almost always avoid saddle points. ar Xiv:1710.07406. Li, M.; Andersen, D. G.; Park, J. W.; Smola, A. J.; Ahmed, A.; Josifovski, V.; Long, J.; Shekita, E. J.; and Su, B.-Y. 2014a. Scaling distributed machine learning with the parameter server. In USENIX OSDI, 583 598. Li, M.; Andersen, D. G.; Smola, A. J.; and Yu, K. 2014b. Communication efﬁcient distributed machine learning with the parameter server. In NIPS. 19 27. Lian, X.; Huang, Y.; Li, Y.; and Liu, J. 2015. Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization. ar Xiv:1506.08272. Lian, X.; Zhang, C.; Zhang, H.; Hsieh, C.-J.; Zhang, W.; and Liu, J. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In NIPS. 5330 5340. Lian, X.; Zhang, W.; Zhang, C.; and Liu, J. 2018. Asynchronous decentralized parallel stochastic gradient descent. In ICML, 3043 3052. Lin, T.; Stich, S. U.; and Jaggi, M. 2018. Don t use large mini-batches, use local SGD. ar Xiv 1808.07217. Lorenzo, P. D., and Scutari, G. 2016. NEXT: In-network nonconvex optimization. IEEE TSIPN 2(2):120 136. Mc Mahan, H. B.; Moore, E.; Ramage, D.; Hampson, S.; and y Arcas, B. A. 2017. Communication-efﬁcient learning of deep networks from decentralized data. In AISTATS. Mitliagkas, I.; Borokhovich, M.; Dimakis, A. G.; and Caramanis, C. 2015. Frogwild!: Fast pagerank approximations on graph engines. Proc. VLDB Endow. 8(8):874 885. Nedi c, A., and Olshevsky, A. 2016. Stochastic gradientpush for strongly convex functions on time-varying directed graphs. IEEE TAC 61(12):3936 3947. Noel, C., and Osindero, S. 2014. Dogwild!-distributed hogwild for cpu & gpu. In NIPSW. Pu, S., and Nedi c, A. 2018. Distributed Stochastic Gradient Tracking Methods. ar Xiv:1805.11454. Recht, B.; Re, C.; Wright, S.; and Niu, F. 2011. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS. 693 701. Scutari, G.; Facchinei, F.; and Lampariello, L. 2017. Parallel and distributed methods for constrained nonconvex optimization part i: Theory. IEEE TSP 65(8):1929 1944. Tang, H.; Lian, X.; Yan, M.; Zhang, C.; and Liu, J. 2018. d2: Decentralized training over decentralized data. In ICML, 4848 4856. Tatarenko, T., and Touri, B. 2017. Non-convex distributed optimization. IEEE TAC 62(8):3744 3757.

Wai, H.; Freris, N. M.; Nedic, A.; and Scaglione, A. 2018. Sucag: Stochastic unbiased curvature-aided gradient method for distributed optimization. In IEEE CDC, 1751 1756. Wang, J., and Joshi, G. 2018. Cooperative SGD: A uniﬁed Framework for the Design and Analysis of Communication Efﬁcient SGD Algorithms. ar Xiv:1808.07576. Wang, J.; Sahu, A. K.; Yang, Z.; Joshi, G.; and Kar, S. 2019. MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling. ar Xiv:1905.09435. Yu, H., and Jin, R. 2019. On the computation and communication complexity of parallel SGD with dynamic batch sizes for stochastic non-convex optimization. In ICML, 7174 7183. Yu, H.; Hsieh, C.; Si, S.; and Dhillon, I. 2012. Scalable coordinate descent approaches to parallel matrix factorization for recommender systems. In IEEE ICDM, 765 774. Yu, H.; Jin, R.; and Yang, S. 2019. On the linear speedup analysis of communication efﬁcient momentum SGD for distributed non-convex optimization. In ICML, 7184 7193. Yu, H.; Yang, S.; and Zhu, S. 2019. Parallel restarted SGD with faster convergence and less communication: Demystifying why model averaging works for deep learning. In AAAI-19, 5693 5700. Zeng, J., and Yin, W. 2018. On nonconvex decentralized gradient descent. IEEE TSP 66(11):2834 2848. Zhang, K.; Alqahtani, S.; and Demirbas, M. 2017. A comparison of distributed machine learning platforms. In ICCCN, 1 9. Zhang, J.; Tu, H.; Ren, Y.; Wan, J.; Zhou, L.; Li, M.; and Wang, J. 2018. An adaptive synchronous parallel strategy for distributed machine learning. IEEE Access 6:19222 19230. Zhou, Z.; Mertikopoulos, P.; Bambos, N.; Glynn, P.; Ye, Y.; Li, L.-J.; and Fei-Fei, L. 2018. Distributed asynchronous optimization with unbounded delays: How slow can you go? In ICML, 5970 5979. Zhu, M., and Mart ınez, S. 2013. An approximate dual subgradient algorithm for multi-agent non-convex optimization. IEEE TAC 58(6):1534 1539.