# d2_decentralized_training_over_decentralized_data__8cb9e0c1.pdf D2: Decentralized Training over Decentralized Data Hanlin Tang 1 Xiangru Lian 1 Ming Yan 2 3 Ce Zhang 4 Ji Liu 5 1 While training a machine learning model using multiple workers, each of which collects data from its own data source, it would be useful when the data collected from different workers are unique and different. Ironically, recent analysis of decentralized parallel stochastic gradient descent (D-PSGD) relies on the assumption that the data hosted on different workers are not too different. In this paper, we ask the question: Can we design a decentralized parallel stochastic gradient descent algorithm that is less sensitive to the data variance across workers? In this paper, we present D2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance among workers (imprecisely, decentralized data). The core of D2 is a variance reduction extension of D-PSGD. It improves the convergence rate from O σ n T + (nζ2) 1 3 T 2/3 where ζ2 denotes the variance among data on different workers. As a result, D2 is robust to data variance among workers. We empirically evaluated D2 on image classification tasks, where each worker has access to only the data of a limited set of labels, and find that D2 significantly outperforms D-PSGD. 1. Introduction Training machine learning models in a decentralized way has attracted intensive interests recently (Lian et al., 2017a; *Equal contribution 1Department of Computation Science, University of Rochester 2Department of Computational Mathematics, Science and Engineering, Michigan State University 3Department of Mathematics, Michigan State University 4Department of Computer Science, ETH Zurich 5Tencent AI Lab. Correspondence to: Hanlin Tang , Xiangru Lian , Ming Yan , Ce Zhang , Ji Liu . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Yuan et al., 2016; Colin et al., 2016). In the decentralized setting, there is a set of workers, each of which collects data from different data sources. Instead of sending all data to a centralized place, these workers only communicate with their neighbors. The goal is to get a model that is the same as if all data are collected in a centralized place. Decentralized learning algorithms are important in scenarios where the centralized communication is expensive or impossible, or the underlying communication network has high latency. For decentralized learning to provide benefits, each user should provide data that is somehow unique, i.e., the variance of data collected from different workers are large. However, many recent theoretical results (Lian et al., 2017a;b; Nedic & Ozdaglar, 2009; Yuan et al., 2016) assume a bounded data variance across workers when data hosted on different workers are very different, these approaches converge slowly, both empirically and theoretically. In this paper, we aim at bringing this discrepancy between the current theoretical understanding and the requirements from some practical scenarios. In this paper, we present D2, a novel decentralized learning algorithm designed to be robust under high data variance. D2 is built upon decentralized parallel stochastic gradient descent (D-PSGD), but benefits from an additional variance reduction component. In D2, each worker stores the stochastic gradient and its local model in the previous iterate and linearly combines them with the current stochastic gradient and local model. It results in an improved convergence rate over D-PSGD by eliminating the data variation among workers. In particular, the convergence rate is improved from n T + (nζ2) 1 3 T 2/3 where ζ2 is the data vari- ation among all workers, σ2 is the data variance within each worker, n is the number of workers, and T is the number of iterations. We empirically show D2 can significantly outperform D-PSGD by training an image classification model where each worker has access to only the data of a limited set of labels. Throughout this paper, we consider the following decentralized optimization: min x RN f(x):= 1 =:fi(x) z }| { Eξ Di Fi(x; ξ), (1) Decentralized Training over Decentralized Data where n is the number of workers and Di is the local data distribution for worker i. All workers are connected through a connected graph. Each worker can only exchange information with its neighbors. Definitions and notation Throughout this paper, we use following notation and definitions: F denotes the Frobenius norm of matrices. denotes the ℓ2 norm for vectors and the spectral norm for matrices. f( ) denotes the gradient of a function f. f denotes the optimal solution of (1). λi( ) denotes the ith largest eigenvalue of a matrix. x(i) denotes the local model of worker i. Fi(x(i); ξ(i)) denotes a local stochastic gradient of worker i. 1 = [1, 1, , 1] Rn denotes the all-one vector. In order to organize the algorithm more clearly, here we define the concatenation of all local variables, stochastic gradients, and their averages respectively: X :=[x(1), . . . , x(n)] RN n, G(X; ξ) :=[ F1(x(1); ξ(1)), . . . , Fn(x(n); ξ(n))] G(X, ξ) :=G(X, ξ)1 i=1 Fi(x(i); ξ(i)), i=1 fi(x(i)), where ξ is the collection of randomly sampled data from all workers. Organization This paper is organized as follows: Section 2 reviews related work about the proposed approach; Section 3 introduces the state-of-the-art decentralized stochastic gradient descent method and its convergence rate; Section 4 introduces the proposed algorithm and its intuition why it improves the state-of-the-art approach; Section 5 provides the theoretical guarantee; and Section 6 validates the proposed approaches via empirical study; and Section 7 concludes this paper. 2. Related work In this section, we review the stochastic gradient descent algorithm and its decentralized variants, decentralized algorithms, and previous variance reduction technologies. Stochastic gradient descent (SGD) The SGD approahces (Ghadimi & Lan, 2013; Moulines & Bach, 2011; Nemirovski et al., 2009) is quite powerful for solving largescale machine learning problems. It achieves a convergence rate of O 1/ T . As an implementation of SGD, the Centralized Parallel Stochastic Gradient Descent (C-PSGD), has been widely used in parallel computation. In C-PSGD, a central worker, whose job is to perform the variable updates, is connected to many leaf workers that are used to compute stochastic gradients in parallel. C-PSGD has been applied to many deep learning frameworks, such as CNTK (Seide & Agarwal, 2016), MXNet (Chen et al., 2015), and Tensor Flow (Abadi et al., 2016). The convergence rate of CPSGD is O 1 , which shows that it can achieve linear speedup with regards to the number of leaf workers. Decentralized algorithms Centralized algorithms require a central server to communicate with all other workers (Suresh et al., 2017). In contrast, decentralized algorithms work on any connected network and only rely on the information exchange between neighbor workers (Kashyap et al., 2007; Lavaei & Murray, 2012; Nedic et al., 2009). Decentralized algorithms are especially useful under a network with limited bandwidth or high latency. It is more favorable when data privacy is sensitive. These advantages have led to successful applications. The decentralized approach for multi-task reinforcement learning was studied in Omidshafiei et al. (2017); Mhamdi et al. (2017). In Colin et al. (2016), a dual based decentralized algorithm was proposed to solve the pairwise function optimization. Shi et al. (2014) and Mokhtari & Ribeiro (2015) analyzed the decentralized version of the ADMM optimization algorithm. An information theoretic approach was used to analyze decentralization in Dobbe et al. (2017). The decentralized version of (sub-)gradient descent was studied in Nedic & Ozdaglar (2009); Yuan et al. (2016). Its O(1/ T) convergence requires a diminishing stepsize or a constant stepsize that depends on the total number of iterations. This phenomenon happens because of the variance between the data in different workers, which we call outer variance to differentiate it from the variance in SGD. Recently, there are several deterministic decentralized optimization algorithms that allows a constant stepsize. For example, EXTRA (Shi et al., 2015a) is the first modification of decentralized gradient descent that converges under a constant stepsize. Later this algorithm is extended for problems with the sum of smooth and nonsmooth functions at each node (Shi et al., 2015b). Decentralized Training over Decentralized Data The algorithm DIGing is proposed in Nedi c et al. (2017), where two exchanges are needed in each iteration. However, their stepsizes depend on both the Lipschitz constant of the differentiable function and the network structure. NIDS is the first algorithm that has a constant network independent stepsize (Li et al., 2017). This algorithm was simultaneously proposed by Yuan et al. (2017) for the smooth case only using a different approach. Decentralized parallel stochastic gradient descent (DPSGD) The D-PSGD algorithm (Nedic & Ozdaglar, 2009; Ram et al., 2010a;b) requires each worker to compute a stochastic gradient and exchange its local model with neighbors. In Duchi et al. (2012), a dual averaging based method is proposed for solving the constrained decentralized SGD optimization. In Yuan et al. (2016), the convergence rate for D-PSGD was analyzed when the gradient is assumed to be bounded. In Lan et al. (2017), a decentralized primal-dual type method was proposed with a computational complexity of O n/ϵ2 for general convex objectives. Lian et al. (2017a) proved that D-PSGD can admits linear speedup with respect to the number of workers with a similar convergence rate as C-PSGD. Variance reduction technology There have been many methods developed for reducing the variance in SGD, including SVRG (Johnson & Zhang, 2013), SAGA (Defazio et al., 2014), SAG (Schmidt et al., 2017), MISO (Mairal, 2015), and m S2GD (Koneˇcn y et al., 2016). However, most of these technologies are designed for centralized approaches. The DSA algorithm (Mokhtari & Ribeiro, 2016) applied the variance reduction similar to SAGA on strongly convex decentralized optimization problems and proved a linear convergence rate. However, the speedup property is unclear and a table of all stochastic gradients need to be stored. 3. Preliminary: decentralized stochastic gradient descent The decentralized stochastic gradient descent (Lian et al., 2017a; Zhang et al., 2017; Shahrampour & Jadbabaie, 2017) allows each worker (say worker i) maintaining its own local variable x(i). During each iteration (say, iteration t), each worker performs the following steps: 1. Query its neighbors local variables. 2. Take weighted average with its local variable and its neighbors local variables: j=1 Wijx(j) t , where Wij is the (i, j) element of the matrix W. Wij = 0 means worker i and worker j are not con- 3. Perform one stochastic gradient descent step x(i) t+1 = x(i) t+ 1 2 γ F(x(i) t ; ξ(i) t ), where ξ(i) t represents the data sampled in worker i at the iteration t following the distribution Di. From a global point of view, the update rule of D-PSGD can be viewed as Xt+1 = Xt W γG(Xt; ξt). It admits the following rate shown in Theorem 1. Theorem 1 (Convergence rate of D-PSGD (Lian et al., 2017a)). Under certain assumptions, the output of D-PSGD admits the following inequality t=0 E f(Xt) 2 + D1 2n σ2 + γ2L2nσ2 (1 λ)D2 + 9γ2L2nς2 where ρ reflects the property of the network, D1 and D2 are defined to be 2 9γ2L2n (1 ρ)2D2 D2 := 1 18γ2 (1 ρ)2 n L2 , and σ and ς measure the variation within each worker and among all workers respectively Eξ Di Fi(x; ξ) fi(x) 2 σ2, i, x, (2) i=1 fi(x) f(x) 2 ζ2, i, x. (3) Choosing the optimal steplength γ = 1 n +n 1 3 ζ 2 3 T 1 3 we have the following convergence rate: t=1 E( f(Xt) 2) O n T + n 1 3 ζ 2 3 The proposed D2 algorithm can improve the convergence rate by removing the dependence to the global bound of outer variance ζ. Decentralized Training over Decentralized Data Algorithm 1 The D2 algorithm 1: Input: Initial point x(i) 0 = 0, step length γ > 0, confusion matrix W, and the total number of iterations T. 2: for t = 0,1,2,. ..,T do 3: Randomly sample ξ(i) t from the local data of the ith worker. 4: Compute a local stochastic gradient based on ξ(i) t and current variable x(i) t : Fi(x(i) t ; ξ(i) t ). 5: if t=0 then 6: x(i) t+ 1 2 = x(i) t γ Fi(x(i) t ; ξ(i) t ), 7: else 8: x(i) t+ 1 2 = 2x(i) t x(i) t 1 γ Fi(x(i) t ; ξ(i) t ) + γ Fi(x(i) t 1; ξ(i) t 1). 9: end if 10: Each worker sends x(i) t+ 1 2 to its neighbors and takes the weighted average j=1 Wijx(j) t+ 1 where x(j) t+ 1 2 is from the worker j. 11: end for 12: Output: 1 n Pn i=1 x(i) T 4. The D2 algorithm In D2 algorithm, each worker (say, worker i) repeats the following updating rule (say, at iteration t): 1. Compute a local stochastic gradient F(x(i) t ; ξ(i) t ) by sampling ξ(i) t from distribution D(i); 2. Update the local model x(i) t+ 1 2 2x(i) t x(i) t 1 γ Fi x(i) t ; ξ(i) t + γ Fi x(i) t 1; ξ(i) t 1 using the local models and stochastic gradients in both the tth iteration and the (t 1)th iteration. 3. When the synchronization barrier is met, exchange x(i) t+ 1 2 with neighbors: j=1 Wijx(j) t+ 1 From a global point of view, the update rule of D2 can be viewed as: Xt+1 = (2Xt Xt 1 γG(Xt; ξt) + γG(Xt 1; ξt 1)) W. The complete algorithm is summarized in Algorithm 1. D2 essentially runs the stochastic gradient descent step. To understand the intuition of D2, let us consider the mean value Xt, which is updated just like the standard stochastic gradient descent: Xt+1 = (2Xt Xt 1 γG(Xt; ξt) + γG(Xt 1; ξt 1)) W 1n Xt+1 =2Xt Xt 1 γG(Xt; ξt) + γG(Xt 1; ξt 1), or equivalently Xt+1 Xt =Xt Xt 1 γG(Xt; ξt) + γG(Xt 1; ξt 1), G(Xt; ξt) G(Xt 1; ξt 1) = γG(Xt; ξt). (X1 = X0 γG(X0; ξ0)). (4) Why D2 improves the D-PSGD? We may notice that DPSGD also essentially updates in the form of stochastic gradient descent in (4). Then why D2 can improve D-PSGD? Assume that Xt has achieved the optimum X := x 1 with all local models equal to the optimum x to (1). Then for D-PSGD, the next update will be Xt+1 = X γG(X ; ξt). It shows that the convergence when we approach a solution is affected by E[ G(X ; ξt 2 F ], which is bounded by O(σ2 + ζ2), as we can see from the following: E[ G(X ; ξt 2 F ] Fi(x ; ξ(i) t+1) fi(x ) + fi(x ) 2 Fi(x ; ξ(i) t+1) fi(x ) 2 + 2 fi(x ) f(x ) 2 Next we apply a similar analysis for D2 by assuming that both Xt 1 and Xt have reached the optimal solution X . The next update for D2 will be: Xt+1 = (X γG(X ; ξt) γG(X ; ξt 1)) W. It shows that for D2, the convergence when we approach a solution relies on the magnitude of E[ G(X ; ξt) G(X ; ξt 1) 2 F ], which is bounded by: Decentralized Training over Decentralized Data which can be seen from: E[ G(X ; ξt) G(X ; ξt 1) 2 F Fi(x ; ξ(i) t ) fi(x ) 2 Fi(x ; ξ(i) t 1) fi(x ) 2 5. Theoretical guarantee This section provides the theoretical guarantee for the proposed D2 algorithm. We first give the assumptions required below. Assumption 1. Throughout this paper, we make the following commonly used assumptions: 1. Lipschitzian gradient: All function fi( ) s are with L-Lipschitzian gradients. 2. Bounded variance: Assume bounded variance of stochastic gradient within each worker Eξ Di Fi(x; ξ) fi(x) 2 σ2, i, x. 3. Symmetric confusion matrix: The confusion matrix W is symmetric and satisfies W1 = 1. 4. Spectral gap: Let the eigenvalues of W Rn n be λ1 λ2 λn. Denote by for short λ := max i {2, ,n} λi = λ2. We assume λ < 1 and λn > 1 5. Initialization: W.l.o.g., assume all local variables are initialized by zero, that is, X0 = 0. Existing decentralized consensus algorithms (Shi et al., 2015b; Li et al., 2017) use a modification of the doubly stochastic matrix such that λ > 0, i.e., choose W = ( W + I)/2 where W is a doubly stochastic matrix. Recently, Li & Yan (2017) show that λn > 1/3 is optimal in the convergence of EXTRA. However, the optimal λn for NIDS (Li et al., 2017) is unknown. In this paper, we proved that 1 3 is the infimum of λn, and when it reduces to deterministic case, this condition is weaker than that in (Li et al., 2017). This is important, because we actually can use a W that performs better. Given Assumption 1, we have following convergence guarantee for D2: Theorem 2 (Convergence of Algorithm 1). Choose the steplength γ in Algorithm 1 to be a constant satisfying 1 24C2γ2L2 > 0. Under Assumption 1, we have the following convergence rate for Algorithm 1: A1 f(0) 2 + E f(Xt) 2 + A2E f(Xt) 2 n σ2 + 6L2C1γ2ζ2 0 C3 + 12L2C2γ2σ2T C3 + 6L2C2γ4L2σ2T n C3 + 6L2C1γ2σ2 i=1 fi(0) f(0) 2, C1 := max 1 1 |v|2 , 1 (1 λ)2 C2 := max λ2 n (1 |v|2), λ2 C3 :=1 24C2γ2L2, A1 :=1 6L2C1γ2 A2 :=1 Lγ 6L2C2γ4L2 By appropriately specifying the step length γ, we reach the following corollary: Corollary 3. Choose the step length γ in Algorithm 1 to be γ = 1 8 C2L+6 C1L+σ T n , where C1 and C2 are defined in Theorem 2. Under Assumption 1, the following convergence rate holds t=0 E f(Xt) 2 σ T + ζ2 0 T + σ2T 2 where ζ0 is defined in Theorem 2 and we treat f(0) f , L, λn, and λ as constants. Note that we can obtain even better constants by choosing different parameters and applying tighter inequalities, however, the main result of this corollary is to show the order of the convergence. We highlight a few key observations from our theoretical results in the following. Tightness of the convergence rate Setting σ = 0 and ζ0 = 0, which reduces the VR-SGD to a normal GD Decentralized Training over Decentralized Data algorithm, we shall see that the convergence rate becomes O 1 T , which is exactly the rate of GD. Linear speedup Since the leading term of the convergence rate is O 1 , which is consistent with the convergence rate of C-PSGD, this indicates that we would achieve a linear speed up with respect to the number of nodes. Consistent with NIDS In NIDS (Li & Yan, 2017), the term depends on ζ0 in the convergence rate is O ζ2 0 T . While the corresponding term in D2 is O ζ2 0 T +σ2T 2 , which indicates when our algorithm is consistent with NIDS because in NIDS σ is considered to be 0. Superiority over D-PSGD When compared to D-PSGD, the convergence rate of D2 only depends on ζ0, and the corresponding decaying rate is ζ0 T 2 . Whereas in D-PSGD (Lian et al., 2017a), we need to assume an upper bound for the global variance between different nodes dataset, and its influence can be compared to σ2, the inner variance of each node itself. This means we can always achieve a much better convergence rate than D-PSGD. 6. Experiments We evaluate the effectiveness of D2 by comparing it with both centralized and decentralized SGD algorithms. 6.1. Experiment Settings We conduct experiments in two settings. 1. TRANSFERLEARNING: We test the case that each worker has access to a local pre-trained neural network as feature extractor, and we want to train a logistic regression model among all these workers. In our experiment, we select the first 16 classes of Image Net and use Inception V4 as the feature extractor to extract 2048 features for each image. We conduct data augmentation and generate a blurblack version for each image. In total this dataset contains 16 1300 2 images. 2. LENET: We test the case that all workers collaboratively train a neural network model. We train a Le Net on the CIFAR10 dataset. In total this dataset contains 50,000 images of size 32 32. One caveat of training more recent neural networks is that modern architectures often have a batch normalization layer, which inherently assumes that the data distribution is uniform across different batches, which is not the case that we are interested in. In principle, we could also flow the batch information through the network in a decentralized way; however, we leave this as future work. By default, each worker only has exclusive access to a subset of classes. For TRANSFERLEARNING, we use 16 workers and each worker has access to one class; for LENET, we use 5 workers and each worker has access to two classes. For comparison, we also consider a case when the datasets is first shuffled and then uniformly partitioned among all the workers, we call this the shuffled case, and the default one the unshuffled case. We use a ring topology for both experiments. Parameter Tuning. For TRANSFERLEARNING, we use constant learning rates and tune it from {0.01, 0.025, 0.05, 0.075, 0.1}. For LENET, we use constant learning rate 0.05 which is tuned from {0.5, 0.1, 0.05, 0.01} for centralized algorithms and batch size 128 on each worker. Metrics. In this paper, we mainly focus on the convergence rate of different algorithms instead of the wall clock speed. This is because the implementation of D2 is a minor change over the standard D-PSGD algorithm, and thus they has almost the same speed to finish one epoch of training, and both are no slower than the centralized algorithm. When the network has high latency, if a decentralized algorithm (D2 or D-PSGD) converges with a similar speed as the centralized algorithm, it can be up to one order of magnitude faster (Lian et al., 2017a). However, the convergence rate depending on the outer variance is different for both algorithms. 6.2. Unshuffled Case variation across workers is maximized. Figure 1 shows the result. In the unshuffled case, we see that D-PSGD converges slower than the centralized case. This is consistent with the original D-PSGD paper (Lian et al., 2017a). On the other hand, D2 converges much faster than D-PSGD, and achieves almost the same loss as the centralized algorithm. For the Le Net case, each worker only has access to data of assigned two labels, which means the data variation is very large. The D-PSGD does not converge with the given learning rate 0.05.1 6.3. Shuffled Case As a sanity check, Figure 2 shows the result of three different algorithms on the shuffled data. In this case, the data variation among workers is small (in expectation, they are drawn from the same distribution). We see that, all strategies have similar convergence rate. This validate that D2 is more effective for larger data variation between different workers. 1We can tune the learning rate 50x smaller for D-PSGD to converge in this case, but doing so will make D-PSGD stuck at the starting point for quite a long time. Decentralized Training over Decentralized Data 0 20 40 60 80 100 Decentralized Centralized (a) TRANSFERLEARNING (b) LENET 0 20 40 60 80 100 Decentralized D2 Centralized Figure 1. Convergence of Different Distributed Training Algorithms (Unshuffled Case). 0 20 40 60 80 100 Decentralized Centralized (a) TRANSFERLEARNING (b) LENET 0 20 40 60 80 100 Decentralized Centralized Figure 2. Convergence of Different Distributed Training Algorithms (Shuffled Case). 7. Conclusion In this paper, we propose a decentralized algorithm, namely, D2 algorithm. D2 algorithm integrates the D-PSGD algorithm with the variance reduction technology, by which we improves the convergence rate of D-PSGD. The variance reduction technology used in this paper is different from the commonly used ones such as SVRG and SAGA, that are designed for centralized approaches. Experiments validate the advantage of D2 over D-PSGD D2 converges with a rate that is similar to centralized SGD while D-PSGD does not converge to a solution with a similar quality when the data variance is large. While being robust to large data variance among workers, the same performance benefit of D-PSGD over the centralized strategy still holds for D2. Acknowledgements This project is supported in part by NSF CCF1718513, NEC fellowship, IBM faculty award, NSF DMS-1621798, Swiss NSF NRP 75 407540 167266, IBM Zurich, Mercedes Benz Research & Development North America, Oracle Decentralized Training over Decentralized Data Labs, Swisscom, Zurich Insurance, and Chinese Scholarship Council. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorflow: A system for large-scale machine learning. In OSDI, volume 16, pp. 265 283, 2016. Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., Xiao, T., Xu, B., Zhang, C., and Zhang, Z. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. ar Xiv preprint ar Xiv:1512.01274, 2015. Colin, I., Bellet, A., Salmon, J., and Cl emenc on, S. Gossip dual averaging for decentralized optimization of pairwise functions. In International Conference on Machine Learning, pp. 1388 1396, 2016. Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pp. 1646 1654, 2014. Dobbe, R., Fridovich-Keil, D., and Tomlin, C. Fully decentralized policies for multi-agent systems: An information theoretic approach. In Advances in Neural Information Processing Systems, pp. 2945 2954, 2017. Duchi, J. C., Agarwal, A., and Wainwright, M. J. Dual averaging for distributed optimization: Convergence analysis and network scaling. IEEE Transactions on Automatic control, 57(3):592 606, 2012. Ghadimi, S. and Lan, G. Stochastic firstand zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341 2368, 2013. doi: 10.1137/120880811. Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pp. 315 323, 2013. Kashyap, A., Bas ar, T., and Srikant, R. Quantized consensus. Automatica, 43(7):1192 1203, 2007. Koneˇcn y, J., Liu, J., Richt arik, P., and Tak aˇc, M. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE Journal of Selected Topics in Signal Processing, 10 (2):242 255, 2016. Lan, G., Lee, S., and Zhou, Y. Communication-efficient algorithms for decentralized and stochastic optimization. 01 2017. Lavaei, J. and Murray, R. M. Quantized consensus by means of gossip algorithm. IEEE Transactions on Automatic Control, 57(1):19 32, 2012. Li, Z. and Yan, M. A primal-dual algorithm with optimal stepsizes and its application in decentralized consensus optimization. ar Xiv preprint ar Xiv:1711.06785, 2017. Li, Z., Shi, W., and Yan, M. A decentralized proximalgradient method with network independent step-sizes and separated convergence rates. ar Xiv preprint ar Xiv:1704.07807, 2017. Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. 05 2017a. Lian, X., Zhang, W., Zhang, C., and Liu, J. Asynchronous decentralized parallel stochastic gradient descent. ar Xiv preprint ar Xiv:1710.06952, 2017b. Mairal, J. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2):829 855, 2015. Mhamdi, E., Mahdi, E., Hendrikx, H., Guerraoui, R., and Maurer, A. D. O. Dynamic safe interruptibility for decentralized multi-agent reinforcement learning. Technical report, EPFL, 2017. Mokhtari, A. and Ribeiro, A. Decentralized double stochastic averaging gradient. In Signals, Systems and Computers, 2015 49th Asilomar Conference on, pp. 406 410. IEEE, 2015. Mokhtari, A. and Ribeiro, A. Dsa: Decentralized double stochastic averaging gradient algorithm. Journal of Machine Learning Research, 17(61):1 35, 2016. Moulines, E. and Bach, F. R. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F., and Weinberger, K. Q. (eds.), Advances in Neural Information Processing Systems 24, pp. 451 459. Curran Associates, Inc., 2011. Nedic, A. and Ozdaglar, A. Distributed subgradient methods for multi-agent optimization. IEEE Transactions on Automatic Control, 54(1):48 61, 2009. Nedic, A., Olshevsky, A., Ozdaglar, A., and Tsitsiklis, J. N. On distributed averaging algorithms and quantization effects. IEEE Transactions on Automatic Control, 54(11): 2506 2517, 2009. Nedi c, A., Olshevsky, A., and Rabbat, M. G. Network topology and communication-computation tradeoffs in decentralized optimization. ar Xiv preprint ar Xiv:1709.08765, 2017. Decentralized Training over Decentralized Data Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009. doi: 10.1137/070704277. Omidshafiei, S., Pazis, J., Amato, C., How, J. P., and Vian, J. Deep decentralized multi-task multi-agent rl under partial observability. ar Xiv preprint ar Xiv:1703.06182, 2017. Ram, S. S., Nedi c, A., and Veeravalli, V. V. Asynchronous gossip algorithm for stochastic optimization: Constant stepsize analysis. In Recent Advances in Optimization and its Applications in Engineering, pp. 51 60. Springer, 2010a. Ram, S. S., Nedi c, A., and Veeravalli, V. V. Distributed stochastic subgradient projection algorithms for convex optimization. Journal of optimization theory and applications, 147(3):516 545, 2010b. Schmidt, M., Le Roux, N., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83 112, 2017. Seide, F. and Agarwal, A. Cntk: Microsoft s open-source deep-learning toolkit. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 16, pp. 2135 2135, New York, NY, USA, 2016. ACM. ISBN 978-1-4503-4232-2. doi: 10.1145/2939672.2945397. Shahrampour, S. and Jadbabaie, A. Distributed online optimization in dynamic environments using mirror descent. IEEE Transactions on Automatic Control, 2017. Shi, W., Ling, Q., Yuan, K., Wu, G., and Yin, W. On the linear convergence of the admm in decentralized consensus optimization. IEEE Trans. Signal Processing, 62(7): 1750 1761, 2014. Shi, W., Ling, Q., Wu, G., and Yin, W. Extra: An exact firstorder algorithm for decentralized consensus optimization. SIAM Journal on Optimization, 25(2):944 966, 2015a. Shi, W., Ling, Q., Wu, G., and Yin, W. A proximal gradient algorithm for decentralized composite optimization. IEEE Transactions on Signal Processing, 63(22):6013 6023, 2015b. Suresh, A. T., Yu, F. X., Kumar, S., and Mc Mahan, H. B. Distributed mean estimation with limited communication. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3329 3337, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Yuan, K., Ling, Q., and Yin, W. On the convergence of decentralized gradient descent. SIAM Journal on Optimization, 26(3):1835 1854, 2016. doi: 10.1137/130943170. Yuan, K., Ying, B., Zhao, X., and Sayed, A. H. Exact diffusion for distributed optimization and learning part i: Algorithm development. ar Xiv preprint ar Xiv:1702.05122, 2017. Zhang, W., Zhao, P., Zhu, W., Hoi, S. C., and Zhang, T. Projection-free distributed online learning in networks. In International Conference on Machine Learning, pp. 4054 4062, 2017.