# decentralized_deep_learning_with_arbitrary_communication_compression__942e7f61.pdf Published as a conference paper at ICLR 2020 DECENTRALIZED DEEP LEARNING WITH ARBITRARY COMMUNICATION COMPRESSION Anastasia Koloskova anastasia.koloskova@epfl.ch Tao Lin tao.lin@epfl.ch Sebastian U. Stich sebastian.stich@epfl.ch Martin Jaggi martin.jaggi@epfl.ch EPFL Lausanne, Switzerland Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches are limited by network bandwidth, we propose the use of communication compression in the decentralized training context. We show that CHOCO-SGD achieves linear speedup in the number of workers for arbitrary high compression ratios on general non-convex functions, and non-IID training data. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over decentralized user devices, connected by a peer-to-peer network and (ii) in a datacenter. 1 INTRODUCTION Distributed machine learning i.e. the training of machine learning models using distributed optimization algorithms has recently enabled many successful applications in research and industry. Such methods offer two of the key success factors: 1) computational scalability by leveraging the simultaneous computational power of many devices, and 2) data-locality, the ability to perform joint training while keeping each part of the training data local to each participating device. Recent theoretical results indicate that decentralized schemes can be as efficient as the centralized approaches, at least when considering convergence of training loss vs. iterations (Scaman et al., 2017; 2018; Lian et al., 2017; Tang et al., 2018; Koloskova et al., 2019; Assran et al., 2019). Gradient compression techniques have been proposed for the standard distributed training case (Alistarh et al., 2017; Wen et al., 2017; Lin et al., 2018; Wangni et al., 2018; Stich et al., 2018), to reduce the amount of data that has to be sent over each communication link in the network. For decentralized training of deep neural networks, Tang et al. (2018) introduce two algorithms (DCD, ECD) which allow for communication compression. However, both these algorithms are restrictive with respect to the used compression operators, only allowing for unbiased compressors and more significantly so far not supporting arbitrarily high compression ratios. We here study CHOCO-SGD recently introduced for convex problems only (Koloskova et al., 2019) which overcomes these constraints. For the evaluation of our algorithm we in particular focus on the generalization performance (on the test-set) on standard machine learning benchmarks, hereby departing from previous work such as e.g. (Tang et al., 2018; Wang et al., 2019; Tang et al., 2019; Reisizadeh et al., 2019) that mostly considered training performance (on the train-set). We study two different scenarios: firstly, (i) training on a challenging peer-to-peer setting, where the training data is distributed over the training devices (and not allowed to move), similar to the federated learning setting (Mc Mahan et al., 2017). We are again able to show speed-ups for CHOCO-SGD over the decentralized baseline (Lian et al., 2017) with much less communication overhead. Secondly, (ii) training in a datacenter setting, where decentralized communication patterns allow better scalability than centralized approaches. For this setting Equal contribution. Published as a conference paper at ICLR 2020 we show that communication efficient CHOCO-SGD can improve time-to-accuracy on large tasks, such as e.g. Image Net training. However, when investigating the scaling of decentralized algorithms to larger number of nodes we observe that (all) decentralized schemes encounter difficulties and often do not reach the same (test and train) performance as centralized schemes. As these findings point out some deficiencies of current decentralized training schemes (and are not particular to our scheme) we think that reporting these results is a helpful contribution to the community to spur further research on decentralized training schemes that scale to large number of peers. Contributions. Our contributions can be summarized as: On the theory side, we are the first to show that CHOCO-SGD converges at rate O 1/ n T + 1/(ρ2δT )2/3 on non-convex smooth functions, where n denotes the number of nodes, T the number of iterations, ρ the spectral gap of the mixing matrix and δ the compression ratio. The main term, O 1/ n T , matches with the centralized baselines with exact communication and shows a linear speedup in the number of workers n. Both ρ and δ only affect the asymptotically smaller second term. On the practical side, we present a version of CHOCO-SGD with momentum and analyze its practical performance on two relevant scenarios: for on-device training over a realistic peer-to-peer social network, where lowering the bandwidth requirements of joint training is especially impactful in a datacenter setting for computational scalability of training deep learning models for resource efficiency and improved time-to-accuracy Lastly, we systematically investigate performance of the decentralized schemes when scaling to larger number of nodes and we point out some (shared) difficulties encountered by current decentralized learning approaches. 2 RELATED WORK For the training in communication restricted settings a variety of methods have been proposed. For instance, decentralized schemes (Lian et al., 2017; Nedi c et al., 2018; Koloskova et al., 2019), gradient compression (Seide et al., 2014; Strom, 2015; Alistarh et al., 2017; Wen et al., 2017; Lin et al., 2018; Wangni et al., 2018; Bernstein et al., 2018; Lin et al., 2018; Alistarh et al., 2018; Stich et al., 2018; Karimireddy et al., 2019), asynchronous methods (Recht et al., 2011; Assran et al., 2019) or performing multiple local SGD steps before averaging (Zhang et al., 2016; Mc Mahan et al., 2017; Lin et al., 2020). This especially covers learning over decentralized data, as extensively studied in the federated learning literature for the centralized algorithms (Mc Mahan et al., 2016). In this paper we advocate for combining decentralized SGD schemes with gradient compression. Decentralized SGD. We in particular focus on approaches based on gossip averaging (Kempe et al., 2003; Xiao & Boyd, 2004; Boyd et al., 2006) whose convergence rate typically depends on the spectral gap ρ 0 of the mixing matrix (Xiao & Boyd, 2004). Lian et al. (2017) combine SGD with gossip averaging and show that the leading term in the convergence rate O 1/ n T is consistent with the convergence of the centralized mini-batch SGD (Dekel et al., 2012) and the spectral gap only affects the asymptotically smaller terms. Similar results have been observed very recently for related schemes (Scaman et al., 2017; 2018; Koloskova et al., 2019; Yu et al., 2019). Quantization. Communication compression with quantization has been popularized in the deep learning community by the reported successes in (Seide et al., 2014; Strom, 2015). Theoretical guarantees were first established for schemes with unbiased compression (Alistarh et al., 2017; Wen et al., 2017; Wangni et al., 2018) but soon extended to biased compression (Bernstein et al., 2018) as well. Schemes with error correction work often best in practice and give the best theoretical gurantees (Lin et al., 2018; Alistarh et al., 2018; Stich et al., 2018; Karimireddy et al., 2019). Recently, also proximal updates and variance reduction have been studied in combination with quantized updates (Mishchenko et al., 2019; Horváth et al., 2019). Decentralized Optimization with Quantization. It has been observed that gossip averaging can diverge (or not converge to the correct solution) in the presence of quantization noise (Xiao et al., 2005; Carli et al., 2007; Nedi c et al., 2008; Dimakis et al., 2010; Carli et al., 2010b; Yuan et al., 2012). Reisizadeh et al. (2018) propose an algorithm that can still converge, though at a slower rate than the exact scheme. Another line of work proposed adaptive schemes (with increasing compression accuracy) that converge at the expense of higher communication cost (Carli et al., 2010a; Doan et al., Published as a conference paper at ICLR 2020 2018; Berahas et al., 2019). For deep learning applications, Tang et al. (2018) proposed the DCD and ECD algorithms that converge at the same rate as the centralized baseline though only for constant compression ratio. The CHOCO-SGD algorithm that we consider in this work can deal with arbitrary high compression, and has been introduced in (Koloskova et al., 2019) but only been analyzed for convex functions. For non-convex functions we show a rate of O 1/ n T + 1/(ρ2δT ) 2 3 , where δ > 0 measures the compression quality. Simultaneous work of Tang et al. (2019) introduced Deep Squeeze, an alternative method which also converges with arbitrary compression ratio. In our experiments, under the same amount of tuning, CHOCO-SGD achieves higher test accuracy. 3 CHOCO-SGD In this section we formally introduce the decentralized optimization problem, compression operators, and the gossip-based stochastic optimization algorithm CHOCO-SGD from (Koloskova et al., 2019). Distributed Setup. We consider optimization problems distributed across n nodes of the form f := min x Rd i=1 fi(x) , fi(x) := Eξi Di Fi(x, ξi) , i [n] , (1) where D1, . . . Dn are local distributions for sampling data which can be different on every node, Fi : Rd Ω R are possibly non-convex (and non-identical) loss functions. This setting covers the important case of empirical risk minimization in distributed machine learning and deep learning applications. Communication. Every device is only allowed to communicate with its local neighbours defined by the network topology, given as a weighted graph G = ([n], E), with edges E representing the communication links along which messages (e.g. model updates) can be exchanged. We assign a positive weight wij to every edge (wij = 0 for disconnected nodes {i, j} / E). Assumption 1 (Mixing matrix). We assume that W [0, 1]n n, (W)ij = wij is a symmetric (W = W ) doubly stochastic (W1 = 1,1 W = 1 ) matrix with eigenvalues 1 = |λ1(W)| > |λ2(W)| |λn(W)| and spectral gap ρ := 1 |λ2(W)| (0, 1] . In our experiments we set the weights based on the local node degrees: wij = max{deg(i), deg(j)} 1 for {i, j} E. This will not only guarantee ρ > 0 but these weights can easily be computed in a local fashion on each node (Xiao & Boyd, 2004). Compression. We aim to only transmit compressed (e.g. quantized or sparsified) messages. We formalized this through the notion of compression operators that was e.g. also used in (Tang et al., 2018; Stich et al., 2018). Definition 3.1 (Compression operator). Q: Rd Rd is a compression operator if it satisfies EQ Q(x) x 2 (1 δ) x 2 , x Rd , (2) for a parameter δ > 0. Here EQ denotes the expectation over the internal randomness of operator Q. In contrast to the quantization operators used in e.g. (Alistarh et al., 2017; Horváth et al., 2019), compression operators defined as in (2) are not required to be unbiased and therefore supports a larger class of compression operators. Some examples can be found in (Koloskova et al., 2019) and we further discuss specific compression schemes in Section 5. Algorithm. CHOCO-SGD is summarized in Algorithm 1. Every worker i stores its own private variable xi Rd that is updated by a stochastic gradient step in part 2 and a modified gossip averaging step on line 2. This step is a key element of the algorithm as it preserves the averages of the iterates even in presence of quantization noise (the compression errors are not discarded, but aggregated in the local variables xi, see also (Koloskova et al., 2019)). The nodes communicate with their neighbors in part 1 and update the variables ˆxj Rd for all their neighbors {i, j} E only using compressed updates. These ˆxi are available to all the neighbours of the node i and represent the publicly available copies of the private xi, in general xi = ˆxi, due to the communication restrictions. From an implementation aspect, it is worth highlighting that the communication part 1 and the gradient computation part 2 can both be executed in parallel because they are independent. Moreover, Published as a conference paper at ICLR 2020 Algorithm 1 CHOCO-SGD (Koloskova et al., 2019) input: Initial value x( 1 2 ) Rd, x ( 1 2 ) i = x( 1 2 ) on each node i [n], consensus stepsize γ, SGD stepsize η, communication graph G = ([n], E) and mixing matrix W, initialize ˆx(0) i := 0 i [n] 1: for t in 0 . . . T 1 do {in parallel for all workers i [n]} 2: x(t) i := x (t 1 2 ) i + γ P j:{i,j} E wij ˆx(t) j ˆx(t) i modified gossip averaging 3: q(t) i := Q(x(t) i ˆx(t) i ) compression 4: for neighbors j : {i, j} E (including {i} E) do 5: Send q(t) i and receive q(t) j communication 6: ˆx(t+1) j := q(t) j + ˆx(t) j local update 8: Sample ξ(t) i , compute gradient g(t) i := Fi(x(t) i , ξ(t) i ) 2 ) i := x(t) i ηg(t) i stochastic gradient update 10: end for each node only needs to store 3 vectors at most, independent of the number of neighbors (this might not be obvious from the notation used here for additinal clarity, for further details c.f. (Koloskova et al., 2019)). We further propose a momentum-version of CHOCO-SGD in Algorithm 2 (see also Section D for further details). 4 CONVERGENCE OF CHOCO-SGD ON SMOOTH NON-CONVEX PROBLEMS As the first main contribution, we here extend the analysis of CHOCO-SGD to non-convex problems. For this we make the following technical assumptions: Assumption 2. Each function fi : Rd R for i [n] is L-smooth, that is fi(y) fi(x) L y x , x, y Rd, i [n], and the variance of the stochastic gradients is bounded on each worker: Eξi Fi(x, ξi) fi(x) 2 σ2 i , Eξi Fi(x, ξi) 2 G2 , x Rd, i [n], (3) where Eξi[ ] denotes the expectation over ξi Di. We also denote σ2 := 1 n Pn i=1 σ2 i for convenience. Theorem 4.1. Under Assumptions 1 2 there exists a constant stepsize η and the consensus stepsize from (Koloskova et al., 2019), γ := ρ2δ 16ρ+ρ2+4β2+2ρβ2 8ρδ with β = I W 2 [0, 2], such that the averaged iterates x(t) := 1 n Pn i=1 x(t) i of Algorithm 1 satisfy: where c := ρ2δ 82 denotes the convergence rate of the underlying consensus averaging scheme of (Koloskova et al., 2019), F0 := f(x(0)) f . This result shows that CHOCO-SGD converges as O 1/ n T + 1/(ρ2δT )2/3 . The first term shows a linear speed-up compared to SGD on a single node, while compression and graph topology affect only the higher order second term. By differently choosing the stepsize η := n/ T +1 we can recover asymptotic convergence rate of O 1/ n T + n/(ρ4δ2T ) . For the proofs and convergence of the individual iterates xi we refer to Appendix A. 5 COMPARISON TO BASELINES FOR VARIOUS COMPRESSION SCHEMES In this section we experimentally compare CHOCO-SGD to the relevant baselines for a selection of commonly used compression operators. For the experiments we further leverage momentum in all implemented algorithms. The newly developed momentum version of CHOCO-SGD is given as Algorithm 2. Published as a conference paper at ICLR 2020 Algorithm 2 CHOCO-SGD with Momentum input: Same as for Algorithm 1, additionally: weight decay factor λ, momentum factor β, local momentum memory v(0) i := 0 i [n] Lines 1 8 in Algorithm 1 are left unmodified Line 9 in Algorithm 1 is replaced with the following two lines 9: v(t+1) i := (g(t) i + λx(t) i ) + βv(t) i local momentum with weight decay 10: x (t+ 1 2 ) i := x(t) i ηv(t+1) i stochastic gradient update Setup. In order to match the setting in (Tang et al., 2018) for our first set of experiments, we use a ring topology with n = 8 nodes and train the Res Net20 architecture (He et al., 2016) on the Cifar10 dataset (50K/10K training/test samples) (Krizhevsky, 2012). We randomly split the training data between workers and shuffle it after every epoch, following standard procedure as e.g. in (Goyal et al., 2017). We implement DCD and ECD with momentum (Tang et al., 2018), Deep Squeeze with momentum (Tang et al., 2019), CHOCO-SGD with momentum (Algorithm 2) and standard (all-reduce) mini-batch SGD with momentum and without compression (Dekel et al., 2012). Our implementations are open-source and available at https://github.com/epfml/ Choco SGD. The momentum factor is set to 0.9 without dampening. For all algorithms we fine-tune the initial learning rate and gradually warm it up from a relative small value (0.1) (Goyal et al., 2017) for the first 5 epochs. The learning rate is decayed by 10 twice, at 150 and 225 epochs, and stop training at 300 epochs. For CHOCO-SGD and Deep Squeeze the consensus learning rate γ is also tuned. The detailed hyper-parameter tuning procedure refers to Appendix F. Every compression scheme is applied to every layer of Res Net20 separately. We evaluate the top-1 test accuracy on every node separately over the whole dataset and report the average performance over all nodes. Compression Schemes. We implement two unbiased compression schemes: (i) gsgdb quantization that randomly rounds the weights to b-bit representations (Alistarh et al., 2017), and (ii) randoma sparsification, which preserves a randomly chosen a fraction of the weights and sets the other ones to zero (Wangni et al., 2018). Further two biased compression schemes: (iii) topa, which selects the a fraction of weights with the largest magnitude and sets the other ones to zero (Alistarh et al., 2018; Stich et al., 2018), and (iv) sign compression, which compresses each weight to its sign scaled by the norm of the full vector (Bernstein et al., 2018; Karimireddy et al., 2019). We refer to Appendix C for exact definitions of the schemes. DCD and ECD have been analyzed only for unbiased quantization schemes, thus the combination with the two biased schemes is not supported by theory. In converse, CHOCO-SGD and Deep Squeeze has been studied only for biased schemes according to Definition 2. However, both unbiased compression schemes can be scaled down in order to meet the specification (cf. discussions in (Stich et al., 2018; Koloskova et al., 2019)) and we adopt this for the experiments. Results. The results are summarized in Tab. 1. For unbiased compression schemes, ECD and DCD only achieve good performance when the compression ratio is small, and sometimes even diverge when the compression ratio is high. This is consistent1 with the theoretical and experimental results in (Tang et al., 2018). We further observe that the performance of DCD with the biased topa sparsification is much better than with the unbiased randoma counterpart, though this operator is not yet supported by theory. CHOCO-SGD can generalize reasonably well in all scenarios (at most 1.65% accuracy drop) for fixed training budget. The sign compression achieves state-of-the-art accuracy and requires approximately 32 less bits per weight than the full precision baseline. 1 Tang et al. (2018) only consider absolute bounds on the quantization error. Such bounds might be restrictive (i.e. allowing only for low compression) when the input vectors are unbounded. This might be the reason for the instabilities observed here and also in (Tang et al., 2018, Fig. 4), (Koloskova et al., 2019, Figs. 5 6). Published as a conference paper at ICLR 2020 Table 1: Top-1 test accuracy for decentralized DCD, ECD, Deep Squeeze and CHOCO-SGD with different compression schemes. Reported top-1 test accuracies are averaged over three runs with fine-tuned hyperparameters (learning rate, weight decay, consensus stepsize). The fine-tuned all-reduce baseline reaches accuracy 92.64, with 1.04 MB gradient transmission per iteration. ( indicates that 2 out of 3 runs diverged). Algorithm Errorfeedback Quantization (QSGD) Sparsification (random-%) 16 bits 8 bits 4 bits 2 bits 50% 10% 1% transmitted data/iteration 0.52 MB 0.26 MB 0.13 MB 0.065 MB 1.04 MB 0.21 MB 0.031 MB DCD-PSGD 92.51 0.05 92.36 0.28 23.56 2.97 diverges 92.05 0.25 diverges diverges ECD-PSGD 92.02 0.14 59.11 1.57 diverges diverges diverges diverges diverges Deep Squeeze 92.27 0.21 91.83 0.35 91.47 0.21 90.96 0.19 91.46 0.09 90.96 0.16 88.55 0.11 CHOCO-SGD 92.34 0.19 92.30 0.08 91.92 0.27 91.41 0.11 92.54 0.26 91.87 0.21 91.32 0.17 Algorithm Errorfeedback Sparsification (top-%) Sign+Norm 50% 10% 1% - transmitted data/iteration 1.04 MB 0.21 MB 0.031 MB 0.032 MB DCD-PSGD 92.40 0.11 91.97 0.14 89.79 0.40 92.40 0.14 ECD-PSGD 17.03 16.78 18.03 diverges Deep Squeeze 91.55 0.28 91.31 0.25 90.47 0.17 91.38 0.19 CHOCO-SGD 92.54 0.26 92.29 0.05 91.73 0.11 92.46 0.10 6 USE CASE I: ON-DEVICE PEER-TO-PEER LEARNING We now shift our focus to challenging real-world scenarios which are intrinsically decentralized, i.e. each part of the training data remains local to each device, and thus centralized methods either fail or are inefficient to implement. Typical scenarios comprise e.g. sensor networks, or mobile devices or hospitals which jointly train a machine learning model. Common to these applications is that i) each device has only access to locally stored or acquired data, ii) communication bandwidth is limited (either physically, or artificially for e.g. metered connections), iii) the global network topology is typically unknown to a single device, and iv) the number of connected devices is typically large. Additionally, this fully decentralized setting is also strongly motivated by privacy aspects, enabling to keep the training data private on each device at all times. Modeling. To simulate this scenario, we permanently split the training data between the nodes, i.e. the data is never shuffled between workers during training, and every node has distinct part of the dataset. To the best of our knowledge, no prior works studied this scenario for decentralized deep learning. For the centralized approach, gathering methods such as all-reduce are not efficiently implementable in this setting, hence we compare to the centralized baseline where all nodes route their updates to a central coordinator for aggregation. For the comparison we consider CHOCO-SGD with sign compression (this combination achieved the compromise between accuracy and compression level in Tab. 1)), decentralized SGD without compression (Lian et al., 2017), and centralized SGD without compression. Scaling to Large Number of Nodes. To study the scaling properties of CHOCOSGD, we train on 4, 16, 36 and 64 number of nodes. We compare decentralized algorithms on two different topologies: ring as the worst possible topology, and on the torus with much larger spectral gap. Their parameters are listed in the table 2. Table 2: Summary of communication topologies. Topology spectral gap ρ max. node degree n = 4 n = 16 n = 36 n = 64 ring 2 0.67 0.05 0.01 0.003 torus 4 0.67 0.4 0.2 0.12 fully-connected d 1 1 1 1 We train Res Net8 (He et al., 2016) (78K parameters), on Cifar10 dataset (50K/10K training/test samples) (Krizhevsky, 2012). For the simplicity, we keep the learning rate constant and separately tune it for all methods. We tune consensus learning rate for CHOCO-SGD. The results are summarized in Fig. 1 (and Fig. 6, Tabs. 7 8 in Appendix G). First we compare the testing accuracy reached after 300 epochs (Fig. 1, Left). Centralized SGD has a good performance for all the considered number of nodes. CHOCO-SGD slows down due to the influence of graph topology (Decentralized curve), which is consistent with the spectral gaps order (see Tab. 2), and also influenced by the communication compression (CHOCO curve), which slows down training uniformly for both topologies. We observed that the train performance is similar to the test on Fig. 1, therefore the performance degradation is explained by the slower convergence (Theorem 4.1) and is not a Published as a conference paper at ICLR 2020 Test top-1 accuracy CHOCO (Torus) CHOCO (Ring) Decentralized SGD (Torus) Decentralized SGD (Ring) Centralized SGD Test top-1 accuracy Fix budget of 300 epochs Fixed budget of communication size (1000 MB) Figure 1: Scaling of CHOCO-SGD with sign compression to large number of devices on Cifar10 dataset. Left: best testing accuracy of the algorithms reached after 300 epochs. Right: best testing accuracy reached after communicating 1000 MB. generalization issue. Increasing the number of epochs improves the performance of the decentralized schemes. However, even using 10 times more epochs, we were not able to perfectly close the gap between centralized and decentralized algorithms for both train and test performance. In the real decentralized scenario, the interest is not to minimize the epochs number, but the amount of communication to reduce the cost of the user s mobile data. We therefore fix the number of transmitted bits to 1000 MB and compare the best testing accuracy reached (Fig. 1, Right). CHOCOSGD performs the best while having slight degradation due to increasing number of nodes. It is beneficial to use torus topology when the number of nodes is large because it has good mixing properties, for small networks there is not much difference between these two topologies the benefit of large spectral gap is canceled by the increased communication due larger node degree for torus topology. Both Decentralized and Centralized SGD requires significantly larger number of bits to reach reasonable accuracy. Experiments on a Real Social Network Graph. We simulate training models on user devices (e.g. mobile phones), connected by a real social network. We chosen Davis Southern women social network (Davis et al., 1941) with 32 nodes. We train Res Net20 (0.27 million parameters) model on the Cifar10 dataset (50K/10K training/test samples) (Krizhevsky, 2012) for image classification and a three-layer LSTM architecture (Hochreiter & Schmidhuber, 1997) (28.95 million parameters) for a language modeling task on Wiki Text-2 (600 training and 60 validation articles with a total of 2 088 628 and 217 646 tokens respectively) (Merity et al., 2016). The depicted curves of the training loss are the averaged local loss over all workers (local model with fixed local data); the test performance uses the mean of the evaluations for local models on whole test dataset. For more detailed experimental setup we refer to Appendix F. The results are summarized in Figs. 2 3 and in Tab. 3. For the image classification task, when comparing the training accuracy reached after the same number of epochs, we observe that the decentralized algorithm performs best, follows by the centralized and lastly the quantized decentralized. However, the test accuracy is highest for the centralized scheme. When comparing the test accuracy reached for the same transmitted data2, CHOCO-SGD significantly outperforms the exact decentralized scheme, with the centralized performing worst. We note a slight accuracy drop, i.e. after the same number of epochs (but much less transmitted data), CHOCO-SGD does not reach the same level of test accuracy than the baselines. For the language modeling task, both decentralized schemes suffer a drop in the training loss when the evaluation reaching the epoch budget; while our CHOCO-SGD outperforms the centralized SGD in test perplexity. When considering perplexity for a fixed data volume (middle and right subfigure of Fig. 3), CHOCO-SGD performs best, followed by the exact decentralized and centralized algorithms. On Figure 4 we additionally depict the test accuracy of the averaged model x(t) = 1 n Pn i=1 x(t) i (left) and averaged distance of the local models from the averaged model (right), for CHOCO-SGD on image classification task. Towards the end of the optimization the local models reach consensus (Figure 4, right), and their individual test performances are the same as performance of averaged model. Interestingly, before decreasing the stepsize at the epoch 225, the local models are in general 2 The figure reports the transmitted data on the busiest node, i.e on the max-degree node (degree 14) node for decentralized schemes, and degree 32 for the centralized one. Published as a conference paper at ICLR 2020 0 100 200 300 Training loss Centralized SGD Decentralized SGD CHOCO (Sign+Norm) 0 1 2 3 4 5 Data transmitted in (MB) Training loss Centralized SGD Decentralized SGD CHOCO (Sign+Norm) 0 1 2 3 4 5 Data transmitted in (MB) Test top-1 accuracy Centralized SGD Decentralized SGD CHOCO (Sign+Norm) Figure 2: Image classification: Res Net-20 on CIFAR-10 on social network topology. 0 50 100 150 200 250 300 Training loss Centralized SGD Decentralized SGD CHOCO (Sign+Norm) 103 105 107 Data transmitted in (MB) Training loss Centralized SGD Decentralized SGD CHOCO (Sign+Norm) 104 105 106 107 108 Data transmitted in (MB) Test perplexity Centralized SGD Decentralized SGD CHOCO (Sign+Norm) Figure 3: Language modeling: LSTM on Wiki Text-2 on social network topology. Table 3: Summary of performance when training with the same epoch budget (as centralized SGD). Algorithm Res Net-20 (Fig. 2) LSTM (Fig. 3) max. connections/node data/gradient top-1 test acc. data/gradient test perplexity Centralized SGD 32 1.04 MB 93.00 110.43 MB 89.39 Exact Decentralized SGD 14 1.04 MB 92.12 110.43 MB 91.38 CHOCO-SGD (Sign + Norm) 14 0.032 MB 91.80 3.45 MB 86.58 0 50 100 150 200 250 300 Test top-1 accuracy Local model on whole data Averaged model on whole data 0 50 100 150 200 250 300 Consensus distance Consensus distance Figure 4: Parameter deviations for Resnet20 trained on Cifar10 (using CHOCO-SGD) on social network topology (32 workers). (Left) performance of the averaged model compared to the average of performances of local models. (Right) parameters divergence: averaged L2 consensus distance between local models xi and the averaged model x = 1 n Pn i=1 xi, i.e., 1 n Pn i=1 xi x 2 2. diverging from the averaged model, while decreasing only when the stepsize decreases. The same behavior was also reported in Assran et al. (2019). 7 USE CASE II: EFFICIENT LARGE-SCALE TRAINING IN A DATACENTER Decentralized optimization methods offer a way to address scaling issues even for well connected devices, such as e.g. in datacenter with fast Infini Band (100Gbps) or Ethernet (10Gbps) connections. Lian et al. (2017) describe scenarios when decentralized schemes can outperform centralized ones, and recently, Assran et al. (2019) presented impressive speedups for training on 256 GPUs, for the setting when all nodes can access all training data. The main differences of their algorithm to CHOCO-SGD are the asynchronous gossip updates, time-varying communication topology and most importantly exact communication, making their setup not directly comparable to ours. We note that these properties of asynchronous communication and changing topology for faster mixing are orthogonal to our contribution, and offer promise to be combined. Setup. We train Image Net-1k (1.28M/50K training/validation) (Deng et al., 2009) with Resnet-50 (He et al., 2016). We perform our experiments on 8 machines (n1-standard-32 from Google Cloud with Intel Ivy Bridge CPU platform), where each of machines has 4 Tesla P100 Published as a conference paper at ICLR 2020 0 20 40 60 80 Training loss Centralized SGD Centralized SGD (Sign+Norm) Decentralized SGD CHOCO (Sign+Norm) 0 10000 20000 30000 40000 50000 Training loss Centralized SGD Centralized SGD (Sign+Norm) Decentralized SGD CHOCO (Sign+Norm) 0 10000 20000 30000 40000 50000 Test top-1 accuracy Centralized SGD Centralized SGD (Sign+Norm) Decentralized SGD CHOCO (Sign+Norm) Figure 5: Large-scale training: Resnet-50 on Image Net-1k in the datacenter setting. The topology has 8 nodes (each accesses 4 GPUs). We use sign as the compression scheme, for CHOCOSGD and Centralized SGD. For centralized SGD baseline without compression, we use all-reduce to aggregate the gradients; we use all-gather for centralized SGD with sign gradients quantization. The benefits of CHOCO-SGD can be further pronounced when scaling to more nodes. GPUs and each machine interconnected via 10Gbps Ethernet. Within one machine communication is fast and we rely on the local data parallelism to aggregate the gradients for the later gradients communication (over the machines). Between different machines we consider centralized (fully connected topology) and decentralized (ring topology) communication, with and without compressed communication (sign compression). Several methods categorized by communication schemes are evaluated: (i) centralized SGD (full-precision communication), (ii) error-feedback centralized SGD with compressed communications Karimireddy et al. (2019) through sign compression, (iii) decentralized SGD (Lian et al., 2017) with parallelized forward pass and gradients communication (full-precision communication), and (iv) CHOCO-SGD with sign compressed communications. The mini-batch size on each GPU is 128, and we follow the general SGD training scheme in (Goyal et al., 2017) and directly use all their hyperparameters for all evaluated methods. Due to the limitation of the computational resource, we did not heavily tune the consensus stepsize for CHOCO-SGD3. Results. We depict the training loss and top-1 test accuracy in terms of epochs and time in Fig. 5. CHOCO-SGD benefits from its decentralized and parallel structure and takes less time than all-reduce to perform the same number of epochs, while having only a slight 1.5% accuracy loss4. In terms of time per epoch, our speedup does not match that of (Assran et al., 2019), as the used hardware and the communication pattern5 are very different. Their scheme is orthogonal to our approach and could be integrated for better training efficiency. Nevertheless, we still demonstrate a time-wise 20% gain over the common all-reduce baseline, on our used commodity hardware cluster. 8 CONCLUSION We propose the use of CHOCO-SGD (and its momentum version) for enabling decentralized deep learning training in bandwidth-constrained environments. We provide theoretical convergence guarantees for the non-convex setting and show that the algorithm enjoys linear speedup in the number of nodes. We empirically study the performance of the algorithm in a variety of settings on the image classification (Image Net-1k, Cifar10) and on the language modeling task (Wiki Text-2). Whilst previous work successfully demonstrated that decentralized methods can be a competitive alternative to centralized training schemes when no communication constraints are present (Lian et al., 2017; Assran et al., 2019), our main contribution is to enable training in strongly communicationrestricted environments, and while respecting the challenging constraint of locality of the training data. We theoretically and practically demonstrate the performance of decentralized schemes for arbitrary high communication compression, and under data-locality, and thus significantly expand the reach of potential applications of fully decentralized deep learning. 3 We estimate the consensus stepsize by running CHOCO-SGD with different values for the first 3 epochs. 4 Centralized SGD with full precision gradients achieved test accuracy of 76.37%, v.s. 76.03% for centralized SGD (with sign compression), v.s. 74.92% for plain decentralized SGD, and vs. 75.15% for CHOCO-SGD (with sign compression). 5 We consider undirected communication, contrary to the directed 1-peer communication (every node sends and receives one message at every iteration) in Assran et al. (2019). Published as a conference paper at ICLR 2020 ACKNOWLEDGEMENTS We acknowledge funding from SNSF grant 200021_175796, as well as a Google Focused Research Award. Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficient SGD via gradient quantization and encoding. In NIPS - Advances in Neural Information Processing Systems 30, pp. 1709 1720. Curran Associates, Inc., 2017. Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cedric Renggli. The convergence of sparsified gradient methods. In Neur IPS - Advances in Neural Information Processing Systems 31, pp. 5977 5987. Curran Associates, Inc., 2018. Mahmoud Assran, Nicolas Loizou, Nicolas Ballas, and Mike Rabbat. Stochastic gradient push for distributed deep learning. In ICML - Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 344 353. PMLR, 09 15 Jun 2019. Albert S. Berahas, Charikleia Iakovidou, and Ermin Wei. Nested distributed gradient methods with adaptive quantized communication. ar Xiv preprint, pp. ar Xiv:1903.08149, 2019. Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. sign SGD: Compressed optimisation for non-convex problems. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 560 569, Stockholmsmässan, Stockholm Sweden, 10 15 Jul 2018. PMLR. Stephen Boyd, Arpita Ghosh, Balaji Prabhakar, and Devavrat Shah. Randomized gossip algorithms. IEEE/ACM Trans. Netw., 14(SI):2508 2530, June 2006. R. Carli, F. Fagnani, P. Frasca, T. Taylor, and S. Zampieri. Average consensus on networks with transmission noise or quantization. In 2007 European Control Conference (ECC), pp. 1852 1857, July 2007. R. Carli, F. Bullo, and S. Zampieri. Quantized average consensus via dynamic coding/decoding schemes. International Journal of Robust and Nonlinear Control, 20:156 175, 2010a. R. Carli, P. Frasca, F. Fagnani, and S. Zampieri. Gossip consensus algorithms via quantized communication. Automatica, 46:70 80, 2010b. A. Davis, B. B. Gardner, and M. R. Gardner. Deep South. University of Chicago Press, Chicago, IL., May 1941. Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. J. Mach. Learn. Res., 13(1):165 202, January 2012. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Image Net: A Large-Scale Hierarchical Image Database. In CVPR09, 2009. A. G. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione. Gossip algorithms for distributed signal processing. Proceedings of the IEEE, 98(11):1847 1864, Nov 2010. Thinh T. Doan, Siva Theja Maguluri, and Justin Romberg. Accelerating the Convergence Rates of Distributed Subgradient Methods with Adaptive Quantization. ar Xiv e-prints, art. ar Xiv:1810.13245, Oct 2018. Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: Training Image Net in 1 hour. ar Xiv preprint ar Xiv:1706.02677, 2017. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770 778, 2016. Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9:1735 80, 12 1997. Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Peter Richtárik, and Sebastian Urban Stich. Stochastic distributed learning with gradient quantization and variance reduction. ar Xiv preprint, pp. ar Xiv:1904.05115, 2019. Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes Sign SGD and other gradient compression schemes. In ICML - Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 3252 3261. PMLR, 09 15 Jun 2019. Published as a conference paper at ICLR 2020 David Kempe, Alin Dobra, and Johannes Gehrke. Gossip-based computation of aggregate information. In Proceedings of the 44th Annual IEEE Symposium on Foundations of Computer Science, FOCS 03, pp. 482 , Washington, DC, USA, 2003. IEEE Computer Society. Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. Decentralized stochastic optimization and gossip algorithms with compressed communication. In ICML - Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 3478 3487. PMLR, 2019. Alex Krizhevsky. Learning multiple layers of features from tiny images. University of Toronto, 05 2012. Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In NIPS - Advances in Neural Information Processing Systems 30, pp. 5330 5340. Curran Associates, Inc., 2017. Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, and Martin Jaggi. Don t use large mini-batches, use local SGD. In ICLR - International Conference on Learning Representations, 2020. Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. In ICLR 2018 - International Conference on Learning Representations, 2018. Brendan Mc Mahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication Efficient Learning of Deep Networks from Decentralized Data. In AISTATS 2017 - Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, pp. 1273 1282, 2017. H. Brendan Mc Mahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. Federated learning of deep networks using model averaging. Co RR, abs/1602.05629, 2016. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. ar Xiv preprint ar Xiv:1609.07843, 2016. Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM language models. ar Xiv preprint, pp. ar Xiv:1708.02182, 2017. Konstantin Mishchenko, Eduard Gorbunov, Martin Takáˇc, and Peter Richtárik. Distributed learning with compressed gradient differences. ar Xiv e-prints, pp. ar Xiv:1901.09269, 2019. A. Nedi c, A. Olshevsky, and M. G. Rabbat. Network topology and communication-computation tradeoffs in decentralized optimization. Proceedings of the IEEE, 106(5):953 976, May 2018. Angelia Nedi c, Alex Olshevsky, Asuman Ozdaglar, and John N. Tsitsiklis. Distributed subgradient methods and quantization effects. In Proceedings of the 47th IEEE Conference on Decision and Control, CDC 2008, pp. 4177 4184, 2008. Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In NIPS - Advances in Neural Information Processing Systems 24, pp. 693 701. Curran Associates, Inc., 2011. Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. An exact quantized decentralized gradient descent algorithm. ar Xiv e-prints, pp. ar Xiv:1806.11536, 2018. Amirhossein Reisizadeh, Hossein Taheri, Aryan Mokhtari, Hamed Hassani, and Ramtin Pedarsani. Robust and communication-efficient collaborative learning. ar Xiv e-prints, pp. ar Xiv:1907.10595, 2019. Kevin Scaman, Francis Bach, Sébastien Bubeck, Yin Tat Lee, and Laurent Massoulié. Optimal algorithms for smooth and strongly convex distributed optimization in networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 3027 3036, International Convention Centre, Sydney, Australia, 06 11 Aug 2017. PMLR. Kevin Scaman, Francis Bach, Sebastien Bubeck, Laurent Massoulié, and Yin Tat Lee. Optimal algorithms for non-smooth distributed optimization in networks. In Neur IPS - Advances in Neural Information Processing Systems 31, pp. 2745 2754. Curran Associates, Inc., 2018. Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs. In INTERSPEECH, pp. 1058 1062. ISCA, 2014. Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for SGD with delayed gradients and compressed communication. ar Xiv preprint, pp. ar Xiv:1909.05350, 2019. Published as a conference paper at ICLR 2020 Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified SGD with memory. In Neur IPS - Advances in Neural Information Processing Systems 31, pp. 4452 4463. 2018. Nikko Strom. Scalable distributed DNN training using commodity GPU cloud computing. In INTERSPEECH, pp. 1488 1492. ISCA, 2015. Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. Communication compression for decentralized training. In Neur IPS - Advances in Neural Information Processing Systems 31, pp. 7663 7673. Curran Associates, Inc., 2018. Hanlin Tang, Xiangru Lian, Shuang Qiu, Lei Yuan, Ce Zhang, Tong Zhang, and Ji Liu. Deepsqueeze: Decentralization meets error-compensated compression. ar Xiv preprint, pp. ar Xiv:1907.07346, 2019. Jianyu Wang, Anit Kumar Sahu, Zhouyi Yang, Gauri Joshi, and Soummya Kar. An exact quantized decentralized gradient descent algorithm. ar Xiv e-prints, pp. ar Xiv:1905.09435, 2019. Jianqiao Wangni, Jialei Wang, Ji Liu, and Tong Zhang. Gradient sparsification for communication-efficient distributed optimization. In Neur IPS - Advances in Neural Information Processing Systems 31, pp. 1306 1316. Curran Associates, Inc., 2018. Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In NIPS - Advances in Neural Information Processing Systems 30, pp. 1509 1519. Curran Associates, Inc., 2017. L. Xiao, S. Boyd, and S. Lall. A scheme for robust distributed sensor fusion based on average consensus. In IPSN 2005. Fourth International Symposium on Information Processing in Sensor Networks, 2005., pp. 63 70, April 2005. Lin Xiao and Stephen Boyd. Fast linear iterations for distributed averaging. Systems & Control Letters, 53(1): 65 78, 2004. Hao Yu, Rong Jin, and Sen Yang. On the linear speedup analysis of communication efficient momentum SGD for distributed non-convex optimization. In ICML - Proceedings of the 36th International Conference on Machine Learning, volume 97, pp. 7184 7193. PMLR, 09 15 Jun 2019. Deming Yuan, Shengyuan Xu, Huanyu Zhao, and Lina Rong. Distributed dual averaging method for multi-agent optimization with quantized communication. Systems & Control Letters, 61(11):1053 1061, 2012. Jian Zhang, Christopher De Sa, Ioannis Mitliagkas, and Christopher Ré. Parallel SGD: When does averaging help? arxiv preprint, 2016. Published as a conference paper at ICLR 2020 A CONVERGENCE OF CHOCO-SGD In this section we present the proof of Theorem 4.1. For this, we will first derive a slightly more general statement: in Theorem A.3 we analyze CHOCO-SGD for arbitrary stepsizes η, and then derive Theorem 4.1 as a special case. The structure of the proof follows Koloskova et al. (2019). That is, we first show that Algorithm 1 is a special case of a more general class of algorithms (given in Algorithm 3): Observe that Algorithm 1 consists of two main components: 2 the stochastic gradient update, performed locally on each node, and 1 the (quantized) averaging among the nodes. We can show convergence of all algorithms of this type i.e. stochastic gradient updates 2 followed by an arbitrary averaging step 1 as long as the averaging scheme exhibits linear convergence. For the specific averaging used in CHOCO-SGD, linear convergence has been shown in (Koloskova et al., 2019) and we will use their estimate of the convergence rate of the averaging scheme. A.1 A GENERAL FRAMEWORK FOR DECENTRALIZED SGD WITH ARBITRARY AVERAGING For convenience, we use the following matrix notation in this subsection. X(t) := h x(t) 1 , . . . , x(t) n i Rd n, X (t) := h x(t), . . . , x(t)i Rd n, F(X(t), ξ(t)) := h F1(x(t) 1 , ξ(t) 1 ), . . . , Fn(x(t) n , ξ(t) n ) i Rd n. Decentralized SGD with arbitrary averaging is given in Algorithm 3. Algorithm 3 DECENTRALIZED SGD WITH ARBITRARY AVERAGING SCHEME input: X(0) = x(0), . . . , x(0) , stepsize η, averaging function h : Rd n Rd n Rd n Rd n, initialize Y (0) = 0 1: for t in 0 . . . T 1 do {in parallel for all workers i [n]} 2 ) = X(t) η Fi(X(t), ξ(t)) stochastic gradient updates 3: (X(t+1), Y (t+1)) = h(X(t+ 1 2 ), Y (t)) blackbox averaging/gossip Assumption 3. For an averaging scheme h: Rd n Rd n Rd n Rd n let (X+, Y +) := h(X, Y ) for X, Y Rd n. Assume that h preserves the average of iterates: n , X, Y Rd n , (4) and that it converges with linear rate for a parameter 0 < c 1 Eh Ψ(X+, Y +) (1 c)Ψ(X, Y ) , X, Y Rd n , (5) and Laypunov function Ψ(X, Y ) := X X 2 F + X Y 2 F with X := 1 n X11 , where Eh denotes the expectation over internal randomness of averaging scheme h. Example: Exact Averaging. Setting X+ = XW and Y + = X+ gives an exact consensus averaging algorithm with mixing matrix W (Xiao & Boyd, 2004). It converges at the rate c = ρ, where ρ is an eigengap of mixing matrix W, defined in Assumption 1. Substituting it into the Algorithm 3 we recover D-PSGD algorithm, analyzed in Lian et al. (2017). Example: CHOCO-SGD. To recover CHOCO-SGD, we need to choose CHOCOGOSSIP (Koloskova et al., 2019) as consensus averaging scheme, which is defined as X+ = X + γY (W I) and Y + = Y + Q(X+ Y ) (in the main text we write ˆX instead of Y ). This scheme converges with c = ρ2δ 82 . The results from the main part can be recovered by Published as a conference paper at ICLR 2020 substituting this c = ρ2δ 82 in the more general results below. It is important to note that for Algorithm 1 given in the main text, the order of the communication part 1 and the gradient computation part 2 is exchanged. We did this to better illustrate that both these parts are independent and that they can be executed in parallel. The effect of this change can be captured by changing the initial values but does not affect the convergence rate. Remark A.1 (Mini-batch variance). If for functions fi, Fi defined in (1) Assumption 2 holds, i.e. Eξ Fi(x, ξ) fi(x) 2 σ2 i , i [n], then Eξ(t) 1 ,...,ξ(t) n fi(x(t) i ) Fi(x(t) i , ξ(t) i ) Pn i=1 σ2 i n . Proof. This follows from i=1 E Yi 2 + X i =j E Yi, Yj i=1 E Yi 2 1 i=1 σ2 i = σ2 for Yi = fi(x(t) i ) Fi(x(t) i , ξ(t) i ). Expectation of scalar product is equal to zero because ξi is independent of ξj since i = j. Lemma A.2. Under Assumptions 1 3 the iterates of the Algorithm 3 with constant stepsize η satisfy x(t) x(t) i 2 2 η2 12n G2 Proof of Lemma A.2. We start by following the proof of Lemma 21 from Koloskova et al. (2019). Define rt = E X(t) X (t) 2 + E X(t) Y (t) 2, rt+1 (5) (1 c)E X (t+ 1 F + (1 c)E Y (t) X(t+ 1 = (1 c)E X (t) X(t) + η F(X(t), ξ(t)) 11 + (1 c)E Y (t) X(t) + η F(X(t), ξ(t)) 2 F (9) (1 c)(1 + α 1)E X (t) X(t) 2 F + Y (t) X(t) 2 + (1 c)(1 + α)η2E F(X(t), ξ(t)) 11 F + F(X(t), ξ(t)) 2 (1 c) (1 + α 1)E X (t) X(t) 2 F + Y (t) X(t) 2 + 2n(1 + α)η2G2 E X (t) X(t) 2 F + Y (t) X(t) 2 Define A = 3n G2, we got a recursion Verifying that rt η2 4A c2 satisfy recursion completes the proof as E X(t) X (t) 2 rt. Published as a conference paper at ICLR 2020 Indeed, r0 = 0 η2 4A c2 as X(0) = X (0) and Y (0) = 0 Theorem A.3. Under Assumptions 1 3 with constant stepsize η < 1 4L, the averaged iterates x(t) = 1 n Pn i=1 x(t) i of Algorithm 3 satisfy: 2 4 η(T + 1) f(x(0)) f + η 2σ2L n + η2 36G2L2 where c denotes convergence rate of underlying averaging scheme. Proof of Theorem A.3. By L-smoothness Et+1 f(x(t+1)) = Et+1 f i=1 Fi(x(t) i , ξ(t) i ) f(x(t)) Et+1 i=1 Fi(x(t) i , ξ(t) i ) | {z } =:T1 i=1 Fi(x(t) i , ξ(t) i ) 2 | {z } =:T2 To estimate the second term, we add and subtract f(x(t)) T1 = η f(x(t)) 2 + η f(x(t)), f(x(t)) 1 i=1 fi(x(t) i ) f(x(t)) 2 + η f(x(t)) fi(x(t) i ) 2 For the last term, we add and subtract f(x(t)) and the sum of fi(x(t) i ) Fi(x(t) i , ξ(t) i ) fi(x(t) i ) i=1 fi(x(t) i ) f(x(t)) (6),(9),(7) σ2 f(x(t)) fi(x(t) i ) 2 2 + 2 f(x(t)) 2 Combining this together and using L-smoothness to estimate f(x(t)) fi(x(t) i ) 2 Et+1 f(x(t+1)) f(x(t)) η 1 2 Lη f(x(t)) 2 2ηL2 + η2L3 1 x(t) x(t) i 2 Using Lemma A.2 to bound the third term and using that η 1 4L in the second and in the third terms Et+1 f(x(t+1)) f(x(t)) η 2 + η3 9L2G2 c2 + η2 Lσ2 Rearranging terms and averaging over t Ef(x(t)) Ef(x(t+1)) + η2 36G2L2 c2 + η 2Lσ2 f(x(0)) f + η 2σ2L n + η2 36G2L2 Published as a conference paper at ICLR 2020 Corollary A.4. Under Assumptions 1 3 with constant stepsize η = q n T +1 for T 16n L2, the averaged iterates x(t) = 1 n Pn i=1 x(t) i of Algorithm 3 satisfy: 2 4 f(x(0)) f + 2σ2L p n(T + 1) + 36G2n L2 where c denotes convergence rate of underlying averaging scheme. The first term shows a linear speed up compared to SGD on one node, whereas the underlying averaging scheme affects only the second-order term. Substituting the convergence rate for exact averaging with W gives the rate O(1/ n T + n/(T ρ2)). CHOCO-SGD with the underlying CHOCO-GOSSIP averaging scheme converges at the rate O(1/ n T + n/(T ρ4δ2)). The dependence on ρ (eigengap of the mixing matrix W) is worse than in the exact case. This might either just be an artifact of our proof technique or a consequence of supporting arbitrary high compression. A.3 CONVERGENCE FOR ARBITRARY T The previous result holds only for T larger than 16n L2. This is not necessary and can be relaxed by carefully tuning the stepsize. Lemma A.5. For any parameters r0 0, b 0, e 0, d 0 there exists constant stepsize η 1 d such that ΨT := r0 η(T + 1) + bη + eη2 2 br0 2 + 2e1/3 r0 T + 1 3 + dr0 T + 1 Proof. Choosing η = min r0 b(T +1) 1 2 , r0 e(T +1) 1 d we have three cases d and is smaller than both r0 b(T +1) 1 2 and r0 e(T +1) 1 ΨT dr0 T + 1 + b 2 + dr0 T + 1 + e1/3 r0 T + 1 η = r0 b(T +1) 1 2 < r0 e(T +1) 1 2 + e r0 b(T + 1) 2 + e 1 3 r0 (T + 1) The last case, η = r0 e(T +1) 1 3 < r0 b(T +1) 1 ΨT 2e 1 3 r0 (T + 1) 3 + b r0 e(T + 1) 3 2e 1 3 r0 (T + 1) Corollary A.6 (Generalized Theorem 4.1). Under Assumptions 1 3 with constant stepsize η tuned as in Lemma A.5, the averaged iterates x(t) = 1 n Pn i=1 x(t) i of Algorithm 3 satisfy: n(T + 1) + 17 GLF0 where c denotes convergence rate of underlying averaging scheme, F0 = f(x(0)) f . Published as a conference paper at ICLR 2020 Proof. The result follows from Theorem A.3 and Lemma A.5 with r0 = 4 f(x(0)) f , b = n , e = 36G2L2 c2 and d = 4L. The corollary gives guarantees for the averaged vector of parameters x, however in a decentralized setting it is very expensive and sometimes impossible to average all the parameters distributed across several machines, especially when the number of machines and the model size is large. We can get similar guarantees on the individual iterates xi as e.g. in (Assran et al., 2019). We summarize these briefly below. Corollary A.7 (Convergence of local weights). Under the same setting as in Corollary A.6, n(T + 1) + 37 GLF0 Proof of Corollary A.7. f(x(t) i ) 2 2 f(x(t) i ) f(x(t)) 2 2 + 2 f(x(t)) 2 2L2 x(t) i x(t) 2 2 + 2 f(x(t)) 2 where we used L-smoothness of f. Using Theorem A.3 and tuning the stepsize as in Lemma A.5 we get the statement of the corollary. B USEFUL INEQUALITIES Lemma B.1. For arbitrary set of n vectors {ai}n i=1, ai Rd i=1 ai 2 . (7) Lemma B.2. For given two vectors a, b Rd 2 a, b γ a 2 + γ 1 b 2 , γ > 0 . (8) Lemma B.3. For given two vectors a, b Rd a + b 2 (1 + α) a 2 + (1 + α 1) b 2 , α > 0 . (9) This inequality also holds for the sum of two matrices A, B Rn d in Frobenius norm. C COMPRESSION SCHEMES We implement the compression schemes detailed below. gsgdb (Alistarh et al., 2017). The unbiased gsgdb : Rd Rd compression operator (for b > 1) is given as gsgdb(x) := x 2 sig(x) 2 (b 1) 2(b 1) |x| where u u.a.r. [0, 1]d is a random dithering vector and sig(x) assigns the element-wise sign: (sig(x))i = 1 if (x)i 0 and (sig(x))i = 1 if (x)i < 0. As the value in the right bracket will be rounded to an integer in {0, . . . , 2(b 1) 1}, each coordinate can be encoded with at most (b 1)+1 bits (1 for the sign). For more efficent encoding schemes cf. Alistarh et al. (2017). Published as a conference paper at ICLR 2020 A biased version is given as gsgdb(x) := x 2 τ sig(x) 2 (b 1) 2(b 1) |x| for τ = 1 + min n d 22(b 1) , d 2(b 1) o and is a δ = 1 τ compression operator (Koloskova et al., 2019). randoma (Wangni et al., 2018). Let u {0, 1}d be a masking vector, sampled uniformly at random from the set {u {0, 1}d : u 1 = ad }. Then the unbiased randoma : Rd Rd operator is defined as randoma(x) := d ad x u . The biased version is given as randoma(x) := x u , and is a δ = a compression operator (Stich et al., 2018). Only 32 ad bits are required to send randoma(x) to another node all the values of non-zero entries (we assume that entries are represented as float32 numbers). Receiver can recover positions of these entries if it knows the random seed of uniform sampling operator used to select these entries. This random seed could be communicated once on preprocessing stage (before starting the algorithm). topa (Alistarh et al., 2018; Stich et al., 2018). The biased topa : Rd Rd operator is defined as topa(x) := x u(x) , where u(x) {0, 1}d, u 1 = ad is a masking vector with (u)i = 1 for indices i π 1({1, . . . , ad }) where the permutation π is such that (x)π(1) (x)π(2) (x)π(d) . The topa operator is a δ = a compression operator (Stich et al., 2018). In the case of topa compression 2 32 ad bits are required because along with the values we need to send positions of these values. sign (Bernstein et al., 2018; Karimireddy et al., 2019). The biased (scaled) sign: Rd R compression operator is defined as sign(x) := x 1 The sign operator is a δ = x 2 1 d x 2 2 compression operator (Karimireddy et al., 2019). In total for the sign compression we need to send only d + 32 bits one bit for every entry in x and 32 bits for x 1. D CHOCO-SGD WITH MOMENTUM Algorithm 2 demonstrates how to combine CHOCO-SGD with weight decay and momentum. Nesterov momentum can be analogously adapted for our decentralized setting. E ERROR FEEDBACK INTERPRETATION OF CHOCO-SGD To better understand how does CHOCO-SGD work, we can interpret it as an error feedback algorithm (Stich et al., 2018; Karimireddy et al., 2019; Stich & Karimireddy, 2019). We can equivalently rewrite CHOCO-SGD (Algorithm 1) as Algorithm 4. The common feature of error feedback algorithms is that quantization errors are saved into the internal memory, which is added to the compressed value at the next iteration. In CHOCO-SGD the value we want to transmit is the difference x(t) i x(t 1) i , which represents the evolution of local variable xi at step t. Before compressing this value on line 4, the internal memory is added on line 3 to correct for the errors. Then, on line 5 internal memory is updated. Note that m(t) i = x(t 1) i ˆx(t) i in the old notation. Published as a conference paper at ICLR 2020 Algorithm 4 CHOCO-SGD (Koloskova et al., 2019) as Error Feedback input: Initial values x(0) i Rd on each node i [n], consensus stepsize γ, SGD stepsize η, communication graph G = ([n], E) and mixing matrix W, initialize ˆx(0) i = x( 1) i := 0, i [n] 1: for t in 0 . . . T 1 do {in parallel for all workers i [n]} 2: x(t) i := x (t 1 2 ) i + γ P j:{i,j} E wij ˆx(t) j ˆx(t) i modified gossip averaging 3: v(t) i = x(t) i x(t 1) i + m(t) i 4: q(t) i := Q(v(t) i ) compression 5: m(t+1) i = v(t) i q(t) i memory update 6: for neighbors j : {i, j} E (including {i} E) do 7: Send q(t) i and receive q(t) j communication 8: ˆx(t+1) j := q(t) j + ˆx(t) j local update 10: Sample ξ(t) i , compute gradient g(t) i := Fi(x(t) i , ξ(t) i ) 11: x (t+ 1 2 ) i := x(t) i ηg(t) i stochastic gradient update 12: end for F DETAILED EXPERIMENTAL SETUP AND TUNED HYPERPARAMETERS We precise the procedure of model training as well as the hyper-parameter tuning in this section. Social Network Setup. For the comparison we consider CHOCO-SGD with sign compression (this combination achieved the compromise between accuracy and compression level in Table 1)), decentralized SGD without compression, and centralized SGD without compression. We train two models, firstly Res Net20 (He et al., 2016) (0.27 million parameters) for image classification on the Cifar10 dataset (50K/10K training/test samples) (Krizhevsky, 2012) and secondly, a three-layer LSTM architecture (Hochreiter & Schmidhuber, 1997) (28.95 million parameters) for a language modeling task on Wiki Text-2 (600 training and 60 validation articles with a total of 2 088 628 and 217 646 tokens respectively) (Merity et al., 2016). For the language modeling task, we borrowed and adapted the general experimental setup of Merity et al. (2017), where we use a three-layer LSTM with hidden dimension of size 650. The loss is averaged over all examples and timesteps. The BPTT length is set to 30. We fine-tune the value of gradient clipping (0.4), and the dropout (0.4) is only applied on the output of LSTM. We train both of Res Net20 and LSTM for 300 epochs, unless mentioned specifically. The per node mini-batch size is 32 for both datasets. The momentum (with factor 0.9) is only applied on the Res Net20 training. Social Network and a Datacenter details. For all algorithms, we gradually warmup (Goyal et al., 2017) the learning rate from a relative small value (0.1) to the fine-tuned initial learning rate for the first 5 training epochs. During the training procedure, the tuned initial learning rate is decayed by the factor of 10 when accessing 50% and 75% of the total training epochs. The learning rate is tuned by finding the optimal initial learning rate (after the scaling). The optimal ˆη is searched in a pre-defined grid and we ensure that the best performance was contained in the middle of the grids. For example, if the best performance was ever at one of the extremes of the grid, we would try new grid points. Same searching logic applies to the consensus stepsize. Table 4 demonstrates the fine-tuned hpyerparameters of CHOCO-SGD for training Res Net-20 on Cifar10, while Table 6 reports our fine-tuned hpyerparameters of our baselines. Table 5 demonstrates the fine-tuned hpyerparameters of CHOCO-SGD for training Res Net-20/LSTM on a social network topology. We estimate the runtime information (depicted in Figure 5) of different methods from three trials of the evaluation on Google Cloud (Kubernetes Engine). More precisely, we create the cluster on Published as a conference paper at ICLR 2020 Google Cloud for three times and each time we estimate the time per mini-batch of different methods (through the first two training epochs). Table 4: Tuned hyper-parameters of CHOCO-SGD for training Res Net-20 on Cifar10, corresponding to the ring topology with 8 nodes in Table 1. We randomly split the training data between nodes and shuffle it after every epoch. The per node mini-batch size is 128 and the degree of each node is 3. Compression schemes Learning rate Consensus stepsize QSGD (16-bit) 1.60 0.2 QSGD (8-bit) 0.96 0.2 QSGD (4-bit) 1.60 0.075 QSGD (2-bit) 0.96 0.025 Sparsification (random-50%) 2.40 0.45 Sparsification (random-10%) 1.20 0.075 Sparsification (random-1%) 0.48 0.00625 Sparsification (top-50%) 1.60 0.45 Sparsification (top-10%) 1.60 0.15 Sparsification (top-1%) 1.20 0.0375 Sign+Norm 1.60 0.45 Table 5: Tuned hyper-parameters of CHOCO-SGD, corresponding to the social network topology with 32 nodes in Table 3. We randomly split the training data between the nodes and keep this partition fixed during the entire training (no shuffling). The per node mini-batch size is 32 and the maximum degree of the node is 14. Configuration Learning rate Consensus stepsize Res Net-20, Cifar10, Sign+Norm 1.0 0.5 LSTM, Wiki Text-2, Sign+Norm 25 0.6 Table 6: Tuned hyper-parameters of DCD, ECD, and Deep Squeeze for training Res Net-20 on Cifar10, corresponding to the ring topology with 8 nodes in Table 1. We randomly split the training data between nodes and shuffle it after every epoch. The per node mini-batch size is 128 and the degree of each node is 3. We only report the hpyerparameters corresponding to results that can reach to reasonable performance in our experiments. Compression schemes Learning rate Consensus stepsize DCD, QSGD (16-bit) 2.40 - DCD, QSGD (8-bit) 1.20 - DCD, Sparsification (random-50%) 0.80 - DCD, Sparsification (top-50%) 1.20 - DCD, Sparsification (top-10%) 1.60 - DCD, Sparsification (top-1%) 2.40 - ECD, QSGD (16-bit) 0.96 - ECD, QSGD (8-bit) 1.20 - Deep Squeeze, QSGD (4-bit) 0.60 0.01 Deep Squeeze, QSGD (2-bit) 0.80 0.005 Deep Squeeze, Sparsification (top-50%) 0.80 0.05 Deep Squeeze, Sparsification (top-10%) 0.60 0.01 Deep Squeeze, Sparsification (top-1%) 0.40 0.005 Deep Squeeze, Sparsification (random-1%) 0.80 0.0005 Deep Squeeze, Sign+Norm 0.48 0.01 Published as a conference paper at ICLR 2020 G ADDITIONAL PLOTS To complement our results for scaling to a large number of nodes, we here additionally depict the learning curves (e.g. test accuracy) for the training on 64 nodes. We also mark the levels used for Fig. 1. 21 23 25 27 29 211 Test top-1 accuracy CHOCO (Torus) CHOCO (Ring) Decentralized SGD (Torus) Decentralized SGD (Ring) Centralized SGD 100 101 102 103 104 105 Data transmitted in (MB) Test top-1 accuracy Figure 6: Scaling of CHOCO-SGD with sign compression to large number of devices on Cifar10 dataset. Convergence curves for 64 nodes. Vertical lines corresponds to the epoch/bits budget used in Fig. 1. Table 7: The exact epoch for the same bits budget in Fig. 1. n = 4 n = 16 n = 36 n = 64 Centralized 5 6 6 6 Decentralized (Ring) 7 17 32 54 Decentralized (Torus) 6 10 18 29 CHOCO (Ring) 105 408 904 1588 CHOCO (Torus) 55 206 454 796 Table 8: The exact transmitted bits (in MB) for the same epoch budget in Fig. 1. n = 4 n = 16 n = 36 n = 64 Centralized 139683 140041 144299 142899 Decentralized (Ring) 69841 17505 8016 4554 Decentralized (Torus) 139683 35010 16033 9109 CHOCO (Ring) 2208 564 253 144 CHOCO (Torus) 4417 1129 506 288 We additionally visualize the learning curves for the social network topology in Fig. 7 and Fig. 8. 0 100 200 300 Training top-1 accuracy Centralized SGD Decentralized SGD CHOCO (Sign+Norm) (a) Training top-1 accuracy. 0 1 2 3 4 5 Data transmitted in (MB) Training top-1 accuracy Centralized SGD Decentralized SGD CHOCO (Sign+Norm) (b) Training top-1 accuracy. 0 100 200 300 Test top-1 accuracy Centralized SGD Decentralized SGD CHOCO (Sign+Norm) (c) Test top-1 accuracy. Figure 7: Training Res Net-20 on CIFAR-10 with decentralized algorithm on a real world social network topology. The topology has 32 nodes and we assume each node can only access a disjoint subset of the whole dataset. The local mini-batch size is 32. We additionally provide the learning curves of training top-1, top-5 accuracy and test top-5 accuracy for the datacenter experiment in Fig. 9. Published as a conference paper at ICLR 2020 0 50 100 150 200 250 300 Centralized SGD Decentralized SGD CHOCO (Sign+Norm) (a) Test loss. 104 105 106 107 108 Data transmitted in (MB) Test perplexity Centralized SGD Decentralized SGD CHOCO (Sign+Norm) (b) Test perplexity. Figure 8: Training LSTM on Wiki Text2 with decentralized algorithm on a real world social network topology. The topology has 32 nodes and we assume each node can only access a disjoint subset of the whole dataset. The local mini-batch size is 32. 0 20 40 60 80 Training top-1 accuracy Centralized SGD Centralized SGD (Sign+Norm) Decentralized SGD CHOCO (Sign+Norm) (a) Training top-1 accuracy. 0 20 40 60 80 Training top-5 accuracy Centralized SGD Centralized SGD (Sign+Norm) Decentralized SGD CHOCO (Sign+Norm) (b) Training top-5 accuracy. 0 20 40 60 80 Test top-1 accuracy Centralized SGD Centralized SGD (Sign+Norm) Decentralized SGD CHOCO (Sign+Norm) (c) Test top-1 accuracy. Figure 9: Large-scale training: Res Net-50 on Image Net in the datacenter.