# deep_learning_with_elastic_averaging_sgd__67ab5515.pdf Deep learning with Elastic Averaging SGD Sixin Zhang Courant Institute, NYU zsx@cims.nyu.edu Anna Choromanska Courant Institute, NYU achoroma@cims.nyu.edu Yann Le Cun Center for Data Science, NYU & Facebook AI Research yann@cims.nyu.edu We study the problem of stochastic optimization for deep learning in the parallel computing environment under communication constraints. A new algorithm is proposed in this setting where the communication and coordination of work among concurrent processes (local workers), is based on an elastic force which links the parameters they compute with a center variable stored by the parameter server (master). The algorithm enables the local workers to perform more exploration, i.e. the algorithm allows the local variables to fluctuate further from the center variable by reducing the amount of communication between local workers and the master. We empirically demonstrate that in the deep learning setting, due to the existence of many local optima, allowing more exploration can lead to the improved performance. We propose synchronous and asynchronous variants of the new algorithm. We provide the stability analysis of the asynchronous variant in the round-robin scheme and compare it with the more common parallelized method ADMM. We show that the stability of EASGD is guaranteed when a simple stability condition is satisfied, which is not the case for ADMM. We additionally propose the momentum-based version of our algorithm that can be applied in both synchronous and asynchronous settings. Asynchronous variant of the algorithm is applied to train convolutional neural networks for image classification on the CIFAR and Image Net datasets. Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient. 1 Introduction One of the most challenging problems in large-scale machine learning is how to parallelize the training of large models that use a form of stochastic gradient descent (SGD) [1]. There have been attempts to parallelize SGD-based training for large-scale deep learning models on large number of CPUs, including the Google s Distbelief system [2]. But practical image recognition systems consist of large-scale convolutional neural networks trained on few GPU cards sitting in a single computer [3, 4]. The main challenge is to devise parallel SGD algorithms to train large-scale deep learning models that yield a significant speedup when run on multiple GPU cards. In this paper we introduce the Elastic Averaging SGD method (EASGD) and its variants. EASGD is motivated by quadratic penalty method [5], but is re-interpreted as a parallelized extension of the averaging SGD algorithm [6]. The basic idea is to let each worker maintain its own local parameter, and the communication and coordination of work among the local workers is based on an elastic force which links the parameters they compute with a center variable stored by the master. The center variable is updated as a moving average where the average is taken in time and also in space over the parameters computed by local workers. The main contribution of this paper is a new algorithm that provides fast convergent minimization while outperforming DOWNPOUR method [2] and other baseline approaches in practice. Simultaneously it reduces the communication overhead between the master and the local workers while at the same time it maintains high-quality performance measured by the test error. The new algorithm applies to deep learning settings such as parallelized training of convolutional neural networks. The article is organized as follows. Section 2 explains the problem setting, Section 3 presents the synchronous EASGD algorithm and its asynchronous and momentum-based variants, Section 4 provides stability analysis of EASGD and ADMM in the round-robin scheme, Section 5 shows experimental results and Section 6 concludes. The Supplement contains additional material including additional theoretical analysis. 2 Problem setting Consider minimizing a function F(x) in a parallel computing environment [7] with p N workers and a master. In this paper we focus on the stochastic optimization problem of the following form min x F(x) := E[f(x, ξ)], (1) where x is the model parameter to be estimated and ξ is a random variable that follows the probability distribution P over Ωsuch that F(x) = R Ωf(x, ξ)P(dξ). The optimization problem in Equation 1 can be reformulated as follows min x1,...,xp, x i=1 E[f(xi, ξi)] + ρ 2 xi x 2, (2) where each ξi follows the same distribution P (thus we assume each worker can sample the entire dataset). In the paper we refer to xi s as local variables and we refer to x as a center variable. The problem of the equivalence of these two objectives is studied in the literature and is known as the augmentability or the global variable consensus problem [8, 9]. The quadratic penalty term ρ in Equation 2 is expected to ensure that local workers will not fall into different attractors that are far away from the center variable. This paper focuses on the problem of reducing the parameter communication overhead between the master and local workers [10, 2, 11, 12, 13]. The problem of data communication when the data is distributed among the workers [7, 14] is a more general problem and is not addressed in this work. We however emphasize that our problem setting is still highly non-trivial under the communication constraints due to the existence of many local optima [15]. 3 EASGD update rule The EASGD updates captured in resp. Equation 3 and 4 are obtained by taking the gradient descent step on the objective in Equation 2 with respect to resp. variable xi and x, xi t+1 = xi t η(gi t(xi t) + ρ(xi t xt)) (3) xt+1 = xt + η i=1 ρ(xi t xt), (4) where gi t(xi t) denotes the stochastic gradient of F with respect to xi evaluated at iteration t, xi t and xt denote respectively the value of variables xi and x at iteration t, and η is the learning rate. The update rule for the center variable x takes the form of moving average where the average is taken over both space and time. Denote α = ηρ and β = pα, then Equation 3 and 4 become xi t+1 = xi t ηgi t(xi t) α(xi t xt) (5) xt+1 = (1 β) xt + β Note that choosing β = pα leads to an elastic symmetry in the update rule, i.e. there exists an symmetric force equal to α(xi t xt) between the update of each xi and x. It has a crucial influence on the algorithm s stability as will be explained in Section 4. Also in order to minimize the staleness [16] of the difference xi t xt between the center and the local variable, the update for the master in Equation 4 involves xi t instead of xi t+1. Note also that α = ηρ, where the magnitude of ρ represents the amount of exploration we allow in the model. In particular, small ρ allows for more exploration as it allows xi s to fluctuate further from the center x. The distinctive idea of EASGD is to allow the local workers to perform more exploration (small ρ) and the master to perform exploitation. This approach differs from other settings explored in the literature [2, 17, 18, 19, 20, 21, 22, 23], and focus on how fast the center variable converges. In this paper we show the merits of our approach in the deep learning setting. 3.1 Asynchronous EASGD We discussed the synchronous update of EASGD algorithm in the previous section. In this section we propose its asynchronous variant. The local workers are still responsible for updating the local variables xi s, whereas the master is updating the center variable x. Each worker maintains its own clock ti, which starts from 0 and is incremented by 1 after each stochastic gradient update of xi as shown in Algorithm 1. The master performs an update whenever the local workers finished τ steps of their gradient updates, where we refer to τ as the communication period. As can be seen in Algorithm 1, whenever τ divides the local clock of the ith worker, the ith worker communicates with the master and requests the current value of the center variable x. The worker then waits until the master sends back the requested parameter value, and computes the elastic difference α(x x) (this entire procedure is captured in step a) in Algorithm 1). The elastic difference is then sent back to the master (step b) in Algorithm 1) who then updates x. The communication period τ controls the frequency of the communication between every local worker and the master, and thus the trade-off between exploration and exploitation. Algorithm 1: Asynchronous EASGD: Processing by worker i and the master Input: learning rate η, moving rate α, communication period τ N Initialize: x is initialized randomly, xi = x, ti = 0 Repeat x xi if (τ divides ti) then a) xi xi α(x x) b) x x + α(x x) end xi xi ηgi ti(x) ti ti + 1 Until forever Algorithm 2: Asynchronous EAMSGD: Processing by worker i and the master Input: learning rate η, moving rate α, communication period τ N, momentum term δ Initialize: x is initialized randomly, xi = x, vi = 0, ti = 0 Repeat x xi if (τ divides ti) then a) xi xi α(x x) b) x x + α(x x) end vi δvi ηgi ti(x + δvi) xi xi + vi ti ti + 1 Until forever 3.2 Momentum EASGD The momentum EASGD (EAMSGD) is a variant of our Algorithm 1 and is captured in Algorithm 2. It is based on the Nesterov s momentum scheme [24, 25, 26], where the update of the local worker of the form captured in Equation 3 is replaced by the following update vi t+1 = δvi t ηgi t(xi t + δvi t) (7) xi t+1 = xi t + vi t+1 ηρ(xi t xt), where δ is the momentum term. Note that when δ = 0 we recover the original EASGD algorithm. As we are interested in reducing the communication overhead in the parallel computing environment where the parameter vector is very large, we will be exploring in the experimental section the asynchronous EASGD algorithm and its momentum-based variant in the relatively large τ regime (less frequent communication). 4 Stability analysis of EASGD and ADMM in the round-robin scheme In this section we study the stability of the asynchronous EASGD and ADMM methods in the roundrobin scheme [20]. We first state the updates of both algorithms in this setting, and then we study their stability. We will show that in the one-dimensional quadratic case, ADMM algorithm can exhibit chaotic behavior, leading to exponential divergence. The analytic condition for the ADMM algorithm to be stable is still unknown, while for the EASGD algorithm it is very simple1. The analysis of the synchronous EASGD algorithm, including its convergence rate, and its averaging property, in the quadratic and strongly convex case, is deferred to the Supplement. In our setting, the ADMM method [9, 27, 28] involves solving the following minimax problem2, max λ1,...,λp min x1,...,xp, x i=1 F(xi) λi(xi x) + ρ 2 xi x 2, (8) where λi s are the Lagrangian multipliers. The resulting updates of the ADMM algorithm in the round-robin scheme are given next. Let t 0 be a global clock. At each t, we linearize the function F(xi) with F(xi t) + F(xi t), xi xi t + 1 2η xi xi t 2 as in [28]. The updates become λi t+1 = λi t (xi t xt) if mod (t, p) = i 1; λi t if mod (t, p) = i 1. (9) ( xi t η F (xi t)+ηρ(λi t+1+ xt) 1+ηρ if mod (t, p) = i 1; xi t if mod (t, p) = i 1. i=1 (xi t+1 λi t+1). (11) Each local variable xi is periodically updated (with period p). First, the Lagrangian multiplier λi is updated with the dual ascent update as in Equation 9. It is followed by the gradient descent update of the local variable as given in Equation 10. Then the center variable x is updated with the most recent values of all the local variables and Lagrangian multipliers as in Equation 11. Note that since the step size for the dual ascent update is chosen to be ρ by convention [9, 27, 28], we have re-parametrized the Lagrangian multiplier to be λi t λi t/ρ in the above updates. The EASGD algorithm in the round-robin scheme is defined similarly and is given below xi t+1 = xi t η F(xi t) α(xi t xt) if mod (t, p) = i 1; xi t if mod (t, p) = i 1. (12) xt+1 = xt + X i: mod (t,p)=i 1 α(xi t xt). (13) At time t, only the i-th local worker (whose index i 1 equals t modulo p) is activated, and performs the update in Equations 12 which is followed by the master update given in Equation 13. We will now focus on the one-dimensional quadratic case without noise, i.e. F(x) = x2 For the ADMM algorithm, let the state of the (dynamical) system at time t be st = (λ1 t, x1 t, . . . , λp t , xp t , xt) R2p+1. The local worker i s updates in Equations 9, 10, and 11 are composed of three linear maps which can be written as st+1 = (F i 3 F i 2 F i 1)(st). For simplicity, we will only write them out below for the case when i = 1 and p = 2: 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 ηρ 1+ηρ 1 η 1+ηρ 0 0 ηρ 1+ηρ 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 1 For each of the p linear maps, it s possible to find a simple condition such that each map, where the ith map has the form F i 3 F i 2 F i 1, is stable (the absolute value of the eigenvalues of the map are 1This condition resembles the stability condition for the synchronous EASGD algorithm (Condition 17 for p = 1) in the analysis in the Supplement. 2The convergence analysis in [27] is based on the assumption that At any master iteration, updates from the workers have the same probability of arriving at the master. , which is not satisfied in the round-robin scheme. smaller or equal to one). However, when these non-symmetric maps are composed one after another as follows F = F p 3 F p 2 F p 1 . . . F 1 3 F 1 2 F 1 1 , the resulting map F can become unstable! (more precisely, some eigenvalues of the map can sit outside the unit circle in the complex plane). We now present the numerical conditions for which the ADMM algorithm becomes unstable in the round-robin scheme for p = 3 and p = 8, by computing the largest absolute eigenvalue of the map F. Figure 1 summarizes the obtained result. 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 Figure 1: The largest absolute eigenvalue of the linear map F = F p 3 F p 2 F p 1 . . . F 1 3 F 1 2 F 1 1 as a function of η (0, 10 2) and ρ (0, 10) when p = 3 and p = 8. To simulate the chaotic behavior of the ADMM algorithm, one may pick η = 0.001 and ρ = 2.5 and initialize the state s0 either randomly or with λi 0 = 0, xi 0 = x0 = 1000, i. Figure should be read in color. On the other hand, the EASGD algorithm involves composing only symmetric linear maps due to the elasticity. Let the state of the (dynamical) system at time t be st = (x1 t, . . . , xp t , xt) Rp+1. The activated local worker i s update in Equation 12 and the master update in Equation 13 can be written as st+1 = F i(st). In case of p = 2, the map F 1 and F 2 are defined as follows 1 η α 0 α 0 1 0 α 0 1 α 1 0 0 0 1 η α α 0 α 1 α For the composite map F p . . . F 1 to be stable, the condition that needs to be satisfied is actually the same for each i, and is furthermore independent of p (since each linear map F i is symmetric). It essentially involves the stability of the 2 2 matrix 1 η α α α 1 α , whose two (real) eigenvalues λ satisfy (1 η α λ)(1 α λ) = α2. The resulting stability condition (|λ| 1) is simple and given as 0 η 2, 0 α 4 2η 5 Experiments In this section we compare the performance of EASGD and EAMSGD with the parallel method DOWNPOUR and the sequential method SGD, as well as their averaging and momentum variants. All the parallel comparator methods are listed below3: DOWNPOUR [2], the pseudo-code of the implementation of DOWNPOUR used in this paper is enclosed in the Supplement. Momentum DOWNPOUR (MDOWNPOUR), where the Nesterov s momentum scheme is applied to the master s update (note it is unclear how to apply it to the local workers or for the case when τ > 1). The pseudo-code is in the Supplement. A method that we call ADOWNPOUR, where we compute the average over time of the center variable x as follows: zt+1 = (1 αt+1)zt + αt+1 xt, and αt+1 = 1 t+1 is a moving rate, and z0 = x0. t denotes the master clock, which is initialized to 0 and incremented every time the center variable x is updated. A method that we call MVADOWNPOUR, where we compute the moving average of the center variable x as follows: zt+1 = (1 α)zt + α xt, and the moving rate α was chosen to be constant, and z0 = x0. t denotes the master clock and is defined in the same way as for the ADOWNPOUR method. 3We have compared asynchronous ADMM [27] with EASGD in our setting as well, the performance is nearly the same. However, ADMM s momentum variant is not as stable for large communication periods. All the sequential comparator methods (p = 1) are listed below: SGD [1] with constant learning rate η. Momentum SGD (MSGD) [26] with constant momentum δ. ASGD [6] with moving rate αt+1 = 1 t+1. MVASGD [6] with moving rate α set to a constant. We perform experiments in a deep learning setting on two benchmark datasets: CIFAR-10 (we refer to it as CIFAR) 4 and Image Net ILSVRC 2013 (we refer to it as Image Net) 5. We focus on the image classification task with deep convolutional neural networks. We next explain the experimental setup. The details of the data preprocessing and prefetching are deferred to the Supplement. 5.1 Experimental setup For all our experiments we use a GPU-cluster interconnected with Infini Band. Each node has 4 Titan GPU processors where each local worker corresponds to one GPU processor. The center variable of the master is stored and updated on the centralized parameter server [2]6. To describe the architecture of the convolutional neural network, we will first introduce a notation. Let (c, y) denotes the size of the input image to each layer, where c is the number of color channels and y is both the horizontal and the vertical dimension of the input. Let C denotes the fully-connected convolutional operator and let P denotes the max pooling operator, D denotes the linear operator with dropout rate equal to 0.5 and S denotes the linear operator with softmax output non-linearity. We use the cross-entropy loss and all inner layers use rectified linear units. For the Image Net experiment we use the similar approach to [4] with the following 11-layer convolutional neural network (3,221)C(96,108)P(96,36)C(256,32)P(256,16)C(384,14) C(384,13)C(256,12)P(256,6)D(4096,1)D(4096,1)S(1000,1). For the CIFAR experiment we use the similar approach to [29] with the following 7-layer convolutional neural network (3,28)C(64,24)P(64,12)C(128,8)P(128,4)C(64,2)D(256,1)S(10,1). In our experiments all the methods we run use the same initial parameter chosen randomly, except that we set all the biases to zero for CIFAR case and to 0.1 for Image Net case. This parameter is used to initialize the master and all the local workers7. We add l2-regularization λ 2 x 2 to the loss function F(x). For Image Net we use λ = 10 5 and for CIFAR we use λ = 10 4. We also compute the stochastic gradient using mini-batches of sample size 128. 5.2 Experimental results For all experiments in this section we use EASGD with β = 0.98 , for all momentum-based methods we set the momentum term δ = 0.99 and finally for MVADOWNPOUR we set the moving rate to α = 0.001. We start with the experiment on CIFAR dataset with p = 4 local workers running on a single computing node. For all the methods, we examined the communication periods from the following set τ = {1, 4, 16, 64}. For comparison we also report the performance of MSGD which outperformed SGD, ASGD and MVASGD as shown in Figure 6 in the Supplement. For each method we examined a wide range of learning rates (the learning rates explored in all experiments are summarized in Table 1, 2, 3 in the Supplement). The CIFAR experiment was run 3 times independently from the same initialization and for each method we report its best performance measured by the smallest achievable test error. From the results in Figure 2, we conclude that all DOWNPOURbased methods achieve their best performance (test error) for small τ (τ {1, 4}), and become highly unstable for τ {16, 64}. While EAMSGD significantly outperforms comparator methods for all values of τ by having faster convergence. It also finds better-quality solution measured by the test error and this advantage becomes more significant for τ {16, 64}. Note that the tendency to achieve better test performance with larger τ is also characteristic for the EASGD algorithm. 4Downloaded from http://www.cs.toronto.edu/ kriz/cifar.html. 5Downloaded from http://image-net.org/challenges/LSVRC/2013. 6Our implementation is available at https://github.com/sixin-zh/mpi T. 7On the contrary, initializing the local workers and the master with different random seeds traps the algorithm in the symmetry breaking phase. 8Intuitively the effective β is β/τ = pα = pηρ (thus ρ = β τpη ) in the asynchronous setting. wallclock time (min) training loss (nll) MSGD DOWNPOUR ADOWNPOUR MVADOWNPOUR MDOWNPOUR EASGD EAMSGD wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) wallclock time (min) training loss (nll) wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) wallclock time (min) training loss (nll) wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) wallclock time (min) training loss (nll) wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) Figure 2: Training and test loss and the test error for the center variable versus a wallclock time for different communication periods τ on CIFAR dataset with the 7-layer convolutional neural network. We next explore different number of local workers p from the set p = {4, 8, 16} for the CIFAR experiment, and p = {4, 8} for the Image Net experiment9. For the Image Net experiment we report the results of one run with the best setting we have found. EASGD and EAMSGD were run with τ = 10 whereas DOWNPOUR and MDOWNPOUR were run with τ = 1. The results are in Figure 3 and 4. For the CIFAR experiment, it s noticeable that the lowest achievable test error by either EASGD or EAMSGD decreases with larger p. This can potentially be explained by the fact that larger p allows for more exploration of the parameter space. In the Supplement, we discuss further the trade-off between exploration and exploitation as a function of the learning rate (section 9.5) and the communication period (section 9.6). Finally, the results obtained for the Image Net experiment also shows the advantage of EAMSGD over the competitor methods. 6 Conclusion In this paper we describe a new algorithm called EASGD and its variants for training deep neural networks in the stochastic setting when the computations are parallelized over multiple GPUs. Experiments demonstrate that this new algorithm quickly achieves improvement in test error compared to more common baseline approaches such as DOWNPOUR and its variants. We show that our approach is very stable and plausible under communication constraints. We provide the stability analysis of the asynchronous EASGD in the round-robin scheme, and show the theoretical advantage of the method over ADMM. The different behavior of the EASGD algorithm from its momentumbased variant EAMSGD is intriguing and will be studied in future works. 9For the Image Net experiment, the training loss is measured on a subset of the training data of size 50,000. wallclock time (min) training loss (nll) MSGD DOWNPOUR MDOWNPOUR EASGD EAMSGD wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) wallclock time (min) training loss (nll) wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) wallclock time (min) training loss (nll) wallclock time (min) test loss (nll) 50 100 150 16 wallclock time (min) test error (%) Figure 3: Training and test loss and the test error for the center variable versus a wallclock time for different number of local workers p for parallel methods (MSGD uses p = 1) on CIFAR with the 7-layer convolutional neural network. EAMSGD achieves significant accelerations compared to other methods, e.g. the relative speed-up for p = 16 (the best comparator method is then MSGD) to achieve the test error 21% equals 11.1. 0 50 100 150 1 wallclock time (hour) training loss (nll) MSGD DOWNPOUR EASGD EAMSGD 0 50 100 150 2 wallclock time (hour) test loss (nll) 0 50 100 150 42 wallclock time (hour) test error (%) 0 50 100 150 wallclock time (hour) training loss (nll) 0 50 100 150 2 wallclock time (hour) test loss (nll) 0 50 100 150 42 wallclock time (hour) test error (%) Figure 4: Training and test loss and the test error for the center variable versus a wallclock time for different number of local workers p (MSGD uses p = 1) on Image Net with the 11-layer convolutional neural network. Initial learning rate is decreased twice, by a factor of 5 and then 2, when we observe that the online predictive loss [30] stagnates. EAMSGD achieves significant accelerations compared to other methods, e.g. the relative speed-up for p = 8 (the best comparator method is then DOWNPOUR) to achieve the test error 49% equals 1.8, and simultaneously it reduces the communication overhead (DOWNPOUR uses communication period τ = 1 and EAMSGD uses τ = 10). Acknowledgments The authors thank R. Power, J. Li for implementation guidance, J. Bruna, O. Henaff, C. Farabet, A. Szlam, Y. Bakhtin for helpful discussion, P. L. Combettes, S. Bengio and the referees for valuable feedback. [1] Bottou, L. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, 1998. [2] Dean, J, Corrado, G, Monga, R, Chen, K, Devin, M, Le, Q, Mao, M, Ranzato, M, Senior, A, Tucker, P, Yang, K, and Ng, A. Large scale distributed deep networks. In NIPS. 2012. [3] Krizhevsky, A, Sutskever, I, and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106 1114, 2012. [4] Sermanet, P, Eigen, D, Zhang, X, Mathieu, M, Fergus, R, and Le Cun, Y. Over Feat: Integrated Recognition, Localization and Detection using Convolutional Networks. Ar Xiv, 2013. [5] Nocedal, J and Wright, S. Numerical Optimization, Second Edition. Springer New York, 2006. [6] Polyak, B. T and Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838 855, 1992. [7] Bertsekas, D. P and Tsitsiklis, J. N. Parallel and Distributed Computation. Prentice Hall, 1989. [8] Hestenes, M. R. Optimization theory: the finite dimensional case. Wiley, 1975. [9] Boyd, S, Parikh, N, Chu, E, Peleato, B, and Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1 122, 2011. [10] Shamir, O. Fundamental limits of online and distributed algorithms for statistical learning and estimation. In NIPS. 2014. [11] Yadan, O, Adams, K, Taigman, Y, and Ranzato, M. Multi-gpu training of convnets. In Arxiv. 2013. [12] Paine, T, Jin, H, Yang, J, Lin, Z, and Huang, T. Gpu asynchronous stochastic gradient descent to speed up neural network training. In Arxiv. 2013. [13] Seide, F, Fu, H, Droppo, J, Li, G, and Yu, D. 1-bit stochastic gradient descent and application to dataparallel distributed training of speech dnns. In Interspeech 2014, September 2014. [14] Bekkerman, R, Bilenko, M, and Langford, J. Scaling up machine learning: Parallel and distributed approaches. Camridge Universityy Press, 2011. [15] Choromanska, A, Henaff, M. B, Mathieu, M, Arous, G. B, and Le Cun, Y. The loss surfaces of multilayer networks. In AISTATS, 2015. [16] Ho, Q, Cipar, J, Cui, H, Lee, S, Kim, J. K, Gibbons, P. B, Gibson, G. A, Ganger, G, and Xing, E. P. More effective distributed ml via a stale synchronous parallel parameter server. In NIPS. 2013. [17] Azadi, S and Sra, S. Towards an optimal stochastic alternating direction method of multipliers. In ICML, 2014. [18] Borkar, V. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization, 36(3):840 851, 1998. [19] Nedi c, A, Bertsekas, D, and Borkar, V. Distributed asynchronous incremental subgradient methods. In Inherently Parallel Algorithms in Feasibility and Optimization and their Applications, volume 8 of Studies in Computational Mathematics, pages 381 407. 2001. [20] Langford, J, Smola, A, and Zinkevich, M. Slow learners are fast. In NIPS, 2009. [21] Agarwal, A and Duchi, J. Distributed delayed stochastic optimization. In NIPS. 2011. [22] Recht, B, Re, C, Wright, S. J, and Niu, F. Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. In NIPS, 2011. [23] Zinkevich, M, Weimer, M, Smola, A, and Li, L. Parallelized stochastic gradient descent. In NIPS, 2010. [24] Nesterov, Y. Smooth minimization of non-smooth functions. Math. Program., 103(1):127 152, 2005. [25] Lan, G. An optimal method for stochastic composite optimization. Mathematical Programming, 133(12):365 397, 2012. [26] Sutskever, I, Martens, J, Dahl, G, and Hinton, G. On the importance of initialization and momentum in deep learning. In ICML, 2013. [27] Zhang, R and Kwok, J. Asynchronous distributed admm for consensus optimization. In ICML, 2014. [28] Ouyang, H, He, N, Tran, L, and Gray, A. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, pages 80 88, 2013. [29] Wan, L, Zeiler, M. D, Zhang, S, Le Cun, Y, and Fergus, R. Regularization of neural networks using dropconnect. In ICML, 2013. [30] Cesa-Bianchi, N, Conconi, A, and Gentile, C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9):2050 2057, 2004. [31] Nesterov, Y. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 2004.