# decoupled_parallel_backpropagation_with_convergence_guarantee__752231c1.pdf Decoupled Parallel Backpropagation with Convergence Guarantee Zhouyuan Huo 1 Bin Gu 1 Qian Yang 1 Heng Huang 1 Abstract Backpropagation algorithm is indispensable for the training of feedforward neural networks. It requires propagating error gradients sequentially from the output layer all the way back to the input layer. The backward locking in backpropagation algorithm constrains us from updating network layers in parallel and fully leveraging the computing resources. Recently, several algorithms have been proposed for breaking the backward locking. However, their performances degrade seriously when networks are deep. In this paper, we propose decoupled parallel backpropagation algorithm for deep learning optimization with convergence guarantee. Firstly, we decouple the backpropagation algorithm using delayed gradients, and show that the backward locking is removed when we split the networks into multiple modules. Then, we utilize decoupled parallel backpropagation in two stochastic methods and prove that our method guarantees convergence to critical points for the non-convex problem. Finally, we perform experiments for training deep convolutional neural networks on benchmark datasets. The experimental results not only confirm our theoretical analysis, but also demonstrate that the proposed method can achieve significant speedup without loss of accuracy. 1. Introduction We have witnessed a series of breakthroughs in computer vision using deep convolutional neural networks (Le Cun et al., 2015). Most neural networks are trained using stochastic gradient descent (SGD) or its variants in which the gradients of the networks are computed by backpropagation algorithm (Rumelhart et al., 1988). As shown in Figure 1, the backpropagation algorithm consists of two processes, the 1Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, PA, United States. Correspondence to: Heng Huang . Proceedings of the 35 th International Conference on Machine Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018 by the author(s). Figure 1. We split a multilayer feedforward neural network into three modules. Each module is a stack of layers. Backpropagation algorithm requires running forward pass (from 1 to 3) and backward pass (from 4 to 6) in sequential order. For example, module A cannot perform step 6 before receiving δt A which is an output of step 5 in module B. forward pass to compute prediction and the backward pass to compute gradient and update the model. After computing prediction in the forward pass, backpropagation algorithm requires propagating error gradients from the top (output layer) all the way back to the bottom (input layer). Therefore, in the backward pass, all layers, or more generally, modules, of the network are locked until their dependencies have executed. The backward locking constrains us from updating models in parallel and fully leveraging the computing resources. It has been shown in practice (Krizhevsky et al., 2012; Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Huang et al., 2016) and in theory (Eldan & Shamir, 2016; Telgarsky, 2016; Bengio et al., 2009) that depth is one of the most critical factors contributing to the success of deep learning. From Alex Net with 8 layers (Krizhevsky et al., 2012) to Res Net-101 with more than one hundred layers (He et al., 2016), the forward and backward time grow from (4.31ms and 9.58ms) to (53.38ms and 103.06ms) when we train the networks on Titan X with the input size of 16 3 224 224 (Johnson, 2017). Therefore, parallelizing the backward pass can greatly reduce the training time when the backward time is about twice of the forward time. We Decoupled Parallel Backpropagation with Convergence Guarantee can easily split a deep neural network into modules like Figure 1 and distribute them across multiple GPUs. However, because of the backward locking, all GPUs are idle before receiving error gradients from dependent modules in the backward pass. There have been several algorithms proposed for breaking the backward locking. For example, (Jaderberg et al., 2016; Czarnecki et al., 2017) proposed to remove the lockings in backpropagation by employing additional neural networks to approximate error gradients. In the backward pass, all modules use the synthetic gradients to update weights of the model without incurring any delay. (Nøkland, 2016; Balduzzi et al., 2015) broke the local dependencies between successive layers and made all hidden layers receive error information from the output layer directly. In (Carreira Perpinan & Wang, 2014; Taylor et al., 2016), the authors loosened the exact connections between layers by introducing auxiliary variables. In each layer, they imposed equality constraint between the auxiliary variable and activation, and optimized the new problem using Alternating Direction Method which is easy to parallel. However, for the convolutional neural network, the performances of all above methods are much worse than backpropagation algorithm when the network is deep. In this paper, we focus on breaking the backward locking in backpropagtion algorithm for training feedforward neural networks, such that we can update models in parallel without loss of accuracy. The main contributions of our work are as follows: Firstly, we decouple the backpropagation using delayed gradients in Section 3 such that all modules of the network can be updated in parallel without backward locking. Then, we propose two stochastic algorithms using decoupled parallel backpropagation in Section 3 for deep learning optimization. We also provide convergence analysis for the proposed method in Section 4 and prove that it guarantees convergence to critical points for the non-convex problem. Finally, we perform experiments for training deep convolutional neural networks in Section 5, experimental results verifying that the proposed method can significantly speed up the training without loss of accuracy. 2. Backgrounds We begin with a brief overview of the backpropagation algorithm for the optimization of neural networks. Suppose that we want to train a feedforward neural network with L layers, each layer taking an input hl 1 and producing an activation hl = Fl(hl 1; wl) with weight wl. Letting d be the dimension of weights in the network, we have w = [w1, w2, ..., w L] Rd. Thus, the output of the network can be represented as h L = F(h0; w), where h0 denotes the input data x. Taking a loss function f and targets y, the training problem is as follows: min w=[w1,...,w L] f(F(x; w), y). (1) In the following context, we use f(w) for simplicity. Gradients based methods are widely used for deep learning optimization (Robbins & Monro, 1951; Qian, 1999; Hinton et al., 2012; Kingma & Ba, 2014). In iteration t, we put a data sample xi(t) into the network, where i(t) denotes the index of the sample. According to stochastic gradient descent (SGD), we update the weights of the network through: wt+1 l = wt l γt fl,xi(t)(wt) l , l {1, 2, ..., L} (2) where γt is the stepsize and fl,xi(t)(wt) Rd is the gradient of the loss function (1) with respect to the weights at layer l and data sample xi(t), all the coordinates in other than layer l are 0. We always utilize backpropagation algorithm to compute the gradients (Rumelhart et al., 1988). The backpropagation algorithm consists of two passes of the network: in the forward pass, the activations of all layers are calculated from l = 1 to L as follows: ht l = Fl(ht l 1; wl); (3) in the backward pass, we apply chain rule for gradients and repeatedly propagate error gradients through the network from the output layer l = L to the input layer l = 1: wt l = ht l wt l ht l 1 = ht l ht l 1 where we let fl,xi(t)(wt) = f(wt) wt l . From equations (4) and (5), it is obvious that the computation in layer l is dependent on the error gradient f(wt) ht l from layer l+1. Therefore, the backward locking constrains all layers from updating before receiving error gradients from the dependent layers. When the network is very deep or distributed across multiple resources, the backward locking is the main bottleneck in the training process. 3. Decoupled Parallel Backpropagation In this section, we propose to decouple the backpropagation algorithm using delayed gradients (DDG). Suppose we split a L-layer feedforward neural network to K modules, such that the weights of the network are divided into K groups. Therefore, we have w = [w G(1), w G(2), ..., w G(K)] where G(k) denotes layer indices in the group k. Decoupled Parallel Backpropagation with Convergence Guarantee Figure 2. We split a multilayer feedforward neural network into three modules (A, B and C), where each module is a stack of layers. After executing the forward pass (from 1 to 3) to predict, our proposed method allows all modules to run backward pass (4) using delayed gradients without locking. Particularly, module A can perform the backward pass using the stale error gradient δt 2 A . Meanwhile, It also receives δt 1 A from module B for the update of the next iteration. 3.1. Backpropagation Using Delayed Gradients In iteration t, data sample xi(t) is input to the network. We run the forward pass from module k = 1 to k = K. In each module, we compute the activations in sequential order as equation (3). In the backward pass, all modules except the last one have delayed error gradients in store such that they can execute the backward computation without locking. The last module updates with the up-to-date gradients. In particular, module k keeps the stale error gradient f(wt K+k) ht K+k Lk , where Lk denotes the last layer in module k. Therefore, the backward computation in module k is as follows: wt K+k l = ht K+k l wt K+k l ht K+k l , (6) ht K+k l 1 = ht K+k l ht K+k l 1 ht K+k l . (7) where ℓ G(k). Meanwhile, each module also receives error gradient from the dependent module for further computation. From (6) and (7), we can know that the stale error gradients in all modules are of different time delay. From module k = 1 to k = K, their corresponding time delays are from K 1 to 0. Delay 0 indicates that the gradients are up-to-date. In this way, we break the backward locking and achieve parallel update in the backward pass. Figure 2 shows an example of the decoupled backpropagation, where error gradients δ := f(w) 3.2. Speedup of Decoupled Parallel Backpropagation When K = 1, there is no time delay and the proposed method is equivalent to the backpropagation algorithm. When K = 1, we can distribute the network across multiple GPUs and fully leverage the computing resources. Table 1 lists the computation time when we sequentially allocate the network across K GPUs. When TF is necessary to compute accurate predictions, we can accelerate the training by re- Table 1. Comparisons of computation time when the network is sequentially distributed across K GPUs. TF and TB denote the forward and backward time for backpropagation algorithm. Method Computation Time Backpropagation TF + TB DDG TF + TB ducing the backward time. Because TB is much large than TF , we can achieve huge speedup even K is small. Relation to model parallelism: Model parallelism usually refers to filter-wise parallelism (Yadan et al., 2013). For example, we split a convolutional layer with N filters into two GPUs, each part containing N 2 filters. Although the filterwise parallelism accelerates the training when we distribute the workloads across multiple GPUs, it still suffers from the backward locking. We can think of DDG algorithm as layer-wise parallelism. It is also easy to combine filter-wise parallelism with layer-wise parallelism for further speedup. 3.3. Stochastic Methods Using Delayed Gradients After computing the gradients of the loss function with respect to the weights of the model, we update the model using delayed gradients. Letting f G(k),xi(t K+k) wt K+k := wt K+k l if t K + k 0 0 otherwise , (8) for any k {1, 2, ..., K}, we update the weights in module k following SGD: wt+1 G(k) = wt G(k) γt[ f G(k),xi(t K+k) wt K+k ]G(k). (9) where γt denotes stepsize. Different from SGD, we update the weights with delayed gradients. Besides, the delayed iteration (t K + k) for group k is also deterministic. We summarize the proposed method in Algorithm 1. Decoupled Parallel Backpropagation with Convergence Guarantee Algorithm 1 SGD-DDG Initial weights w0 = [w0 G(1), ..., w0 G(K)] Rd; Stepsize sequence {γt}; 1: for t = 0, 1, 2, . . . , T 1 do 2: for k = 1, . . . , K in parallel do 3: Compute delayed gradient: gt k h f G(k),xi(t K+k) wt K+k i 4: Update weights: wt+1 G(k) wt G(k) γt gt k; 5: end for 6: end for Moreover, we can also apply the delayed gradients to other variants of SGD, for example Adam in Algorithm 2. In each iteration, we update the weights and moment vectors with delayed gradients. We analyze the convergence for Algorithm 1 in Section 4, which is the basis of analysis for other methods. 4. Convergence Analysis In this section, we establish the convergence guarantees to critical points for Algorithm 1 when the problem is nonconvex. Analysis shows that our method admits similar convergence rate to vanilla stochastic gradient descent (Bottou et al., 2016). Throughout this paper, we make the following commonly used assumptions: Assumption 1 (Lipschitz-continuous gradient) The gradient of f(w) is Lipschitz continuous with Lipschitz constant L > 0, such that w, v Rd: f(w) f(v) 2 L w v 2 (10) Assumption 2 (Bounded variance) To bound the variance of the stochastic gradient, we assume the second moment of the stochastic gradient is upper bounded, such that there exists constant M 0, for any sample xi and w Rd: fxi(w) 2 2 M (11) Because of the unnoised stochastic gradient E [ fxi(w)] = f(w) and the equation regarding variance E fxi(w) f(w) 2 2 = E fxi(w) 2 2 f(w) 2 2, the variance of the stochastic gradient is guaranteed to be less than M. Under Assumption 1 and 2, we obtain the following lemma about the sequence of objective functions. Lemma 1 Assume Assumption 1 and 2 hold. In addition, we let σ := maxt γmax{0,t K+1} γt and MK = KM +σK4M. Algorithm 2 Adam-DDG Initial weights: w0 = [w0 G(1), ..., w0 G(K)] Rd; Stepsize: γ; Constant ϵ = 10 8; Exponential decay rates: β1 = 0.9 and β2 = 0.999 ; First moment vector: m0 G(k) 0, k {1, 2, ..., K}; Second moment vector: v0 G(k) 0, k {1, 2, ..., K}; 1: for t = 0, 1, 2, . . . , T 1 do 2: for k = 1, . . . , K in parallel do 3: Compute delayed gradient: gt k h f G(k),xi(t K+k) wt K+k i 4: Update biased first moment estimate: mt+1 G(k) β1 mt G(k) + (1 β1) gt k 5: Update biased second moment estimate: vt+1 G(k) β2 vt G(k) + (1 β2) (gt k)2 6: Compute bias-correct first moment estimate: ˆmt+1 G(k) mt+1 G(k)/(1 βt+1 1 ) 7: Compute bias-correct second moment estimate: ˆvt+1 G(k) vt+1 G(k)/(1 βt+1 2 ) 8: Update weights: wt+1 G(k) wt G(k) γ ˆmt+1 G(k)/ q ˆvt+1 G(k) + ϵ 9: end for 10: end for The iterations in Algorithm 1 satisfy the following inequality, for all t N: E f(wt+1) f(wt) γt f(wt) 2 2 + γ2 t LMK (12) From Lemma 1, we can observe that the expected decrease of the objective function is controlled by the stepsize γt and MK. Therefore, we can guarantee that the values of objective functions are decreasing as long as the stepsizes γt are small enough such that the right-hand side of (12) is less than zero. Using the lemma above, we can analyze the convergence property for Algorithm 1. 4.1. Fixed Stepsize γt Firstly, we analyze the convergence for Algorithm 1 when γt is fixed and prove that the learned model will converge sub-linearly to the neighborhood of the critical points. Theorem 1 Assume Assumption 1 and 2 hold and the fixed stepsize sequence {γt} satisfies γt = γ and γL 1, t {0, 1, ..., T 1}. In addition, we assume w to be the optimal solution to f(w) and let σ = 1 such that MK = KM + K4M. Then, the output of Algorithm 1 satisfies that: t=0 E f(wt) 2 2 2 f(w0) f(w ) γT + 2γLMK (13) Decoupled Parallel Backpropagation with Convergence Guarantee 0 50 100 150 200 250 300 Epoch Split point at layer 1 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Split point at layer 3 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Split point at layer 5 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Split point at layer 7 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split point at layer 1 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split point at layer 3 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split point at layer 5 BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split point at layer 7 BP Train DNI Train DDG Train BP Test DNI Test DDG Test Figure 3. Training and testing curves regarding epochs for Res Net-8 on CIFAR-10. Upper: Loss function values regarding epochs; Bottom: Top 1 classification accuracies regarding epochs. We split the network into two modules such that there is only one split point in the network for DNI and DDG. In Theorem 1, we can observe that when T , the average norm of the gradients is upper bounded by 2γLMK. The number of modules K affects the value of the upper bound. Selecting a small stepsize γ allows us to get better neighborhood to the critical points, however it also seriously decreases the speed of convergence. 4.2. Diminishing Stepsize γt In this section, we prove that Algorithm 1 with diminishing stepsizes can guarantee the convergence to critical points for the non-convex problem. Theorem 2 Assume Assumption 1 and 2 hold and the diminishing stepsize sequence {γt} satisfies γt = γ0 1+t and γt L 1, t {0, 1, ..., T 1}. In addition, we assume w to be the optimal solution to f(w) and let σ = K such that MK = KM + K5M. Setting ΓT = T 1 P t=0 γt, then the output of Algorithm 1 satisfies that: t=0 γt E f(wt) 2 2 2 f(w0) f(w ) t=0 γ2 t LMK Corollary 1 Since γt = γ0 t+1, the stepsize requirements in (Robbins & Monro, 1951) are satisfied that: t=0 γt = and lim T t=0 γ2 t < . (15) Therefore, according to Theorem 2, when T , the right-hand side of (14) converges to 0. Corollary 2 Suppose ws is chosen randomly from {wt}T 1 t=0 with probabilities proportional to {γt}T 1 t=0 . According to Theorem 2, we can prove that Algorithm 1 guarantees convergence to critical points for the non-convex problem: lim s E f(ws) 2 2 = 0 (16) 5. Experiments In this section, we experiment with Res Net (He et al., 2016) on image classification benchmark datasets: CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009). In section 5.1, we evaluate our method by varying the positions and the number of the split points in the network; In section 5.2 we use our method to optimize deeper neural networks and show that its performance is as good as the performance of backpropagation; finally, we split and distribute the Res Net110 across GPUs in Section 5.3, results showing that the proposed method achieves a speedup of two times without loss of accuracy. Implementation Details: We implement DDG algorithm using Py Torch library (Paszke et al., 2017). The trained network is split into K modules where each module is running on a subprocess. The subprocesses are spawned using multiprocessing package 1 such that we can fully leverage 1https://docs.python.org/3/library/multiprocessing.html#modulemultiprocessing Decoupled Parallel Backpropagation with Convergence Guarantee 0 50 100 150 200 250 300 Epoch Split points at layers {1,3} BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Split points at layers {1,3,5} BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Split points at layers {1,3,5,7} BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split points at layers {1,3} BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split points at layers {1,3,5} BP Train DNI Train DDG Train BP Test DNI Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Split points at layers {1,3,5,7} BP Train DNI Train DDG Train BP Test DNI Test DDG Test Figure 4. Training and testing curves regarding epochs for Res Net-8 on CIFAR-10. Upper: Loss function values regarding epochs; Bottom: Top1 classification accuracies regarding epochs. For DNI and DDG, the number of split points in the network ranges from 2 to 4. multiple processors on a given machine. Running modules on different subprocesses make the communication very difficult. To make the communication fast, we utilize the shared memory objects in the multiprocessing package. As in Figure 2, every two adjacent modules share a pair of activation (h) and error gradient (δ). 5.1. Comparison of BP, DNI and DDG In this section, we train Res Net-8 on CIFAR-10 on a single Titan X GPU. The architecture of the Res Net-8 is in Table 2. All experiments are run for 300 epochs and optimized using Adam optimizer (Kingma & Ba, 2014) with a batch size of 128. The stepsize is initialized at 1 10 3. We augment the dataset with random cropping, random horizontal flipping and normalize the image using mean and standard deviation. There are three compared methods in this experiment: BP: Adam optimizer in Pytorch uses backpropagation algorithm with data parallelism (Rumelhart et al., 1988) to compute gradients. DNI: Decoupled neural interface (DNI) in (Jaderberg et al., 2016). Following (Jaderberg et al., 2016), the synthetic network is a stack of three convolutional layers with L 5 5 filters with resolution preserving padding. The filter depth L is determined by the position of DNI. We also input label information into the synthetic network to increase final accuracy. Table 2. Architectural details. Units denotes the number of residual units in each group. Each unit is a basic residual block without bottleneck. Channels indicates the number of filters used in each unit in each group. Architecture Units Channels Res Net-8 1-1-1 16-16-32-64 Res Net-56 9-9-9 16-16-32-64 Res Net-110 18-18-18 16-16-32-64 DDG: Adam optimizer using delayed gradients in Algorithm 2. Impact of split position (depth). The position (depth) of the split points determines the number of layers using delayed gradients. Stale or synthetic gradients will induce noises in the training process, affecting the convergence of the objective. Figure 3 exhibits the experimental results when there is only one split point with varying positions. In the first column, we know that all compared methods have similar performances when the split point is at layer 1. DDG performs consistently well when we place the split point at deeper positions 3, 5 or 7. On the contrary, the performance of DNI degrades as we vary the positions and it cannot even converge when the split point is at layer 7. Impact of the number of split points. From equation (7), we know that the maximum time delay is determined by Decoupled Parallel Backpropagation with Convergence Guarantee 0 50 100 150 200 250 300 Epoch Convergence on CIFAR-10 DDG Train 1GPUs DDG Train 2GPUs DDG Train 3GPUs DDG Train 4GPUs DDG Test 1GPUs DDG Test 2GPUs DDG Train 3GPUs DDG Train 4GPUs Figure 5. Training and testing loss curves for Res Net-110 on CIFAR-10 using multiple GPUs. (5a) Loss function value regarding epochs. (5b) Loss function value regarding computation time. BP #GPUs=1 DDG #GPUs=1 DDG #GPUs=2 DDG #GPUs=3 DDG #GPUs=4 Optimization Setup Computation Time Top1 Accuracy % Computation time and accuracy on CIFAR-10 93.53% 93.41% 93.38% 93.39% 93.38% Figure 6. Computation time and the best Top 1 accuracy for Res Net-110 on the test data of CIFAR-10. The most left bar denotes the computation time using backpropagation algorithm on a GPU, where the forward time accounts for about 32%. We normalize the computation time of all optimization settings using the amount of time required by backpropagation. the number of modules K. Theorem 2 also shows that K affects the convergence rate. In this experiment, we vary the number of split points in the network from 2 to 4 and plot the results in Figure 4. It is easy to observe that DDG performs as well as BP, regardless of the number of split points in the network. However, DNI is very unstable when we place more split points, and cannot even converge sometimes. 5.2. Optimizing Deeper Neural Networks In this section, we employ DDG to optimize two very deep neural networks (Res Net-56 and Res Net-110) on CIFAR-10 and CIFAR-100. Each network is split into two modules at the center. We use SGD with the momentum of 0.9 and Table 3. The best Top 1 classification accuracy (%) for Res Net-56 and Res Net-110 on the test data of CIFAR-10 and CIFAR-100. Architecture CIFAR-10 CIFAR-100 BP DDG BP DDG Res Net-56 93.12 93.11 69.79 70.17 Res Net-110 93.53 93.41 71.90 71.39 the stepsize is initialized to 0.01. Each model is trained for 300 epochs and the stepsize is divided by a factor of 10 at 150 and 225 epochs. The weight decay constant is set to 5 10 4. We perform the same data augmentation as in section 5.1. Experiments are run on a single Titan X GPU. Figure 7 presents the experimental results of BP and DDG. We do not compare DNI because its performance is far worse when models are deep. Figures in the first column present the convergence of loss regarding epochs, showing that DDG and BP admit similar convergence rates. We can also observe that DDG converges faster when we compare the loss regarding computation time in the second column of Figure 7. In the experiment, the Volatile GPU Utility is about 70% when we train the models with BP. Our method runs on two subprocesses such that it fully leverages the computing capacity of the GPU. We can draw similar conclusions when we compare the Top 1 accuracy in the third and fourth columns of Figure 7. In Table 3, we list the best Top 1 accuracy on the test data of CIFAR-10 and CIFAR100. We can observe that DDG can obtain comparable or better accuracy even when the network is deep. 5.3. Scaling the Number of GPUs In this section, we split Res Net-110 into K modules and allocate them across K Titan X GPUs sequentially. We do Decoupled Parallel Backpropagation with Convergence Guarantee 0 50 100 150 200 250 300 Epoch Res Net-56 on CIFAR-10 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Res Net-56 on CIFAR-10 BP Train DDG Train BP Test DDG Test 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (s) Top1 Accuracy % Res Net-56 on CIFAR-10 BP Train DDG Train BP Test DDG Test 0 50 100 150 200 250 300 Epoch Res Net-110 on CIFAR-10 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Res Net-110 on CIFAR-10 BP Train DDG Train BP Test DDG Test 0 2000 4000 6000 8000 10000 12000 14000 16000 Time (s) Top1 Accuracy % Res Net-110 on CIFAR-10 BP Train DDG Train BP Test DDG Test 0 50 100 150 200 250 300 Epoch Res Net-56 on CIFAR-100 BP Train DDG Train BP Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Res Net-56 on CIFAR-100 BP Train DDG Train BP Test DDG Test 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 Time (s) Top1 Accuracy % Res Net-56 on CIFAR-100 BP Train DDG Train BP Test DDG Test 0 50 100 150 200 250 300 Epoch Res Net-110 on CIFAR-100 BP Train DDG Train BP Test DDG Test 0 50 100 150 200 250 300 Epoch Top1 Accuracy % Res Net-110 on CIFAR-100 BP Train DDG Train BP Test DDG Test 0 2000 4000 6000 8000 10000 12000 14000 16000 Time (s) Top1 Accuracy % Res Net-110 on CIFAR-100 BP Train DDG Train BP Test DDG Test Figure 7. Training and testing curves for Res Net-56 and Res Net-110 on CIFAR-10 and CIFAR-100. Column 1 and 2 present the loss function value regrading epochs and computation time respectively; Column 3 and 4 present the Top 1 classification accuracy regrading epochs and computation time. For DDG, there is only one split point at the center of the network. not consider filter-wise model parallelism in this experiment. The selections of the parameters in the experiment are similar to Section 5.2. From Figure 5, we know that training networks in multiple GPUs does not affect the convergence rate. For comparison, we also count the computation time of backpropagation algorithm on a single GPU. The computation time is worse when we run backpropagation algorithm on multiple GPUs because of the communication overhead. In Figure 6, we can observe that forward time only accounts for about 32% of the total computation time for backpropagation algorithm. Therefore, backward locking is the main bottleneck. In Figure 6, it is obvious that when we increase the number of GPUs from 2 to 4, our method reduces about 30% to 50% of the total computation time. In other words, DDG achieves a speedup of about 2 times without loss of accuracy when we train the networks across 4 GPUs. 6. Conclusion In this paper, we propose decoupled parallel backpropagation algorithm, which breaks the backward locking in backpropagation algorithm using delayed gradients. We then apply the decoupled parallel backpropagation to two stochastic methods for deep learning optimization. In the theoretical section, we also provide convergence analysis and prove that the proposed method guarantees convergence to critical points for the non-convex problem. Finally, we perform experiments on deep convolutional neural networks, results verifying that our method can accelerate the training significantly without loss of accuracy. Decoupled Parallel Backpropagation with Convergence Guarantee Acknowledgement This work was partially supported by U.S. NIH R01 AG049371, NSF IIS 1302675, IIS 1344152, DBI 1356628, IIS 1619308, IIS 1633753. Balduzzi, D., Vanchinathan, H., and Buhmann, J. M. Kickback cuts backprop s red-tape: Biologically plausible credit assignment in neural networks. In AAAI, pp. 485 491, 2015. Bengio, Y. et al. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1 127, 2009. Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. ar Xiv preprint ar Xiv:1606.04838, 2016. Carreira-Perpinan, M. and Wang, W. Distributed optimization of deeply nested systems. In Artificial Intelligence and Statistics, pp. 10 19, 2014. Czarnecki, W. M., Swirszcz, G., Jaderberg, M., Osindero, S., Vinyals, O., and Kavukcuoglu, K. Understanding synthetic gradients and decoupled neural interfaces. ar Xiv preprint ar Xiv:1703.00522, 2017. Eldan, R. and Shamir, O. The power of depth for feedforward neural networks. In Conference on Learning Theory, pp. 907 940, 2016. He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770 778, 2016. Hinton, G., Srivastava, N., and Swersky, K. Neural networks for machine learning-lecture 6a-overview of mini-batch gradient descent, 2012. Huang, G., Liu, Z., Weinberger, K. Q., and van der Maaten, L. Densely connected convolutional networks. ar Xiv preprint ar Xiv:1608.06993, 2016. Jaderberg, M., Czarnecki, W. M., Osindero, S., Vinyals, O., Graves, A., and Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. ar Xiv preprint ar Xiv:1608.05343, 2016. Johnson, J. Benchmarks for popular cnn models. https: //github.com/jcjohnson/cnn-benchmarks, 2017. Kingma, D. and Ba, J. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Le Cun, Y., Bengio, Y., and Hinton, G. Deep learning. Nature, 521(7553):436 444, 2015. Nøkland, A. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pp. 1037 1045, 2016. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De Vito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017. Qian, N. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145 151, 1999. Robbins, H. and Monro, S. A stochastic approximation method. The annals of mathematical statistics, pp. 400 407, 1951. Rumelhart, D. E., Hinton, G. E., Williams, R. J., et al. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988. Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1 9, 2015. Taylor, G., Burmeister, R., Xu, Z., Singh, B., Patel, A., and Goldstein, T. Training neural networks without gradients: A scalable admm approach. In International Conference on Machine Learning, pp. 2722 2731, 2016. Telgarsky, M. Benefits of depth in neural networks. ar Xiv preprint ar Xiv:1602.04485, 2016. Yadan, O., Adams, K., Taigman, Y., and Ranzato, M. Multigpu training of convnets. ar Xiv preprint ar Xiv:1312.5853, 2013.