# training_neural_networks_using_features_replay__e4c7199d.pdf Training Neural Networks Using Features Replay Zhouyuan Huo1,2, Bin Gu2, Heng Huang1,2 1Electrical and Computer Engineering, University of Pittsburgh, 2 JDDGlobal.com zhouyuan.huo@pitt.edu, jsgubin@gmail.com heng.huang@pitt.edu Training a neural network using backpropagation algorithm requires passing error gradients sequentially through the network. The backward locking prevents us from updating network layers in parallel and fully leveraging the computing resources. Recently, there are several works trying to decouple and parallelize the backpropagation algorithm. However, all of them suffer from severe accuracy loss or memory explosion when the neural network is deep. To address these challenging issues, we propose a novel parallel-objective formulation for the objective function of the neural network. After that, we introduce features replay algorithm and prove that it is guaranteed to converge to critical points for the non-convex problem under certain conditions. Finally, we apply our method to training deep convolutional neural networks, and the experimental results show that the proposed method achieves faster convergence, lower memory consumption, and better generalization error than compared methods. 1 Introduction In recent years, the deep convolutional neural networks have made great breakthroughs in computer vision [8, 10, 19, 20, 32, 33], natural language processing [15, 16, 31, 36], and reinforcement learning [21, 23, 24, 25]. The growth of the depths of the neural networks is one of the most critical factors contributing to the success of deep learning, which has been verified both in practice [8, 10] and in theory [2, 7, 35]. Gradient-based methods are the major methods to train deep neural networks, such as stochastic gradient descent (SGD) [29], ADAGRAD [6], RMSPROP [9] and ADAM [17]. As long as the loss functions are differentiable, we can compute the gradients of the networks using backpropagation algorithm [30]. The backpropagation algorithm requires two passes of the neural network, the forward pass to compute activations and the backward pass to compute gradients. As shown in Figure 1 (BP), error gradients are repeatedly propagated from the top (output layer) all the way back to the bottom (input layer) in the backward pass. The sequential propagation of the error gradients is called backward locking because all layers of the network are locked until their dependencies have executed. According to the benchmark report in [14], the computational time of the backward pass is about twice of the computational time of the forward pass. When networks are quite deep, backward locking becomes the bottleneck of making good use of computing resources, preventing us from updating layers in parallel. There are several works trying to break the backward locking in the backpropagation algorithm. [4] and [34] avoid the backward locking by removing the backpropagation algorithm completely. In [4], the authors proposed the method of auxiliary coordinates (MAC) and simplified the nested functions by imposing quadratic penalties. Similarly, [34] used Lagrange multipliers to enforce equality constraints between auxiliary variables and activations. Both of the reformulated problems do not require backpropagation algorithm at all and are easy to be parallelized. However, neither of Corresponding Author. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. Error gradient Forward pass Backward pass Network layer Method BP DNI DDG Backward locking Yes No No LS = | | δt ˆδt | | 2 Figure 1: Illustrations of the backward pass of the backpropagation algorithm (BP) [30], decoupled neural interface (DNI) [13] and decoupled parallel backpropagation (DDG) [11]. DNI breaks the backward locking by synthesizing error gradients. DDG breaks the backward locking by storing stale gradients. them have been applied to training convolutional neural networks yet. There are also several works breaking the dependencies between groups of layers or modules in the backpropagation algorithm. In [13], the authors proposed to remove the backward locking by employing the decoupled neural interface to approximate error gradients (Figure 1 DNI). [1, 27] broke the local dependencies between successive layers and made all hidden layers receive error information from the output layer directly. In the backward pass, we can use the synthetic gradients or the direct feedbacks to update the weights of all modules without incurring any delay. However, these methods work poorly when the neural networks use very deep architecture. In [11], the authors proposed decoupled parallel backpropagation by using stale gradients, where modules are updated with the gradients from different timestamps (Figure 1 DDG). However, it requires large amounts of memory to store the stale gradients and suffers from the loss of accuracy. In this paper, we propose feature replay algorithm which is free of the above three issues: backward locking, memory explosion and accuracy loss. The main contributions of our work are summarized as follows: Firstly, we propose a novel parallel-objective formulation for the objective function of the neural networks in Section 3. Using this new formulation, we break the backward locking by introducing features replay algorithm, which is easy to be parallelized. Secondly, we provide the theoretical analysis in Section 4 and prove that the proposed method is guaranteed to converge to critical points for the non-convex problem under certain conditions. Finally, we validate our method with experiments on training deep convolutional neural networks in Section 5. Experimental results demonstrate that the proposed method achieves faster convergence, lower memory consumption, and better generalization error than compared methods. 2 Background We assume there is a feedforward neural network with L layers, where w = [w1, w2, ..., w L] Rd denotes the weights of all layers. The computation in each layer can be represented as taking an input hl 1 and producing an activation hl = Fl(hl 1; wl) using weight wl. Given a loss function f and target y, we can formulate the objective function of the neural network f(w) as follows: min w f(h L, y) s.t. hl = Fl(hl 1; wl) for all l {1, 2, ..., L} (1) where h0 denotes the input data x. By using stochastic gradient descent, the weights of the network are updated in the direction of their negative gradients of the loss function following: wt+1 l = wt l γt gt l for all l {1, 2, ..., L} (2) Forward pass Backward pass h Activation Error gradient Figure 2: Backward pass of Features Replay Algorithm. We divide a 12-layer neural network into four modules, where each module stores its input history and a stale error gradient from the upper module. At each iteration, all modules compute the activations by inputting features from the history and compute the gradients by applying the chain rule. After that, they receive the error gradients from the upper modules for the next iteration. where γt denotes the stepsize and gt l := fxt(wt) wt l denotes the gradient of the loss function (1) regarding wt l with input samples xt. The backpropagation algorithm [30] is utilized to compute the gradients for the neural networks. At iteration t, it requires two passes over the network: in the forward pass, the activations of all layers are computed from the bottom layer l = 1 to the top layer l = L following: ht l = Fl(ht l 1; wt l); in the backward pass, it applies the chain rule and propagates error gradients through the network from the top layer l = L to the bottom layer l = 1 following: wt l = ht l wt l fxt(wt) ht l and fxt(wt) ht l 1 = ht l ht l 1 fxt(wt) According to (3), computing gradients for the weights wl of the layer l is dependent on the error gradient fxt(wt) ht l from the layer l +1, which is known as backward locking. Therefore, the backward locking prevents all layers from updating before receiving error gradients from dependent layers. When the networks are deep, the backward locking becomes the bottleneck in the training process. 3 Features Replay In this section, we propose a novel parallel-objective formulation for the objective function of the neural networks. Using our new formulation, we break the backward locking in the backpropagation algorithm by using features replay algorithm. 3.1 Problem Reformulation As shown in Figure 2, we assume to divide an L-layer feedforward neural network into K modules where K L, such that w = [w G(1), w G(2), ..., w G(K)] Rd and G(k) denotes the layers in the module k. Let Lk represent the last layer of the module k, the output of this module can be written as h Lk. The error gradient variable is denoted as δt k , which is used for the gradient computation of the module k. We can split the problem (1) into K subproblems. The task of the module k (except k = K) is minimizing the least square error between the error gradient variable δt k and fht Lk (wt) ht Lk which is the gradient of the loss function regarding ht Lk with input ht Lk into the module k + 1, and the task of the module K is minimizing the loss between the prediction ht LK and the real label yt. From this point of view, we propose a novel parallel-objective loss function at iteration t as follows: δt k fht Lk (wt) 2 + f ht LK, yt s.t. ht Lk = FG(k)(ht Lk 1; wt G(k)) for all k {1, ..., K}, (4) Algorithm 1 Features Replay Algorithm 1: Initialize: weights w0 = [w0 G(1), ..., w0 G(K)] Rd and stepsize sequence {γt}; 2: for t = 0, 1, 2, . . . , T 1 do 3: Sample mini-batch (xt, yt) from the dataset and let ht L0 = xt; 4: for k = 1, . . . , K do 5: Store ht Lk 1 in the memory; 6: Compute ht Lk following ht Lk = FG(k) ht Lk 1; wt G(k) ; Play Forward pass 7: Send ht Lk to the module k + 1 if k < K; 8: end for 9: Compute loss f(wt) = f ht LK, yt ; 10: for k = 1, . . . , K in parallel do 11: Compute ht Lk following ht Lk = FG(k)(ht+k K Lk 1 ; wt G(k)); Replay 12: Compute gradient gt G(k) following (7); Backward pass 13: Update weights: wt+1 G(k) = wt G(k) γt gt G(k); fht+k K Lk 1 (wt) ht+k K Lk 1 to the module k 1 if k > 1; 15: end for 16: end for where ht L0 denotes the input data xt. It is obvious that the optimal solution for the left term of the problem (4) is δt k = fht Lk (wt) ht Lk , for all k {1, ..., K 1}. In other words, the optimal solution of the module k is dependent on the output of the upper modules. Therefore, minimizing the problem (1) with the backpropagation algorithm is equivalent to minimizing the problem (4) with the first K 1 subproblems obtaining optimal solutions. 3.2 Breaking Dependencies by Replaying Features Features replay algorithm is introduced in Algorithm 1. In the forward pass, immediate features are generated and passed through the network, and the module k keeps a history of its input with size K k+1. To break the dependencies between modules in the backward pass, we propose to compute the gradients of the modules using immediate features from different timestamps. Features replay denotes that immediate feature ht+k K Lk 1 is input into the module k for the first time in the forward pass at iteration t + k K, and it is input into the module k for the second time in the backward pass at iteration t. If t + k K < 0, we set ht+k K Lk 1 = 0 . Therefore, the new problem can be written as: δt k f ht Lk (wt) 2 + f( ht LK, yt) s.t. ht Lk = FG(k)(ht+k K Lk 1 ; wt G(k)) for all k {1, ..., K}. (5) where f ht Lk (wt) ht Lk denotes the gradient of the loss f(wt) regarding ht Lk with input ht Lk into the module k + 1. It is important to note that it is not necessary to get the optimal solutions for the first K 1 subproblems while we do not compute the optimal solution for the last subproblem. To avoid the tedious computation, we make a trade-off between the error of the left term in (5) and the computational time by making: δt k = fht+k K Lk (wt 1) ht+k K Lk for all k {1, ..., K 1}, (6) fht+k K Lk (wt 1) ht+k K Lk denotes the gradient of the loss f(wt 1) regarding ht+k K Lk with in- put ht+k K Lk into the module k + 1 at the previous iteration. Assuming the algorithm has converged as t , we have wt wt 1 wt+k K such that ht Lk ht+k K Lk and fht+k K Lk (wt 1) ht+k K Lk f ht Lk (wt) 2 0 for all k {1, ..., K 1}. Therefore, (6) is a reasonable approximation of the optimal solutions to the first K 1 subproblems in (5). In this way, we break the backward locking in the backpropagation algorithm because the error gradient variable δt k can be determined at the previous iteration t 1 such that all modules are independent of each other at iteration t. Additionally, we compute the gradients inside each module following: fht+k K Lk 1 (wt) wt l = ht Lk wt l δt k and fht+k K Lk 1 (wt) ht l = ht Lk ht l δt k, (7) where l G(k). At the end of each iteration, the module k sends fht+k K Lk 1 (wt) ht+k K Lk 1 to module k 1 for the computation of the next iteration. 4 Convergence Analysis In this section, we provide theoretical analysis for Algorithm 1. Analyzing the convergence of the problem (5) directly is difficult, as it involves the variables of different timestamps. Instead, we solve this problem by building a connection between the gradients of Algorithm 1 and stochastic gradient descent in Assumption 1, and prove that the proposed method is guaranteed to converge to critical points for the non-convex problem (1). Assumption 1 (Sufficient direction) We assume that the expectation of the descent direction k=1 gt G(k) in Algorithm 1 is a sufficient descent direction of the loss f(wt) regarding wt. Let f(wt) denote the full gradient of the loss, there exists a constant σ > 0 such that, * k=1 gt G(k) σ f(wt) 2 2. (8) Sufficient direction assumption guarantees that the model is moving towards the descending direction of the loss function. Assumption 2 Throughout this paper, we make two assumptions following [3]: (Lipschitz-continuous gradient) The gradient of f is Lipschitz continuous with a constant L > 0, such that for any w1, w2 Rd, it is satisfied that f(w1) f(w2) 2 L w1 w2 2. (Bounded variance) We assume that the second moment of the descent direction in Algorithm 1 is upper bounded. There exists a constant M 0 such that E k=1 gt G(k) According to the equation regarding variance E ξ E [ξ] 2 2 = E ξ 2 2 E [ξ] 2 2 , the variance of the descent direction E k=1 gt G(k) E K P k=1 gt G(k) 2 is guaranteed to be less than M. According to the above assumptions, we prove the convergence rate for the proposed method under two circumstances of γt. Firstly, we analyze the convergence for Algorithm 1 when γt is fixed and prove that the learned model will converge sub-linearly to the neighborhood of the critical points for the non-convex problem. Theorem 1 Assume that Assumptions 1 and 2 hold, and the fixed stepsize sequence {γt} satisfies γt = γ for all t {0, 1, ..., T 1}. In addition, we assume w to be the optimal solution to f(w). Then, the output of Algorithm 1 satisfies that: t=0 E f(wt) 2 2 f(w0) f(w ) Sufficient Direction Constant Figure 3: Sufficient direction constant σ for Res Net164 and Res Net101 on CIFAR-10. Therefore, the best solution we can obtain is controlled by γLM 2σ . We also prove that Algorithm 1 can guarantee the convergence to critical points for the non-convex problem, as long as the diminishing stepsizes satisfy the requirements in [29] such that: t=0 γt = and lim T t=0 γ2 t < . (10) Theorem 2 Assume that Assumptions 1 and 2 hold and the diminishing stepsize sequence {γt} satisfies (10). In addition, we assume w to be the optimal solution to f(w). Setting ΓT = T 1 P then the output of Algorithm 1 satisfies that: t=0 γt E f(wt) 2 2 f(w0) f(w ) Remark 1 Suppose ws is chosen randomly from {wt}T 1 t=0 with probabilities proportional to {γt}T 1 t=0 . According to Theorem 2, we can prove that Algorithm 1 guarantees convergence to critical points for the non-convex problem: lim s E f(ws) 2 2 = 0 . (12) 5 Experiments In this section, we validate our method with experiments training deep convolutional neural networks. Experimental results show that the proposed method achieves faster convergence, lower memory consumption and better generalization error than compared methods. 5.1 Experimental Setting Implementations: We implement our method in Py Torch [28], and evaluate it with Res Net models [8] on two image classification benchmark datasets: CIFAR-10 and CIFAR-100 [18]. We adopt the standard data augmentation techniques in [8, 10, 22] for training these two datasets: random cropping, random horizontal flipping and normalizing. We use SGD with the momentum of 0.9, and the stepsize is initialized to 0.01. Each model is trained using batch size 128 for 300 epochs and the stepsize is divided by a factor of 10 at 150 and 225 epochs. The weight decay constant is set to 5 10 4. In the experiment, a neural network with K modules is sequentially distributed across K GPUs. All experiments are performed on a server with four Titan X GPUs. Compared Methods: We compare the performance of four methods in the experiments, including: BP: we use the backpropagation algorithm [30] in Py Torch Library. DNI: we implement the decoupled neural interface in [13]. Following [13], the synthetic network 0 50 100 150 200 250 300 Epoch CIFAR-10 (Res Net164) BP Train DDG Train K=4 FR Train K=2 FR Train K=3 FR Train K=4 DNI Train K=4 BP Test DDG Test K=4 FR Test K=2 FR Test K=3 FR Test K=4 DNI Test K=4 0 50 100 150 200 250 300 Epoch CIFAR-10 (Res Net101) BP Train DDG Train K=4 FR Train K=2 FR Train K=3 FR Train K=4 BP Test DDG Test K=4 FR Test K=2 FR Test K=3 FR Test K=4 0 50 100 150 200 250 300 Epoch CIFAR-10 (Res Net152) BP Train DDG Train K=4 FR Train K=2 FR Train K=3 FR Train K=4 BP Test DDG Test K=4 FR Test K=2 FR Test K=3 FR Test K=4 Figure 4: Training and testing curves for Res Net-164, Res Net101 and Res Net152 on CIFAR-10. Row 1 and row 2 present the convergence of the loss function regrading epochs and computational time respectively. Because DNI diverges for all models, we only plot the result of DNI for Res Net164. has two hidden convolutional layers with 5 5 filters, padding of size 2, batch-normalization [12] and Re LU [26]. The output layer is a convolutional layer with 5 5 filters and padding size of 2. DDG: we implement the decoupled parallel backpropagation in [11]. FR: features replay algorithm in Algorithm 1. 5.2 Sufficient Direction We demonstrate that the proposed method satisfies Assumption 1 empirically. In the experiment, we divide Res Net164 and Res Net 101 into 4 modules and visualize the variations of the sufficient direction constant σ during the training period in Figure 3. Firstly, it is obvious that the values of σ of these modules are larger than 0 all the time. Therefore, Assumption 1 is satisfied such that Algorithm 1 is guaranteed to converge to the critical points for the non-convex problem. Secondly, we can observe that the values of σ of the lower modules are relatively small at the first half epochs, and become close to 1 afterwards. The variation of σ indicates the difference between the descent direction of FR and the steepest descent direction. Small σ at early epochs can help the method escape from saddle points and find better local minimum; large σ at the final epochs can prevent the method from diverging. In the following context, we will show that our method has better generation error than compared methods. 5.3 Performance Comparisons To evaluate the performance of the compared methods, we utilize three criterion in the experiment including convergence speed, memory consumption, and generalization error. Faster Convergence: In the experiments, we evaluate the compared methods with three Res Net models: Res Net164 with the basic building block, Res Net101 and Res Net152 with the bottleneck building block [8]. The performances of the compared methods on CIFAR-10 are shown in Figure 4. There are several nontrivial observations as follows: Firstly, DNI cannot converge for all models. The synthesizer network in [13] is so small that it cannot learn an accurate approximation of the error gradient when the network is deep. Secondly, DDG cannot converge for the model Res Net152 when we set K = 4. The stale gradients can impose noise in the optimization and lead to divergence. Thirdly, our method converges much faster than BP when we increase the number of modules. In the experiment, the proposed method FR can achieve a speedup of up to 2 times compared to BP. We do not consider data parallelism for BP in this section. In the supplementary material, we show that our method also converges faster than BP with data parallelism. K=2 K=3 K=4 K=2 K=3 K=4 K=2 K=3 K=4 0 Memory (Mi B) 104 Memory Consumption Res Net164 Res Net101 Res Net152 Figure 5: Memory consumption for Res Net164, Res Net101 and Res Net152. We do not report the memory consumption of DNI because it does not converge. DDG also diverges when K = 3, 4 for Res Net152. Algorithm Backward Memory Locking (Activations) BP [30] yes O(L) DNI [13] no O(L + KLs) DDG [11] no O(LK + K2) FR no O(L + K2) Table 1: Comparisons of memory consumption of the neural network with L layers, which is divided into K modules and L K. We use O(L) to represent the memory consumption of the activations. For DNI, each gradient synthesizer has Ls layers. From the experiments, it is reasonable to assume that Ls K to make the algorithm converge. The memory consumed by the weights is negligible compared to the activations. Lower Memory Consumption: In Figure 5, we present the memory consumption of the compared methods for three models when we vary the number of modules K. We do not consider DNI because it does not converge for all models. It is evident that the memory consumptions of FR and BP are very close. On the contrary, when K = 4, the memory consumption of DDG is more than two times of the memory consumption of BP. The observations in the experiment are also consistent with the analysis in Table 1. For DNI, since a three-layer synthesizer network cannot converge, it is reasonable to assume that Ls should be large if the network is very deep. We do not explore it because it is out of the scope of this paper. We always set K very small such that K L and K Ls. FR can still obtain a good speedup when K is very small according to the second row in Figure 4. Model CIFAR [18] BP [30] DDG [11] FR Res Net164 C-10 6.40 6.45 6.03 C-100 28.53 28.51 27.34 Res Net101 C-10 5.25 5.35 4.97 C-100 23.48 24.25 23.10 Res Net152 C-10 5.26 5.72 4.91 C-100 25.20 26.39 23.61 Table 2: Best testing error rates (%) of the compared methods on CIFAR-10 and CIFAR-100 datasets. For DDG and FR, we set K = 2 in the experiment. Better Generalization Error: Table 2 shows the best testing error rates for the compared methods. We do not report the result of DNI because it does not converge. We can observe that FR always obtains better testing error rates than other two methods BP and DDG by a large margin. We think it is related to the variation of the sufficient descent constant σ. Small σ at the early epochs help FR escape saddle points and find better local minimum, large σ at the final epochs prevent FR from diverging. DDG usually performs worse than BP because the stale gradients impose noise in the optimization, which is commonly observed in asynchronous algorithms with stale gradients [5]. 6 Conclusion In this paper, we proposed a novel parallel-objective formulation for the objective function of the neural network and broke the backward locking using a new features replay algorithm. Besides the new algorithms, our theoretical contributions include analyzing the convergence property of the proposed method and proving that our new algorithm is guaranteed to converge to critical points for the non-convex problem under certain conditions. We conducted experiments with deep convolutional neural networks on two image classification datasets, and all experimental results verify that the proposed method can achieve faster convergence, lower memory consumption, and better generalization error than compared methods. [1] David Balduzzi, Hastagiri Vanchinathan, and Joachim M Buhmann. Kickback cuts backprop s red-tape: Biologically plausible credit assignment in neural networks. In AAAI, pages 485 491, 2015. [2] Yoshua Bengio et al. Learning deep architectures for ai. Foundations and trends R in Machine Learning, 2(1):1 127, 2009. [3] Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. ar Xiv preprint ar Xiv:1606.04838, 2016. [4] Miguel Carreira-Perpinan and Weiran Wang. Distributed optimization of deeply nested systems. In Artificial Intelligence and Statistics, pages 10 19, 2014. [5] Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous sgd. ar Xiv preprint ar Xiv:1604.00981, 2016. [6] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121 2159, 2011. [7] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907 940, 2016. [8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770 778, 2016. [9] Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini batch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012001/lecture,[Online, 2012. [10] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. ar Xiv preprint ar Xiv:1608.06993, 2016. [11] Zhouyuan Huo, Bin Gu, Qian Yang, and Heng Huang. Decoupled parallel backpropagation with convergence guarantee. ar Xiv preprint ar Xiv:1804.10574, 2018. [12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448 456, 2015. [13] Max Jaderberg, Wojciech Marian Czarnecki, Simon Osindero, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Decoupled neural interfaces using synthetic gradients. ar Xiv preprint ar Xiv:1608.05343, 2016. [14] Justin Johnson. Benchmarks for popular cnn models. https://github.com/ jcjohnson/cnn-benchmarks, 2017. [15] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. ar Xiv preprint ar Xiv:1404.2188, 2014. [16] Yoon Kim. Convolutional neural networks for sentence classification. ar Xiv preprint ar Xiv:1408.5882, 2014. [17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009. [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097 1105, 2012. [20] Yann Le Cun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436 444, 2015. [21] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. ar Xiv preprint ar Xiv:1509.02971, 2015. [22] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. ar Xiv preprint ar Xiv:1312.4400, 2013. [23] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928 1937, 2016. [24] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. ar Xiv preprint ar Xiv:1312.5602, 2013. [25] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015. [26] Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807 814, 2010. [27] Arild Nøkland. Direct feedback alignment provides learning in deep neural networks. In Advances in Neural Information Processing Systems, pages 1037 1045, 2016. [28] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. [29] Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400 407, 1951. [30] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations by back-propagating errors. Cognitive modeling, 5(3):1, 1988. [31] Cicero D Santos and Bianca Zadrozny. Learning character-level representations for part-ofspeech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 1818 1826, 2014. [32] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. ar Xiv preprint ar Xiv:1409.1556, 2014. [33] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1 9, 2015. [34] Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: A scalable admm approach. In International Conference on Machine Learning, pages 2722 2731, 2016. [35] Matus Telgarsky. Benefits of depth in neural networks. ar Xiv preprint ar Xiv:1602.04485, 2016. [36] Xiang Zhang, Junbo Zhao, and Yann Le Cun. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649 657, 2015.