# proximal_backpropagation__94e69756.pdf Published as a conference paper at ICLR 2018 PROXIMAL BACKPROPAGATION Thomas Frerix1 , Thomas M ollenhoff1 , Michael Moeller2 , Daniel Cremers1 thomas.frerix@tum.de thomas.moellenhoff@tum.de michael.moeller@uni-siegen.de cremers@tum.de 1 Technical University of Munich 2 University of Siegen We propose proximal backpropagation (Prox Prop) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of explicit gradient descent, which poses an impediment for optimization. Prox Prop is developed from a general point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Specifically, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy, which comprises the network activations as variables of the optimization. We further analyze theoretical properties of Prox Prop and in particular prove that the algorithm yields a descent direction in parameter space and can therefore be combined with a wide variety of convergent algorithms. Finally, we devise an efficient numerical implementation that integrates well with popular deep learning frameworks. We conclude by demonstrating promising numerical results and show that Prox Prop can be effectively combined with common first order optimizers such as Adam. 1 INTRODUCTION In recent years neural networks have gained considerable attention in solving difficult correlation tasks such as classification in computer vision (Krizhevsky et al., 2012) or sequence learning (Sutskever et al., 2014) and as building blocks of larger learning systems (Silver et al., 2016). Training neural networks is accomplished by optimizing a nonconvex, possibly nonsmooth, nested function of the network parameters. Since the introduction of stochastic gradient descent (SGD) (Robbins & Monro, 1951; Bottou, 1991), several more sophisticated optimization methods have been studied. One such class is that of quasi-Newton methods, as for example the comparison of L-BFGS with SGD in (Le et al., 2011), Hessian-free approaches (Martens, 2010), and the Sum of Functions Optimizer in (Sohl-Dickstein et al., 2013). Several works consider specific properties of energy landscapes of deep learning models such as frequent saddle points (Dauphin et al., 2014) and well-generalizable local optima (Chaudhari et al., 2017a). Among the most popular optimization methods in currently used deep learning frameworks are momentum based improvements of classical SGD, notably Nesterov s Accelerated Gradient (Nesterov, 1983; Sutskever et al., 2013), and the Adam optimizer (Kingma & Ba, 2015), which uses estimates of first and second order moments of the gradients for parameter updates. Nevertheless, the optimization of these models remains challenging, as learning with SGD and its variants requires careful weight initialization and a sufficiently small learning rate in order to yield a stable and convergent algorithm. Moreover, SGD often has difficulties in propagating a learning signal deeply into a network, commonly referred to as the vanishing gradient problem (Hochreiter et al., 2001). contributed equally Published as a conference paper at ICLR 2018 Figure 1: Notation overview. For an L-layer feed-forward network we denote the explicit layer-wise activation variables as zl and al. The transfer functions are denoted as φ and σ. Layer l is of size nl. Training neural networks can be formulated as a constrained optimization problem by explicitly introducing the network activations as variables of the optimization, which are coupled via layerwise constraints to enforce a feasible network configuration. The authors of (Carreira-Perpi n an & Wang, 2014) have tackled this problem with a quadratic penalty approach, the method of auxiliary coordinates (MAC). Closely related, (Taylor et al., 2016) introduce additional auxiliary variables to further split linear and nonlinear transfer between layers and propose a primal dual algorithm for optimization. From a different perspective, (Le Cun, 1988) takes a Lagrangian approach to formulate the constrained optimization problem. In this work, we start from a constrained optimization point of view on the classical backpropagation algorithm. We show that backpropagation can be interpreted as a method alternating between two steps. First, a forward pass of the data with the current network weights. Secondly, an ordered sequence of gradient descent steps on a quadratic penalty energy. Using this point of view, instead of taking explicit gradient steps to update the network parameters associated with the linear transfer functions, we propose to use implicit gradient steps (also known as proximal steps, for the definition see (6)). We prove that such a model yields a descent direction and can therefore be used in a wide variety of (provably convergent) algorithms under weak assumptions. Since an exact proximal step may be costly, we further consider a matrix-free conjugate gradient (CG) approximation, which can directly utilize the efficient pre-implemented forward and backward operations of any deep learning framework. We prove that this approximation still yields a descent direction and demonstrate the effectiveness of the proposed approach in Py Torch. 2 MODEL AND NOTATION We propose a method to train a general L-layer neural network of the functional form J(θ; X, y) = Ly(φ(θL 1, σ(φ( , σ(φ(θ1, X)) )). (1) Here, J(θ; X, y) denotes the training loss as a function of the network parameters θ, the input data X and the training targets y. As the final loss function Ly we choose the softmax cross-entropy for our classification experiments. φ is a linear transfer function and σ an elementwise nonlinear transfer function. As an example, for fully-connected neural networks θ = (W, b) and φ(θ, a) = Wa + b1. While we assume the nonlinearities σ to be continuously differentiable functions for analysis purposes, our numerical experiments indicate that the proposed scheme extends to rectified linear units (Re LU), σ(x) = max(0, x). Formally, the functions σ and φ map between spaces of different dimensions depending on the layer. However, to keep the presentation clean, we do not state this dependence explicitly. Figure 1 illustrates our notation for the fully-connected network architecture. Throughout this paper, we denote the Euclidean norm for vectors and the Frobenius norm for matrices by || ||, induced by an inner product , . We use the gradient symbol to denote the transpose of the Jacobian matrix, such that the chain rule applies in the form inner derivative times outer derivative . For all computations involving matrix-valued functions and their gradient/Jacobian, we uniquely identify all involved quantities with their vectorized form by flattening matrices in a column-first order. Furthermore, we denote by A the adjoint of a linear operator A. Published as a conference paper at ICLR 2018 Algorithm 1 - Penalty formulation of Back Prop Input: Current parameters θk. // Forward pass. for l = 1 to L 2 do zk l = φ(θk l , ak l 1), // a0 = X. ak l = σ(zk l ). end for // Perform minimization steps on (3). a grad. step on E wrt. (θL 1, a L 2) for l = L 2 to 1 do b grad. step on E wrt. zl and al 1, c grad. step on E wrt. θl. end for Output: New parameters θk+1. Algorithm 2 - Prox Prop Input: Current parameters θk. // Forward pass. for l = 1 to L 2 do zk l = φ(θk l , ak l 1), // a0 = X. ak l = σ(zk l ). end for // Perform minimization steps on (3). a grad. step on E wrt. (θL 1, a L 2), Eqs. 8, 12. for l = L 2 to 1 do b grad. step on E wrt. zl and al 1, Eqs. 9, 10. c prox step on E wrt. θl, Eq. 11. end for Output: New parameters θk+1. 3 PENALTY FORMULATION OF BACKPROPAGATION The gradient descent iteration on a nested function J(θ; X, y), θk+1 = θk τ J(θk; X, y), (2) is commonly implemented using the backpropagation algorithm (Rumelhart et al., 1986). As the basis for our proposed optimization method, we derive a connection between the classical backpropagation algorithm and quadratic penalty functions of the form E(θ, a, z) = Ly(φ(θL 1, a L 2)) + γ 2 σ(zl) al 2 + ρ 2 φ(θl, al 1) zl 2. (3) The approach of (Carreira-Perpi n an & Wang, 2014) is based on the minimization of (3), as under mild conditions the limit ρ, γ leads to the convergence of the sequence of minimizers of E to the minimizer of J (Nocedal & Wright, 2006, Theorem 17.1). In contrast to (Carreira-Perpi n an & Wang, 2014) we do not optimize (3), but rather use a connection of (3) to the classical backpropagation algorithm to motivate a semi-implicit optimization algorithm for the original cost function J. Indeed, the iteration shown in Algorithm 1 of forward passes followed by a sequential gradient descent on the penalty function E is equivalent to the classical gradient descent iteration. Proposition 1. Let Ly, φ and σ be continuously differentiable. For ρ = γ = 1/τ and θk as the input to Algorithm 1, its output θk+1 satisfies (2), i.e., Algorithm 1 computes one gradient descent iteration on J. Proof. For this and all further proofs we refer to Appendix A. 4 PROXIMAL BACKPROPAGATION The interpretation of Proposition 1 leads to the natural idea of replacing the explicit gradient steps a , b and c in Algorithm 1 with other possibly more powerful minimization steps. We propose Proximal Backpropagation (Prox Prop) as one such algorithm that takes implicit instead of explicit gradient steps to update the network parameters θ in step c . This algorithm is motivated by the step size restriction of gradient descent. 4.1 GRADIENT DESCENT AND PROXIMAL MAPPINGS Explicit gradient steps pose severe restrictions on the allowed step size τ: Even for a convex, twice continuously differentiable, L -smooth function f : Rn R, the convergence of the gradient Published as a conference paper at ICLR 2018 descent algorithm can only be guaranteed for step sizes 0 < τ < 2/L . The Lipschitz constant L of the gradient f is in this case equal to the largest eigenvalue of the Hessian H. With the interpretation of backpropagation as in Proposition 1, gradient steps are taken on quadratic functions. As an example for the first layer, 2||θX z1||2 . (4) In this case the Hessian is H = XX , which is often ill-conditioned. For the CIFAR-10 dataset the largest eigenvalue is 6.7 106, which is seven orders of magnitude larger than the smallest eigenvalue. Similar problems also arise in other layers where poorly conditioned matrices al pose limitations for guaranteeing the energy E to decrease. The proximal mapping (Moreau, 1965) of a function f : Rn R is defined as: proxτf(y) := argmin x Rn f(x) + 1 2τ ||x y||2. (5) By rearranging the optimality conditions to (5) and taking y = xk, it can be interpreted as an implicit gradient step evaluated at the new point xk+1 (assuming differentiability of f): xk+1 := argmin x Rn f(x) + 1 2τ ||x xk||2 = xk τ f(xk+1). (6) The iterative algorithm (6) is known as the proximal point algorithm (Martinet, 1970). In contrast to explicit gradient descent this algorithm is unconditionally stable, i.e. the update scheme (6) monotonically decreases f for any τ > 0, since it holds by definition of the minimizer xk+1 that f(xk+1) + 1 2τ ||xk+1 xk||2 f(xk). Thus proximal mappings yield unconditionally stable subproblems in the following sense: The update in θl provably decreases the penalty energy E(θ, ak, zk) from (3) for fixed activations (ak, zk) for any choice of step size. This motivates us to use proximal steps as depicted in Algorithm 2. 4.2 PROXPROP We propose to replace explicit gradient steps with proximal steps to update the network parameters of the linear transfer function. More precisely, after the forward pass zk l = φ(θk l , ak l 1), ak l = σ(zk l ), (7) we keep the explicit gradient update equations for zl and al. The last layer update is ak+1/2 L 2 = ak L 2 τ a L 2Ly(φ(θL 1, a L 2)), (8) and for all other layers, zk+1/2 l = zk l σ (zk l )(σ(zk l ) ak+1/2 l ), (9) ak+1/2 l 1 = ak l 1 1 2 φ(θl, ) zk+1/2 l 2 (ak l 1), (10) where we use ak+1/2 l and zk+1/2 l to denote the updated variables before the forward pass of the next iteration and multiplication in (9) is componentwise. However, instead of taking explicit gradient steps to update the linear transfer parameters θl, we take proximal steps θk+1 l = argmin θ 1 2||φ(θ, ak l 1) zk+1/2 l ||2 + 1 2τθ ||θ θk l ||2. (11) This update can be computed in closed form as it amounts to a linear solve (for details see Appendix B). While in principle one can take a proximal step on the final loss Ly, for efficiency reasons we choose an explicit gradient step, as the proximal step does not have a closed form solution in many scenarios (e.g. the softmax cross-entropy loss in classification problems). Specifically, θk+1 L 1 = θk L 1 τ θL 1Ly(φ(θk L 1, ak L 2)). (12) Published as a conference paper at ICLR 2018 Note that we have eliminated the step sizes in the updates for zl and al 1 in (9) and (10), as such updates correspond to the choice of ρ = γ = 1 τ in the penalty function (3) and are natural in the sense of Proposition 1. For the proximal steps in the parameters θ in (11) we have introduced a step size τθ which as we will see in Proposition 2 below changes the descent metric opposed to τ which rather rescales the magnitude of the update. We refer to one sweep of updates according to equations (7) - (12) as Prox Prop, as it closely resembles the classical backpropagation (Back Prop), but replaces the parameter update by a proximal mapping instead of an explicit gradient descent step. In the following subsection we analyze the convergence properties of Prox Prop more closely. 4.2.1 CONVERGENCE OF PROXPROP Prox Prop inherits all convergence-relevant properties from the classical backpropagation algorithm, despite replacing explicit gradient steps with proximal steps: It minimizes the original network energy J(θ; X, y) as its fixed-points are stationary points of J(θ; X, y), and the update direction θk+1 θk is a descent direction such that it converges when combined with a suitable optimizer. In particular, it is straight forward to combine Prox Prop with popular optimizers such as Nesterov s accelerated gradient descent (Nesterov, 1983) or Adam (Kingma & Ba, 2015). In the following, we give a detailed analysis of these properties. Proposition 2. For l = 1, . . . , L 2, the update direction θk+1 θk computed by Prox Prop meets θk+1 l θk l = τ 1 τθ I + ( φ( , ak l 1))( φ( , ak l 1)) 1 θl J(θk; X, y). (13) In other words, Prox Prop multiplies the gradient θl J with the inverse of the positive definite, symmetric matrix τθ I + ( φ( , ak l 1))( φ( , ak l 1)) , (14) which depends on the activations ak l 1 of the forward pass. Proposition 2 has some important implications: Proposition 3. For any choice of τ > 0 and τθ > 0, fixed points θ of Prox Prop are stationary points of the original energy J(θ; X, y). Moreover, we can conclude convergence in the following sense. Proposition 4. The Prox Prop direction θk+1 θk is a descent direction. Moreover, under the (weak) assumption that the activations ak l remain bounded, the angle αk between θJ(θk; X, y) and θk+1 θk remains uniformly bounded away from π/2, i.e. cos(αk) > c 0, k 0, (15) for some constant c. Proposition 4 immediately implies convergence of a whole class of algorithms that depend only on a provided descent direction. We refer the reader to (Nocedal & Wright, 2006, Chapter 3.2) for examples and more details. Furthermore, Proposition 4 states convergence for any minimization scheme in step c of Algorithm 2 that induces a descent direction in parameter space and thus provides the theoretical basis for a wide range of neural network optimization algorithms. Considering the advantages of proximal steps over gradient steps, it is tempting to also update the auxiliary variables a and z in an implicit fashion. This corresponds to a proximal step in b of Algorithm 2. However, one cannot expect an analogue version of Proposition 3 to hold anymore. For example, if the update of a L 2 in (8) is replaced by a proximal step, the propagated error does not correspond to the gradient of the loss function Ly, but to the gradient of its Moreau envelope. Consequently, one would then minimize a different energy. While in principle this could result in an optimization algorithm with, for example, favorable generalization properties, we focus on minimizing the original network energy in this work and therefore do not further pursue the idea of implicit steps on a and z. Published as a conference paper at ICLR 2018 4.2.2 INEXACT SOLUTION OF PROXIMAL STEPS As we can see in Proposition 2, the Prox Prop updates differ from vanilla gradient descent by the variable metric induced by the matrices (M k l ) 1 with M k l defined in (14). Computing the Prox Prop update direction vk l := 1 τ (θk+1 l θk l ) therefore reduces to solving the linear equation M k l vk l = θl J(θk; X, y), (16) which requires an efficient implementation. We propose to use a conjugate gradient (CG) method, not only because it is one of the most efficient methods for iteratively solving linear systems in general, but also because it can be implemented matrix-free: It merely requires the application of the linear operator M k l which consists of the identity and an application of ( φ( , ak l 1))( φ( , ak l 1)) . The latter, however, is preimplemented for many linear transfer functions φ in common deep learning frameworks, because φ(x, ak l 1) = ( φ( , ak l 1)) (x) is nothing but a forward-pass in φ, and φ (z, ak l 1) = ( φ( , ak l 1))(z) provides the gradient with respect to the parameters θ if z is the backpropagated gradient up to that layer. Therefore, a CG solver is straight-forward to implement in any deep learning framework using the existing, highly efficient and high level implementations of φ and φ . For a fully connected network φ is a matrix multiplication and for a convolutional network the convolution operation. As we will analyze in more detail in Section 5.1, we approximate the solution to (16) with a few CG iterations, as the advantage of highly precise solutions does not justify the additional computational effort in obtaining them. Using any number of iterations provably does not harm the convergence properties of Prox Prop: Proposition 5. The direction vk l one obtains from approximating the solution vk l of (16) with the CG method remains a descent direction for any number of iterations. 4.2.3 CONVERGENCE IN THE STOCHASTIC SETTING While the above analysis considers only the full batch setting, we remark that convergence of Prox Prop can also be guaranteed in the stochastic setting under mild assumptions. Assuming that the activations ak l remain bounded (as in Proposition 4), the eigenvalues of (M k l ) 1 are uniformly contained in the interval [λ, τθ] for some fixed λ > 0. Therefore, our Prox Prop updates fulfill Assumption 4.3 in (Bottou et al., 2016), presuming the classic stochastic gradient fulfills them. This guarantees convergence of stochastic Prox Prop updates in the sense of (Bottou et al., 2016, Theorem 4.9), i.e. for a suitable sequence of diminishing step sizes. 5 NUMERICAL EVALUATION Prox Prop generally fits well with the API provided by modern deep learning frameworks, since it can be implemented as a network layer with a custom backward pass for the proximal mapping. We chose Py Torch for our implementation1. In particular, our implementation can use the API s GPU compute capabilities; all numerical experiments reported below were conducted on an NVIDIA Titan X GPU. To directly compare the algorithms, we used our own layer for either proximal or gradient update steps (cf. step c in Algorithms 1 and 2). A Prox Prop layer can be seamlessly integrated in a larger network architecture, also with other parametrized layers such as Batch Normalization. 5.1 EXACT AND APPROXIMATE SOLUTIONS TO PROXIMAL STEPS We study the behavior of Prox Prop in comparison to classical Back Prop for a supervised visual learning problem on the CIFAR-10 dataset. We train a fully connected network with architecture 3072 4000 1000 4000 10 and Re LU nonlinearities. As the loss function, we chose the crossentropy between the probability distribution obtained by a softmax nonlinearity and the ground-truth labels. We used a subset of 45000 images for training while keeping 5000 images as a validation set. We initialized the parameters θl uniformly in [ 1/ nl 1, 1/ nl 1], the default initialization of Py Torch. 1https://github.com/tfrerix/proxprop Published as a conference paper at ICLR 2018 0 10 20 30 40 50 Full Batch Training Loss CIFAR-10, 3072-4000-1000-4000-10 MLP Back Prop Prox Prop (cg3) Prox Prop (cg5) Prox Prop (cg10) Prox Prop (exact) 0 50 100 150 200 250 300 Full Batch Training Loss CIFAR-10, 3072-4000-1000-4000-10 MLP Back Prop Prox Prop (cg3) Prox Prop (cg5) Prox Prop (cg10) Prox Prop (exact) 0 10 20 30 40 50 Validation Accuracy CIFAR-10, 3072-4000-1000-4000-10 MLP Back Prop Prox Prop (cg3) Prox Prop (cg5) Prox Prop (cg10) Prox Prop (exact) Figure 2: Exact and inexact solvers for Prox Prop compared with Back Prop. Left: A more precise solution of the proximal subproblem leads to overall faster convergence, while even a very inexact solution (only 3 CG iterations) already outperforms classical backpropagation. Center & Right: While the run time is comparable between the methods, the proposed Prox Prop updates have better generalization performance ( 54% for Back Prop and 56% for ours on the test set). Figure 2 shows the decay of the full batch training loss over epochs (left) and training time (middle) for a Nesterov momentum2 based optimizer using a momentum of µ = 0.95 and minibatches of size 500. We used τθ = 0.05 for the Prox Prop variants along with τ = 1. For Back Prop we chose τ = 0.05 as the optimal value we found in a grid search. As we can see in Figure 2, using implicit steps indeed improves the optimization progress per epoch. Thanks to powerful linear algebra methods on the GPU, the exact Prox Prop solution is competitive with Back Prop even in terms of runtime. The advantage of the CG-based approximations, however, is that they generalize to arbitrary linear transfer functions in a matrix-free manner, i.e. they are independent of whether the matrices M k l can be formed efficiently. Moreover, the validation accuracies (right plot in Figure 2) suggest that these approximations have generalization advantages in comparison to Back Prop as well as the exact Prox Prop method. Finally, we found the exact solution to be significantly more sensitive to changes of τθ than its CG-based approximations. We therefore focus on the CG-based variants of Prox Prop in the following. In particular, we can eliminate the hyperparameter τθ and consistently chose τθ = 1 for the rest of this paper, while one can in principle perform a hyperparameter search just as for the learning rate τ. Consequently, there are no additional parameters compared with Back Prop. 5.2 STABILITY FOR LARGER STEP SIZES We compare the behavior of Prox Prop and Back Prop for different step sizes. Table 1 shows the final full batch training loss after 50 epochs with various τ. The Prox Prop based approaches remain stable over a significantly larger range of τ. Even more importantly, deviating from the optimal step size τ by one order of magnitude resulted in a divergent algorithm for classical Back Prop, but still provides reasonable training results for Prox Prop with 3 or 5 CG iterations. These results are in accordance with our motivation developed in Section 4.1. From a practical point of view, this eases hyperparameter search over τ. τ 50 10 5 1 0.5 0.1 0.05 5 10 3 5 10 4 Back Prop 0.524 0.091 0.637 1.531 Prox Prop (cg1) 77.9 0.079 0.145 0.667 0.991 1.481 1.593 1.881 2.184 Prox Prop (cg3) 94.7 0.644 0.031 2 10 3 0.012 1.029 1.334 1.814 2.175 Prox Prop (cg5) 66.5 0.190 0.027 3 10 4 2 10 3 0.399 1.049 1.765 2.175 Table 1: Full batch loss for conjugate gradient versions of Prox Prop and Back Prop after training for 50 epochs, where indicates that the algorithm diverged to Na N. The implicit Prox Prop algorithms remain stable for a significantly wider range of step sizes. 2Py Torch s Nesterov SGD for direction d(θk): mk+1 = µmk+d(θk), θk+1 = θk τ(µmk+1+d(θk)). Published as a conference paper at ICLR 2018 0 10 20 30 40 50 Full Batch Loss CIFAR-10, Convolutional Neural Network Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg) 0 50 100 150 200 250 300 Full Batch Loss CIFAR-10, Convolutional Neural Network Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg) 0 10 20 30 40 50 Validation Accuracy CIFAR-10, Convolutional Neural Network Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg) 0 50 100 150 200 250 300 Validation Accuracy CIFAR-10, Convolutional Neural Network Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg) Figure 3: Prox Prop as a first-order oracle in combination with the Adam optimizer. The proposed method leads to faster decrease of the full batch loss in epochs and to an overall higher accuracy on the validation set. The plots on the right hand side show data for a fixed runtime, which corresponds to a varying number of epochs for the different optimizers. 5.3 PROXPROP AS A FIRST-ORDER ORACLE We show that Prox Prop can be used as a gradient oracle for first-order optimization algorithms. In this section, we consider Adam (Kingma & Ba, 2015). Furthermore, to demonstrate our algorithm on a generic architecture with layers commonly used in practice, we trained on a convolutional neural network of the form: Conv[16 32 32] Re LU Pool[16 16 16] Conv[20 16 16] Re LU Pool[20 8 8] Conv[20 8 8] Re LU Pool[20 4 4] FC + Softmax[10 1 1] Here, the first dimension denotes the respective number of filters with kernel size 5 5 and max pooling downsamples its input by a factor of two. We set the step size τ = 10 3 for both Back Prop and Prox Prop. The results are shown in Fig. 3. Using parameter update directions induced by Prox Prop within Adam leads to a significantly faster decrease of the full batch training loss in epochs. While the running time is higher than the highly optimized backpropagation method, we expect that it can be improved through further engineering efforts. We deduce from Fig. 3 that the best validation accuracy (72.9%) of the proposed method is higher than the one obtained with classical backpropagation (71.7%). Such a positive effect of proximal smoothing on the generalization capabilities of deep networks is consistent with the observations of Chaudhari et al. (2017b). Finally, the accuracies on the test set after 50 epochs are 70.7% for Prox Prop and 69.6% for Back Prop which suggests that the proposed algorithm can lead to better generalization. Published as a conference paper at ICLR 2018 6 CONCLUSION We have proposed proximal backpropagation (Prox Prop) as an effective method for training neural networks. To this end, we first showed the equivalence of the classical backpropagation algorithm with an algorithm that alternates between sequential gradient steps on a quadratic penalty function and forward passes through the network. Subsequently, we developed a generalization of Back Prop, which replaces explicit gradient steps with implicit (proximal) steps, and proved that such a scheme yields a descent direction, even if the implicit steps are approximated by conjugate gradient iterations. Our numerical analysis demonstrates that Prox Prop is stable across various choices of step sizes and shows promising results when compared with common stochastic gradient descent optimizers. We believe that the interpretation of error backpropagation as the alternation between forward passes and sequential minimization steps on a penalty functional provides a theoretical basis for the development of further learning algorithms. L eon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991. L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. ar Xiv preprint ar Xiv:1606.04838, 2016. Miguel A. Carreira-Perpi n an and Weiran Wang. Distributed optimization of deeply nested systems. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics, AISTATS, 2014. Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. In Proceedings of the 5th International Conference on Learning Representations, ICLR, 2017a. Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto, and Guillame Carlier. Deep Relaxation: partial differential equations for optimizing deep neural networks. ar Xiv preprint ar Xiv:1704.04932, 2017b. Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS, 2014. Sepp Hochreiter, Yoshua Bengio, and Paolo Frasconi. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, 2015. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Image Net classification with deep convolutional neural networks. In Proceedings of the 25th International Conference of Neural Information Processing Systems, NIPS, 2012. Quoc V Le, Adam Coates, Bobby Prochnow, and Andrew Y Ng. On optimization methods for deep learning. In Proceedings of The 28th International Conference on Machine Learning, ICML, 2011. Yann Le Cun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connectionist Models Summer School, pp. 21 28, 1988. James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning, ICML, 2010. Published as a conference paper at ICLR 2018 Bernard Martinet. R egularisation d in equations variationnelles par approximations successives. Rev. Francaise Inf. Rech. Oper., pp. 154 159, 1970. Jean-Jacques Moreau. Proximit e et dualit e dans un espace hilbertien. Bulletin de la Soci et e math ematique de France, 93:273 299, 1965. Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983. Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2006. Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400 407, 1951. David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533 536, 1986. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016. Jascha Sohl-Dickstein, Ben Poole, and Surya Ganguli. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of The 31st International Conference on Machine Learning, ICML, 2013. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML, 2013. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference of Neural Information Processing Systems, NIPS, 2014. Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: A scalable ADMM approach. In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016. Published as a conference paper at ICLR 2018 A THEORETICAL RESULTS Proof of Proposition 1. We first take a gradient step on E(θ, a, z) = Ly(φ(θL 1, a L 2)) + γ 2 σ(zl) al 2 + ρ 2 φ(θl, al 1) zl 2, (17) with respect to (θL 1, a L 2). The gradient step with respect to θL 1 is the same as in the gradient descent update, θk+1 = θk τ J(θk; X, y), (18) since J depends on θL 1 only via Ly φ. The gradient descent step on a L 2 in a yields ak+1/2 L 2 = ak L 2 τ aφ(θk L 1, ak L 2) φLy(φ(θk L 1, ak L 2)), (19) where we use ak+1/2 L 2 to denote the updated variable a L 2 before the forward pass of the next iteration. To keep the presentation as clear as possible we slightly abused the notation of a right multiplication with aφ(θk L 1, ak L 2): While this notation is exact in the case of fully connected layers, it represents the application of the corresponding linear operator in the more general case, e.g. for convolutions. For all layers l L 2 note that due to the forward pass in Algorithm 1 we have σ(zk l ) = ak l , φ(θk l , ak l 1) = zk l (20) and we therefore get the following update equations in the gradient step b zk+1/2 l = zk l τγ σ(zk l ) σ(zk l ) ak+1/2 l = zk l σ(zk l ) ak l ak+1/2 l , (21) and in the gradient step c w.r.t. al 1, ak+1/2 l 1 = ak l 1 τρ aφ(θk l , ak l 1) φ(θk l , ak l 1) zk+1/2 l = ak l 1 aφ(θk l , ak l 1) zk l zk+1/2 l . (22) Equations (21) and (22) can be combined to obtain: zk l zk+1/2 l = σ(zk l ) aφ(θk l+1, ak l ) zk l+1 zk+1/2 l+1 . (23) The above formula allows us to backtrack the differences of the old zk l and the updated zk+1/2 l up to layer L 2, where we can use equations (21) and (19) to relate the difference to the loss. Altogether, we obtain zk l zk+1/2 l = τ q=l σ(zk q ) aφ(θk q+1, ak q) φLy(φ(θk L 1, ak L 2)). (24) By inserting (24) into the gradient descent update equation with respect to θl in c , θk+1 = θk θφ(θk l , ak l 1) zk l zk+1/2 l , (25) we obtain the chain rule for update (18). Proof of Proposition 2. Since only the updates for θl, l = 1, . . . , L 2, are performed implicitly, one can replicate the proof of Proposition 1 exactly up to equation (24). Let us denote the right hand side of (24) by gk l , i.e. zk+1/2 l = zk l gk l and note that τ θl J(θk; X, y) = θφ( , ak l 1) gk l (26) Published as a conference paper at ICLR 2018 holds by the chain rule (as seen in (25)). We have eliminated the dependence of θφ(θk l , ak l 1) on θk l and wrote θφ( , ak l 1) instead, because we assume φ to be linear in θ such that θφ does not depend on the point θ where the gradient is evaluated anymore. We now rewrite the Prox Prop update equation of the parameters θ as follows θk+1 l = argmin θ 1 2||φ(θ, ak l 1) zk+1/2 l ||2 + 1 2τθ ||θ θk l ||2 1 2||φ(θ, ak l 1) (zk l gk l )||2 + 1 2τθ ||θ θk l ||2 1 2||φ(θ, ak l 1) (φ(θk, ak l 1) gk l )||2 + 1 2τθ ||θ θk l ||2 1 2||φ(θ θk, ak l 1) + gk l ||2 + 1 2τθ ||θ θk l ||2, where we have used that φ is linear in θ. The optimality condition yields 0 = φ( , ak l 1)(φ(θk+1 l θk, ak l 1) + gk l ) + 1 τθ (θk+1 l θk l ) (28) Again, due to the linearity of φ in θ, one has φ(θ, ak l 1) = ( φ( , ak l 1)) (θ), (29) where , denotes the adjoint of a linear operator. We conclude 0 = φ( , ak l 1)( φ( , ak l 1)) (θk+1 l θk) + φ( , ak l 1)gk l + 1 τθ (θk+1 l θk l ), τθ I + φ( , ak l 1)( φ( , ak l 1)) (θk+1 l θk l ) = φ( , ak l 1)gk l = τ θl J(θk; X, y), (30) which yields the assertion. Proof of Proposition 3. Under the assumption that θk converges, θk ˆθ, one finds that ak l ˆal and zk l ˆzl = φ(ˆθl, ˆal 1) converge to the respective activations of the parameters ˆθ due to the forward pass and the continuity of the network. As we assume J( ; X, y) to be continuously differentiable, we deduce from (30) that limk θl J(θk; X, y) = 0 for all l = 1, . . . , L 2. The parameters of the last layer θL 1 are treated explicitly anyways, such that the above equation also holds for l = L 1, which then yields the assertion. Proof of Proposition 4. As the matrices τθ I + ( φ( , ak l 1))( φ( , ak l 1)) (31) (with the convention M k L 1 = I) are positive definite, so are their inverses, and the claim that θk+1 θk is a descent direction is immediate, θk+1 l θk l , θl J(θk; Y, x) = τ (M k l ) 1 θl J(θk; Y, x), θl J(θk; Y, x) . (32) We still have to guarantee that this update direction does not become orthogonal to the gradient in the limit k . The largest eigenvalue of (M k l ) 1 is bounded from above by τθ. If the ak l 1 remain bounded, then so does φ( , ak l 1) and the largest eigenvalue of φ( , ak l 1) φ( , ak l 1) must be bounded by some constant c. Therefore, the smallest eigenvalue of (M k l ) 1 must remain bounded from from below by ( 1 τθ + c) 1. Abbreviating v = θl J(θk; Y, x), it follows that cos(αk) = τ (M k l ) 1v, v τ (M k l ) 1v v λmin((M k l ) 1) v 2 (M k l ) 1v v λmin((M k l ) 1) λmax((M k l ) 1) Published as a conference paper at ICLR 2018 which yields the assertion. Proof of Proposition 5. According to (Nocedal & Wright, 2006, p. 109, Thm. 5.3) and (Nocedal & Wright, 2006, p. 106, Thm. 5.2) the k-th iteration xk of the CG method for solving a linear system Ax = b with starting point x0 = 0 meets xk = arg min x span(b,Ab,...,Ak 1b) 1 2 x, Ax b, x , (34) i.e. is optimizing over an order-k Krylov subspace. The starting point x0 = 0 can be chosen without loss of generality. Suppose the starting point is x0 = 0, then one can optimize the variable x = x x0 with a starting point x0 = 0 and b = b + A x0. We will assume that the CG iteration has not converged yet as the claim for a fully converged CG iteration immediately follow from Proposition 4. Writing the vectors b, Ab, . . . , Ak 1b as columns of a matrix Kk, the condition x span(b, Ab, . . . , Ak 1b) can equivalently be expressed as x = Kkα for some α Rk. In terms of α our minimization problem becomes xk = Kkα = arg min α Rk 1 2 α, (Kk)T AKkα (Kk)T b, α , (35) leading to the optimality condition 0 = (Kk)T AKkα (Kk)T b, xk = Kk((Kk)T AKk) 1(Kk)T b. (36) Note that A is symmetric positive definite and can therefore be written as A, leading to (Kk)T AKk = ( being symmetric positive definite. Hence, the matrix ((Kk)T AKk) 1 is positive definite, too, and xk, b = Kk((Kk)T AKk) 1(Kk)T b, b = ((Kk)T AKk) 1(Kk)T b, (Kk)T b > 0. (38) Note that (Kk)T b is nonzero if b is nonzero, as b 2 is its first entry. To translate the general analysis of the CG iteration to our specific case, using any number of CG iterations we find that an approximate solution vk l of M k l vk l = θl J(θk; X, y) (39) leads to vk l , θl J(θk; X, y) > 0, i.e., to vk l being a descent direction. B PROXIMAL OPERATOR FOR LINEAR TRANSFER FUNCTIONS In order to update the parameters θl of the linear transfer function, we have to solve the problem (11), θk+1 = argmin θ 1 2||φ(θ, ak) zk+1/2||2 + 1 2τθ ||θ θk||2. (40) Since we assume that φ is linear in θ for a fixed ak, there exists a matrix Ak such that vec(θk+1) = argmin θ 1 2||Akvec(θ) vec(zk+1/2)||2 + 1 2τθ ||vec(θ) vec(θk)||2, (41) and the optimality condition yields vec(θk+1) = (I + τθ(Ak)T Ak) 1(vec(θk) + (Ak)T vec(zk+1/2)). (42) Published as a conference paper at ICLR 2018 In the main paper we sometimes use the more abstract but also more concise notion of φ( , ak), which represents the linear operator φ( , ak)(Y ) = vec 1((Ak)T vec(Y )). (43) To also make the above more specific, consider the example of φ(θ, ak) = θak. In this case the variable θ may remain in a matrix form and the solution of the proximal mapping becomes θk+1 = zk+1/2 (ak) + 1 τθ θk ak(ak) + 1 τθ I 1 . (44) Since ak Rn N for some layer size n and batch size N, the size of the linear system is independent of the batch size.