# proximal_backpropagation__94e69756.pdf

Published as a conference paper at ICLR 2018

PROXIMAL BACKPROPAGATION

Thomas Frerix1 , Thomas M ollenhoff1 , Michael Moeller2 , Daniel Cremers1

thomas.frerix@tum.de thomas.moellenhoff@tum.de michael.moeller@uni-siegen.de cremers@tum.de

1 Technical University of Munich 2 University of Siegen

We propose proximal backpropagation (Prox Prop) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of explicit gradient descent, which poses an impediment for optimization. Prox Prop is developed from a general point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Speciﬁcally, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy, which comprises the network activations as variables of the optimization. We further analyze theoretical properties of Prox Prop and in particular prove that the algorithm yields a descent direction in parameter space and can therefore be combined with a wide variety of convergent algorithms. Finally, we devise an efﬁcient numerical implementation that integrates well with popular deep learning frameworks. We conclude by demonstrating promising numerical results and show that Prox Prop can be effectively combined with common ﬁrst order optimizers such as Adam.

1 INTRODUCTION

In recent years neural networks have gained considerable attention in solving difﬁcult correlation tasks such as classiﬁcation in computer vision (Krizhevsky et al., 2012) or sequence learning (Sutskever et al., 2014) and as building blocks of larger learning systems (Silver et al., 2016). Training neural networks is accomplished by optimizing a nonconvex, possibly nonsmooth, nested function of the network parameters. Since the introduction of stochastic gradient descent (SGD) (Robbins & Monro, 1951; Bottou, 1991), several more sophisticated optimization methods have been studied. One such class is that of quasi-Newton methods, as for example the comparison of L-BFGS with SGD in (Le et al., 2011), Hessian-free approaches (Martens, 2010), and the Sum of Functions Optimizer in (Sohl-Dickstein et al., 2013). Several works consider speciﬁc properties of energy landscapes of deep learning models such as frequent saddle points (Dauphin et al., 2014) and well-generalizable local optima (Chaudhari et al., 2017a). Among the most popular optimization methods in currently used deep learning frameworks are momentum based improvements of classical SGD, notably Nesterov s Accelerated Gradient (Nesterov, 1983; Sutskever et al., 2013), and the Adam optimizer (Kingma & Ba, 2015), which uses estimates of ﬁrst and second order moments of the gradients for parameter updates.

Nevertheless, the optimization of these models remains challenging, as learning with SGD and its variants requires careful weight initialization and a sufﬁciently small learning rate in order to yield a stable and convergent algorithm. Moreover, SGD often has difﬁculties in propagating a learning signal deeply into a network, commonly referred to as the vanishing gradient problem (Hochreiter et al., 2001).

contributed equally

Published as a conference paper at ICLR 2018

Figure 1: Notation overview. For an L-layer feed-forward network we denote the explicit layer-wise activation variables as zl and al. The transfer functions are denoted as φ and σ. Layer l is of size nl.

Training neural networks can be formulated as a constrained optimization problem by explicitly introducing the network activations as variables of the optimization, which are coupled via layerwise constraints to enforce a feasible network conﬁguration. The authors of (Carreira-Perpi n an & Wang, 2014) have tackled this problem with a quadratic penalty approach, the method of auxiliary coordinates (MAC). Closely related, (Taylor et al., 2016) introduce additional auxiliary variables to further split linear and nonlinear transfer between layers and propose a primal dual algorithm for optimization. From a different perspective, (Le Cun, 1988) takes a Lagrangian approach to formulate the constrained optimization problem.

In this work, we start from a constrained optimization point of view on the classical backpropagation algorithm. We show that backpropagation can be interpreted as a method alternating between two steps. First, a forward pass of the data with the current network weights. Secondly, an ordered sequence of gradient descent steps on a quadratic penalty energy.

Using this point of view, instead of taking explicit gradient steps to update the network parameters associated with the linear transfer functions, we propose to use implicit gradient steps (also known as proximal steps, for the deﬁnition see (6)). We prove that such a model yields a descent direction and can therefore be used in a wide variety of (provably convergent) algorithms under weak assumptions. Since an exact proximal step may be costly, we further consider a matrix-free conjugate gradient (CG) approximation, which can directly utilize the efﬁcient pre-implemented forward and backward operations of any deep learning framework. We prove that this approximation still yields a descent direction and demonstrate the effectiveness of the proposed approach in Py Torch.

2 MODEL AND NOTATION

We propose a method to train a general L-layer neural network of the functional form

J(θ; X, y) = Ly(φ(θL 1, σ(φ( , σ(φ(θ1, X)) )). (1)

Here, J(θ; X, y) denotes the training loss as a function of the network parameters θ, the input data X and the training targets y. As the ﬁnal loss function Ly we choose the softmax cross-entropy for our classiﬁcation experiments. φ is a linear transfer function and σ an elementwise nonlinear transfer function. As an example, for fully-connected neural networks θ = (W, b) and φ(θ, a) = Wa + b1.

While we assume the nonlinearities σ to be continuously differentiable functions for analysis purposes, our numerical experiments indicate that the proposed scheme extends to rectiﬁed linear units (Re LU), σ(x) = max(0, x). Formally, the functions σ and φ map between spaces of different dimensions depending on the layer. However, to keep the presentation clean, we do not state this dependence explicitly. Figure 1 illustrates our notation for the fully-connected network architecture.

Throughout this paper, we denote the Euclidean norm for vectors and the Frobenius norm for matrices by || ||, induced by an inner product , . We use the gradient symbol to denote the transpose of the Jacobian matrix, such that the chain rule applies in the form inner derivative times outer derivative . For all computations involving matrix-valued functions and their gradient/Jacobian, we uniquely identify all involved quantities with their vectorized form by ﬂattening matrices in a column-ﬁrst order. Furthermore, we denote by A the adjoint of a linear operator A.

Published as a conference paper at ICLR 2018

Algorithm 1 - Penalty formulation of Back Prop

Input: Current parameters θk. // Forward pass. for l = 1 to L 2 do

zk l = φ(θk l , ak l 1), // a0 = X. ak l = σ(zk l ). end for // Perform minimization steps on (3).

a grad. step on E wrt. (θL 1, a L 2) for l = L 2 to 1 do

b grad. step on E wrt. zl and al 1, c grad. step on E wrt. θl. end for Output: New parameters θk+1.

Algorithm 2 - Prox Prop

Input: Current parameters θk. // Forward pass. for l = 1 to L 2 do

zk l = φ(θk l , ak l 1), // a0 = X. ak l = σ(zk l ). end for // Perform minimization steps on (3).

a grad. step on E wrt. (θL 1, a L 2), Eqs. 8, 12. for l = L 2 to 1 do

b grad. step on E wrt. zl and al 1, Eqs. 9, 10. c prox step on E wrt. θl, Eq. 11. end for Output: New parameters θk+1.

3 PENALTY FORMULATION OF BACKPROPAGATION

The gradient descent iteration on a nested function J(θ; X, y),

θk+1 = θk τ J(θk; X, y), (2)

is commonly implemented using the backpropagation algorithm (Rumelhart et al., 1986). As the basis for our proposed optimization method, we derive a connection between the classical backpropagation algorithm and quadratic penalty functions of the form

E(θ, a, z) = Ly(φ(θL 1, a L 2)) +

γ 2 σ(zl) al 2 + ρ

2 φ(θl, al 1) zl 2. (3)

The approach of (Carreira-Perpi n an & Wang, 2014) is based on the minimization of (3), as under mild conditions the limit ρ, γ leads to the convergence of the sequence of minimizers of E to the minimizer of J (Nocedal & Wright, 2006, Theorem 17.1). In contrast to (Carreira-Perpi n an & Wang, 2014) we do not optimize (3), but rather use a connection of (3) to the classical backpropagation algorithm to motivate a semi-implicit optimization algorithm for the original cost function J.

Indeed, the iteration shown in Algorithm 1 of forward passes followed by a sequential gradient descent on the penalty function E is equivalent to the classical gradient descent iteration.

Proposition 1. Let Ly, φ and σ be continuously differentiable. For ρ = γ = 1/τ and θk as the input to Algorithm 1, its output θk+1 satisﬁes (2), i.e., Algorithm 1 computes one gradient descent iteration on J.

Proof. For this and all further proofs we refer to Appendix A.

4 PROXIMAL BACKPROPAGATION

The interpretation of Proposition 1 leads to the natural idea of replacing the explicit gradient steps a , b and c in Algorithm 1 with other possibly more powerful minimization steps. We propose Proximal Backpropagation (Prox Prop) as one such algorithm that takes implicit instead of explicit gradient steps to update the network parameters θ in step c . This algorithm is motivated by the step size restriction of gradient descent.

4.1 GRADIENT DESCENT AND PROXIMAL MAPPINGS

Explicit gradient steps pose severe restrictions on the allowed step size τ: Even for a convex, twice continuously differentiable, L -smooth function f : Rn R, the convergence of the gradient

Published as a conference paper at ICLR 2018

descent algorithm can only be guaranteed for step sizes 0 < τ < 2/L . The Lipschitz constant L of the gradient f is in this case equal to the largest eigenvalue of the Hessian H. With the interpretation of backpropagation as in Proposition 1, gradient steps are taken on quadratic functions. As an example for the ﬁrst layer,

2||θX z1||2 . (4)

In this case the Hessian is H = XX , which is often ill-conditioned. For the CIFAR-10 dataset the largest eigenvalue is 6.7 106, which is seven orders of magnitude larger than the smallest eigenvalue. Similar problems also arise in other layers where poorly conditioned matrices al pose limitations for guaranteeing the energy E to decrease.

The proximal mapping (Moreau, 1965) of a function f : Rn R is deﬁned as:

proxτf(y) := argmin x Rn f(x) + 1

2τ ||x y||2. (5)

By rearranging the optimality conditions to (5) and taking y = xk, it can be interpreted as an implicit gradient step evaluated at the new point xk+1 (assuming differentiability of f):

xk+1 := argmin x Rn f(x) + 1

2τ ||x xk||2 = xk τ f(xk+1). (6)

The iterative algorithm (6) is known as the proximal point algorithm (Martinet, 1970). In contrast to explicit gradient descent this algorithm is unconditionally stable, i.e. the update scheme (6) monotonically decreases f for any τ > 0, since it holds by deﬁnition of the minimizer xk+1 that f(xk+1) + 1 2τ ||xk+1 xk||2 f(xk).

Thus proximal mappings yield unconditionally stable subproblems in the following sense: The update in θl provably decreases the penalty energy E(θ, ak, zk) from (3) for ﬁxed activations (ak, zk) for any choice of step size. This motivates us to use proximal steps as depicted in Algorithm 2.

4.2 PROXPROP

We propose to replace explicit gradient steps with proximal steps to update the network parameters of the linear transfer function. More precisely, after the forward pass

zk l = φ(θk l , ak l 1),

ak l = σ(zk l ), (7)

we keep the explicit gradient update equations for zl and al. The last layer update is

ak+1/2 L 2 = ak L 2 τ a L 2Ly(φ(θL 1, a L 2)), (8)

and for all other layers,

zk+1/2 l = zk l σ (zk l )(σ(zk l ) ak+1/2 l ), (9)

ak+1/2 l 1 = ak l 1 1

2 φ(θl, ) zk+1/2 l 2 (ak l 1), (10)

where we use ak+1/2 l and zk+1/2 l to denote the updated variables before the forward pass of the next iteration and multiplication in (9) is componentwise. However, instead of taking explicit gradient steps to update the linear transfer parameters θl, we take proximal steps

θk+1 l = argmin θ

1 2||φ(θ, ak l 1) zk+1/2 l ||2 + 1

2τθ ||θ θk l ||2. (11)

This update can be computed in closed form as it amounts to a linear solve (for details see Appendix B). While in principle one can take a proximal step on the ﬁnal loss Ly, for efﬁciency reasons we choose an explicit gradient step, as the proximal step does not have a closed form solution in many scenarios (e.g. the softmax cross-entropy loss in classiﬁcation problems). Speciﬁcally,

θk+1 L 1 = θk L 1 τ θL 1Ly(φ(θk L 1, ak L 2)). (12)

Published as a conference paper at ICLR 2018

Note that we have eliminated the step sizes in the updates for zl and al 1 in (9) and (10), as such updates correspond to the choice of ρ = γ = 1

τ in the penalty function (3) and are natural in the sense of Proposition 1. For the proximal steps in the parameters θ in (11) we have introduced a step size τθ which as we will see in Proposition 2 below changes the descent metric opposed to τ which rather rescales the magnitude of the update.

We refer to one sweep of updates according to equations (7) - (12) as Prox Prop, as it closely resembles the classical backpropagation (Back Prop), but replaces the parameter update by a proximal mapping instead of an explicit gradient descent step. In the following subsection we analyze the convergence properties of Prox Prop more closely.

4.2.1 CONVERGENCE OF PROXPROP

Prox Prop inherits all convergence-relevant properties from the classical backpropagation algorithm, despite replacing explicit gradient steps with proximal steps: It minimizes the original network energy J(θ; X, y) as its ﬁxed-points are stationary points of J(θ; X, y), and the update direction θk+1 θk is a descent direction such that it converges when combined with a suitable optimizer. In particular, it is straight forward to combine Prox Prop with popular optimizers such as Nesterov s accelerated gradient descent (Nesterov, 1983) or Adam (Kingma & Ba, 2015).

In the following, we give a detailed analysis of these properties. Proposition 2. For l = 1, . . . , L 2, the update direction θk+1 θk computed by Prox Prop meets

θk+1 l θk l = τ 1

τθ I + ( φ( , ak l 1))( φ( , ak l 1)) 1 θl J(θk; X, y). (13)

In other words, Prox Prop multiplies the gradient θl J with the inverse of the positive deﬁnite, symmetric matrix

τθ I + ( φ( , ak l 1))( φ( , ak l 1)) , (14)

which depends on the activations ak l 1 of the forward pass. Proposition 2 has some important implications: Proposition 3. For any choice of τ > 0 and τθ > 0, ﬁxed points θ of Prox Prop are stationary points of the original energy J(θ; X, y).

Moreover, we can conclude convergence in the following sense. Proposition 4. The Prox Prop direction θk+1 θk is a descent direction. Moreover, under the (weak) assumption that the activations ak l remain bounded, the angle αk between θJ(θk; X, y) and θk+1 θk remains uniformly bounded away from π/2, i.e.

cos(αk) > c 0, k 0, (15)

for some constant c.

Proposition 4 immediately implies convergence of a whole class of algorithms that depend only on a provided descent direction. We refer the reader to (Nocedal & Wright, 2006, Chapter 3.2) for examples and more details.

Furthermore, Proposition 4 states convergence for any minimization scheme in step c of Algorithm 2 that induces a descent direction in parameter space and thus provides the theoretical basis for a wide range of neural network optimization algorithms.

Considering the advantages of proximal steps over gradient steps, it is tempting to also update the auxiliary variables a and z in an implicit fashion. This corresponds to a proximal step in b of Algorithm 2. However, one cannot expect an analogue version of Proposition 3 to hold anymore. For example, if the update of a L 2 in (8) is replaced by a proximal step, the propagated error does not correspond to the gradient of the loss function Ly, but to the gradient of its Moreau envelope. Consequently, one would then minimize a different energy. While in principle this could result in an optimization algorithm with, for example, favorable generalization properties, we focus on minimizing the original network energy in this work and therefore do not further pursue the idea of implicit steps on a and z.

Published as a conference paper at ICLR 2018

4.2.2 INEXACT SOLUTION OF PROXIMAL STEPS

As we can see in Proposition 2, the Prox Prop updates differ from vanilla gradient descent by the variable metric induced by the matrices (M k l ) 1 with M k l deﬁned in (14). Computing the Prox Prop update direction vk l := 1

τ (θk+1 l θk l ) therefore reduces to solving the linear equation

M k l vk l = θl J(θk; X, y), (16)

which requires an efﬁcient implementation. We propose to use a conjugate gradient (CG) method, not only because it is one of the most efﬁcient methods for iteratively solving linear systems in general, but also because it can be implemented matrix-free: It merely requires the application of the linear operator M k l which consists of the identity and an application of ( φ( , ak l 1))( φ( , ak l 1)) . The latter, however, is preimplemented for many linear transfer functions φ in common deep learning frameworks, because φ(x, ak l 1) = ( φ( , ak l 1)) (x) is nothing but a forward-pass in φ, and φ (z, ak l 1) = ( φ( , ak l 1))(z) provides the gradient with respect to the parameters θ if z is the backpropagated gradient up to that layer. Therefore, a CG solver is straight-forward to implement in any deep learning framework using the existing, highly efﬁcient and high level implementations of φ and φ . For a fully connected network φ is a matrix multiplication and for a convolutional network the convolution operation.

As we will analyze in more detail in Section 5.1, we approximate the solution to (16) with a few CG iterations, as the advantage of highly precise solutions does not justify the additional computational effort in obtaining them. Using any number of iterations provably does not harm the convergence properties of Prox Prop:

Proposition 5. The direction vk l one obtains from approximating the solution vk l of (16) with the CG method remains a descent direction for any number of iterations.

4.2.3 CONVERGENCE IN THE STOCHASTIC SETTING

While the above analysis considers only the full batch setting, we remark that convergence of Prox Prop can also be guaranteed in the stochastic setting under mild assumptions. Assuming that the activations ak l remain bounded (as in Proposition 4), the eigenvalues of (M k l ) 1 are uniformly contained in the interval [λ, τθ] for some ﬁxed λ > 0. Therefore, our Prox Prop updates fulﬁll Assumption 4.3 in (Bottou et al., 2016), presuming the classic stochastic gradient fulﬁlls them. This guarantees convergence of stochastic Prox Prop updates in the sense of (Bottou et al., 2016, Theorem 4.9), i.e. for a suitable sequence of diminishing step sizes.

5 NUMERICAL EVALUATION

Prox Prop generally ﬁts well with the API provided by modern deep learning frameworks, since it can be implemented as a network layer with a custom backward pass for the proximal mapping. We chose Py Torch for our implementation1. In particular, our implementation can use the API s GPU compute capabilities; all numerical experiments reported below were conducted on an NVIDIA Titan X GPU. To directly compare the algorithms, we used our own layer for either proximal or gradient update steps (cf. step c in Algorithms 1 and 2). A Prox Prop layer can be seamlessly integrated in a larger network architecture, also with other parametrized layers such as Batch Normalization.

5.1 EXACT AND APPROXIMATE SOLUTIONS TO PROXIMAL STEPS

We study the behavior of Prox Prop in comparison to classical Back Prop for a supervised visual learning problem on the CIFAR-10 dataset. We train a fully connected network with architecture 3072 4000 1000 4000 10 and Re LU nonlinearities. As the loss function, we chose the crossentropy between the probability distribution obtained by a softmax nonlinearity and the ground-truth labels. We used a subset of 45000 images for training while keeping 5000 images as a validation set. We initialized the parameters θl uniformly in [ 1/ nl 1, 1/ nl 1], the default initialization of Py Torch.

1https://github.com/tfrerix/proxprop

Published as a conference paper at ICLR 2018

0 10 20 30 40 50

Full Batch Training Loss

CIFAR-10, 3072-4000-1000-4000-10 MLP

Back Prop Prox Prop (cg3) Prox Prop (cg5) Prox Prop (cg10) Prox Prop (exact)

0 50 100 150 200 250 300

Full Batch Training Loss

CIFAR-10, 3072-4000-1000-4000-10 MLP

Back Prop Prox Prop (cg3) Prox Prop (cg5) Prox Prop (cg10) Prox Prop (exact)

0 10 20 30 40 50

Validation Accuracy

CIFAR-10, 3072-4000-1000-4000-10 MLP

Back Prop Prox Prop (cg3) Prox Prop (cg5) Prox Prop (cg10) Prox Prop (exact)

Figure 2: Exact and inexact solvers for Prox Prop compared with Back Prop. Left: A more precise solution of the proximal subproblem leads to overall faster convergence, while even a very inexact solution (only 3 CG iterations) already outperforms classical backpropagation. Center & Right: While the run time is comparable between the methods, the proposed Prox Prop updates have better generalization performance ( 54% for Back Prop and 56% for ours on the test set).

Figure 2 shows the decay of the full batch training loss over epochs (left) and training time (middle) for a Nesterov momentum2 based optimizer using a momentum of µ = 0.95 and minibatches of size 500. We used τθ = 0.05 for the Prox Prop variants along with τ = 1. For Back Prop we chose τ = 0.05 as the optimal value we found in a grid search.

As we can see in Figure 2, using implicit steps indeed improves the optimization progress per epoch. Thanks to powerful linear algebra methods on the GPU, the exact Prox Prop solution is competitive with Back Prop even in terms of runtime.

The advantage of the CG-based approximations, however, is that they generalize to arbitrary linear transfer functions in a matrix-free manner, i.e. they are independent of whether the matrices M k l can be formed efﬁciently. Moreover, the validation accuracies (right plot in Figure 2) suggest that these approximations have generalization advantages in comparison to Back Prop as well as the exact Prox Prop method. Finally, we found the exact solution to be signiﬁcantly more sensitive to changes of τθ than its CG-based approximations. We therefore focus on the CG-based variants of Prox Prop in the following. In particular, we can eliminate the hyperparameter τθ and consistently chose τθ = 1 for the rest of this paper, while one can in principle perform a hyperparameter search just as for the learning rate τ. Consequently, there are no additional parameters compared with Back Prop.

5.2 STABILITY FOR LARGER STEP SIZES

We compare the behavior of Prox Prop and Back Prop for different step sizes. Table 1 shows the ﬁnal full batch training loss after 50 epochs with various τ. The Prox Prop based approaches remain stable over a signiﬁcantly larger range of τ. Even more importantly, deviating from the optimal step size τ by one order of magnitude resulted in a divergent algorithm for classical Back Prop, but still provides reasonable training results for Prox Prop with 3 or 5 CG iterations. These results are in accordance with our motivation developed in Section 4.1. From a practical point of view, this eases hyperparameter search over τ.

τ 50 10 5 1 0.5 0.1 0.05 5 10 3 5 10 4

Back Prop 0.524 0.091 0.637 1.531

Prox Prop (cg1) 77.9 0.079 0.145 0.667 0.991 1.481 1.593 1.881 2.184

Prox Prop (cg3) 94.7 0.644 0.031 2 10 3 0.012 1.029 1.334 1.814 2.175

Prox Prop (cg5) 66.5 0.190 0.027 3 10 4 2 10 3 0.399 1.049 1.765 2.175

Table 1: Full batch loss for conjugate gradient versions of Prox Prop and Back Prop after training for 50 epochs, where indicates that the algorithm diverged to Na N. The implicit Prox Prop algorithms remain stable for a signiﬁcantly wider range of step sizes.

2Py Torch s Nesterov SGD for direction d(θk): mk+1 = µmk+d(θk), θk+1 = θk τ(µmk+1+d(θk)).

Published as a conference paper at ICLR 2018

0 10 20 30 40 50

Full Batch Loss

CIFAR-10, Convolutional Neural Network

Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg)

0 50 100 150 200 250 300

Full Batch Loss

CIFAR-10, Convolutional Neural Network

Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg)

0 10 20 30 40 50

Validation Accuracy

CIFAR-10, Convolutional Neural Network

Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg)

0 50 100 150 200 250 300

Validation Accuracy

CIFAR-10, Convolutional Neural Network

Adam + Back Prop Adam + Prox Prop (3 cg) Adam + Prox Prop (10 cg)

Figure 3: Prox Prop as a ﬁrst-order oracle in combination with the Adam optimizer. The proposed method leads to faster decrease of the full batch loss in epochs and to an overall higher accuracy on the validation set. The plots on the right hand side show data for a ﬁxed runtime, which corresponds to a varying number of epochs for the different optimizers.

5.3 PROXPROP AS A FIRST-ORDER ORACLE

We show that Prox Prop can be used as a gradient oracle for ﬁrst-order optimization algorithms. In this section, we consider Adam (Kingma & Ba, 2015). Furthermore, to demonstrate our algorithm on a generic architecture with layers commonly used in practice, we trained on a convolutional neural network of the form:

Conv[16 32 32] Re LU Pool[16 16 16] Conv[20 16 16] Re LU Pool[20 8 8] Conv[20 8 8] Re LU Pool[20 4 4] FC + Softmax[10 1 1]

Here, the ﬁrst dimension denotes the respective number of ﬁlters with kernel size 5 5 and max pooling downsamples its input by a factor of two. We set the step size τ = 10 3 for both Back Prop and Prox Prop.

The results are shown in Fig. 3. Using parameter update directions induced by Prox Prop within Adam leads to a signiﬁcantly faster decrease of the full batch training loss in epochs. While the running time is higher than the highly optimized backpropagation method, we expect that it can be improved through further engineering efforts. We deduce from Fig. 3 that the best validation accuracy (72.9%) of the proposed method is higher than the one obtained with classical backpropagation (71.7%). Such a positive effect of proximal smoothing on the generalization capabilities of deep networks is consistent with the observations of Chaudhari et al. (2017b). Finally, the accuracies on the test set after 50 epochs are 70.7% for Prox Prop and 69.6% for Back Prop which suggests that the proposed algorithm can lead to better generalization.

Published as a conference paper at ICLR 2018

6 CONCLUSION

We have proposed proximal backpropagation (Prox Prop) as an effective method for training neural networks. To this end, we ﬁrst showed the equivalence of the classical backpropagation algorithm with an algorithm that alternates between sequential gradient steps on a quadratic penalty function and forward passes through the network. Subsequently, we developed a generalization of Back Prop, which replaces explicit gradient steps with implicit (proximal) steps, and proved that such a scheme yields a descent direction, even if the implicit steps are approximated by conjugate gradient iterations. Our numerical analysis demonstrates that Prox Prop is stable across various choices of step sizes and shows promising results when compared with common stochastic gradient descent optimizers.

We believe that the interpretation of error backpropagation as the alternation between forward passes and sequential minimization steps on a penalty functional provides a theoretical basis for the development of further learning algorithms.

L eon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991.

L eon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. ar Xiv preprint ar Xiv:1606.04838, 2016.

Miguel A. Carreira-Perpi n an and Weiran Wang. Distributed optimization of deeply nested systems. In Proceedings of the 17th International Conference on Artiﬁcial Intelligence and Statistics, AISTATS, 2014.

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann Le Cun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-SGD: Biasing gradient descent into wide valleys. In Proceedings of the 5th International Conference on Learning Representations, ICLR, 2017a.

Pratik Chaudhari, Adam Oberman, Stanley Osher, Stefano Soatto, and Guillame Carlier. Deep Relaxation: partial differential equations for optimizing deep neural networks. ar Xiv preprint ar Xiv:1704.04932, 2017b.

Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS, 2014.

Sepp Hochreiter, Yoshua Bengio, and Paolo Frasconi. Gradient ﬂow in recurrent nets: the difﬁculty of learning long-term dependencies. In Field Guide to Dynamical Recurrent Networks. IEEE Press, 2001.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR, 2015.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Image Net classiﬁcation with deep convolutional neural networks. In Proceedings of the 25th International Conference of Neural Information Processing Systems, NIPS, 2012.

Quoc V Le, Adam Coates, Bobby Prochnow, and Andrew Y Ng. On optimization methods for deep learning. In Proceedings of The 28th International Conference on Machine Learning, ICML, 2011.

Yann Le Cun. A theoretical framework for back-propagation. In Proceedings of the 1988 Connectionist Models Summer School, pp. 21 28, 1988.

James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning, ICML, 2010.

Published as a conference paper at ICLR 2018

Bernard Martinet. R egularisation d in equations variationnelles par approximations successives. Rev. Francaise Inf. Rech. Oper., pp. 154 159, 1970.

Jean-Jacques Moreau. Proximit e et dualit e dans un espace hilbertien. Bulletin de la Soci et e math ematique de France, 93:273 299, 1965.

Yurii Nesterov. A method of solving a convex programming problem with convergence rate O(1/k2). Soviet Mathematics Doklady, 27(2):372 376, 1983.

Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer Series in Operations Research and Financial Engineering. Springer, 2006.

Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400 407, 1951.

David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533 536, 1986.

David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484 489, 2016.

Jascha Sohl-Dickstein, Ben Poole, and Surya Ganguli. Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of The 31st International Conference on Machine Learning, ICML, 2013.

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML, 2013.

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference of Neural Information Processing Systems, NIPS, 2014.

Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, and Tom Goldstein. Training neural networks without gradients: A scalable ADMM approach. In Proceedings of the 33rd International Conference on Machine Learning, ICML, 2016.

Published as a conference paper at ICLR 2018

A THEORETICAL RESULTS

Proof of Proposition 1. We ﬁrst take a gradient step on

E(θ, a, z) = Ly(φ(θL 1, a L 2)) +

γ 2 σ(zl) al 2 + ρ

2 φ(θl, al 1) zl 2, (17)

with respect to (θL 1, a L 2). The gradient step with respect to θL 1 is the same as in the gradient descent update, θk+1 = θk τ J(θk; X, y), (18) since J depends on θL 1 only via Ly φ.

The gradient descent step on a L 2 in a yields

ak+1/2 L 2 = ak L 2 τ aφ(θk L 1, ak L 2) φLy(φ(θk L 1, ak L 2)), (19)

where we use ak+1/2 L 2 to denote the updated variable a L 2 before the forward pass of the next iteration. To keep the presentation as clear as possible we slightly abused the notation of a right multiplication with aφ(θk L 1, ak L 2): While this notation is exact in the case of fully connected layers, it represents the application of the corresponding linear operator in the more general case, e.g. for convolutions.

For all layers l L 2 note that due to the forward pass in Algorithm 1 we have

σ(zk l ) = ak l , φ(θk l , ak l 1) = zk l (20)

and we therefore get the following update equations in the gradient step b

zk+1/2 l = zk l τγ σ(zk l ) σ(zk l ) ak+1/2 l = zk l σ(zk l ) ak l ak+1/2 l , (21)

and in the gradient step c w.r.t. al 1,

ak+1/2 l 1 = ak l 1 τρ aφ(θk l , ak l 1) φ(θk l , ak l 1) zk+1/2 l

= ak l 1 aφ(θk l , ak l 1) zk l zk+1/2 l . (22)

Equations (21) and (22) can be combined to obtain:

zk l zk+1/2 l = σ(zk l ) aφ(θk l+1, ak l ) zk l+1 zk+1/2 l+1 . (23)

The above formula allows us to backtrack the differences of the old zk l and the updated zk+1/2 l up to layer L 2, where we can use equations (21) and (19) to relate the difference to the loss. Altogether, we obtain

zk l zk+1/2 l = τ

q=l σ(zk q ) aφ(θk q+1, ak q)

φLy(φ(θk L 1, ak L 2)). (24)

By inserting (24) into the gradient descent update equation with respect to θl in c ,

θk+1 = θk θφ(θk l , ak l 1) zk l zk+1/2 l , (25)

we obtain the chain rule for update (18).

Proof of Proposition 2. Since only the updates for θl, l = 1, . . . , L 2, are performed implicitly, one can replicate the proof of Proposition 1 exactly up to equation (24). Let us denote the right hand side of (24) by gk l , i.e. zk+1/2 l = zk l gk l and note that

τ θl J(θk; X, y) = θφ( , ak l 1) gk l (26)

Published as a conference paper at ICLR 2018

holds by the chain rule (as seen in (25)). We have eliminated the dependence of θφ(θk l , ak l 1) on θk l and wrote θφ( , ak l 1) instead, because we assume φ to be linear in θ such that θφ does not depend on the point θ where the gradient is evaluated anymore.

We now rewrite the Prox Prop update equation of the parameters θ as follows

θk+1 l = argmin θ

1 2||φ(θ, ak l 1) zk+1/2 l ||2 + 1

2τθ ||θ θk l ||2

1 2||φ(θ, ak l 1) (zk l gk l )||2 + 1

2τθ ||θ θk l ||2

1 2||φ(θ, ak l 1) (φ(θk, ak l 1) gk l )||2 + 1

2τθ ||θ θk l ||2

1 2||φ(θ θk, ak l 1) + gk l ||2 + 1

2τθ ||θ θk l ||2,

where we have used that φ is linear in θ. The optimality condition yields

0 = φ( , ak l 1)(φ(θk+1 l θk, ak l 1) + gk l ) + 1

τθ (θk+1 l θk l ) (28)

Again, due to the linearity of φ in θ, one has

φ(θ, ak l 1) = ( φ( , ak l 1)) (θ), (29) where , denotes the adjoint of a linear operator. We conclude

0 = φ( , ak l 1)( φ( , ak l 1)) (θk+1 l θk) + φ( , ak l 1)gk l + 1

τθ (θk+1 l θk l ),

τθ I + φ( , ak l 1)( φ( , ak l 1)) (θk+1 l θk l ) = φ( , ak l 1)gk l = τ θl J(θk; X, y),

(30) which yields the assertion.

Proof of Proposition 3. Under the assumption that θk converges, θk ˆθ, one ﬁnds that ak l ˆal and zk l ˆzl = φ(ˆθl, ˆal 1) converge to the respective activations of the parameters ˆθ due to the forward pass and the continuity of the network. As we assume J( ; X, y) to be continuously differentiable, we deduce from (30) that limk θl J(θk; X, y) = 0 for all l = 1, . . . , L 2. The parameters of the last layer θL 1 are treated explicitly anyways, such that the above equation also holds for l = L 1, which then yields the assertion.

Proof of Proposition 4. As the matrices

τθ I + ( φ( , ak l 1))( φ( , ak l 1)) (31)

(with the convention M k L 1 = I) are positive deﬁnite, so are their inverses, and the claim that θk+1 θk is a descent direction is immediate, θk+1 l θk l , θl J(θk; Y, x) = τ (M k l ) 1 θl J(θk; Y, x), θl J(θk; Y, x) . (32)

We still have to guarantee that this update direction does not become orthogonal to the gradient in the limit k . The largest eigenvalue of (M k l ) 1 is bounded from above by τθ. If the ak l 1 remain bounded, then so does φ( , ak l 1) and the largest eigenvalue of φ( , ak l 1) φ( , ak l 1)

must be bounded by some constant c. Therefore, the smallest eigenvalue of (M k l ) 1 must remain bounded from from below by ( 1

τθ + c) 1. Abbreviating v = θl J(θk; Y, x), it follows that

cos(αk) = τ (M k l ) 1v, v τ (M k l ) 1v v

λmin((M k l ) 1) v 2

(M k l ) 1v v

λmin((M k l ) 1) λmax((M k l ) 1)

Published as a conference paper at ICLR 2018

which yields the assertion.

Proof of Proposition 5. According to (Nocedal & Wright, 2006, p. 109, Thm. 5.3) and (Nocedal & Wright, 2006, p. 106, Thm. 5.2) the k-th iteration xk of the CG method for solving a linear system Ax = b with starting point x0 = 0 meets

xk = arg min x span(b,Ab,...,Ak 1b) 1 2 x, Ax b, x , (34)

i.e. is optimizing over an order-k Krylov subspace. The starting point x0 = 0 can be chosen without loss of generality. Suppose the starting point is x0 = 0, then one can optimize the variable x = x x0 with a starting point x0 = 0 and b = b + A x0.

We will assume that the CG iteration has not converged yet as the claim for a fully converged CG iteration immediately follow from Proposition 4. Writing the vectors b, Ab, . . . , Ak 1b as columns of a matrix Kk, the condition x span(b, Ab, . . . , Ak 1b) can equivalently be expressed as x = Kkα for some α Rk. In terms of α our minimization problem becomes

xk = Kkα = arg min α Rk 1 2 α, (Kk)T AKkα (Kk)T b, α , (35)

leading to the optimality condition

0 = (Kk)T AKkα (Kk)T b,

xk = Kk((Kk)T AKk) 1(Kk)T b. (36)

Note that A is symmetric positive deﬁnite and can therefore be written as

A, leading to

(Kk)T AKk = (

being symmetric positive deﬁnite. Hence, the matrix ((Kk)T AKk) 1 is positive deﬁnite, too, and

xk, b = Kk((Kk)T AKk) 1(Kk)T b, b

= ((Kk)T AKk) 1(Kk)T b, (Kk)T b > 0. (38)

Note that (Kk)T b is nonzero if b is nonzero, as b 2 is its ﬁrst entry.

To translate the general analysis of the CG iteration to our speciﬁc case, using any number of CG iterations we ﬁnd that an approximate solution vk l of

M k l vk l = θl J(θk; X, y) (39)

leads to vk l , θl J(θk; X, y) > 0,

i.e., to vk l being a descent direction.

B PROXIMAL OPERATOR FOR LINEAR TRANSFER FUNCTIONS

In order to update the parameters θl of the linear transfer function, we have to solve the problem (11),

θk+1 = argmin θ

1 2||φ(θ, ak) zk+1/2||2 + 1

2τθ ||θ θk||2. (40)

Since we assume that φ is linear in θ for a ﬁxed ak, there exists a matrix Ak such that

vec(θk+1) = argmin θ

1 2||Akvec(θ) vec(zk+1/2)||2 + 1

2τθ ||vec(θ) vec(θk)||2, (41)

and the optimality condition yields

vec(θk+1) = (I + τθ(Ak)T Ak) 1(vec(θk) + (Ak)T vec(zk+1/2)). (42)

Published as a conference paper at ICLR 2018

In the main paper we sometimes use the more abstract but also more concise notion of φ( , ak), which represents the linear operator

φ( , ak)(Y ) = vec 1((Ak)T vec(Y )). (43)

To also make the above more speciﬁc, consider the example of φ(θ, ak) = θak. In this case the variable θ may remain in a matrix form and the solution of the proximal mapping becomes

θk+1 = zk+1/2 (ak) + 1

τθ θk ak(ak) + 1

τθ I 1 . (44)

Since ak Rn N for some layer size n and batch size N, the size of the linear system is independent of the batch size.