# monotone_operator_equilibrium_networks__ee926318.pdf

Monotone operator equilibrium networks

Ezra Winston School of Computer Science

Carnegie Mellon University

Pittsburgh, United States ewinston@cs.cmu.edu

J. Zico Kolter School of Computer Science

Carnegie Mellon University

& Bosch Center for AI Pittsburgh, United States

zkolter@cs.cmu.edu

Implicit-depth models such as Deep Equilibrium Networks have recently been shown to match or exceed the performance of traditional deep networks while being much more memory efﬁcient. However, these models suffer from unstable convergence to a solution and lack guarantees that a solution exists. On the other hand, Neural ODEs, another class of implicit-depth models, do guarantee existence of a unique solution but perform poorly compared with traditional networks. In this paper, we develop a new class of implicit-depth model based on the theory of monotone operators, the Monotone Operator Equilibrium Network (mon DEQ). We show the close connection between ﬁnding the equilibrium point of an implicit network and solving a form of monotone operator splitting problem, which admits efﬁcient solvers with guaranteed, stable convergence. We then develop a parameterization of the network which ensures that all operators remain monotone, which guarantees the existence of a unique equilibrium point. Finally, we show how to instantiate several versions of these models, and implement the resulting iterative solvers, for structured linear operators such as multi-scale convolutions. The resulting models vastly outperform the Neural ODE-based models while also being more computationally efﬁcient. Code is available at http://github.com/locuslab/monotone_op_net.

1 Introduction

Recent work in deep learning has demonstrated the power of implicit-depth networks, models where features are created not by explicitly iterating some number of nonlinear layers, but by ﬁnding a solution to some implicitly deﬁned equation. Instances of such models include the Neural ODE [8], which computes hidden layers as the solution to a continuous-time dynamical system, and the Deep Equilibrium (DEQ) Model [5], which ﬁnds a ﬁxed point of a nonlinear dynamical system corresponding to an effectively inﬁnite-depth weight-tied network. These models, which trace back to some of the original work on recurrent backpropagation [2, 23], have recently regained attention since they have been shown to match or even exceed to performance of traditional deep networks in domains such as sequence modeling [5]. At the same time, these models show drastically improved memory efﬁciency over traditional networks since backpropagation is typically done analytically using the implicit function theorem, without needing to store the intermediate hidden layers.

However, implict-depth models that perform well require extensive tuning in order to achieve stable convergence to a solution. Obtaining convergence in DEQs requires careful initialization and regularization, which has proven difﬁcult in practice [21]. Moreover, solutions to these models are not guaranteed to exist or be unique, making the output of the models potentially ill-deﬁned. While Neural ODEs [8] do guarantee existence of a unique solution, training remains unstable since the ODE problems can become severely ill-posed [10]. Augmented Neural ODEs [10] improve the stability of Neural ODEs by learning ODEs with simpler ﬂows, but neither model achieves efﬁcient

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

convergence nor performs well on standard benchmarks. Crucial questions remain about how models can have guaranteed, unique solutions, and what algorithms are most efﬁcient at ﬁnding them.

In this paper, we present a new class of implicit-depth equilibrium model, the Monotone Operator Equilibrium Network (mon DEQ), which guarantees stable convergence to a unique ﬁxed point.1 The model is based upon the theory of monotone operators [6, 26], and illustrates a close connection between simple ﬁxed-point iteration in weight-tied networks and the solution to a particular form of monotone operator splitting problem. Using this connection, this paper lays the theoretical and practical foundations for such networks. We show how to parameterize networks in a manner that ensures all operators remain monotone, which establishes the existence and uniqueness of the equilibrium point. We show how to backpropagate through such networks using the implicit function theorem; this leads to a corresponding (linear) operator splitting problem for the backward pass, which also is guaranteed to have a unique solution. We then adapt traditional operator splitting methods, such as forward-backward splitting or Peaceman-Rachford splitting, to naturally derive algorithms for efﬁciently computing these equilibrium points.

Finally, we demonstrate how to practically implement such models and operator splitting methods, in the cases of typical feedforward, fully convolutional, and multi-scale convolutional networks. For convolutional networks, the most efﬁcient ﬁxed-point solution methods require an inversion of the associated linear operator, and we illustrate how to achieve this using the fast Fourier transform. The resulting networks show strong performance on several benchmark tasks, vastly improving upon the accuracy and efﬁciency of Neural ODEs-based models, the other implicit-depth models where solutions are guaranteed to exist and be unique.

2 Related work

Implicit models in deep learning There has been a growing interest in recent years in implicit layers in deep learning. Instead of specifying the explicit computation to perform, a layer speciﬁes some condition that should hold at the solution to the layer, such as a nonlinear equality, or a differential equation solution. Using the implicit function theorem allows for backpropagating through the layer solutions analytically, making these layers very memory efﬁcient, as they do not need to maintain intermediate iterations of the solution procedure. Recent examples include layers that compute inference in graphical models [15], solve optimization problems [12, 3, 13, 1], execute model-based control policies [4], solve two-player games [20], solve gradient-based optimization for meta-learning [24], and many others.

Stability of ﬁxed-point models The issue of model stability has in fact been at the heart of much work in ﬁxed-point models. The original work on attractor-style recurrent models, trained via recurrent backpropagation [2, 23], precisely attempted to ensure that the forward iteration procedure was stable. And indeed, much of the work in recurrent architectures such as LSTMs has focused on these issues of stability [14]. Recent work has revisited recurrent backpropagation in a similar manner to DEQs, with the similar aim of speeding up the computation of ﬁxed points [19]. And other work has looked at the stability of implicit models [11], with an emphasis on guaranteeing the existence of ﬁxed points, but focused on alternative stability conditions, and considered only relatively small-scale experiments. Other recent work has looked to use control-theoretic methods to ensure the stability of implicit models, [25], though again they consider only small-scale evaluations.

Monotone operators in deep learning Although most work in the ﬁeld of monotone operators is concerned with general convex analysis, the recent work of [9] does also highlight connections between deep networks and monotone operator problems. Unlike our current work, however, that work focused largely on the fact that many common non-linearities can be expressed via proximal operators, and analyzed traditional networks under the assumptions that certain of the operators were monotone, but did not address conditions for the networks to be monotone or algorithms for solving or backpropagating through the networks.

3 A monotone operator view of ﬁxed-point networks

This section lays out our main methodological and theoretical contribution, a class of equilibrium networks based upon monotone operators. We begin with some preliminaries, then highlight the

1We largely use the terms ﬁxed point and equilibrium point interchangably in this work, using ﬁxed point in the context of an iterative procedure, and equilibrium point to refer more broadly to the point itself.

basic connection between the ﬁxed point of an inﬁnite-depth network and an associated operator splitting problem; next, we propose a parameterization that guarantees the associated operators to be maximal monotone; ﬁnally, we show how to use operator splitting methods to both compute the ﬁxed point and backpropagate through the ﬁxed point efﬁciently.

3.1 Preliminaries

Monotone operator theory The theory of monotone operators plays a foundational role in convex analysis and optimization. Monotone operators are a natural generalization of monotone functions, which can be used to assess the convergence properties of many forms of iterative ﬁxed-point algorithms. We emphasize that the majority of the work in this paper relies on well-known properties about monotone operators, and we refer to standard references on the topic including [6] and a less formal survey by [26]; we do include a brief recap of the deﬁnitions and results we require in Appendix A. Formally, an operator is a subset of the space F Rn Rn; in our setting this will usually correspond to set-valued or single-valued function. Operator splitting approaches refer to methods for ﬁnding a zero in a sum of operators, i.e., ﬁnd x such that 0 2 (F + G)(x). There are many such methods, but the two we will use mainly in this work are forward-backward splitting (eqn. A9 in the Appendix) and Peaceman-Rachford splitting (eqn. A10). As we will see, both ﬁnding a network equilibrium point and backpropagating through it can be formulated as operator splitting problems, and different operator splitting methods will lead to different approaches in their application to our subsequent implicit networks.

Deep equilibrium models The mon DEQ architecture is closely relate to the DEQ model, which parameterizes a weight-tied, input-injected network of the form zi+1 = g(zi, x), where x denotes the input to the network, injected at each layer; zi denotes the hidden layer at depth i; and g denotes a nonlinear function which is the same for each layer (hence the network is weight-tied). The key aspect of the DEQ model is that in this weight-tied setting, instead of forward iteration, we can simply use any root-ﬁnding approach to ﬁnd an equilibrium point of such an iteration z = g(z , x). Assuming the model is stable, this equilibrium point corresponds to an inﬁnite-depth ﬁxed point of the layer. The mon DEQ architecture can be viewed as an instance of a DEQ model, but one that relies on the theory of monotone operators, and a speciﬁc paramterization of the network weights, to guarantee the existence of a unique ﬁxed point for the network. Crucially, however, as is the case for DEQs, naive forward iteration of this model is not necessarily stable; we therefore employ operator splitting methods to develop provably (linearly) convergent methods for ﬁnding such ﬁxed points.

3.2 Fixed-point networks as operator splitting

As a starting point of our analysis, consider the weight-tied, input-injected network in which x 2 Rd denotes the input, and zk 2 Rn denotes the hidden units at layer k, given by the iteration2

zk+1 = σ(Wzk + Ux + b) (1) where σ : R ! R is a nonlinearity applied elementwise, W 2 Rn n are the hidden unit weights, U 2 Rn x are the input-injection weights and b 2 Rn is a bias term. An equilibrium point, or ﬁxed point, of this system is some point z? which remains constant after an update:

z? = σ(Wz? + Ux + b). (2) We begin by observing that it is possible to characterize this equilibrium point exactly as the solution to a certain operator splitting problem, under certain choices of operators and activation σ. This can be formalized in the following theorem, which we prove in Appendix B:

Theorem 1. Finding a ﬁxed point of the iteration (1) is equivalent to ﬁnding a zero of the operator splitting problem 0 2 (F + G)(z?) with the operators

F(z) = (I W)(z) (Ux + b), G = @f (3) and σ( ) = prox1

f( ) for some convex closed proper (CCP) function f, where prox

f denotes the proximal operator

f (x) argmin

2 + f(z). (4)

2This setting can also be seen as corresponding to a recurrent network with identical inputs at each time (indeed, this is the view of so-called attractor networks [23]). However, because in modern usage recurrent networks typically refer to sequential models with different inputs at each time, we don t adopt this terminology.

It is also well-established that many common nonlinearities used in deep networks can be represented as proximal operators of CCP functions [7, 9]. For example, the Re LU nonlinearity σ(x) = [x]+ is the proximal operator of the indicator of the positive orthant f(x) = I{x 0}, and tanh, sigmoid, and softplus all have close correspondence with proximal operators of simple expressions [7].

In fact, this method establishes that some seemingly unstable iterations can actually still lead to convergent algorithms. Re LU activations, for instance, have traditionally been avoided in iterative models such as recurrent networks, due to exploding or vanishing gradient problems and nonsmoothness. Yet this iteration shows that (with input injection and the above constraint on W), Re LU operators are perfectly well-suited to these ﬁxed-point iterations.

3.3 Enforcing existence of a unique solution

The above connection is straightforward, but also carries interesting implications for deep learning. Speciﬁcally, we can establish the existence and uniqueness of the equilibirum point z? via the simple sufﬁcient criterion that I W is strongly monotone, or in other words3 I W m I for some m > 0 (see Appendix A). The constraint is by no means a trivial condition. Although many layers obey this condition under typical initialization schemes, during training it is normal for W to move outside this regime. Thus, the ﬁrst step of the mon DEQ architecture is to parameterize W in such a way that it always satisﬁes this strong monotonicity constraint. Proposition 1. We have I W m I if and only if there exist A, B 2 Rn n such that

W = (1 m)I AT A + B BT . (5)

We therefore propose to simply parameterize W directly in this form, by deﬁning the A and B matrices directly. While this is an overparameterized form for a dense matrix, we could avoid this issue by, e.g. constraining A to be lower triangular (making it the Cholesky factor of AT A), and by making B strictly upper triangular; in practice, however, simply using general A and B matrices has little impact upon the performance of the method. The parameterization does notably raise additional complications when dealing with convolutional layers, but we defer this discussion to Section 4.2.

3.4 Computing the network ﬁxed point

Given the mon DEQ formulation, the ﬁrst natural question to ask is: how should we compute the equilibrium point z? = σ(Wz? + Ux + b)? Crucially, it can be the case that the simple forward iteration of the network equation (1) does not converge, i.e., the iteration may be unstable. Fortunately, monotone operator splitting leads to a number of iterative methods for ﬁnding these ﬁxed points, which are guaranteed to converge under proper conditions. For example, the forward-backward iteration applied to the monotone operator formulation from Theorem 1 results exactly in a damped version of the forward iteration

zk+1 = prox

f (zk ((I W)zk (Ux + b))) = prox

f ((1 )zk + (Wzk + Ux + b)). (6)

This iteration is guaranteed to converge linearly to the ﬁxed point z? provided that 2m/L2, when the operator I W is Lipschitz and strongly monotone with parameters L (which is simply the operator norm k I Wk2) and m [26].

A key advantage of the mon DEQ formulation is the ﬂexibility to employ alternative operator splitting methods that converge much more quickly to the equilibrium. One such example is Peaceman Rachford splitting which, when applied to the formulation from Theorem 1, takes the form

uk+1/2 = 2zk uk

zk+1/2 = (I + (I W)) 1(uk+1/2 (Ux + b))

uk+1 = 2zk+1/2 uk+1/2

zk+1 = prox

where we use the explicit form of the resolvents for the two monotone operators of the model. The advantage of Peaceman-Rachford splitting over forward-backward is two-fold: 1) it typically converges in fewer iterations, which is a key bottleneck for many implicit models; and 2) it converges

3For non-symmetric matrices, which of course is typically the case with W, positive deﬁniteness is deﬁned as the positive deﬁniteness of the symmetric component I W m I , I (W + W T )/2 m I.

Algorithm 1 Forward-backward equilibrium solving

z := 0; err := 1 while err > do

z+ := (1 )z + (Wz + Ux + b) z+ := prox

err := kz+ zk2

kz+k2 z := z+

Algorithm 2 Peaceman-Rachford equilibrium solving

z, u := 0; err := 1; V := (I + (I W)) 1

while err > do

u1/2 := 2z u z1/2 := V (u1/2 + (Ux + b)) u+ := 2z1/2 u1/2

err := kz+ zk2

kz+k2 z, u := z+, u+

for any > 0 [26], unlike forward-backward splitting which is dependent on the Lipschitz constant of I W. The disadvantage of Peaceman-Rachford splitting, however, is that it requires an inverse involving the weight matrix W. It is not immediately clear how to apply such methods if the W matrix involves convolutions or multi-layer models; we discuss these points in Section 4.2. A summary of these methods for computation of the forward equilibrium point is given in Algorithms 1 and 2.

3.5 Backprogation through the monotone operator layer

Finally, a key challenge for any implicit model is to determine how to perform backpropagation through the layer. As with most implicit models, a potential beneﬁt of the ﬁxed-point conditions we describe is that, by using the implicit function theorem, it is possible to perform backpropagation without storing the intermediate iterates of the operator splitting algorithm in memory, and instead backpropagating directly through the equilibrium point.

To begin, we present a standard approach to differentiating through the ﬁxed point z? using the implicit function theorem. This formulation has some compelling properties for mon DEQ, namely the fact that this (sub)gradient will always exist. When training a network via gradient descent, we need to compute the gradients of the loss function

where ( ) denotes some input to the layer or parameters, i.e. W, x, etc. The challenge here is computing (or left-multiplying by) the Jacobian @z?/@( ), since z? is not an explicit function of the inputs. While it would be possible to simply compute gradients through the unrolled updates, e.g. zk+1 = σ(Wzk + Ux + b) for forward iteration, this would require storing each intermediate state zk, a potentially memory-intensive operation. Instead, the following theorem gives an explicit formula for the necessary (sub)gradients. We state the theorem more directly in terms of the operators mentioned Theorem 1; that is, we use prox1

f( ) in place of σ( ).

Theorem 2. For the equilibrium point z? = prox1

f(Wz? + Ux + b), we have

@z? (I JW) 1J @(Wz? + Ux + b)

J = D prox1

f(Wz? + Ux + b) (10)

denotes the Clarke generalized Jacobian of the nonlinearity evaluated at the point Wz? + Ux + b. Furthermore, for the case that (I W) m I, this derivative always exists.

To apply the theorem in practice to perform reverse-mode differentiation, we need to solve the system

The above system is a linear equation and while it is typically computationally infeasible to compute the inverse (I JW) T exactly, we could compute a solution to (I JW) T v using, e.g., conjugate gradient methods. However, we present an alternative formulation to computing (11) as the solution to a (linear) monotone operator splitting problem:

Algorithm 3 Forward-backward equilibrium backpropagation

u := 0; err := 1; v := @ @z while err > do

u+ := (1 )u + W T u

i + vi 1+ (1+Dii) if Dii < 1 0 if Dii = 1

err := ku+ uk2

ku+k2 u := u+

Algorithm 4 Peaceman-Rachford equilibrium backpropagation

z, u := 0; err := 1; v := @ @z ; V := (I + (I W)) 1

while err > do

u1/2 := 2z u z1/2 := V T u1/2

u+ := 2z1/2 u1/2

i + vi 1+ (1+Dii) if Dii < 1 0 if Dii = 1

err := kz+ zk2

kz+k2 z, u := z+, u+

Theorem 3. Let z? be a solution to the monotone operator splitting problem deﬁned in Theorem 1, and deﬁne J as in (10). Then for v 2 Rn the solution of the equation

u? = (I JW) T v (12) is given by

u? = v + W T u? (13)

where u? is a zero of the operator splitting problem 0 2 ( F + G)(u?), with operators deﬁned as

F( u) = (I W T )( u), G( u) = D u v (14) where D is a diagonal matrix deﬁned by J = (I + D) 1 (where we allow for the possibility of Dii = 1 for Jii = 0).

An advantage of this approach when using Peaceman-Rachford splitting is that it allows us to reuse a fast method for multiplying by (I + (I W)) 1 which is required by Peaceman-Rachford during both the forward pass (equilibrium solving) and backward pass (backpropagation) of training a mon DEQ. Algorithms detailing both the Peaceman-Rachford and forward-backward solvers for the backpropagation problem (14) are given in Algorithms 3 and 4.

4 Example monotone operator networks

With the basic foundations from the previous section, we now highlight several different instantiations of the mon DEQ architecture. In each of these settings, as in Theorem 1, we will formulate the objective as one of ﬁnding a solution to the operator splitting problem 0 2 (F + G)(z?) for

F(z) = (I W)(z) (Ux + b), G = @f (15) or equivalently as computing an equilibrium point z? = prox1

f(Wz? + Ux + b).

In each of these settings we need to deﬁne what the input and hidden state x and z correspond to, what the W and U operators consist of, and what is the function f which determines the network nonlinearity. Key to the application of monotone operator methods are that 1) we need to constrain the W matrix such that I W m I as described in the previous section and 2) we need a method to compute (or solve) the inverse (I + (I W)) 1, needed e.g. for Peaceman-Rachford; while this would not be needed if using only forward-backward splitting, we believe that the full power of the monotone operator view is realized precisely when these more involved methods are possible.

4.1 Fully connected networks

The simplest setting, of course, is the case we have largely highlighted above, where x 2 Rd and z 2 Rn are unstructured vectors, and W 2 Rn n and U 2 Rn d and b 2 Rn are dense matrices and vectors respectively. As indicated above, we parameterize W directly by A, B 2 Rn n as in (5). Since the Ux term simply acts as a bias in the iteration, there is no constraint on the form of U.

We can form an inverse directly by simply forming and inverting the matrix I + (I W), which has cost O(n3). Note that this inverse needs to be formed only once, and can be reused over all iterations of the operator splitting method and over an entire batch of examples (but recomputed, of course, when W changes). Any proximal function can be used as the activation: for example the Re LU, though as mentioned there are also close approximations to the sigmoid, tanh, and softplus.

4.2 Convolutional networks

The real power of mon DEQs comes with the ability to use more structured linear operators such as convolutions. We let x 2 Rds2 be a d-channel input of size s s and z 2 Rns2 be a n-channel hidden layer. We also let W 2 Rns2 ns2 denote the linear form of a 2D convolutional operator and similarly for U 2 Rns2 ds2. As above, W is parameterized by two additional convolutional operators A, B of the same form as W. Note that this implicitly increases the receptive ﬁeld size of W: if A and B are 3 3 convolutions, then W = (1 m)I AT A + B BT will have an effective kernel size of 5.

Inversion The beneﬁt of convolutional operators in this setting is the ability to perform efﬁcient inversion via the fast Fourier transform. Speciﬁcally, in the case that A and B represent circular convolutions, we can reduce the matrices to block-diagonal form via the discrete Fourier transform (DFT) matrix

where Fs denotes (a permuted form of) the 2D DFT operator and DA 2 Cns2 ns2 is a (complex) block diagonal matrix where each block DAii 2 Cn n corresponds to the DFT at one particular location in the image. In this form, we can efﬁciently multiply by the inverse of the convolutional operator, noting that

I + (I W) = (1 + m)I + AT A B + BT

= Fs((1 + m)I + D

The inner term here is itself a block diagonal matrix with complex n n blocks (each block is also guaranteed to be invertible by the same logic as for the full matrix). Thus, we can multiply a set of hidden units z by the inverse of this matrix by simply inverting each n n block, taking the fast Fourier transform (FFT) of z, multiplying each corresponding block of Fsz by the corresponding inverse, then taking the inverse FFT. The details are given in Appendix C.

The computational cost of multiplying by this inverse is O(n2s2 log s + n3s2) to compute the FFT of each convolutional ﬁlter and precompute the inverses, and then O(bns2 log s + bn2s2) to multiply by the inverses for a set of hidden units with a minibatch of size b. Note that just computing the convolutions in a normal manner has cost O(bn2s2), so that these computations are on the same order as performing typical forward passes through a network, though empirically 2-3 times slower owing to the relative complexity of performing the necessary FFTs.

One drawback of using the FFT in this manner is that it requires that all convolutions be circular; however, this circular dependence can be avoided using zero-padding, as detailed in Section C.2.

4.3 Forward multi-tier networks

Although a single fully-connected or convolutional operator within a mon DEQ can be suitable for small-scale problems, in typical deep learning settings it is common to model hidden units at different hierarchical levels. While mon DEQs may seem inherently single-layer, we can model this same hierarchical structure by incorporating structure into the W matrix. For example, assuming a convolutional setting, with input x 2 Rds2 as in the previous section, we could partition z into L different hierarchical resolutions and let W have a multi-tier structure, e.g.

1 z2 2 Rn2s2

2 ... z L 2 Rn Ls2

W11 0 0 0 W21 W22 0 0 0 W32 W33 0 ... ... ... ... ... 0 0 0 WLL

where zi denotes the hidden units at level i, an si si resolution hidden unit with ni channels, and where Wii denotes an ni channel to ni channel convolution, and Wi+1,i denotes an ni to ni+1 channel, strided convolution. This structure of W allows for both interand intra-tier inﬂuence.

One challenge is to ensure that we can represent W with the form (1 m)I AT A + B BT while still maintaining the above structure, which we achieve by parameterizing each Wij block appropriately. Another consideration is the inversion of the multi-tier operator, which can be achieved via the FFT similarly as for single-convolutional W, but with additional complexity arising from the fact that the Ai+1,i convolutions are strided. These details are described in Appendix D.

Method Model size Acc.

Neural ODE 172K 55.3 0.3% Aug. Neural ODE 172K 58.9 2.8% Neural ODE 1M 59.9% Aug. Neural ODE 1M 73.4%

mon DEQ (ours)

Single conv 172K 74.0 0.1% Multi-tier 170K 72.0 0.3% Single conv 854K 82.0 0.3% Multi-tier 1M 89.0 0.3%

Method Model size Acc.

Neural ODE 172K 81.0% Aug. Neural ODE 172K 83.5%

mon DEQ (ours)

Single conv 172K 88.7 1.1% Multi-tier 170K 92.4 0.1%

Method Model size Acc.

Neural ODE 84K 96.4% Aug. Neural ODE 84K 98.2%

mon DEQ (ours)

Fully connected 84K 98.1 0.1% Single conv 84K 99.1 0.1% Multi-tier 81K 99.0 0.1%

Table 1: Test accuracy of mon DEQ models compared to Neural ODEs and Augmented Neural ODEs. *with data augmentation; best test accuracy before training diverges; as reported in [10].

1 10 20 30 40 ESRFhs

% 7est DFFur DFy

6ing Oe FRnv 0u Oti-tier

Figure 1: Test accuracy of mon DEQs during training on CIFAR-10, with NODE [8] and ANODE [10] for comparison. NODE and ANODE curves obtained using code provided by [10].

0 20 40 60 Epo Fhs

Forward-pass iterations

α = 1 α tuned

Figure 2: Iterations required by Peaceman-Rachford equilibrium solving over the course of training, for best and = 1.

5 Experiments

To test the expressive power and training stability of mon DEQs, we evaluate the mon DEQ instantiations described in Section 4 on several image classiﬁcation benchmarks. We take as a point of comparison the Neural ODE (NODE) [8] and Augmented Neural ODE (ANODE) [10] models, the only other implicit-depth models which guarantee the existence and uniqueness of a solution. We also assess the stability of training standard DEQs of the same form as our mon DEQs.

The training process relies upon the operator splitting algorithms derived in Sections 3.4 and 3.5; for each batch of examples, the forward pass of the network involves ﬁnding the network ﬁxed point (Algorithm 1 or 2), and the backward pass involves backpropagating the loss gradient through the ﬁxed point (Algorithm 3 or 4). We analyze the convergence properties of both the forward-backward and Peaceman-Rachford operator splitting methods, and use the more efﬁcient Peaceman-Rachford splitting for our model training. For further training details and model architectures see Appendix E. Experiment code can be found at http://github.com/locuslab/monotone_op_net.

Performance on image benchmarks We train small mon DEQs on CIFAR-10 [17], SVHN [22], and MNIST [18], with a similar number of parameters as the ODE-based models reported in [8] and [10]. The results (averages over three runs) are shown in Table 1. Training curves for mon DEQs, NODE, and ANODE on CIFAR-10 are show in Figure (1) and additional training curves are shown in Figure F1. Notably, except for the fully-connected model on MNIST, all mon DEQs signiﬁcantly outperform the ODE-based models across datasets. We highlight the performance of the small single convolution mon DEQ on CIFAR-10 which outperforms Augmented Neural ODE by 15.1%.

0 50 100 150 200 250 Iterati Rns

FRrward-ba Fkward

0 50 100 150 200 250 Iterati Rns

3ea Fe Pan-5a Fh IRrd

α 1 α 1/4 α 1/8 α 1/16 α 1/32 α 1/64

Figure 3: Convergence of Peaceman-Rachford and forward-backward equilibrium solving, on fully-trained model.

We also attempt to train standard DEQs of the same structure as our small multi-tier convolutional mon DEQ. We train DEQs both with unconstrained W and with W having the monotone parameterization (5), and solve for the ﬁxed point using Broyden s method as in [5]. All models quickly diverge during the ﬁrst few epochs of training, even when allowed 300 iterations of Broyden s method.

Additionally, we train two larger mon DEQs on CIFAR-10 with data augmentation. The strong performance (89% test accuracy) of the multi-tier network, in particular, goes a long way towards closing the performance gap with traditional deep networks. For comparison, we train larger NODE and ANODE models with a comparable number of parameters (~1M). These attain higher test accuracy than the smaller models during training, but diverge after 10-30 epochs (see Figure F1).

Efﬁciency of operator splitting methods We compare the convergence rates of Peaceman Rachford and forward-backward splitting on a fully trained model, using a large multi-tier mon DEQ trained on CIFAR-10. Figure 3 shows convergence for both methods during the forward pass, for a range of . As the theory suggests, the convergence rates depend strongly on the choice of . Forward-backward does not converge for > 0.125, but convergence speed varies inversely with for < 0.125. In contrast, Peaceman-Rachford is guaranteed to converge for any > 0 but the dependence is non-monotonic. We see that, for the optimal choice of , Peaceman-Rachford can converge much more quickly than forward-backward. The convergence rate also depends on the Lipschitz parameter L of I W, which we observe increases during training. Peaceman-Rachford therefore requires an increasing number of iterations during both the forward pass (Figure 2) and backward pass (Figure F2).

Finally, we compare the efﬁciency of mon DEQ to that of the ODE-based models. We report the time and number of function evaluations (OED solver steps or operator splitting iterations) required by the ~170k-parameter models to train on CIFAR-10 for 40 epochs. The mon DEQ, neural ODE, and ANODE training takes respectively 1.4, 4.4, and 3.3 hours, with an average of 20, 96, and 90 function evals per minibatch. Note however that training the larger 1M-parameter mon DEQ on CIFAR-10 requires 65 epochs and takes 16 hours. All experiments are run on a single RTX 2080 Ti GPU.

6 Conclusion

The connection between monotone operator splitting and implicit network equilibria brings a new suite of tools to the study of implicit-depth networks. The strong performance, efﬁciency, and guaranteed stability of mon DEQ indicate that such networks could become practical alternatives to deep networks, while the ﬂexibility of the framework means that performance can likely be further improved by, e.g. imposing additional structure on W or employing other operator splitting methods. At the same time, we see potential for the study of mon DEQs to inform traditional deep learning itself. The guarantees we can derive about what architectures and algorithms work for implicit-depth networks may give us insights into what will work for explicit deep networks.

Broader impact statement

While the main thrust of our work is foundational in nature, we do demonstrate the potential for implicit models to become practical alternatives to traditional deep networks. Owing to their improved memory efﬁciency, these networks have the potential to further applications of AI methods on edge devices, where they are currently largely impractical. However, the work is still largely algorithmic in nature, and thus it is much less clear the immediate societal-level beneﬁts (or harms) that could result from the speciﬁc tehniques we propose and demonstrate in this paper.

Acknowledgements

Ezra Winston is supported by a grant from the Bosch Center for Artiﬁcial Intelligence.

[1] A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter. Differentiable convex

optimization layers. In Neural Information Processing Systems, 2019.

[2] L. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial

environment. In Artiﬁcial Neural Networks. 1990.

[3] B. Amos and J. Z. Kolter. Opt Net: Differentiable optimization as a layer in neural networks. In

International Conference on Machine Learning, 2017.

[4] B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable mpc for end-to-end

planning and control. In Advances in Neural Information Processing Systems, pages 8299 8310, 2018.

[5] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In Advances in Neural Information

Processing Systems, pages 688 699, 2019.

[6] H. H. Bauschke, P. L. Combettes, et al. Convex analysis and monotone operator theory in

Hilbert spaces, volume 408. Springer, 2011.

[7] A. Bibi, B. Ghanem, V. Koltun, and R. Ranftl. Deep layers as stochastic solvers. In International

Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=ryxx Ci Rq YX.

[8] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential

equations. In Advances in neural information processing systems, pages 6571 6583, 2018.

[9] P. L. Combettes and J.-C. Pesquet. Deep neural network structures solving variational inequali-

ties. Set-Valued and Variational Analysis, pages 1 28, 2020.

[10] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural odes. In Advances in Neural

Information Processing Systems, pages 3134 3144, 2019.

[11] L. El Ghaoui, F. Gu, B. Travacca, and A. Askari. Implicit deep learning. ar Xiv preprint

ar Xiv:1908.06315, 2019.

[12] S. Gould, B. Fernando, A. Cherian, P. Anderson, R. S. Cruz, and E. Guo. On differentiating

parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016.

[13] S. Gould, R. Hartley, and D. Campbell. Deep declarative networks: A new hope. ar Xiv preprint

ar Xiv:1909.04866, 2019.

[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):

1735 1780, 1997.

[15] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical

models with neural networks for structured representations and fast inference. In Advances in Neural Information Processing Systems, 2016.

[16] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and

Y. Le Cun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980.

[17] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical

report, University of Toronto, 2009. [18] Y. Le Cun, C. Cortes, and C. Burges. Mnist handwritten digit database. ATT Labs [Online].

Available: http://yann.lecun.com/exdb/mnist, 2, 2010. [19] R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. Yoon, X. Pitkow, R. Urtasun, and R. Zemel. Reviving

and improving recurrent back-propagation. In International Conference on Machine Learning, pages 3082 3091, 2018. [20] C. K. Ling, F. Fang, and J. Z. Kolter. What game are we playing? End-to-end learning in normal

and extensive form games. In International Joint Conference on AI, 2018. [21] D. Linsley, A. K. Ashok, L. N. Govindarajan, R. Liu, and T. Serre. Stable and expressive

recurrent vision models, 2020. [22] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in nat-

ural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. [23] F. J. Pineda. Generalization of back propagation to recurrent and higher order neural networks.

In Advances in Neural Information Processing Systems, 1988. [24] A. Rajeswaran, C. Finn, S. M. Kakade, and S. Levine. Meta-learning with implicit gradients. In

Advances in Neural Information Processing Systems, 2019. [25] M. Revay and I. R. Manchester. Contracting implicit recurrent neural networks: Stable models

with improved trainability. ar Xiv preprint ar Xiv:1912.10402, 2019. [26] E. K. Ryu and S. Boyd. Primer on monotone operator methods. Appl. Comput. Math, 15(1):

3 43, 2016. [27] L. N. Smith and N. Topin. Super-convergence: Very fast training of neural networks using large

learning rates. In Artiﬁcial Intelligence and Machine Learning for Multi-Domain Operations Applications, volume 11006, page 1100612. International Society for Optics and Photonics,