# multiscale_deep_equilibrium_models__af3d9f4d.pdf

Multiscale Deep Equilibrium Models

Shaojie Bai Carnegie Mellon University Vladlen Koltun Intel Labs J. Zico Kolter Carnegie Mellon University Bosch Center for AI

We propose a new class of implicit networks, the multiscale deep equilibrium model (MDEQ), suited to large-scale and highly hierarchical pattern recognition domains. An MDEQ directly solves for and backpropagates through the equilibrium points of multiple feature resolutions simultaneously, using implicit differentiation to avoid storing intermediate states (and thus requiring only O(1) memory consumption). These simultaneously-learned multi-resolution features allow us to train a single model on a diverse set of tasks and loss functions, such as using a single MDEQ to perform both image classiﬁcation and semantic segmentation. We illustrate the effectiveness of this approach on two large-scale vision tasks: Image Net classiﬁcation and semantic segmentation on high-resolution images from the Cityscapes dataset. In both settings, MDEQs are able to match or exceed the performance of recent competitive computer vision models: the ﬁrst time such performance and scale have been achieved by an implicit deep learning approach. The code and pre-trained models are at tt s t s q.

1 Introduction

State-of-the-art pattern recognition systems in domains such as computer vision and audio processing are almost universally based on multi-layer hierarchical feature extractors [33, 35, 36]. These models are structured in stages: the input is processed via a number of consecutive blocks, each operating at a different resolution [32, 54, 51, 26]. The architectures explicitly express hierarchical structure, with upand downsampling layers that transition between consecutive blocks operating at different scales. An important motivation for such designs is the prominent multiscale structure and extremely high signal dimensionalities in these domains. A typical image, for instance, contains millions of pixels, which must be processed coherently by the model.

An alternative approach to differentiable modeling is exempliﬁed by recent progress on implicit deep networks, such as Neural ODEs (NODEs) [12] and deep equilibrium models (DEQs) [5]. These constructions replace explicit, deeply stacked layers with analytical conditions that the model must satisfy, and are able to simulate models with inﬁnite depth within a constant memory footprint. A notable achievement for implicit modeling is its successful application to large-scale sequences in natural language processing [5].

Is implicit deep learning relevant for general pattern recognition tasks? One clear challenge here is that implicit networks do away with ﬂexible layers and stages . It is therefore not clear whether they can appropriately model multiscale structure, which appears essential to high discriminative power in some domains. This is the challenge that motivates our work. Can implicit models that forego deep sequences of layers and stages attain competitive accuracy in domains characterized by rich multiscale structure, such as computer vision?

To address this challenge, we introduce a new class of implicit networks: the multiscale deep equilibrium model (MDEQ). It is inspired by DEQs, which attained high accuracy in sequence modeling [5]. We expand upon the DEQ construction substantially to introduce simultaneous equilibrium modeling

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

of multiple signal resolutions. MDEQ solves for equilibria of multiple resolution streams simultaneously by directly optimizing for stable representations on all feature scales at the same time. Unlike standard explicit deep networks, MDEQ does not process different resolutions in succession, with higher resolutions ﬂowing into lower ones or vice versa. Rather, the different feature scales are maintained side by side in a single shallow model that is driven to equilibrium.

This design brings two major advantages. First, like the basic DEQ, our model does not require backpropagation through an explicit stack of layers and has an O(1) memory footprint during training. This is especially important as pattern recognition systems are memory-intensive. Second, MDEQ rectiﬁes one of the drawbacks of DEQ by exposing multiple feature scales at equilibrium, thereby providing natural interfaces for auxiliary losses and for compound training procedures such as pretraining (e.g., on Image Net) and ﬁne-tuning (e.g., on segmentation or detection tasks). Multiscale modeling enables a single MDEQ to simultaneously train for multiple losses deﬁned on potentially very different scales, whose equilibrium features can serve as heads for a variety of tasks.

We demonstrate the effectiveness of MDEQ via extensive experiments on large-scale image classiﬁcation and semantic segmentation datasets. Remarkably, this shallow implicit model attains comparable accuracy levels to state-of-the-art deeply-stacked explicit ones. On Image Net classiﬁcation, MDEQs outperform baseline Res Nets (e.g., Res Net-101) with similar parameter counts, reaching 77.5% top-1 accuracy. On Cityscapes semantic segmentation (dense labeling of 2-megapixel images), identical MDEQs to the ones used for Image Net experiments match the performance of recent explicit models while consuming much less memory. Our largest MDEQ surpasses 80% m Io U on the Cityscapes validation set, outperforming strong convolutional networks and coming tantalizingly close to the state of the art. This is by far the largest-scale application of implicit deep learning to date and a remarkable result for a class of models that until recently were applied largely to toy domains.

2 Background

Implicit Deep Learning. Virtually all modern deep learning approaches use explicit models, which provide explicit computation graphs for forward propagation. Backward passes proceed in reverse order through the same graph. This approach is the core of popular deep learning frameworks [1] and is associated with the very concept of architecture . In contrast, implicit models do not have prescribed computation graphs. They instead posit a speciﬁc criterion that the model must satisfy (e.g., the endpoint of an ODE ﬂow, or the root of an equation). Importantly, the algorithm that drives the model to fulﬁll this criterion is not prescribed. Therefore, implicit models can leverage black-box solvers in their forward passes and enjoy analytical backward passes that are independent of the forward pass trajectories.

Implicit modeling of hidden states has been explored by the deep learning community for decades. Pineda [43] and Almeida [2] studied implicit differentiation techniques for training recurrent dynamics, also known as recurrent back-propagation (RBP) [37]. Implicit approaches to network design have recently attracted renewed interest [20, 24]. For example, Neural ODEs (NODEs) [12, 18] model a recursive residual block using implicit ODE solvers, equivalent to a continuous Res Net taking inﬁntesimal steps. Deep equilibrium models (DEQs) [5] solve for the ﬁxed point of a sequence model with black-box root-ﬁnding methods, equivalent to ﬁnding the limit state of an inﬁnite-layer network. Other instantiations of implicit modeling include optimization layers [17, 3], differentiable physics engines [14, 45], logical structure learning [58], and continuous generative models [25].

Our work takes the deep equilibrium approach [5] into signal domains characterized by rich multiscale structure. We develop the ﬁrst one-layer implicit deep model that is able to scale to realistic visual tasks (e.g., megapixel-level images), and achieve competitive results in these regimes. In comparison, ODE-based models have so far only been applied to relatively low-dimensional signals due to numerical instability. For example, Chen et al. [12] downsampled 28 28 MNIST images to 7 7 before feeding them to Neural ODEs. More broadly, our work can be seen as a new perspective on implicit models, wherein the models deﬁne and optimize simultaneous criteria over multiple data streams that can have different dimensionalities. While DEQs and NODEs have so far been deﬁned on a single stream of features, a single MDEQ can jointly optimize features for different tasks, such as image segmentation and classiﬁcation.

Multiscale Modeling in Computer Vision. Computer vision is a canonical application domain for hierarchical multiscale modeling. The ﬁeld has come to be dominated by deep convolutional

networks [33, 32]. Computer vision problems can be viewed in terms of the granularity of the desired output: from low-resolution, such as a label for a whole image [16], to high-resolution output that assigns a label to each pixel, as in semantic segmentation [49, 11, 61, 64]. State-of-the-art models for these problems are explicitly structured into sequential stages of processing that operate at different resolutions [32, 54, 51, 26]. For example, a Res Net [26] typically consists of 4-6 sequential stages, each operating at half the resolution of the preceding one. A dilated Res Net [62] uses a different schedule for the progression of resolutions. A Dense Net [27] uses different connectivity patterns to carry information between layers, but shares the overarching structure: a sequence of stages. Other designs progressively decrease feature resolution and then increase it step by step [46]. Downsampling and upsampling can also be repeated, again in an explicitly choreographed sequence [42, 53].

Multiscale modeling has been a central motif in computer vision. The Laplacian pyramid is an inﬂuential early example of multiscale modeling [7]. Multiscale processing has been integrated with convolutional networks for scene parsing by Farabet et al. [21] and has been explicitly addressed in many subsequent architectures [49, 11, 61, 8, 38, 64, 28, 10, 57].

Our work brings multiscale modeling to implicit deep networks. MDEQ has in essence only one stage, in which the different resolutions coexist side by side. The input is injected at the highest resolution and then propagated implicitly to the other scales, which are optimized simultaneously by a (black-box) solver that drives them to satisfy a joint equilibrium condition. Just like DEQs, an MDEQ is able to represent an inﬁnitely deep network with only a constant memory cost.

3 Multiscale Deep Equilibrium Models

We begin by brieﬂy summarizing the basic DEQ construction and some major challenges that arise when extending it to computer vision.

3.1 Deep Equilibrium (DEQ): Generic Formulation

One of the core ideas that motivated the DEQ approach was weight-tying: the same set of parameters can be shared across the layers of a deep network. Formally, Bai et al. [5] formulated an L-layer weight-tied transformation with parameter θ on hidden state z as

z[i+1] = fθ(z[i]; x), i = 0, . . . , L 1 (1)

where the input representation x was injected into each layer. When sufﬁcient stability conditions were ensured, stacking such layers inﬁnitely (i.e., L ) was shown to essentially perform ﬁxed-point iterations and thus tend to an equilibrium z = fθ(z ; x). Intuitively, as we iterate the transformation fθ, the hidden representation tends to converge to a stable state, z . Such construction has a number of appealing properties. First, we can directly solve for the ﬁxed point, which can be done faster than explicitly iterating through the layers. We formulate this as a root-ﬁnding problem:

gθ(z; x) := fθ(z; x) z = z = Rootﬁnd(gθ; x) (2)

For example, one can leverage Newton or quasi-Newton methods to achieve quadratic or superlinear convergence to the root. Second, one can directly backpropagate through the equilibrium state using the Jacobian of gθ at z , without tracing through the forward root-ﬁnding process. Formally, given a loss ℓ= L(z , y) (where y is the target), the gradients can be written as

z J 1 gθ |z fθ(z ; x)

z J 1 gθ |z fθ(z ; x)

See Bai et al. [5] for the proof, which is based on the implicit function theorem [30]. This means that the forward pass of a DEQ can rely on any black-box root solver, while the backward pass is based independently on differentiating through only one layer (or block) at the equilibrium (i.e., fθ(z ;x)

( ) ). The memory consumption of the entire training process is equivalent to that of just one block rather than L blocks. Since the Jacobian of gθ can be expensive to compute, DEQs solve for a linear equation involving a vector-Jacobian product, which is a lot cheaper:

x(Jgθ|z ) + ℓ

The DEQ model therefore solves for the network output at its inﬁnite depth, with each step of the model now implicitly deﬁned to reach an analytical objective (the equilibrium).

z1 : H1 W1 C1

z2 : H2 W2 C2

zn : Hn Wn Cn

z3 : H3 W3 C3 ...

Multi-resolution Fusion n resolutions: z[i] = [z[i] 1 , . . . , z[i] n ]

High Resolution Low Resolution

Downsample (Strided conv.) Upsample (Interpolation)

Residual Block

Residual Block

Residual Block

Residual Block

(to the highest resolution only) Input Injection

Equilibrium z?

Multiscale Equilibrium Solver for z? = fθ(z?; x)

Figure 1: The structure of a multiscale deep equilibrium model (MDEQ). All components of the model are shown in this ﬁgure. MDEQ consists of a transformation fθ that is driven to equilibrium. Features at different scales coexist side by side and are driven to equilibrium simultaneously.

Challenges. The construction of Bai et al. [5], which we have just summarized, was primarily aimed at processing sequences. As we transition from sequences to high-resolution images, we note important differences between these domains. First, unlike typical autoregressive sequence learning problems (e.g., language modeling), where input and output have identical length and dimensionality, general pattern recognition systems (such as those in vision) entail multi-stage modeling via a combination of upand downsampling in the architecture. The basic DEQ construction does not exhibit such structure. Second, the output of a computer vision task such as image classiﬁcation (a label) or object localization (a region) may have very different dimensionality from the input (a full image): again a feature that the basic DEQ does not support. Third, state-of-the-art models for tasks such as semantic segmentation are commonly based on backbones that are pretrained for image classiﬁcation, even though the tasks are structurally different and their outputs have very different dimensionalities (e.g., one label for the whole image versus a label for each pixel). It s not clear how a DEQ construction can support such transfer. Fourth, whereas past work on DEQs could leverage state-of-the-art weight-tied architectures for sequence modeling as the basis for the transformation fθ [4, 15], no such counterparts exist in state-of-the-art computer vision modeling.

3.2 The MDEQ Model

Notation. Figure 1 illustrates the entire structure of MDEQ. As before, fθ denotes the transformation that is (implicitly) iterated to a ﬁxed point, x is the (precomputed) input representation provided to fθ, and z is the model s internal state. We omit the batch dimension for clarity.

Transformation fθ. The central part of MDEQ is the transformation fθ that is driven to equilibrium. We use a simple design in which features at each resolution are ﬁrst taken through a residual block. The blocks are shallow and are identical in structure. At resolution i, the residual block receives the internal state zi and outputs a transformed feature tensor z+ i at the same resolution. Notably, the highest resolution stream (i.e., i = 1) also receives an input injection x that is precomputed directly from the source image and injected to the highest-resolution residual block. (See Eq. (5) and the discussion below.)

The internal structure of the residual block is shown in Figure 2. We largely adopt the design of He et al. [26], but use group normalization [59] rather than batch normalization [29], for stability reasons that are discussed in Section 3.3. The residual block at resolution i can be formally expressed as

zi = Group Norm Conv2d(zi)

ˆzi = Group Norm Conv2d(Re LU( zi)) + 1{i=1} x

z+ i = Group Norm Re LU(ˆzi + zi) (5)

Following these blocks, the second part of fθ is a multi-resolution fusion step that mixes the feature maps across different scales (see Figure 1). The transformed features z+ i undergo either upsampling

3 3 Conv (s = 1) + Group Norm

3 3 Conv (s = 1) + Group Norm

Residual Connections Residual Block

Input Injection (only if i = 1)

Figure 2: The residual block used in MDEQ. An MDEQ contains only one such layer.

or downsampling from the current scale i to each other scale j = i. In our construction, downsampling is performed by j i consecutive 2-strided 3 3 Conv2d, whereas upsampling is performed by direct bilinear interpolation. The ﬁnal output at scale j is formed by summing over the transformed feature maps provided from all incoming scales i (along with z+ j ); i.e., the output feature tensor at each scale is a mixture of transformed features from all scales. This forces the features at all scales to be consistent and drives the whole system to a coordinated equilibrium that harmonizes the representations across scales.

Input Representation. The raw input ﬁrst goes through a transformation (e.g., a linear layer that aligns the feature channels) to form x, which will be provided to fθ. The existence of such input injection is vital to implicit models as it (along with θ) correlates the ﬂow of the dynamical system with the input. However, unlike multiscale input representations used by some explicit vision architectures [21, 11], we only inject x to the highest-resolution feature stream (see Eq. (5)). The input is provided to MDEQ at a single (full) resolution. The lower resolutions hence start with no knowledge at all about the input; this information will only implicitly propagate through them as all scales are gradually driven to coordinated equilibria z by the (black-box) solver.

(Limited-memory) Multiscale Equilibrium Solver. In the DEQ, the internal state is a single tensor z [5]. The MDEQ state, however, is a collection of tensors at n resolutions: z = [z1, . . . , zn]. Note that this is not a concatenation, as the different zi have different dimensionalities, feature resolutions, and semantics.

With this in mind, our equilibrium solver leverages Broyden s method. We initialize the internal states by setting z[0] i = 0 for all scales i. z = [z1, . . . , zn] is maintained as a collection of n tensors whose respective equilibrium states (i.e., roots) are solved for and backpropagated through simultaneously (with each resolution inducing its own loss).

The original Broyden solver was not efﬁcient enough when applied to computer vision datasets, which have very high dimensionality. For example, in the Cityscapes segmentation task (see Section 4), the Jacobian of a 4-resolution MDEQ at z is well over 2,000 times larger than its single-scale counterpart in word-level language modeling [5]. Note that even with low-rank approximations of the Jacobian in quasi-Newton methods, the high dimensionality of images can make storing these updates extremely expensive. To address this, we improve the memory efﬁciency of the forward and backward passes by optimizing Broyden s method. We implemented a new solver that is inspired by Limited-memory BFGS (L-BFGS) [39], where we only keep the latest m low-rank updates at any step and discard the earlier ones (see Appendix B.1).

Pretraining and Auxiliary Losses. Figure 3 provides a comparison of MDEQ with single-stream implicit models such as the DEQ, and with explicit deep networks in computer vision. These different models expose different interfaces that can be used to deﬁne losses for different tasks. Prior implicit models such as neural ODEs and DEQs typically assume that a loss is deﬁned on a single stream of implicit hidden states, which has a uniform input and output shape (Figure 3b). It is therefore not clear how such a model can be ﬂexibly transferred across structurally different tasks (e.g., pretraining on image classiﬁcation and ﬁne-tuning on semantic segmentation). Furthermore, there is no natural way to deﬁne auxiliary losses [34], because there are no layers and the forward and backward computation trajectories are decoupled.

In comparison, MDEQ exposes convenient interfaces to its states at multiple resolutions. One resolution (the highest) can be the same as the resolution of the input, and can be used to deﬁne losses for dense prediction tasks such as semantic segmentation. Another resolution (the lowest) can be a vector in which the spatial dimensions are collapsed, and can be used to deﬁne losses for image-level labeling tasks such as image classiﬁcation. This suggests clean protocols for training the same model for different tasks, either jointly (e.g., multi-task learning in which structurally different supervision ﬂows through multiple heads) or in sequence (e.g., pretraining for image classiﬁcation through one head and ﬁne-tuning for semantic segmentation through another).

L2 L3 Laux.

High-resolution loss(es) (e.g., segmentation)

Low-resolution loss(es) (e.g., classiﬁcation)

Equilibrium Solver for fθ(z?; x) = z?

(a) MDEQ exposes multiple interfaces at equilibrium

DEQ Solver for fθ(z; x) or ODE Solver for fθ(z(t), t)

Explicit Module

Explicit Module

(b) Single-stream implicit models (e.g., DEQs and NODEs)

Pre-trained Backbone (e.g., Res Net-101)

(c) Explicit deep models in vision

Figure 3: A visual comparison of MDEQ with prior implicit models and with standard explicit models in computer vision. Equilibrium states at multiple resolutions enable MDEQ to incorporate supervision in different forms.

3.3 Integrating Common DL Techniques with MDEQs

MDEQ simulates an inﬁnitely deep network by implicitly modeling one layer. Such implicitness calls for care when adapting common deep learning practices. We provide an exploration of such adaptations and their impact on the training dynamics of MDEQ. We believe these observations will also be valuable for future research on implicit models.

Normalization. Layer normalization of hidden activations in fθ played an important role in constraining the output and stabilizing DEQs on sequences [5]. A natural counterpart in vision is batch normalization (BN) [29]. However, BN is not directly suitable for implicit models, since it estimates population statistics based on layers, which are implicit in our setting, and the Jacobian matrix of the transformation fθ will scale badly to make the ﬁxed point signiﬁcantly harder to solve for. We therefore use group normalization (GN) [59], which groups the input channels and performs normalization within each group. GN is independent of batch size and offers more natural support for transfer learning (e.g., pretraining and ﬁne-tuning on structurally different tasks). Unlike in DEQs, we keep the learnable afﬁne parameters of GN.

Dropout. The conventional spatial dropout used by explicit vision models applies a random mask to given layers in the network [52]. A new mask is generated whenever dropout is invoked. Such layer-based stochasticity can signiﬁcantly hurt the stability of convergence to the equilibrium. In fact, as two adjacent calls to fθ most probably will have different Bernoulli dropout masks, it is almost impossible to reach a ﬁxed point where fθ(z ; x) = z . We therefore adopt variational dropout [22] and apply the exact same mask at all invocations of fθ in a given training iteration. The mask is reset at each training iteration.

Nonlinearities. The multiscale features are initialized to z[0] i = 0 for all resolutions i. However, we found that this could induce certain instabilities when training MDEQ (especially in the starting phase of it), most likely due to the drastic change of slope of the Re LU non-linearity at the origin, where the derivative is undeﬁned [23]. To combat this, we replace the last Re LU in both the residual block and the multiscale fusion by a softplus [23] in the initial phase of training. These are later switched back to Re LU. The softplus provides a smooth approximation to the Re LU, but has slope 1 1 1+exp(βz) 1

2 around z = 0 (where β controls the curvature).

Convolution and Convergence to Equilibrium. Whereas the original DEQ model focused primarily on self-attention transformations [56], where all hidden units communicate globally, MDEQ models face additional challenges due to the nature of typical vision models. Speciﬁcally, our MDEQ models employ convolutions with small receptive ﬁelds (e.g., the two 3 3 convolutional ﬁlters in fθ s residual block) on potentially very large images: for instance, we eventually evaluate our semantic segmentation model on megapixel-scale images. In consequence, we typically need a higher number of root-ﬁnding iterations to converge to an exact equilibrium. While this does pose a challenge, we ﬁnd that using the aforementioned strategies of 1) multiscale simultaneous upand downsampling and 2) quasi-Newton root-ﬁnding, drives the model close to equilibrium within a reasonable number of iterations. We further analyze convergence behavior in Appendix B.

(a) Training dynamics of implicit models

CIFAR-10 (w/ data augmentation) CIFAR-10 (w/o data augmentation)

(50 epochs)

6.3 5.0 6.2

12.9 2.3 2.5

Res Net-101 Dense Net-121 MDEQ ANODEs MDEQ-Small

(Benchmarked on Input Batch Size 32)

Memory (GB) or Relative Runtime

Error (%) Memory (GB) Runtime (relative to Res Net-101)

(b) Runtime and memory consumption on CIFAR-10 Figure 4: Left: test accuracy as a function of training epochs. Right: MDEQ-Small and ANODEs correspond to the settings and results reported in Table 1. For all metrics, lower is better.

4 Experiments

Table 1: Evaluation on CIFAR-10. Standard deviations are calculated on 5 runs.

Model Size Accuracy CIFAR-10 (without data augmentation)

Neural ODEs [18] 172K 53.7% 0.2% Aug. Neural ODEs [18] 172K 60.6% 0.4% Single-stream DEQ [5] 170K 82.2% 0.3% Res Net-18 [26] [Explicit] 170K 81.6% 0.3% MDEQ-small (ours) 170K 87.1% 0.4%

CIFAR-10 (with data augmentation) Res Net-18 [26] [Explicit] 10M 92.9% 0.2% MDEQ (ours) 10M 93.8% 0.3%

In this section, we investigate the empirical performance of MDEQs from two aspects. First, as prior implicit approaches such as NODEs have mostly evaluated on smaller-scale benchmarks such as MNIST [33] and CIFAR-10 (32 32 images) [31], we compare MDEQs with these baselines on the same benchmarks. We evaluate both training-time stability and inference-time performance. Second, we evaluate MDEQs on large-scale computer vision tasks: Image Net classiﬁcation [16] and semantic segmentation on the Cityscapes dataset [13]. These tasks have extremely high-dimensional inputs (e.g., 2048 1024 images for Cityscapes) and are dominated by explicit models. We provide more detailed descriptions of the tasks, hyperparameters, and training settings in Appendix A.

Our focus is on the behavior of MDEQs and their competitiveness with prior implicit or explicit models. We are not aiming to set a new state of the art on Image Net classiﬁcation or Cityscapes segmentation, as this typically involves substantial additional investment [60]. However, we do note that even with the implicit modeling of layer fθ, the mini explicit structure within the design of fθ (e.g., the residual block) is still very helpful empirically in improving the equilibrium representations.

All experiments with MDEQs use the limited-memory version of Broyden s method in both forward and backward passes, and the root solvers are stopped whenever 1) the objective value reaches some predetermined threshold ε or 2) the solver s iteration count reaches a limit T. On large-scale vision benchmarks (Image Net and Cityscapes), we downsample the input twice with 2-strided convolutions before feeding it into MDEQs, following the common practice in explicit models [64, 57]. We use the cosine learning rate schedule for all tasks [41].

4.1 Comparing with Prior Implicit Models on CIFAR-10

Following the setting of Dupont et al. [18], we run the experiments on CIFAR-10 classiﬁcation (without data augmentation) for 50 epochs and compare models with approximately the same number of parameters. However, unlike the ODE-based approaches, we do not perform downsamplings on the raw images before passing the inputs to the MDEQ solver (so the highest-resolution stream stays at 32 32). When training the MDEQ model, all resolutions are used for the ﬁnal prediction: higherresolution streams go through additional downsampling layers and are added to the lowest-resolution output to make a prediction (i.e., a form of auxiliary loss).

The results of MDEQ models on CIFAR-10 image classiﬁcation are shown in Table 1. Compared to NODEs [12] and Augmented NODEs [18], a small MDEQ with a similar parameter count improves accuracy by more than 20 percentage points: an error reduction by more than a factor of 2. MDEQ also improves over the single-stream DEQ (applied at the highest resolution). The training dynamics of the different models are visualized in Figure 4a. Finally, a larger MDEQ matches and even

Table 2: Evaluation on Image Net classiﬁcation with top-1 and top-5 accuracies reported. MDEQs were trained for 100 epochs.

Model Size top1 Acc. top5 Acc.

Alex Net [32] 238M 57.0% 80.3% Res Net-18 [26] 13M 70.2% 89.9% Res Net-34 [26] 21M 74.8% 91.1% Inception-V2 [29] 12M 74.8% 92.2% Res Net-50 [26] 26M 75.1% 92.5% HRNet-W18-C [57] 21M 76.8% 93.4% Single-stream DEQ + global pool [5] 18M 72.9% 91.0% MDEQ-small (ours) [Implicit] 18M 75.5% 92.7%

Res Net-101 [26] 52M 77.1% 93.5% W-Res Net-50 [63] 69M 78.1% 93.9% Dense Net-264 [27] 74M 79.7% 94.8% MDEQ-large (ours) [Implicit] 63M 77.5% 93.6% Unrolled 5-layer MDEQ-large 63M 75.9% 93.0% MDEQ-XL (ours) [Implicit] 81M 79.2% 94.5%

Table 3: Evaluation on Cityscapes semantic segmentation. * marks the current SOTA. Higher m Io U (mean Intersection over Union) is better.

Backbone Model Size m Io U

Res Net-18-A [40] Res Net-18 3.8M 55.4 Res Net-18-B [40] Res Net-18 15.24M 69.1 Mobile Net V2Plus [48] Mobile Net V2 8.3M 74.5 GSCNN [55] Res Net-50 - 73.0 HRNet V2-W18-Small-v2* [57] HRNet 4.0M 76.0 MDEQ-small (ours) [Implicit] MDEQ 7.8M 75.1

U-Net++ [66] Res Net-101 59.5M 75.5 Dilated-Res Net [62] D-Res Net-101 52.1M 75.7 PSPNet [64] D-Res Net-101 65.9M 78.4 Deep Labv3 [9] D-Res Net-101 58.0M 78.5 PSANet [65] Res Net-101 - 78.6 HRNet V2-W48* [57] HRNet 65.9M 81.1 MDEQ-large (ours) [Implicit] MDEQ 53.0M 77.8 MDEQ-XL (ours) [Implicit] MDEQ 70.9M 80.3

exceeds the accuracy of a Res Net-18 with the same capacity: the ﬁrst time such performance has been demonstrated by an implicit model.

4.2 Image Net Classiﬁcation

We now test the ability of MDEQ to scale to a much larger dataset with higher-resolution images: Image Net [16]. As with CIFAR-10 classiﬁcation, we add a shallow classiﬁcation layer after the MDEQ module to fuse the equilibrium outputs from different scales, and train on a combined loss.

We benchmark both a small MDEQ model and a large MDEQ to provide appropriate comparisons with a number of reference models, such as Res Net-18, -34, -50, and -101 [26]. Note that MDEQ has only one layer of residual blocks followed by multi-resolution fusion. Therefore, to match the capacity of standard explicit models, we need to increase the feature dimensionality within MDEQ. This is accomplished mainly by adjusting the width of the convolutional ﬁlter within the residual block (see Figure 2).

Table 2 shows the accuracy of two MDEQs (of different sizes) in comparison to well-known reference models in computer vision. MDEQs are remarkably competitive with strong explicit models. For example, a small MDEQ with 18M parameters outperforms Res Net-18 (13M parameters), Res Net-34 (21M parameters), and even Res Net-50 (26M parameters). A larger MDEQ (64M parameters) reaches the same level of performance as Res Net-101 (52M parameters). This is far beyond the scale and accuracy levels of prior applications of implicit modeling.

4.3 Cityscapes Semantic Segmentation

After training on Image Net, we train the same MDEQs for semantic segmentation on the Cityscapes dataset [13]. When transferring the models from Image Net to Cityscapes, we directly use the highest-resolution equilibrium output z 1 to train on the highest-resolution loss. Thus MDEQ is its own backbone . We train on the Cityscapes tr set and evaluate on the set. Following the evaluation protocol of Zhao et al. [65] and Wang et al. [57], we test on a single scale with no ﬂipping.

MDEQs attain remarkably high levels of accuracy. They come close to the current state of the art, and match or outperform well-known and carefully architected explicit models that were released in the past two years. A small MDEQ (7.8M parameters) achieves a mean Io U of 75.1. This improves upon a Mobile Net V2Plus [48] of the same size and is close to the SOTA for models on this scale. A large MDEQ (53.5M parameters) reaches 77.8 m Io U, which is within 1 percentage point of highly regarded recent semantic segmentation models such as Deep Labv3 [9] and PSPNet [64], whereas a larger version (70.9M parameters) surpasses them. It is surprising that such levels of accuracy can be achieved by a shallow implicit model, based on principles that have not been applied to this domain before. Examples of semantic segmentation results are shown in Appendix C.

4.4 Runtime and Memory Consumption

We provide a runtime and memory analysis of MDEQs using CIFAR-10 data, with input batch size 32. Since prior implicit models such as ANODEs [18] are relatively small, we provide results for both MDEQ and MDEQ-small for a fair comparison. All computation speeds are benchmarked relative

0 20 40 60 80 100 120 Number of Function Evaluations

Residual Change ||z[i + 1] z[i]||/||z[i]||

MDEQ-Large on CIFAR-10

Broyden (Init; (0, 0.01))

Broyden (Final) Iterating f (Final)

(a) CIFAR-10 classiﬁcation

0 20 40 60 80 100 120 Number of Function Evaluations

Residual Change ||z[i + 1] z[i]||/||z[i]||

MDEQ-Large on Image Net

Broyden (Init; (0, 0.01))

Broyden (Final) Iterating f (Final)

(b) Image Net classiﬁcation

0 20 40 60 80 100 120 Number of Function Evaluations

Residual Change ||z[i + 1] z[i]||/||z[i]||

MDEQ-Large on Cityscapes

Broyden (Init; Image Net pretrained) Broyden (Final) Iterating f (Final)

(c) Cityscapes segmentation

Figure 5: Plots of MDEQ s convergence to equilibrium (measured by z[i+1] z[i]

z[i] ) as a function of the number of times we evaluate fθ. As input image resolution grows (from CIFAR-10 to Cityscapes), MDEQ takes more steps to converge with (L-)Broyden s method. Standard deviation is calculated on 5 randomly selected batches from each dataset.

to the Res Net-101 model (about 150ms per batch) on a single RTX 2080 Ti GPU. The results are summarized in Figure 4b.

MDEQ saves more than 60% of the GPU memory at training time compared to explicit models such as Res Nets and Dense Nets, while maintaining competitive accuracy. Training a large MDEQ on Image Net consumes about 6GB of memory, which is mostly used by Broyden s method. This low memory footprint is a direct result of the analytical backward pass. Meanwhile, MDEQs are generally slower than explicit networks. We observe a 2.7 slowdown for MDEQ compared to Res Net-101, a tendency similar to that observed in the sequence domain [5]. A major factor contributing to the slowdown is that MDEQs maintain features at all resolutions throughout, whereas explicit models such as Res Nets gradually downsample their activations and thus reduce computation (e.g., 70% of Res Net-101 layers operate on features that are downsampled by 8 8 or more). However, when compared to ANODEs with 172K parameters, an MDEQ of similar size is 3 faster while achieving a 3 error reduction. Additional discussion of runtime and convergence is provided in Appendix B.2.

4.5 Equilibrium Convergence on High-resolution Inputs

As we scale MDEQ to higher-resolution inputs, the equilibrium solving process becomes more challenging. This is illustrated in Figure 5, where we show the equilibrium convergence of MDEQ on CIFAR-10 (low-resolution), Image Net (medium-resolution) and Cityscapes (high-resolution) images by measuring the change of residual with respect to the number of function evaluations. We empirically ﬁnd that (limited-memory) Broyden s method and multiscale fusion both help stabilize the convergence on high-resolution data. For example, in all three cases, Broyden s method (blue lines in Figure 5) converges to the ﬁxed point in a more stable and efﬁcient manner than simply iterating fθ (yellow lines). Further analysis of the multiscale convergence behavior is provided in Appendix B.2.

5 Conclusion

We introduced multiscale deep equilibrium models (MDEQs): a new class of implicit architectures for domains characterized by high dimensionality and multiscale structure. Unlike prior implicit models, such as DEQs and Neural ODEs, an MDEQ solves for and backpropagates through synchronized equilibria of multiple feature representations at different resolutions. We show that a single MDEQ can be used for different tasks, such as image classiﬁcation and semantic segmentation. Our experiments demonstrate for the ﬁrst time that shallow implicit models can scale to practical computer vision tasks and achieve competitive performance that matches explicit architectures characterized by sequential processing through deeply stacked layers.

The remarkable performance of implicit models in this work brings up core questions in machine learning. Are complex stage-wise hierarchical architectures, which have dominated deep learning to date, necessary? MDEQ exempliﬁes a different approach to differentiable modeling. The most signiﬁcant message of our work is that this approach may be much more relevant in practice than previously appeared. We hope that this will contribute to the development of implicit deep learning and will further broaden the agenda in differentiable modeling.

Broader Impacts

Computer vision techniques themselves, which are the primary application focus on this paper, have numerous applications of both positive and negative societal beneﬁts. They can enable potentially live-saving advances in e.g., assisted driving, medical diagnoses, etc, but also have inherent limitations and biases that could lead to problematic applications in these areas (e.g., if performance of a vision model notably differs when given input images of people of different races or genders); and this says nothing of more genuinely problematic enabled applications, such as facial recognition for surveillance applications.

The question of more relevance to this paper, however, is whether there any societal-level consequences that are unique to this particular algorithmic approach, i.e., the use of implicit versus explicit models in computer vision domains. This point is genuinely less clear to us. It is possible that the relative memory-efﬁciency of implicit vision models would make them e.g., more amenable to edge devices, which in turn contribution raises the potential for both beneﬁcial and harmful use cases. However, this is a large leap from current methods, where the improved memory efﬁciency comes at a cost of increased compute time (and thus could arguably be less efﬁcient in their current form on edge devices). Thus, we believe the speciﬁc impacts of implicit models are still unclear at this point, and should largely be re-evaluated as the models become more standard or more widely adopted.

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor Flow: A system for large-scale machine learning. In OSDI, 2016.

[2] L. B. Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Artiﬁcial Neural Networks. 1990.

[3] B. Amos and J. Z. Kolter. Opt Net: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning (ICML), 2017.

[4] S. Bai, J. Z. Kolter, and V. Koltun. Trellis networks for sequence modeling. In International Conference on Learning Representations (ICLR), 2019.

[5] S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019.

[6] C. G. Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of Computation, 1965.

[7] P. Burt and E. Adelson. The Laplacian pyramid as a compact image code. IEEE Transactions on Communications, 31(4), 1983.

[8] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In Computer Vision and Pattern Recognition (CVPR), 2016.

[9] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. ar Xiv:1706.05587, 2017.

[10] L.-C. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. Searching for efﬁcient multi-scale architectures for dense image prediction. In Advances in Neural Information Processing Systems, 2018.

[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deep Lab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4), 2018.

[12] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018.

[13] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes dataset for semantic urban scene understanding. In Computer Vision and Pattern Recognition (CVPR), 2016.

[14] F. de Avila Belbute-Peres, K. Smith, K. Allen, J. Tenenbaum, and J. Z. Kolter. End-to-end differentiable physics for learning and control. In Advances in Neural Information Processing Systems, 2018.

[15] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser. Universal transformers. In International Conference on Learning Representations (ICLR), 2019.

[16] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. Image Net: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR), 2009.

[17] J. Djolonga and A. Krause. Differentiable learning of submodular models. In Advances in Neural Information Processing Systems, 2017.

[18] E. Dupont, A. Doucet, and Y. W. Teh. Augmented neural ODEs. In Advances in Neural Information Processing Systems, 2019.

[19] D. Eigen, J. Rolfe, R. Fergus, and Y. Le Cun. Understanding deep architectures using a recursive convolutional network. ar Xiv preprint ar Xiv:1312.1847, 2013.

[20] L. El Ghaoui, F. Gu, B. Travacca, and A. Askari. Implicit deep learning. ar Xiv:1908.06315, 2019.

[21] C. Farabet, C. Couprie, L. Najman, and Y. Le Cun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2013.

[22] Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. In Advances in Neural Information Processing Systems, 2016.

[23] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectiﬁer neural networks. In AISTATS, 2011.

[24] S. Gould, R. Hartley, and D. Campbell. Deep declarative networks: A new hope. ar Xiv:1909.04866, 2019.

[25] W. Grathwohl, R. T. Chen, J. Betterncourt, I. Sutskever, and D. Duvenaud. FFJORD: Free-form continuous dynamics for scalable reversible generative models. In International Conference on Learning Representations (ICLR), 2019.

[26] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.

[27] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2017.

[28] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Q. Weinberger. Multi-scale dense networks for resource efﬁcient image classiﬁcation. In International Conference on Learning Representations (ICLR), 2018.

[29] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), 2015.

[30] S. G. Krantz and H. R. Parks. The implicit function theorem: History, theory, and applications. Springer, 2012.

[31] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.

[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Image Net classiﬁcation with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.

[33] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 1989.

[34] C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, 2015.

[35] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In International Conference on Machine Learning (ICML), 2009.

[36] H. Lee, P. Pham, Y. Largman, and A. Y. Ng. Unsupervised feature learning for audio classiﬁcation using convolutional deep belief networks. In Advances in Neural Information Processing Systems, 2009.

[37] R. Liao, Y. Xiong, E. Fetaya, L. Zhang, K. Yoon, X. Pitkow, R. Urtasun, and R. Zemel. Reviving and improving recurrent back-propagation. In International Conference on Machine Learning (ICML), 2018.

[38] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.

[39] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1-3), 1989.

[40] Y. Liu, C. Shu, J. Wang, and C. Shen. Structured knowledge distillation for dense prediction. In Computer Vision and Pattern Recognition (CVPR), 2019.

[41] I. Loshchilov and F. Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR), 2017.

[42] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), 2016.

[43] F. J. Pineda. Generalization of back propagation to recurrent and higher order neural networks. In Advances in Neural Information Processing Systems, 1988.

[44] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In International conference on machine learning, pages 82 90, 2014.

[45] Y.-L. Qiao, J. Liang, V. Koltun, and M. C. Lin. Scalable differentiable physics for learning and control. In International Conference on Machine Learning (ICML), 2020.

[46] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.

[47] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, 2016.

[48] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobile Net V2: Inverted residuals and linear bottlenecks. In Computer Vision and Pattern Recognition (CVPR), 2018.

[49] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4), 2017.

[50] J. Sherman and W. J. Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 1950.

[51] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.

[52] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overﬁtting. Journal of Machine Learning Research (JMLR), 15, 2014.

[53] S. Sun, J. Pang, J. Shi, S. Yi, and W. Ouyang. Fish Net: A versatile backbone for image, region, and pixel level prediction. In Advances in Neural Information Processing Systems, 2018.

[54] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Computer Vision and Pattern Recognition (CVPR), 2015.

[55] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler. Gated-scnn: Gated shape cnns for semantic segmentation. In International Conference on Computer Vision (ICCV), 2019.

[56] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.

[57] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.

[58] P.-W. Wang, P. Donti, B. Wilder, and Z. Kolter. SATNet: Bridging deep learning and logical reasoning using a differentiable satisﬁability solver. In International Conference on Machine Learning (ICML), 2019.

[59] Y. Wu and K. He. Group normalization. In European Conference on Computer Vision (ECCV), 2018.

[60] Q. Xie, E. Hovy, M.-T. Luong, and Q. V. Le. Self-training with noisy student improves Image Net classiﬁcation. In Computer Vision and Pattern Recognition (CVPR), 2020.

[61] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In International Conference on Learning Representations (ICLR), 2016.

[62] F. Yu, V. Koltun, and T. Funkhouser. Dilated residual networks. In Computer Vision and Pattern Recognition (CVPR), 2017.

[63] S. Zagoruyko and N. Komodakis. Wide residual networks. ar Xiv:1605.07146, 2016.

[64] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In Computer Vision and Pattern Recognition (CVPR), 2017.

[65] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia. PSANet: Point-wise spatial attention network for scene parsing. In European Conference on Computer Vision (ECCV), 2018.

[66] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang. UNet++: A nested U-Net architecture for medical image segmentation. In MICCAI, 2018.