# optimization_inspired_multibranch_equilibrium_models__c132339a.pdf

Published as a conference paper at ICLR 2022

Optimization Inspired Multi-Branch Equilibrium Models

Mingjie Li1, Yisen Wang1,2, Xingyu Xie1, Zhouchen Lin1,2,3

1 Key Lab. of Machine Perception (Mo E), School of Artiﬁcial Intelligence, Peking University 2 Institute for Artiﬁcial Intelligence, Peking University 3 Pazhou Lab, Guangzhou, 510330, China.

Works have shown the strong connections between some implicit models and optimization problems. However, explorations on such relationships are limited. Most works pay attention to some common mathematical properties, such as sparsity. In this work, we propose a new type of implicit model inspired by the designing of the systems hidden objective functions, called the Multi-branch Optimization induced Equilibrium networks (MOpt Eqs). The model architecture is designed based on modelling the hidden objective function for the multi-resolution recognition task. Furthermore, we also propose a new strategy inspired by our understandings of the hidden objective function. In this manner, the proposed model can better utilize the hierarchical patterns for recognition tasks and retain the abilities for interpreting the whole structure as trying to obtain the minima of the problem s goal. Comparing with the state-of-the-art models, our MOpt Eqs not only enjoys better explainability but are also superior to MDEQ with less parameter consumption and better performance on practical tasks. Furthermore, we also implement various experiments to demonstrate the eﬀectiveness of our new methods and explore the applicability of the model s hidden objective function.

1 Introduction

Whereas Deep Neural Networks (DNNs) have achieved great success in many real-world tasks such as computer vision and neural language process, the limited interpretability of DNNs greatly hinders their further development.

However, many traditional machine learning methods, e.g.,matrix recovery (Zhang et al., 2018b; 2015; Liu & Li, 2016), subspace clustering (You et al., 2016), image deblurring (Liu et al., 2014) and so on, can be interpreted into minimizing the following objective functions:

min Z,E f(Z) + g(E), s.t. X = AZ + BE.

where A Rm d1, B Rm d2, X Rm n, and f( ) and g( ) are convex functions designed by the modelling of their properties. For this count, the interpretability of such methods is much better than DNNs. Furthermore, these methods can also enjoy state-of-the-art and robust performance on these tasks. We call such interpretability these methods enjoy as "mathematical interpretability", i.e., whether the whole network structure can be summarized as a compact mathematical model that can be mathematically analyzed.

Except for these methods, the Optimization Induced Equilibrium Networks (Opt Eqs) proposed by Xie et al. (2021) recover its whole system to an optimization problem. The forward propagation for Opt Eqs tries to solve Eqn (1) to get output y, z Rd for input x Rdin. We call the front part of Eqn (1) as Opt Eqs equilibrium equation, which is the central part for equilibrium models. And its forward procedure can be regarded as solving the hidden objective functions shown as Eqn(2),

Corresponding author

Published as a conference paper at ICLR 2022

z = σ(Wz + Ug(x) + b), y = Wcz , (1)

min z G(z; g(x)) = min z

h 1 f(W 1 z) D Ug(x) + b, W 1 z E + 1

2 W 1 z 2 1

2 z 2 . (2)

W, U, Wc are learnable weights, g is the convolution layers projecting input x from realworld domain to feature space like other neural architectures, and f is the outputs penalty.

Although Opt Eqs interpretability is satisfying, it still remains some weaknesses worth exploring. First, although Opt Eqs performs better than some implicit models, these models only consider a single view (or resolution) of the input in their implicit parts. We call these models single-view implicit models in the following. However, nearly all the state-of-the-art pattern recognition systems (Lee et al., 2009b; Wang et al., 2019a; Huang et al., 2017; He et al., 2016; Burt & Adelson, 1987) beneﬁt from the multi-layer or multi-resolution feature extractors in domains like computer vision and audio processing. Secondly, limited new modules are proposed in their work, which mainly consider the math properties of features like sparsity, but whether these properties can beneﬁt image recognition tasks is unexplored.

Besides Opt Eqs, MDEQ (Bai et al., 2020) constructs its equilibrium equation z = F(z ; x) and block F with the inspiration of explicit models especially HRNet Wang et al. (2019a). Then they solve the equilibrium equation by the accelerated algorithm for output z . Although it shows the state-of-the-art performance with eﬃcient memory cost as other implicit models, such strategies will make the model lose "interpretability" since the MDEQ is a black box because its output is solved in an implicit way (root-ﬁnding method), which makes analysis on its features of intermediate layers almost impossible. Furthermore, its complicated structure also hinders its analysis from mathematical systems. More eﬃcient blocks with good interpretability still needs exploring.

Motivated by the limitations of the above models, we would like to design a multi-scale DEQ structure with state-of-the-art performance and preserve mathematical interpretability from the inspiration of Opt Eqs and multi-scale models. Our contributions are listed below:

We propose a multi-branch implicit model, called Multi-branch Opt Eqs (MOpt Eqs), which eﬃciently utilizes diﬀerent scales inputs with better performance and smaller model size. Furthermore, it still retains its connection to an optimization problem. We propose some properties modelled as new terms for the model s hidden objective function from our analysis on the relationships between branches and their hierarchical dependencies. With new problem s formulation, we obtain the fusion module in our MOpt Eqs, called Hierarchical Heritage and Diversity Module (HH&D Module). Apart from the new module design, we also propose the Perturbation Enhanced strategy (PE) for our MOpt Eqs from our analysis on the model s hidden objective function. The new strategy cannot only enhance the performance of our MOpt Eqs, but also improve the robustness of our models.

1.1 Related Works

Implicit Models. Nearly all modern deep learning approaches use explicit models, which provide an explicit computation graph for the forward propagation. In contrast, the computation graphs for implicit models are "ﬂexible" or can be assumed as having "inﬁnite" depth. For example, Neural ODEs (Chen et al., 2018; Massaroli et al., 2020) encode their neural architectures by a diﬀerential system with learnable parameters. Then the implicit ODE solvers they used is equivalent to a continuous Res Net taking inﬁnitesimal steps. Furthermore, the training process of Neural ODEs can be depicted as ﬁnding a diﬀerential system of a certain type (like heat equations) by updating its learnable parameters which demonstrate that Neural ODE also enjoy interpretability to a certain extent.

Moreover, DEQ (Bai et al., 2019; Winston & Kolter, 2020) is another class of implicit models. The central part of DEQ is the designation of the equilibrium equations z = F(z ; x) and block F. Its forward procedure is trying to solve the equilibrium equations with input x to

Published as a conference paper at ICLR 2022

get the equilibrium state z as output by accelerated algorithms. Since the z is ﬁxed given F and x, its inference can be regarded as forwarding an explicit network stacked by the same block F for inﬁnite times. For example, DEQ chooses their F by one set of Conv+Re LU while MDEQ constructs an HRNet-like block as theirs. However, no evidence shows that the best block constructions in explicit models can perform the best in the DEQ scheme. The construction of DEQ blocks or equilibrium equations is an open question worth exploring. Furthermore, most DEQ blocks do not have any mathematical insights (like the diﬀusion process in Neural ODE) and perform totally a black box with limited interpretability.

Since implicit models usually adopt accelerated root-ﬁnding algorithms for their forward outputs and backward gradients, they enjoy the advantages of constant and more eﬃcient memory costs compared with DNNs. Due to the above advantages, the design of implicit models draws much attention these days (Ghaoui et al., 2019; Gould et al., 2019). Apart from the models illustrated above, many kinds of other implicit models are proposed, including diﬀerentiable physics engines (Qiao et al., 2020; de Avila Belbute-Peres et al., 2018), logical structure learning (Wang et al., 2019b) and implicit neural blocks (Li et al., 2020).

Model Interpretaiblity. Many researchers nowadays are trying to make their proposed method more interpretable. Although there are many kinds of approaches to achieve this goal, we marginally divide them into two parts: "Empirical" and "Mathematical" interpretability. Many works, such as (Zhang et al., 2018a; Zhang & Zhu, 2018; Bau et al., 2018), attempt to empirically disassemble the black box by characterizing some statistical or structural information like the outputs and gradients of neural networks middle layers. However, these works cannot be directly implemented on DEQ models since implicit ones do not have explicit depths like DNNs, which makes analyzing such models by dissecting each layer s behavior almost impossible. Apart from that, researchers (Djolonga & Krause, 2017; Amos & Kolter, 2017; Xie et al., 2019; Chan et al., 2020) also managed to understand the neural architecture by linking it to a mathematical problem. In this way, researchers can analyze the black box by its corresponding mathematical problem. Furthermore, new components can be proposed due to the analysis of these problems. Our model aims to design a new DEQ architecture with proper mathematical interpretability. Compared with former works, our model can perform better on the classiﬁcation tasks with novel components.

2 Multi-Branch Optimized Induced Equilirium Models

2.1 The proposed architecture for the Multi-Branch Opt Eqs

Inspired by Opt Eqs, we ﬁrst design an objective function for our tasks to solve and then use its ﬁrst-order stationary conditions as MOpt Eqs equilibrium equation. The base of our function is formed by summarizing several objective functions of diﬀerent scales deﬁned for Opt Eqs (Details are stated in Appendix A.1). However, such a model can not obtain satisfying results since these branches are independent. For this count, we need to design some terms describing the dependencies of each branch in the hidden objective function, as shown in the following paragraphs. Then we can obtain the equilibrium equations for our MOpt Eqs and our MOpt Eqs block F by analyzing the ﬁrst-order stationary condition of our designed optimization problem.

Hierarchical Heritage Modeling. State-of-the-art explicit models for image tasks are explicitly structured into sequential stages and process diﬀerent resolutions hierarchically (He et al., 2016; Shelhamer et al., 2017; Lee et al., 2009a), which implies that features should be extracted hierarchically. In other words, the posterior branches should inherit from their prior ones, which we call such property hierarchical heritage. In our work, we are going to formulate such correlation of neighbouring branches in the hidden objective function by adding the following inner-product term into its origin:

H(zi, zj) = z i Pi:jzj (3)

where zi Rdi, zj Rdj, Pi:j Rsi sj denotes the Average Downsample (its transpose can be regarded as a weighted nearest upsample) or Identity matrix suited to the shape of zi and zj (i < j or j = 1, i = L). This term estimates the summation for similarity of i-th and j-th branchs channels with the same channel index (corresponding channels).

Published as a conference paper at ICLR 2022

When zi and zj are similar, the inner product will be large. Otherwise, the result will become small, which implies that the relationship of corresponding channels is weak. In this way, we can ensure the similarity of corresponding channels of the near branches. Except for the relations between diﬀerent branches corresponding channels, the relationships between dis-corresponding channels are explored in the following section.

Diversity Modeling. In addition to the hierarchical relations for corresponding channels, works (Pang et al., 2019; Amada et al., 2021) have shown enhancing the diversities between branches can also improve the model s improvements. Therefore, we also consider making diﬀerent branches extract various features to improve the representation abilities. To achieve such a goal, we add a diversity term in the objective function Eqn (4) for i < j,

D(zi, zj) =

k2=1 k2 =k1

|vec 1(zi)(k1) vec 1(Pi:jzj)(k2)|, (4)

where vec 1 is the inverse vectorization operator converting the vectorized feature zi Rdi 1 to a matrix in R di

C C with C channels and vec 1(zi)(k) R di

C 1 denotes the kth channel of i-th branch in practice. Since we have already considered the relationships between the corresponding channels in the hierarchical term, we only tries to estimate the diversities of dis-corresponding channels of the i-th and j-th branches in this term. As the diversity term goes small, the diversities between the branches become stronger.

The architecture of MOpt Eqs and its hidden problem. With the two terms we proposed, we can reformulate the hidden objective for our problem as follows:

min z1,...,z L G(z1, ..., z L; g(x)) = min z1,...,z L

1 f(W 1 i zi) D Uig(x) + bi, W 1 i z E

2(λD(zi, zi+1) + W 1 zi 2 H(zi, zi+1)) .

g(x) is the input feature for the raw input x, λ > 0 is a hyperparameter and zi Rdi 1 are the ﬁnal outputs of MOpt Eqs. We can choose f to constrain zi on our demand and will inﬂuence model s activation function. If we choose f(x) = I{x 0} to ensure the outputs to be positive, then the activation function is Re LU. We set D(z L, z L+1) = D(z L, z1) and H(z L, z L+1) = H(z L, z1) when i = L to complete the loop. The two terms we added tries to make the corresponding channels of diﬀerent branches correlated by maximizing the H term and enhancing the diversities of diﬀerent branches by minimizing D term. We note that H and D are not conﬂict since they are handling diﬀerent pairs of channels for branches.

As the following proposition shows, we ﬁnally get the equilibrium equations (Eqn (6)) for our MOpt Eqs by calculating the ﬁrst-order stationary conditions zi G = 0 for problem G.

Proposition 1 The proposed multi-branch structure induced by Eqn (5) can be depicted as solving the equilibrium points ez := [z 1 , ..., z L ] R L i=1 di for the following equations:

ez = f W σ(f Wh( z) + e Ug(x) + eb)

where f W =

, e U = [U 1 , ..., U L] , eb = [b 1 , ..., b L] . (6)

And h( z) = [h1(z1, z L, z2) , .., hi(zi, zi 1, zi+1) , .., h L(z L, z L 1, z1) ] and each hi is deﬁned as the mapping from Rdi Rdi+1 Rdi 1 to Rdi,

h(zi, zi 1, zi+1) =1

2(Pi:i+1zi+1 + P i 1:izi 1)

2 vec(vec 1(Pi:i+1zi+1)sign[Mi:i+1 diag(Mi:i+1)])

2 vec(vec 1(P i 1:izi 1)sign[M i 1:i diag(Mi 1:i)]),

Mi:i+1 =vec 1(Pi:i+1zi+1) vec 1(zi),

Mi 1:i =vec 1(zi) vec 1(P i 1:izi 1).

Published as a conference paper at ICLR 2022

𝝈(𝑾𝑳𝒛𝑳+ 𝑼𝑳𝒈(𝒙) + 𝒃𝑳)

𝝈(𝑾𝟑𝒛𝟑+ 𝑼𝟑𝒈(𝒙) + 𝒃𝟑)

𝝈(𝑾𝟐𝒛𝟐+ 𝑼𝟐𝒈(𝒙) + 𝒃𝟐)

𝝈(𝑾𝟏𝒛𝟏+ 𝑼𝟏𝒈(𝒙) + 𝒃𝟏)

(a) MOpt Eqs structure.

𝑃 ଵ: 𝑧 ଵ ℝ 𝑧୧ ଵ

𝑃 : ଵ𝑧 ଵ ℝ 𝑧୧ ଵ

(b) HH&D s structure.

Figure 1: The structure of the MOpt Eqs and HH&D Module. Dotted lines in the ﬁgure denotes the upsampling, downsampling or identity operators to suit the size of each branch, denotes multiplication operator, denotes addition operator, RD operator processes a matrix by removing its diagonal, SIGN operator convert each element of a matrix to its sign and the di = CHi Wi with C is the channel number and Hi, Wi are the height and width of the feature map.

The equilibrium points for the above formula are also the ﬁrst-order stationary points for the optimization problem Eqn(5).

The proof of the proposition is listed in Appendix A.3. In this manner, we obtain the proposed architecture MOpt Eqs, its forward propagation is equivalent to ﬁnd the equilibrium points of Eqn(6), and such process can also be regarded as solving the stationary points of the optimization problem (5). The whole structure is also shown in Figure 1 (a) and we also draw ﬁgures to illustrate the practical process of hi (HH&D) on its right (Figure 1 (b)). We also evaluate the eﬀectiveness of such modelling in Section. 3.2 s experiments.

2.2 The proposed Perturbation Enhanced Strategy for MOpt Eqs

Researches (Kong et al., 2020; Xie et al., 2020; Tang et al., 2021) have discovered that small perturbations can enhance the generalization abilities and varieties of branches for neural architectures. Most eﬃcient works use adversarial perturbations based on gradients (Xie & Yuille, 2020) for such a trick. However, such a strategy will slow the training process since they need at least one more back-propagation in each training step for perturbations, which is time-consuming. Unlike these methods, we can acquire more computation eﬃcient perturbations because our models can be formulated as minimizing an objective function we designed. Then we can replace the maximum problem (details in Appendix A.2) for adversarial perturbation with maximizing our architecture s hidden objective function G, since we assume MOpt Eq s performance will go worse if G(z; g(x)) changes a lot at g(x) s neighborhoods since the forward propagation of MOpt Eqs is minimization G(z; g(x)).

Assumption 1 For a well trained Opt Eqs with its objective function denoted as G(z; g(x)) and a natural sample x which can be correctly classiﬁed. If an input perturbation δ ϵ can cause one of the following changes:

|G(z ; g(x) δ) G(z ; g(x))| L1ϵ,

with L1, L2 > 1, it may lead the Opt Eqs to the wrong results with high probability.

For the second part of the assumption, it s common sense and widely used in the analysis of adversarial robustness (Zhang et al., 2021; Li et al., 2020). With the above assumption, we can generate the perturbations by maximizing the objective function G(z; g(x)) for our MOpt Eqs. Since our MOpt Eqs can be regarded as a kind of "implicit" ensembling that parallels several equilibrium models, we decide to inject the perturbations acquired for the prior branch to the posterior one, like boosting to enhance the performance.

Published as a conference paper at ICLR 2022

Then the hidden objective function of MOpt Eqs with G(z; g(x)) is shown as follows,

min z1,...,z L G(z1, ...,z L; g(x)) = min z1,...,z L

1 f(W 1 i zi) D Ui(g(x) δi 1) + bi, W 1 i z E

λD(zi, zi+1) + W 1 zi 2 H(zi, zi+1) ,

s.t. δi = argmax δi ϵ 1 f(W 1 i zi) D Ui(g(x) δi) + bi, W 1 i zi E for i [1, L].

where δ0 = 0. And the problem becomes a bilevel optimization problem. The solution to the lower-level problem is δi = ϵsign(U i W 1 i zi). Since the perturbation can be obtained by feeding the output of the activation layer σ(Wz+Ug(x)+b) to the transposed convolution layer with weight Ui, we call it reconstructed perturbation. Compared with the adversarial perturbations based on gradients, our perturbations can be acquired directly by matrixvector multiplication instead of iteratively built by gradients. Thus, such a process does not require much computation cost. Following the above steps, we obtain a harmful perturbed direction for the prior branches and then added them to their posterior branches. We note that our perturbation is added to the input feature g(x) instead of the raw input.

Proposition 2 For a natural sample x and a well trained Opt Eqs

z = W σ(Wz + Ug(x) + b)

with its hidden objective function denoted as G(z; g(x)), for g (x) = g(x) δ and δ = ϵsign(U i W 1 i zi), which obeys the constraint δ ϵ. z and z are deﬁned as follows:

z = argmin z G(z; g (x)),

z = argmin z G(z; g(x)).

If the W 2, U 2 1 and 1 N U W 1 z 1 ϵ, where N is the element number of g(x), then

max n G(z ; g(x) δ) G(z ; g(x)),

Nϵ z z 2 o 1

2 ϵ U W 1 z 1 Nϵ2 .

The proposition demonstrates if we choose the perturbed direction to be ϵsign(U W 1 z), at least one of the changes for the G or z will be around ϵ U W 1 z 1 and implies each branch may perform bad on the perturbed data according to our assumptions. With the experiments in Section.3.2, we can conclude that our reconstructed perturbation is useful and our assumption is reasonable. Like boosting strategy, we can feed the perturbed data for the prior branch to its posterior branch to enhance the performance; we call such method as the Perturbation Enhanced strategy (PE). Following experiments also show that PE can indeed enhance the performance of our MOpt Eqs.

2.3 Model Optimization and Forward Convergence

Forward Propagation and Implicit Diﬀerentiation. Like other equilibrium models, the forward propagation procedure is solving the given equilibrium functions Eqn 7. For this count, our model also enjoys the constant memory cost advantages as other DEQs. In our work, run root-ﬁnding algorithms to solve the roots z = [z 1, ...z L] of the following problem to reach the equilibrium states, which is the same as MDEQ (shown in Appendix A.5):

z G(z1, ..., z L; g(x)) = 0. (7)

As for the backward propagation, we also adopt the implicit diﬀerentiation method widely used in (Bai et al., 2020; 2019; Chen et al., 2018). Instead of tracing the gradients during the forward propagation, the implicit diﬀerentiation method directly backpropagates through the equilibrium state using the Jacobian of Tθ = z G+ z at z . For a given loss l = L( z , y) (where y is the target) and the gradients can be written as

z (I JTθ| z ) 1 Tθ

Published as a conference paper at ICLR 2022

Model Model Size Accuracy

Neural ODEs 172K 53.7% Aug. Neural ODEs 172K 60.6% Single-tire Mon DEQ 854K 82.5% Deep Opt Eqs1 199K 87.4%

Parallel-Opt Eqs 193K 87.4% POpt Eqs(sum) 193K 88.4% POpt Eqs(conv) 276K 88.9% MOpt Eqs (w/o PE) 193K 89.1% MOpt Eqs 193K 89.5% (a) Comparison of the models with single-scales

Model Model Size Accuracy

Res Net-18 10M 92.9 0.2%

Mon DEQ 1M 89.7% MDEQ 2.53M 92.6 0.2% MDEQ 10M 93.8 0.3%

MOpt Eqs 0.48M 92.9 0.2% MOpt Eqs 1.9M 94.0 0.1% MOpt Eqs 8.1M 94.6% (b) Comparison of the models with multiple-scales. Mon DEQ here is the multi-tired monotone DEQ.

Table 1: Evaluation on CIFAR-10 for diﬀerent models. "Parallel-Opt Eqs"(POpt Eqs) means our model is built without utilizing HH&D fusion (Stated in Appendix A.1), which is formed by paralleling several Opt Eqs with the method stated in the brackets for fusion. "w/o PE" means MOpt Eqs is trained without our PE strategy.

And the calculation of l z (I JTθ| z ) 1 is equivalent to solve root m for the following equation:

m(I JTθ| z ) + l

As for the root-ﬁnding algorithm, we can use Broyden method (Broyden, 1965), Anderson Method (Anderson, 1965; Bai et al., 2021) or other root-ﬁnding methods to solve Eqn(7) for the equilibrium state and Eqn(9) for the backward gradient.

Forward Convergence. Like other implicit models, we make some constraints on parameters in order to make the whole MOpt Eqs TΘ( z; x) ( z here is {zi}L i=1) be a contractive mapping. For the MOpt Eqs without considering HH&D (Eqn 10), the models can easily converge with Wi 2 ζ < 1. But since we use the HH&D module, we need to choose proper λ to ensure the convergence, around or less than 1 C will be appropriate. We also conduct experiments to verify the convergence in Appendix A.7.

3 Experimental Results

In this section, we conduct experiments for the image classiﬁcation tasks on CIFAR10,CIFAR-100 Krizhevsky et al. (2009) and Image Nette implemented on the Py Torch (Paszke et al., 2017) platform to demonstrate the eﬀectiveness of our model. Details are listed in Appendix A.6.

3.1 Comparison of Prior Implicit Models

In this part, we decide to verify the superiority of MOpt Eqs in two aspects. First, we build the single-scale MOpt Eqs and compare the empirical results with other single-scale implicit models. And then we compare the experimental results with prior implicit models that utilize the multi-scale inputs like MDEQs, the state-of-the-art implicit model with multi-resolution.

Implicit Models with Single View. In this part, we compare our MOpt Eqs with other single-view implicit models with comparable model sizes. Like Opt Eqs who uses three blocks for their experiment, we ﬁrst construct our small MOpt Eqs with three branches whose outputs share the same size and the channel number C = 32. Results in Table.1 (a) demonstrates that our MOpt Eqs structure can even enhance the performance for recognition tasks under the single-view cases. Compared to the multi-branch model without fusion (POpt Eqs), our MOpt Eqs HH&D can eﬃciently utilize the relationships between diﬀerent branches and lead the model to better performance. Our superiority also holds compared with former multi branches fusion methods: "sum" (non-parametric module) and "conv" (parametric module). Meanwhile, we can also conclude the eﬀectiveness of our perturbation enhanced strategy from the table since its performance is the best.

Published as a conference paper at ICLR 2022

Model Model Size Accuracy

MDEQ 2.6M 70.8 0.2% 11M 72.4 0.2%

MOpt Eqs 1.9M 73.4 0.2% MOpt Eqs 8.1M 74.7% (a) Evaluation on CIFAR-100 for models.

Model Model Size Accuracy

MDEQ 2.5M 90.5 0.2% 10M 91.3 0.3%

MOpt Eqs 1.9M 92.1 0.2% MOpt Eqs 10M 92.4% (b) Evaluation on Imagenette for models.

Table 2: Evaluation on CIFAR-100 and Image Nette for MDEQ and MOpt Eqs.

Implicit Models with Multiple Scales. Moreover, we conduct experiments compared with other models which handling multi-scale inputs. Like MDEQ, we construct our MOpt Eqs with four branches with resolution size equals to 32,16,8,4. Other details can be found in Appendix A.6. From Table.1 (b) and as shown in Table.2 (a), one can see that our MOpt Eqs not only outperforms the widely used explicit model Res Net-18 (He et al., 2016), but also shows better performance compared with MDEQ with fewer parameters on CIFAR, which is one of the best models. The empirical results for the multi-scale models further verify the superiority of MOpt Eqs and its strategy.

Apart from experiments on small images, we also conduct experiments on Image Nette, which is a subset of 10 classes from Image Net. Compared with MDEQ, our MOpt Eqs consistently perform better shown in Table.2 (b). Nevertheless, as one can see from the results, the diﬀerence between MDEQ and our model becomes much bigger in the CIFAR-100 and Imagenette with the same training hyper-parameters. Such a phenomenon demonstrates that the performance of our model is much more stable than MDEQ. Apart from these experiments, we also conducted ablation studies for our models in the following section.

3.2 The comprehensive understanding of MOpt Eqs

0 20 40 60 80 100 Sample Index

Average H Average D

(a) MOpt Eqs.

0 20 40 60 80 100 Sample Index

Average H Average D

(b) POpt Eqs.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(c) MOpt Eqs.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(d) POpt Eqs.

Figure 2: Visualization of the channels correlations for the MOpt Eqs and POpt Eqs.

Visualization of the impact for the HH&D module. In this section, we try to estimate the eﬀect of our HH&D module. We ﬁnish experiments in this part using three-branch MOpt Eqs with 16 channels for each branch to make the visualization more clear.

We tries to verify our HH&D module s eﬀect in two aspects. First, we plot scatter ﬁgures for

the values of

j N (i) D(zi,zj) L(C2 C) (denoted as Average D) and

j N (i) H(zi,zj)

LC (denoted as Average H) for 100 randomly chosen samples (N(i) denotes neighboring branch of the i-th branch). Since the hidden optimization problem is to maximizing H while minimizing D, we ﬁnd that our MOpt Eqs (Figure 2(a)) can induce the output features to reach such goal compared with POpt Eqs (MOpt Eqs without HH&D) (Figure 2(b)).

Furthermore, we also plot heatmaps of the ﬁrst two branches correlation score (|vec 1(z2)vec 1(z1) | R16 16) for a randomly chosen sample shown in Figure 2(c)(d). One can see that MOpt Eqs heatmap (Figure 2(c)) is more likely to be a diagonal matrix while heatmap Figure 2(d) for POpt Eqs looks random. The heatmap shows that our HH&D can induce the model to perform as our demand, which means corresponding channels for adjacent branches can be more correlated, while the dis-corresponding channels are unrelated due to our architecture design. The visualizations and former results on diﬀerent datasets verify the eﬀectiveness of our modelling and our HH&D modules design.

Published as a conference paper at ICLR 2022

0 0.1 0.2 0.3 Perturbation Size

Test Accuracy

Reconstruct Perturbation Uniform Perturbation Binomial Perturbation

3 4 5 6 7 8 9 10 PGD iteration

Test Accuracy

MOpt Eq trained w/o PE MOpt Eq (ε=0.1) MOpt Eq (ε=0.25) MOpt Eq (ε=0.5)

10-4 10-3 10-2 10-1 82

Test Accuracy

MOpt Eqs w/o HHD trained by augmented Loss MOpt Eqs w/o PE baseline

Figure 3: (a) Test accuracy changes with respect to the perturbation size for diﬀerent perturbed directions. (b) Test Accuracy for small MOpt Eqs (w. and w/o. PE) under PGD attack with diﬀerent inner iterations. (c) Plot of the accuracies for the small MOpt Eqs without HH&D trained by the extended loss with respect to various γ.

Evaluation of the reconstructed perturbation. We compare our perturbation with randomly generated ones in a one-branch well-trained MOpt Eqs to validate its eﬀectiveness. We ﬁrst feed a natural sample x to the model and acquire output z for x and generate our reconstructed perturbation using such output and then feed it to the model. Furthermore, we add the perturbations generated by Binomial distribution (P(δi = ϵ) = P(δi = ϵ) = 0.5) and Uniform distribution U[ ϵ, ϵ] to input features g(x) for comparison. The result of each distribution is averaged for ﬁve trials. Figure 3(a) shows our perturbation generated by maximizing G is much more eﬀective, which validates the eﬀectiveness of our reconstructed perturbations and the rationality of our assumptions and analysis in Section.2.2.

Robustness of our MOpt Eqs trained by PE strategy. In addition to improving the generalization abilities for models as shown in Sec.3.1, perturbations with proper size may also enhance the robustness of our MOpt Eqs, shown in Figure 3(b). We conduct experiments on small MOpt Eqs (the same setting as Sec.3.1 s single-view model). Our perturbation enhanced strategy goes stronger as ϵ increases. The ﬁgure shows that accuracy for the model trained naturally drops quicker than trained by PE strategy. Overall, trained with appropriate reconstructed perturbation can partly improve the robustness of our MOpt Eqs.

MOpt Eqs vs. adding Regularizers in training loss. Apart from our HH&D modules, adding D and H to the training loss is one of the most popular methods for obtaining outputs with certain properties. In order to demonstrate the superiority of our module empirically, we conduct experiments for our MOpt Eqs trained by cross-entropy and POpt Eqs (MOpt Eqs (w/o. HH&D)) trained by the cross-entropy loss adding γ PL i=1(H(zi, zi+1) λD(zi, zi+1)) as regularizers (we call it augmented loss). Then we drew its test accuracy for diﬀerent γ trained by the augmented loss (λ is the same as ours) shown in Figure 3(c).

With proper γ, POpt Eqs trained by additional loss can perform better than γ = 0. Such phenomenon demonstrates the eﬀectiveness of our HH&D Modelling. Furthermore, the ﬁgure demonstrates the superiority of our model since the traditional method is consistently worse than ours. We left some other explorations for our model in the Appendix.

4 Conclusion

We introduce the multi-branch optimization induced equilibrium models (MOpt Eqs), a new extension of Opt Eqs that can utilize multiscale information for recognition tasks and retain its ability to recover to an optimization problem whose solution is equivalent to the equilibrium states of our model. The model architecture is designed based on modelling the hidden objective function for the multi-resolution recognition task. Furthermore, we also propose a new strategy inspired by our understandings of the hidden optimization problem. The empirical results show the advantages of our proposed methods. The success of our HH&D module and PE strategy demonstrates the deep link between the optimization problem and neural architecture and may motivate further explorations.

Published as a conference paper at ICLR 2022

Acknowledgments

Yisen Wang is partially supported by the National Natural Science Foundation of China under Grant 62006153, Project 2020BD006 supported by PKU-Baidu Fund, and Open Research Projects of Zhejiang Lab (No. 2022RC0AB05).

Zhouchen Lin is supported by the NSF China (No. 61731018), NSFC Tianyuan Fund for Mathematics (No. 12026606), Project 2020BD006 supported by PKU-Baidu Fund, and Qualcomm.

Takuma Amada, Kazuya Kakizaki, Toshinori Araki, Seng Pei Liew, Joseph Keshet, and Jun Furukawa. Adversarial robustness for face recognition: How to introduce ensemble diversity among feature extractors? 2021.

Brandon Amos and J Zico Kolter. Optnet: Diﬀerentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136 145. PMLR, 2017.

Donald G. M. Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM, 12(4):547 560, 1965. doi: 10.1145/321296.321305. URL http://dblp.uni-trier. de/db/journals/jacm/jacm12.html#Anderson65.

Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Neur IPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp. 688 699, 2019.

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Multiscale deep equilibrium models. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Shaojie Bai, Vladlen Koltun, and J. Zico Kolter. Stabilizing equilibrium models by jacobian regularization. Co RR, abs/2106.14342, 2021. URL https://arxiv.org/abs/2106. 14342.

David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, and Antonio Torralba. GAN dissection: Visualizing and understanding generative adversarial networks. Co RR, abs/1811.10597, 2018. URL http://arxiv.org/abs/ 1811.10597.

Charles G Broyden. A class of methods for solving nonlinear simultaneous equations. Mathematics of computation, 19(92):577 593, 1965.

Peter J Burt and Edward H Adelson. The laplacian pyramid as a compact image code. In Readings in computer vision, pp. 671 679. Elsevier, 1987.

Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, and Yi Ma. Deep networks from the principle of rate reduction. ar Xiv preprint ar Xiv:2010.14765, 2020.

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary diﬀerential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, Neur IPS 2018, December 3-8, 2018, Montréal, Canada, pp. 6572 6583, 2018.

Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. End-to-end diﬀerentiable physics for learning and control. Advances in neural information processing systems, 31:7178 7189, 2018.

Published as a conference paper at ICLR 2022

Josip Djolonga and Andreas Krause. Diﬀerentiable learning of submodular models. Advances in Neural Information Processing Systems, 30:1013 1023, 2017.

Zhengyang Geng, Meng-Hao Guo, Hongxu Chen, Xia Li, Ke Wei, and Zhouchen Lin. Is attention better than matrix decomposition? In International Conference on Learning Representations, 2021.

Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, and Armin Askari. Implicit deep learning. Co RR, abs/1908.06315, 2019.

Stephen Gould, Richard Hartley, and Dylan Campbell. Deep declarative networks: A new hope. Co RR, abs/1909.04866, 2019.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 770 778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90.

Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2261 2269. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.243.

Kezhi Kong, Guohao Li, Mucong Ding, Zuxuan Wu, Chen Zhu, Bernard Ghanem, Gavin Taylor, and Tom Goldstein. Flag: Adversarial data augmentation for graph neural networks. ar Xiv preprint ar Xiv:2010.09891, 2020.

Alex Krizhevsky, Geoﬀrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.

Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pp. 609 616, 2009a.

Honglak Lee, Roger B. Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Andrea Pohoreckyj Danyluk, Léon Bottou, and Michael L. Littman (eds.), Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, volume 382 of ACM International Conference Proceeding Series, pp. 609 616. ACM, 2009b. doi: 10.1145/1553374.1553453.

Jia Li, Cong Fang, and Zhouchen Lin. Lifted proximal operator machines. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 33, pp. 4181 4188, 2019.

Mingjie Li, Lingshen He, and Zhouchen Lin. Implicit euler skip connections: Enhancing adversarial robustness via numerical stability. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 5874 5883. PMLR, 2020.

Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21):5623 5633, 2016.

Guangcan Liu, Shiyu Chang, and Yi Ma. Blind image deblurring using spectral properties of convolution operators. IEEE Transactions on image processing, 23(12):5047 5056, 2014.

Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, and Hajime Asama. Dissecting neural odes. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Michael C Mozer. A focused back-propagation algorithm for temporal pattern recognition. Complex systems, 3(4):349 381, 1989.

Published as a conference paper at ICLR 2022

Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds.), Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pp. 4970 4979. PMLR, 2019.

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary De Vito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic diﬀerentiation in pytorch. In NIPS-W, 2017.

Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, and Ming Lin. Scalable diﬀerentiable physics for learning and control. In International Conference on Machine Learning, pp. 7847 7856. PMLR, 2020.

Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640 651, 2017. doi: 10.1109/TPAMI.2016.2572683.

Shiyu Tang, Ruihao Gong, Yan Wang, Aishan Liu, Jiakai Wang, Xinyun Chen, Fengwei Yu, Xianglong Liu, Dawn Song, Alan Yuille, et al. Robustart: Benchmarking robustness on architecture design and training techniques. ar Xiv preprint ar Xiv:2109.05211, 2021.

Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, and Bin Xiao. Deep highresolution representation learning for visual recognition. Co RR, abs/1908.07919, 2019a.

Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a diﬀerentiable satisﬁability solver. In International Conference on Machine Learning, pp. 6545 6554. PMLR, 2019b.

Ezra Winston and J. Zico Kolter. Monotone operator equilibrium networks. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020.

Cihang Xie and Alan L. Yuille. Intriguing properties of adversarial training at scale. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Open Review.net, 2020.

Cihang Xie, Mingxing Tan, Boqing Gong, Jiang Wang, Alan L Yuille, and Quoc V Le. Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 819 828, 2020.

Xingyu Xie, Jianlong Wu, Guangcan Liu, Zhisheng Zhong, and Zhouchen Lin. Diﬀerentiable linearized admm. In International Conference on Machine Learning, pp. 6902 6911. PMLR, 2019.

Xingyu Xie, Jianlong Wu, Guangcan Liu, Zhisheng Zhong, and Zhouchen Lin. Optimization induced deep equilibrium networks. ar Xiv preprint ar Xiv:2105.13228, 2021.

Chong You, Daniel Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3918 3927, 2016.

Bohang Zhang, Tianle Cai, Zhou Lu, Di He, and Liwei Wang. Towards certifying l inﬁnity) robustness using neural networks with l inﬁnity-dist neurons. Co RR, abs/2102.05363, 2021.

Hongyang Zhang, Zhouchen Lin, Chao Zhang, and Edward Y Chang. Exact recoverability of robust pca via outlier pursuit with tight recovery bounds. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

Published as a conference paper at ICLR 2022

Quanshi Zhang and Song-Chun Zhu. Visual interpretability for deep learning: a survey. ar Xiv preprint ar Xiv:1802.00614, 2018.

Quanshi Zhang, Ying Nian Wu, and Song-Chun Zhu. Interpretable convolutional neural networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 8827 8836. Computer Vision Foundation / IEEE Computer Society, 2018a. doi: 10.1109/CVPR. 2018.00920. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Zhang_ Interpretable_Convolutional_Neural_CVPR_2018_paper.html.

Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu. A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery. In International conference on machine learning, pp. 5862 5871. PMLR, 2018b.

Published as a conference paper at ICLR 2022

A.1 Paralleling Multi-Branch Opt Eqs

Experiments show that even a simple one-layer Opt Eqs module enjoys impressive generalization abilities. If we view a simple one-layer Opt Eqs module as a powerful feature generator, we can use an "ensemble" scheme on the DEQs by paralleling several DEQ modules. Then we get the multi-branch implicit structure (formulated as Eqn 10), which is capable of utilizing samples of multi-resolution but also retains the capability of recovering to an optimization problem.

z 1 = W 1 σ(W1z 1 + U1g(x) + b1), ......

z L = W Lσ(WLz L + ULg(x) + b L),

where z i denotes the equilibrium outputs of i-th branch, we call this structure Parallel Opt Eqs (POpt Eqs). Like (Xie et al., 2021) stated for Opt Eqs, the POpt Eqs can also be depicted as solving the following problem (Eqn 11) if Wi is invertible and σ( ) is activation function,

min z1,...,z L G(z1, ..., z L; g(x)) = min z1,...,z L

1 f(W 1 i zi) Uig(x) + bi, W 1 i z

2 W 1 i zi 2 1

where g(x) is the input feature with raw input x, zi Rdi 1 and f added for the activation function. For example f is the positive orthant f(x) = I{x 0} if we use Re LU activation, since (I + f) 1 is the Re LU activation function, other activation functions can be found in Li et al. (2019).The calculation of the above structure (Eqn 10) is equivalent to solving the stationary points zi G = 0.

A.2 Adversarial Perturbations.

Adversarial perturbations aim to cheat well-trained neural networks with small or unnoticeable changes on natural images. The adversarial perturbations can be obtained by solving the following maximum problem for input xnat and label y:

max δ C L(f NN(xnat + δ); y),

where C is the feasible set of δ in case the change is too large and usually chosen to be an l-inﬁnite ball. Due to the neural network is a non-convex function for δ, obtaining the perturbation is not easy. The most popular method is the PGD-k method which solves the above problem by implementing the projected gradient method by k times. Although the performance of such a method is satisfactory in most cases, it makes the whole training procedure k + 1 times slower than the origin. In our work, we replace the above objective function with our model s hidden objective function. Since the hidden optimization problem is convex with respect to the perturbation, we can get the perturbation much easier. Moreover, such perturbation can also enhance the generalization abilities, as shown in the experiments.

A.3 Proofs for Proposition.1

In this section, we are going to prove that the equilibrium equation Eqn 1 is also the ﬁrst order stationary points for the optimization problem 5.

Published as a conference paper at ICLR 2022

Proof 1 The equilibrium equations Eqn 1 in the proposition can be separated into diﬀerent branches as follows:

z1 = W 1 σ (W1h1 (z1, z L, z2) + U1g(x) + b1) , ......

zi = W i σ (Wihi (zi, zi 1, zi+1) + Uig(x) + bi) , ......

z L = W Lσ (WLh L (z L, z L 1, z1) + ULg(x) + b L) ,

where hi : Rdi Rdi+1 Rdi 1 Rdi can be deﬁned as:

hi(zi, zi 1, zi+1) =1

2(Pi:i+1zi+1 + P i 1:izi 1)

2 vec vec 1(Pi:i+1zi+1)sign [Mi:i+1 diag(Mi:i+1)]

2 vec(vec 1 P i 1:izi 1)sign M i 1:i diag(Mi 1:i) ,

where Mi:i+1 = vec 1(Pi:i+1zi+1) vec 1(zi).

First, we need to prove that

h(zi, zi 1, zi+1) = 1

i=1 [λD(zi, zi+1) H(zi, zi+1)] (14)

The derivatives for the right term can be divided as the derivatives for H and D:

k=1 H(zk, zk+1) = zi

k=1 z k Pk:k+1zk+1

= zi(z i Pi:i+1zi+1 + z i 1Pi 1:izi)

= Pi:i+1zi+1 + P i 1:izi 1

which is the ﬁrst term of hi. Then we are going to calculate the derivatives for D:

k=1 D(zk, zk+1) = zi

k2 =k1 |vec 1(zk)(k1) vec 1(Pk:k+1zk+1)(k2)|

k2 =k1 |vec 1(zi)(k1) vec 1(Pi:i+1zi+1)(k2)|

k2 =k1 |vec 1(zi 1)(k1) vec 1(Pi 1:izi)(k2)|

k2 =k1 |vec 1(zi)(k1) vec 1(Pi:i+1zi+1)(k2)|

k2 =k1 |vec 1(zi 1)(k1) vec 1(Pi 1:izi)(k2)|

with the following equation holds for the sampling matrix Pi 1:i and sampling is done for each channel independently:

vec 1(zi 1)(k1) vec 1(Pi 1:izi)(k2) = vec 1(P i 1:izi 1)(k1) vec 1(zi)(k2)

Published as a conference paper at ICLR 2022

The equation can be formulated as:

k=1 D(zk, zk+1) = vec

k2 =k1 |vec 1(zi)(k1) vec 1(Pi:i+1zi+1)(k2)|

k2 =k1 |vec 1(P i 1:izi 1)(k1) vec 1(zi)(k2)|

k2 =k1 |vec 1(zi)(k1) vec 1(Pi:i+1zi+1)(k2)|

vec 1(zi)(1)

k2 =k1 |vec 1(zi)(k1) vec 1(Pi:i+1zi+1)(k2)|

, ..., vec 1(zi)(C)(...)

k2 =1 sign(vec 1(zi)(1) vec 1(Pi:i+1zi+1)(k2))vec 1(Pi:i+1zi+1)(k2),

k2 =C sign(vec 1(zi)(C) vec 1(Pi:i+1zi+1)(k2))vec 1(Pi:i+1zi+1)(k2)

=vec 1(Pi:i+1zi+1)sign[vec 1(Pi:i+1zi+1) vec 1(zi) diag(vec 1(Pi:i+1zi+1) vec 1(zi))]

=vec 1(Pi:i+1zi+1)sign[Mi:i+1 diag(Mi:i+1)],

In the same manner, we can get,

k2 =k1 |vec 1(P i 1:izi 1)(k1) vec 1(zi)(k2)|

=vec 1(P i 1:izi 1)sign[M i 1:i diag(Mi 1:i)]

Taking all the terms together, we ve proved the correctness of Eqn 14. Then we can proved the relationships between Eqn 12 and the ﬁrst-order stationary points for Eqn 5. Take zi for an example:

1 f(W 1 i zi) Uig(x) + bi, W 1 i z + 1

2 λD(zi, zi+1) + W 1 zi 2 H(zi, zi+1)

0 = W 1 i (I + f)(W 1 i zi) W 1 i (Uig(x) + bi) h(zi, zi 1, zi+1)

Then we can get the equilibrium equation:

zi = W i σ(Wihi(zi, zi 1, zi+1) + Uig(x) + bi),

the proof is the same for all i. In this manner, we proved the relations between the structure and the optimization problems. The proof for the proposition is complete.

A.4 Proofs for Proposition.2

Proof 2 From the preliminaries above, the diﬀerence between G(z ; g(x) δ) and G(z ; g(x)) with δ = ϵSIGN(U W 1 z ) is as follows:

G(z ; g(x) δ) G(z ; g(x)) = G(z ; g(x)) G(z ; g(x)) + ϵsign(U W 1 z ), U W 1 z

Since z is the minimizer of G at g(x), the above function can be converted as follows:

G(z ; g(x) δ) G(z ; g(x)) ϵsign(U W 1 z ), U W 1 z

= ϵ U W 1 z 1 ϵsign(U W 1 z ), U W 1 (z z )

Published as a conference paper at ICLR 2022

From the structure of Opt Eqs, we can get:

ϵSIGN(U W 1 z ) U W 1 (z z )

ϵ U W 1 (z z ) 1

Nϵ U W 1 (z z ) 2

Nϵ U σ(Wz + Ug(x) + b) σ(Wz + U(g(x) δ) + b) 2

Nϵ σ(Wz + Ug(x) + b) σ(Wz + U(g(x) δ) + b) 2

Nϵ Wz + Ug(x) + b Wz U(g(x) δ) b 2

Nϵ W(z z ) + Uδ 2

Nϵ W(z z ) 2 +

Nϵ z z 2 + Nϵ2

The inequality is acquired because of the Lipshitzness of common activation function (Re LU and Leacky Re LU) and the bounds on weight matrices. With the above results, we can get:

G(z ; g(x) δ) G(z ; g(x)) ϵ U W 1 z 1

Nϵ z z 2 Nϵ2

Then we can obtain the proposition:

max n G(z ; g(x) δ) G(z ; g(x)),

Nϵ z z 2 o 1

2 ϵ U W 1 z 1 Nϵ2

The proofs are similar when considering the HH&D modules.

A.5 Forward Process of our MOpt Eqs implicit part.

The forward propagation is illustrated as follows:

Algorithm 1: Forward Process of our MOpt Eqs implicit part.

Require: Input x, initial points {zi}L i=1, perturbation size ϵ, δ0 = 0, maximum perturbation size ϵ. Ensure: Equilibrium points{zi}L i=1 Calculate the input feature g(x). The calculation of z G( z; g(x)) = 0 is the same as ﬁnding the equilibrium points

Tθ = z G + z = z

. Use Broyden (Broyden, 1965) or Anderson Method (Anderson, 1965) or others to ﬁnd roots of the following equation like MDEQ:

Tθ( z; g(x)) = z

return z = {z i }L i=1

And the calculation of Tθ is shown as follows:

Published as a conference paper at ICLR 2022

Algorithm 2: The Calculation of Tθ = z G + z at ( z, g(x)).

Require: Input feature g(x), initial points {zi}L i=1, perturbation size ϵ, δ0 = 0, maximum perturbation size ϵ. Ensure: Output Tθ = z G( z; g(x)) + z

for i RANGE(1, L) do

k1, k2 neighbor of i in L-loop Calculate zpi by hi (Shown in Figure 1 (b)): zpi hi(zi, zi 1, zi+1) if Use PE Method then

yi σ(Wizpi + Ui(g(x) δi 1) + bi) δi ϵSIGN(U i yi) else

yi σ(Wizpi + Uig(x) + bi) end if zi W i yi end for return {z i }L i=1

A.6 Experiments details of MOpt Eqs for the experiments Classification.

A.6.1 CIFAR-10

For classiﬁcation tasks of single-view models, we set the channel number of each branch to 32, λ = 0.01 and perturbation size ϵ = 0.1 for the small MOpt Eqs with three branches for the single-view comparison as Opt Eqs. The batch size is set to be 128 for all the experiments.

As for the multi-scale models comparison, we use four branches with each branch take inputs with resolutions equals 32, 16, 8, 4 like MDEQ, the output channel number for each branch is 256 for MOpt Eqs with 1.9M learnable parameters. At the same time, we set the channel number to be 128 for MOpt Eqs with 0.48M learnable parameters. And we set λ = 0.001 and perturbation size ϵ = 0.1 for the experiments.

We use the widely used SGD algorithm for training procedure. We set weight decay to be 0.001 and the initial learning rate start from 0.1 and decay by 0.1 at 100, 150, 175-th epoch with 200 epochs in total like others. We use the standard data augmentation for all the experiments.

Our implementation for small MDEQ only changes the channel number of MDEQ s 10M model to obtain the comparable size with our model, without changing the training settings and its hyper-parameters.

A.6.2 CIFAR-100

Except for CIFAR-10, we also ﬁnish the experiments on CIFAR-100 classiﬁcation2 to further verify our model s eﬀectiveness. The model we used is the same as the models used in the multi-scale comparison, only changing the output from 10 to 100 for classiﬁcation. MDEQ with 11M parameter is the same model as (Bai et al., 2020) proposed for CIFAR-10 experiment with only changing the output from 10 to 100 for CIFAR-100. Moreover, the hyper-parameter setting is also the same as (Bai et al., 2020) proposed in the paper.

We use the widely used SGD algorithm for training procedure. We set weight decay to be 0.001 and the initial learning rate start from 0.1 and decay by 0.1 at 100, 150, 175-th epoch with 200 epochs in total like others. We use the standard data augmentation for all the experiments.

2https://www.cs.toronto.edu/ kriz/cifar.html

Published as a conference paper at ICLR 2022

0 50 10 20 30 40 Iteration Numbers

Relative Error

λ=0.005 λ=0.01 λ=0.1

(a) Plot of small MOpt Eqs convergence with respect to diﬀerent λ.

0 50 10 20 30 40 Iteration Numbers

Relative Error

(b) Plot of multi-scale MOpt Eqs (λ = 0.001) convergence.

Figure 4: The convergent iterations of our forward root-ﬁnding procedure used for our MOpt Eqs (λ = 0.001) forward propagation. Relative error is deﬁned as T j+1 T j 2

A.6.3 Image Nette

Besides the classiﬁcation for images with small size, we also conduct the experiments on the Imagenette3, which is a subset of 10 classes from Image Net4 with 9469 training photos and its test set consists of 3925 images. Furthermore, we use the full-size version of Imagenette5 further to verify the superiority of our models on large-scale images. The model architecture for our MOpt Eqs and MDEQ (for comparison) is the same as the ones used in the multiscale models comparison for CIFAR-10 except adding two downsampling layers in the head of models to downsample the input size from 256 to 64. As for the hyper-parameters, we only change the weight decay to be 5e 5, batch size to be 32 and extend the whole training epochs to 200 with 100-th, 150-th, 175-th for the learning rate decay for all the models. All the hyper-parameters for both MOpt Eqs and MDEQs are hardly changed during all the datasets.

A.7 Convergence validation for our MOpt Eqs.

As shown in Figure 4(a), the Anderson method can quickly ﬁnd the equilibrium points for our small MOpt Eq used for single-view comparison. From the ﬁgure, one can see that MOpt Eqs with proper λ (around or smaller than 1 C , around 0.03 in this circumstance) can ensure the whole mappings be contractive. But too large λ will make the model hard to converge or even fail.

Apart from plotting the convergent iterations of our forward root-ﬁnding procedure for the small MOpt Eqs (shown in Figure 4 (a)), we also draw ﬁgures to show the convergence of the forward propagation for our multi-scale MOpt Eqs (λ = 0.001) used in the imagenette and CIFAR experiments as Figure 4 (b) shown. We use the pre-trained model on Imagenette whose prediction accuracy is reported in the above sections. One can see that our multi-scale MOpt Eqs can also quickly converge to the equilibrium points as we expected.

A.8 Ablation Studies on the impact of D and H part

In this section, we analyzed the inﬂuence of our D and H term for a single-branch MOpt Eq model trained without PE method shown in Table. 3. We set λ = 0.02 for the experiment. From the table, one can see that the contribution of the H term is slightly higher than the diversity term. But considering both two terms can signiﬁcantly improve the results comparing with MOpt Eqs without considering these two terms.

3https://github.com/fastai/imagenette/ 4https://www.image-net.org/ 5https://s3.amazonaws.com/fast-ai-imageclas/imagenette2.tgz

Published as a conference paper at ICLR 2022

Model Natural Accuracy

MOpt Eqs 89.1%

- D 88.6% - H 88.4% - D, H 87.4%

Table 3: Evaluation on MOpt Eqs trained with and without considering D and H part.

A.9 Experiments on the Image Net

Except CIFAR and Imagenette, we also conducted experiments on Image Net with 13M parameters shown as below:

Model Model Size Natural Accuracy

Res Net 13M 70.3% HR-Net 13M 72.3% MOpt Eqs 13M 72.5%

Table 4: Evaluation on MOpt Eqs on Image Net.

From the results, one can see that our model can achieve the satisfying performance on Image Net compared with the state-of-the-art models.

A.10 PE method and robustness

From the above experiments, one can see that the PE method may also improve the robustness of our MOpt Eqs. In this section, we conduct experiments to see whether the PE methods can further enhance the robustness of MOpt Eqs when we use adversarial training methods on these models. We ﬁnish comparisons of PGD-3 training for the single-view MOpt Eq with or without PE, and the result is shown as follows: From the table, we can

Model Natural Accuracy PGD-20 Accuracy(ϵ = 8 255)

PGD-3 w/o PE 78.60% 37.18% PGD-3 w. PE 78.40% 38.05%

Table 5: Evaluation on MOpt Eqs trained by PGD-3 with and without PE method.

conclude that our PE method can improve the robustness of our MOpt Eqs model trained by the adversarial training method. Such advantages also demonstrate the superiority of our models mathematical interpretability since the PE method is acquired based on the model s hidden optimization problem.

A.11 Other Training methods: Fixed Point method and One-step gradients

A.11.1 Methods

Since our model is not so complicated, we can use the ﬁxed point method for L iteration for the forward propagation. While for the back propagation, we can use BPTT (Mozer, 1989) and (Geng et al., 2021) proposed method, which only backward the ﬁnal forward iterations using the chain rule: L ( ) = L

where θ Θ are the learnable parameters for the implicit models, T L Θ denotes the output z for the L-th ﬁxed-point iteration. Since the back-propagation of such a strategy is similar

Published as a conference paper at ICLR 2022

to backwards only a single layer of DNN module without the calculation of root-ﬁnding algorithm for the Jacobian in the implicit diﬀerentiation, the computational cost is much less than the implicit method. However, the gradient approximated by such a method is not accurate and may cause the performance drop. However, as we illustrated below, such a method can be regarded as a trade-oﬀif the computational resources of training are not enough.

A.11.2 Empirical Results and Computational Efficiency of One-stpe Gradient

We take the Image Nette dataset as an example, the training memory cost, forward time, and test accuracy are shown as follows (each batch contains 32 images and is ﬁnished on 2 x GTX 1080Ti for 200 epochs):

Model Memory/Batch Infer Time/Batch Total Time Accuracy

MDEQ 8GB 1.06s 9h 91.42% MOpt Eq(one step) 3GB 0.26s 3h 91.21%

Table 6: The comparison of the computational cost of MDEQ and our MOpt Eqs (trained by one-step gradient).

The results shows that the one-step gradient methods can save a lot computation resources while with slightly drop on the performance.