# geq_gaussian_kernel_inspired_equilibrium_models__bf41269b.pdf

GEQ: Gaussian Kernel Inspired Equilibrium Models

Mingjie Li1, Yisen Wang1,2, Zhouchen Lin1,2,3

1 National Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University 2 Institute for Artificial Intelligence, Peking University 3 Peng Cheng Laboratory lmjat0111@outlook.com, {yisen.wang, zlin}@pku.edu.cn

Despite the connection established by optimization-induced deep equilibrium models (Opt Eqs) between their output and the underlying hidden optimization problems, the performance of it along with its related works is still not good enough especially when compared to deep networks. One key factor responsible for this performance limitation is the use of linear kernels to extract features in these models. To address this issue, we propose a novel approach by replacing its linear kernel with a new function that can readily capture nonlinear feature dependencies in the input data. Drawing inspiration from classical machine learning algorithms, we introduce Gaussian kernels as the alternative function and then propose our new equilibrium model, which we refer to as GEQ. By leveraging Gaussian kernels, GEQ can effectively extract the nonlinear information embedded within the input features, surpassing the performance of the original Opt Eqs. Moreover, GEQ can be perceived as a weight-tied neural network with infinite width and depth. GEQ also enjoys better theoretical properties and improved overall performance. Additionally, our GEQ exhibits enhanced stability when confronted with various samples. We further substantiate the effectiveness and stability of GEQ through a series of comprehensive experiments.

1 Introduction

Deep Neural Networks (DNNs) show impressive performance in many real-world tasks on various data like graphs [43], images [41, 20], sequences [8], and others. However, most neural network structures are constructed by experience or searching on the surrogate datasets [31, 57]. Therefore, these architectures cannot be interpretable and such a phenomenon hinders further development. Apart from the current neural network models, traditional machine learning methods like dictionary learning [46, 33], subspace clustering [54] and other methods [29, 56, 55, 30] can design their whole procedure by designing optimization problems with specific regularizers customized from their mathematical modeling and requirements. Thus, these models are easily interpreted. However, the traditional machine learning algorithms whole procedures do not consider the hidden properties of features and labels. Therefore, they usually perform worse on tasks with more data.

To link two types of models, Opt Eqs [52] tries to recover the model s hidden optimization problem to make their model mathematically explainable . They claim that the output features z (we also called the equilibrium state) with respect to input x, which are obtained by solving the fixed-point

Corresponding author

37th Conference on Neural Information Processing Systems (Neur IPS 2023).

equation in Eqn (1), is the optimal solution for its hidden optimization problem defined in Eqn (2).

z = W σ( W z + Ux + b), ˆy = Wc z + bc, (1)

min z G( z; x) = min z

1 f( W 1 z) D Ux + b, W 1 z E + 1

2 W 1 z 2 2 1

where σ is the Re LU activation function, U, W, b, Wc, bc are learnable parameters. They are trained by optimizing loss functions, such as cross-entropy loss, which are calculated based on the final prediction ˆy derived from z as shown in Eqn (1). W 1 represents an invertible or pseudo-invertible matrix for W, and the function f is a positive indicator function, which outputs 1 when x 0 and outputs in other cases. By choosing this function, the first-order condition for Eqn 2 will contain a non-linear Re LU function, which is the activation function in our GEQ. The equivalence between z and the optimal solution for the hidden optimization problem (2) enables researchers to not only gain insights into Opt Eqs behavior by understanding the underlying hidden optimization problem but also innovate by designing new models based on different problem formulations tailored to specific tasks. For instance, Opt Eqs introduces a module that promotes output sparsity, and Multi-branch Opt Eqs (MOpt Eqs) incorporates fusion modules to enhance the diversity among its branches. Despite these advancements and the incorporation of various modules inspired by different vision tasks, the performance of Opt Eqs and related models still falls short when compared to deep neural networks in image classification. This discrepancy suggests the existence of crucial components that limit the performance of equilibrium models.

To identify such a component, we delve into the hidden optimization problem of Opt Eqs and observe that it can be decomposed into two distinct parts: the regularizer term for the output features and the feature extraction term. While the feature extraction term is crucial as it depends on the input and determines the patterns extracted from the input features, the exploration of the regularizer term has been largely overlooked, with linear kernel functions being the predominant choice for feature extraction. Thereby, we believe that the limitations of previous equilibrium models stem from their feature extraction parts, as linear kernels struggle to capture complex features effectively. Building upon these insights, we take a step forward by leveraging the widely adopted Gaussian kernel for feature extraction in the hidden optimization problem.

Then by calculating the stationary condition for the above new hidden optimization problem, we propose our new type of Opt Eqs, the Gaussian kernels inspired equilibrium models (GEQ). The model involves a new attentive module induced by its hidden optimization problem and enjoys much better performances on classification tasks even compared with deep models. Furthermore, we also prove that the new model s outputs are equivalent to the outputs for Opt Eqs with weight-tied infinite wide mappings. Therefore, an interesting finding is that our model can be regarded as a double-infinite model because the original Opt Eqs can be regarded as a weight-tied infinite deep model. Apart from the above findings, the utilization of Gaussian kernels also makes our proposed model enjoy better generalization abilities. Besides the generalization abilities, we also analyze the stability of our GEQ and find its stability is better on various inputs. We summarize our contributions as follows:

We first reformulate the Opt Eqs hidden optimization problem with Gaussian kernels and propose a new equilibrium model called GEQ. It contains a new attention module induced by its hidden optimization problem and performs better on real-world datasets. We find that our GEQ can be regarded as a weight-tied neural network with both infinite width and depth, and better generalization ability through our analysis. Empirical results also confirm the superiority of our GEQ. We theoretically demonstrate the advantages of the stability of our GEQ compared with former Opt Eqs on various inputs. We also conduct experiments to validate such advantages.

2 Related Works

2.1 Implicit Models

Most modern deep learning approaches provide explicit computation graphs for forward propagations and we call these models explicit models . Contrary to these models, recent researchers proposed some neural architecture with dynamic computation graphs and we call them implicit models . A

notable example of an implicit model is Neural ODEs [7], its architecture is encoded as a differential system and the implicit ODE solvers they used are equivalent to continuous Res Nets that take infinitesimal steps. By representing the entire structure using differential systems, implicit models tap into the black box of traditional neural networks while offering increased flexibility and interpretability. Because of the flexibility and the interpretability of implicit models, the design of implicit models [15, 17] draws much attention these days. Many kinds of implicit models have been proposed, including optimization layers [11, 1], differentiable physics engines [40, 9], logical structure learning [50], differential programming [53, 43], and others [25, 49].

Among the various implicit models, Opt Eqs [52] and its multi-branch version MOpt Eqs [28] stand out as they not only exhibit superior performance compared to other implicit models but also explicitly establish the relationship between their structure and a well-defined optimization problem. Therefore, exploring better equilibrium models is a promising direction to achieve more interpretable neural architectures. However, it is worth noting that while Opt Eqs and its variants have shown promising results, their performance is still not entirely satisfactory, particularly when compared to the deep explicit models. Other works [51, 42] also show the connection between their architectures and optimization problems, but their performance is also not satisfying. Besides the above general models, there are many works design equilibrium models from the view of optimization to deal with their specific tasks, like implicit graph models [6, 26], certified robust models [27] and image denoising models [5]. However, these models can only work on their specific domains.

2.2 Infinite Wide Models and Kernel Methods in Deep Learning

By employing kernel methods to estimate the outputs of single-layer networks for various samples, researchers discover that such networks can exhibit characteristics of a Gaussian process (GP) when their parameters are randomly initialized with a large width limit [35]. Building upon this idea, recent researchers have extended these findings to neural networks with multiple layers [24, 10] and other architectures [37, 13]. These studies primarily focus on weakly-trained models, where the network parameters are randomly initialized and kept fixed throughout the training process except for the last classification layer [2]. Despite their "weakly-trained" nature, these models still provide valuable insights applicable to current neural networks. For instance, mean-field theory [4, 16, 19] explains phenomena such as gradient vanishing and exploding during back-propagation, which are relevant not only to single-layer networks but also to other structures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs). Other researchers explore stationary kernels to enhance the interpretability of neural networks by designing different activation functions [34].

In addition to weakly trained models, recent studies [22, 2] introduce the concept of Neural Tangent Kernel (NTK) and its variants. These works have demonstrated that the sample kernel of infinitely wide networks, with appropriate initialization, can converge to a fixed neural tangent kernel when trained using gradient descent with infinitesimal steps (gradient flow). The NTK model is a theoretical construct with strict constraints, and its weights are not learned. It is important to note that although our model can also be seen as an infinitely wide model, there are several key differences between our approach and the aforementioned models. Firstly, our model utilizes kernel methods to operate on input features and output features, while the NTK models employ the kernel method on samples. Secondly, our GEQ model can be viewed as employing a "weight-tied infinite wide" projection that is parameterized by learnable parameters, allowing for updates during the training process. This contrasts with NTKs and NTK-DEQ [12](an equilibrium model constructed with vanilla NTK layers), where the weights are fixed and not learned. Therefore, despite the potential overlap in terminologies used in our paper and NTK-related works, our GEQ model differs significantly.

3 Gaussian kernel Inspired Equilibrium models

3.1 Formulation and Structure of GEQ

Before starting our analysis, we need to reformulate the original formulations of Opt Eqs equilibrium equation (1) and hidden optimization problem (2) for convenience. We replace Wc W with Wc, W := W , and replace z with W 1 z. Then the original Opt Eqs optimization problem can be

reformulated as:

min z G(z; x) = min z

2 z 2 2 Ux + b, z 1

With the new formulation, we can rewrite the equilibrium equation for Opt Eqs with input x by calculating Eqn (3) s first order stationary condition G = 0 and then reformulate it as follows:

z = σ W Wz + Ux + b , (4) where σ is the Re LU activation function, U, W, b are learnable parameters trained by optimizing loss functions (like cross-entropy loss). From problem Eqn (3), one can see that the GEQ s outputs try to extract features by minimizing the similarity term with the input feature Ux + b through a linear kernel function with some constraints defined in its regulation terms to prevent the trivial outputs. Such an explanation can also extend to other DEQs [3, 51] under the symmetric weight constraints. However, as other traditional machine learning mechanisms show, linear kernel functions cannot perform well when processing complex inputs. We deem that this term will also restrict the performance in equilibrium models. We note that the symmetric constraints won t influence the final performance much as many works [32, 21] show.

A natural consideration arises as to whether we can utilize alternative kernel functions to extract input features for the equilibrium state. However, we find that other equilibrium models employing different kernels with inner products, like the polynomial kernel and sigmoid kernel, lead to a similar structure to Opt Eqs with appropriate weight re-parameterization and lead to similar empirical results. We provide a detailed discussion of the related models in Appendix 4.6. Thereby, we decide to use the Gaussian kernels, and our new hidden optimization equation is formulated as follows:

min z G(z; x) = min z

2γ e γ Ux+b Wz 2 2 , (5)

where γ is the hyperparameter denoting the reciprocal of Gaussian kernels variance for scaling. Calculating G = 0 for new G, we can get the Gaussian kernel inspired Equilibrium models (GEQ) as the following fixed-point equation:

z = σ h e γ Ux+b Wz 2 2W ( Wz + Ux + b) i . (6)

Compared with linear kernels, Gaussian kernels can easily extract the non-linear relations from the input features and show more stable and powerful performance in SVM and other machine learning methods [21, 44]. We also find that the formulation of our GEQ is similar to adding a new attention module to the original equilibrium models. Therefore, our GEQ is supposed to enjoy more representative abilities than the original Opt Eqs. In the following parts of this section, we will analyze the theoretical advantages of our GEQ against the vanilla Opt Eqs. And we also empirically evaluate GEQ s performance in the following sections.

3.2 GEQ equals to the Opt Eqs with infinite width

Like other Gaussian-related models, our GEQ model can also be regarded as computing similarities by mapping them to an infinite-dimensional space. This allows GEQ to extract input features at the infinite-dimensional level, enabling the capture of non-linear dependencies in the input space. Essentially, our GEQ can be seen as a specialized version of Opt Eqs operating within the infinitedimensional space after mapping input features x and output embedding z to this expanded domain. Proposition 1. The output of our GEQ (Eqn (6)) is the same as a special Opt Eqs output whose hidden optimization problem is defined as follows:

min z G(z; x) = min z

2 z 2 2 λ ΦU(x + U 1b), ΦW(z) , (7)

where f is the positive indicator function and (1 + f) 1 is the Re LU activation function, λ = e γ Ux+b 2 2e γ Wz 2, and ΦW(z) = [1, 2γΦ(1) W (z), ..., p

(2γ)i/i!Φ(i) W(z), ...] R1 which maps the inputs to the infinite-dimensional space with Φ(i) W : Rn Rini defined as follows:

i z }| { (Wx)0(Wx)0...(Wx)0,

i z }| { (Wx)0(Wx)0...(Wx)1, ...,

i z }| { (Wx)j(Wx)k...(Wx)m, ... | {z } ini

where (Wx)j denotes the j-th element of vector Wx.

Based on the analysis provided above, it becomes evident that the hidden optimization problem of our GEQ exhibits a similar formulation to a specific Opt Eqs, whose inputs x and outputs z are mapped to an infinite-dimensional space using the weight-tied infinite wide mapping ΦW and ΦU. Given that both GEQ and Opt Eqs are derived from their respective hidden optimization problems, the equivalence in these problems implies the existence of the same equilibrium states for both models. Consequently, our GEQ can be considered an extension of the "infinite-depth" Opt Eqs to the "infinite-width" domain. Since wider neural networks are generally expected to perform better on classification tasks, we can infer that our GEQ outperforms vanilla equilibrium models like Opt Eqs. We further support this claim with the theoretical analysis illustrated in the subsequent sections.

3.3 Generalization abilities for our GEQ

Apart from the above empirical intuition, we are going to prove our GEQ s generalization advantages over Opt Eqs using the generalization bound under the PAC-Bayesian framework [36]. For convenience, we use fgeq(x) denotes the equilibrium state z for input x. Then we use the expected margin loss Lη(f c geq) at margin η of our GEQ on the data distribution D for classification, which is defined as follows,

Lη(f c geq) = P(x,y) D

f c geq(x)y η + max j =y f c geq(x)j

where f c geq(x) = Wcfgeq(x) + bc stands for GEQ s final prediction at input x with learnable parameters Wc and bc, and the index j, y here denote the prediction score for certain class. Then we can analyze the generalization bound for our GEQ following the former work s settings [38].

Proposition 2. If input x 2 is bounded by B, µ := max { U 2, W 2, Wc 2, b 2} < 1, then we have following results for GEQ and Opt Eqs with Re LU activation. For δ, η > 0, with probability at least 1 δ over the training set of size M, we have:

L0(f c geq) ˆLη(f c geq) +

16hln(24h) [βmaxµ4B + (2µβmax + 1)(1 βmaxm)µB + (1 βmaxm)2]2 BW

η2(1 βmaxm)4M + ln( M

L0(f c opteq) ˆLη(f c opteq) +

16hln(24h) [µ3B + (1 m)µB + (1 m)2]2 BW

η2(1 m)4M + ln( M

where ˆLη(f c geq) denotes the empirical margin loss on the training set, the maximum scaling number is defined by βmax := max x D e γ Ux+b Wz 2 2, BW := W W 2 F + U 2 F + b 2 2+ Wc 2 F + bc 2 2,

and m = W W 2 is less than 1 to ensure the convergence of equilibrium models.

Remark 1. If βmax < 0.8 and µ, m > 0.9, we can get βmaxµ 1 βmaxm < 1 1 m and 2µβmax+1

1 βmaxm 1 1 m. In the meanwhile, our GEQ s generalization bound is tighter than the original Opt Eq.

In practical experiments, we find that the above conditions for βmax and µ, m are satisfied in most cases. And the following experiments also support our above theoretical advantages. However, the above bound is not tight, and how to approximate a much tighter bound for equilibrium models still needs exploring.

3.4 GEQ enjoys More Stable Performance

Apart from better performance, Gaussian kernel stands out as one of the most extensively employed kernels in machine learning tasks owing to its stability across various input scenarios. Motivated by this, we aim to investigate whether incorporating Gaussian kernels into our equilibrium models can enhance the model s stability across diverse inputs. Firstly, we are going to estimate output changes with respect to the input perturbations.

Proposition 3. If norms for the inputs and outputs are bounded by B, the spectral norm for the weight parameter W, U of equilibrium models with Re LU activation are bounded by µ < 1 to ensure

convergence, then we have the conclusions as below:

fgeq(x1) fgeq(x2) 2 Lgeq x1 x2 2 = βmaxµ2 + γBµ3

1 βmaxµ2 γBµ3 x1 x2 2, (11)

fopteq(x1) fopteq(x2) 2 Lopteq x1 x2 2 = µ 1 µ2 x1 x2 2, (12)

where x1 and x2 are input samples, fgeq(x ) and fopteq(x ) denotes the equilibrium states for GEQ and Opt Eqs given input x , and βmax := maxx D e γ Ux Wz 2 2 < 1. Remark 2. If βmax < 0.8, B < 1, and γ < 0.2, then Lgeq < Lopteq.

In practical experiments, we choose different γ to reach the above condition for βmax and the condition for input B can also be achieved by normalization layers. Although the above bound is not tight, we can use it as a rough explanation for our GEQ s stability, which is demonstrated in the following experiments. How to approximate a much tighter bound for equilibrium models still needs exploring.

Besides having stable outputs under perturbations, a stable model should also show large output differences for different classes to make classification easier. However, the above Lipschitz term can not constrain outputs similarity when samples are far apart, then we need a new metric for analysis. In line with previous works [18, 34, 10], we assume all weight parameters go to infinite dimensions and analyze the expected output similarity κ for a model f for inputs x1 and x2 defined below:

κ(x1, x2) = E f(x1) f(x2) = Z

R fu(x1) fu(x2)p(u)du, (13)

with p(u) is the distribution of weight U s vectorization. If κ is smaller for samples x1 and x2 when they belong to different classes, which means they are far away, then the classifier can easily classify these two samples with different labels. The margin for the classification will also be large and easy for the classification of difficult samples. The κ s upper bound for GEQ and Opt Eqs are listed below: Proposition 4. If norms for the inputs and outputs are bounded by B, the spectral norm for the weight parameter W of equilibrium models with Re LU activation are bounded by µ < 1 to ensure the convergence, and each row in U obeys the spherical Gaussian distributions N(0, E[U 2 i ]I). Then we have the following conclusions for the expectation of the output similarity for GEQ and Opt Eqs with respect to input x1, x2 as follows,

κgeq(x1, x2) κgeq = µ2De γ

4 (σmin(U)2 x1 x2 2 2)E[U 2 i ] x1 2 x2 2 (sin θ0 + (π θ0) cos θ0) 2π(1 βmaxµ2)2 ,

κopteq(x1, x2) κopteq = E[U 2 i ] x1 2 x2 2 (sin θ0 + (π θ0) cos θ0)

2π(1 µ2)2 , (15)

where x1 and x2 are input samples, D = eγB W 2 2, βmax := maxx D e γ Ux Wz 2 2, σmin(U) is U s minimal singular term, and θ0 = cos 1( x1,x2

x1 x2 ) is the angle between the samples.

Remark 3. If x1 x2 2 2 p

log(1/D)/σmin(U), then κgeq κgeq.

Based on the aforementioned analysis, it is evident that our GEQ exhibits a smaller output similarity for dissimilar samples. As a result, the predictions made by GEQ are primarily based on the most similar samples, enabling it to successfully classify challenging instances. This claim is further supported by the results obtained from our carefully designed experiments.

3.5 Patch Splitting in GEQ

Since different parts of images have different impacts on the image classification, calculating the whole similarity using Gaussian kernels for GEQ is not enough. Inspired by former works [47], we also split the feature map Ux into patches, and then our optimization problem becomes:

min z G(z; x) = min z

i=1 e γ ( xi zi)Wh 2 2 #

Channel Multiplication

Figure 1: The sketch map of one layer GEQ s n-th fixed point iteration. x is the input and z(n 1), z(n) are the output of (n 1)-th, n-th iteration, and m is a middle state.

where xi Rcsp2 is the i-th patch of Ux + b while zi Rcsp2 is the i-th patch of Wz and Wh Rcsp2 chid is a linear layer to project patches with different size to the constant dimension. cs denotes the channel splitting number, p denotes the patch size, and chid denotes the hidden dimension of patches after projection. We note that the patch-splitting approach is a GEQ s unique feature, as incorporating this technique makes no difference in Opt Eqs due to its linear kernel. Figure 1 provides a sketch for GEQ s i-th fixed-point iteration. From the figure, it is evident that our GEQ can be viewed as a special Opt Eqs with additional attention mechanisms to capture the most important regions. Thereby, it can achieve enhanced performance. For a more detailed understanding of the forward procedure in our GEQ, please refer to Appendix A.1.

4 Empirical Results

4.1 Experiment Settings

In our experiments, we employed parallel GEQs with different input scales like MOpt Eqs and averaged the output of each branch after average pooling or nearest up-sampling to fuse the branches. We use weight normalization to ensure the convergence as MOpt Eqs and MDEQ, and set γ to 0.2/M, where M is the minimum x z Wh 2 2 among all patches. For the equilibrium calculation, we used the Anderson algorithm in the forward procedure, similar to other implicit models [28], and applied Phantom gradients [14] for back-propagation. All models were trained using SGD with a step learning rate schedule. We implemented our experiments on the Py Torch platform [39] using RTX-3090. Further details can be found in the Appendix A.6. To compare the performance of our GEQ, we used MOpt Eqs and MDEQ as benchmark implicit models, which have demonstrated superior performance over Opt Eqs on image classification tasks. Additionally, we used Res Net-18 and Res Net-50 as benchmark explicit models for comparison.

4.2 Results for Image Classification

Firstly, we finish the experiments on CIFAR-10 and CIFAR-100. They are widely used datasets for image classification on small images. In the experiment, we parallel 6 branches GEQ with the input scale is 32, 16, 8, 8, 4, 4 and MOpt Eqs architecture setting is also the same. The details can be found in the Appendix. As for the comparison, we also conduct experiments of the same training procedure for MDEQ, MOpt Eqs, and Res Net. The results are listed in Table 1. From the results, one can see that our GEQ enjoys clear advantages on CIFAR datasets, which demonstrates the powerful generalization ability of other models.

Besides small datasets, we also conducted experiments on large-scale image datasets, as presented in Table 2. The results clearly demonstrate the consistent superiority of our GEQ over other models, highlighting its clear advantages. Particularly noteworthy is our GEQ achieves a 2% improvement on Image Net-100 against deep model Res Net-50 while consuming approximately half the number of parameters, which emphasizes the effectiveness and efficiency of GEQ on large-scale inputs.

Model Size Accuracy Res Net-18 10M 93.5 0.2% Res Net-50 23M 95.2 0.2 MDEQ 10M 94.2 0.3% MOpt Eqs 8M 94.6 0.2% GEQ 5M 94.8 0.1% GEQ 8M 95.6 0.2%

(a) CIFAR-10.

Model Size Accuracy Res Net-18 10M 74.5 0.2% Res Net-50 23M 77.9 0.1% MDEQ 10M 74.7 0.3% MOpt Eqs 8M 75.6 0.2% GEQ 5M 76.4 0.3% GEQ 8M 78.2 0.2%

(b) CIFAR-100.

Table 1: The Empirical results for image classification on CIFAR-10 and CIFAR-100.

Model Size Accuracy Res Net-18 11M 92.3 0.1% Res Net-50 23M 93.0 0.2% MDEQ 10M 91.5 0.2% MOpt Eqs 10M 92.4 0.2% GEQ 6M 92.9 0.2% GEQ 13M 93.2 0.1%

(a) Image Nette.

Model Size Accuracy Res Net-18 11M 80.9 0.3% Res Net-50 23M 81.7 0.2% MDEQ 10M 81.3 0, 2% MOpt Eqs 13M 81.5 0.4% GEQ 6M 82.2 0.2% GEQ 13M 83.9 0.3%

(b) Image Net-100.

Table 2: The Empirical results for image classification on Image Nette and Image Net-100.

4.3 Validations on the models stability

Evaluation on Unseen difficult samples. In order to assess the stability of our GEQ model on difficult examples, we conducted experiments using CIFAR-100 super-class classification. CIFAR100 consists of 20 super classes, each containing five sub-classes 2. We trained our GEQ and MOpt Eqs models to predict the super-classes using the first four sub-classes from each super-class for training. We evaluated the models using both the test set, which includes the first four sub-classes from each super-class (referred to as "Known Accuracy"), and a separate set of samples from unseen sub-classes (referred to as "Unknown Accuracy"). The classification of the unseen samples is more difficult as they are different from the training set. The results of our GEQ and MOpt Eqs models are presented in Table 3.

Known Accuracy Unknown Accuracy MOpt Eqs 80.1 0.3% 77.4 0.5% GEQ 80.9 0.2% 80.1 0.6%

Table 3: Empirical rsults on CIFAR-100 s super-class classification.

The above table clearly demonstrates that our GEQ model surpasses MOpt Eqs in achieving superior performance on the challenging task at hand and demonstrates GEQ s stability. Such advantages can be attributed to the fact that GEQ exhibits smaller output similarities compared to Opt Eqs when input samples are far apart (e.g., samples from different classes). This characteristic can lead to larger margins between different classes, enabling the classifier to be more easily optimized during training. Consequently, our GEQ model excels in accurately classifying difficult unseen samples, further highlighting its stability and superiority over former equilibrium models.

Ablation Studies on corrupted datasets. Apart from difficult samples, we are going to compare the robustness of our GEQ, MOpt Eqs, and Res Net on the CIFAR-10 corruption dataset, which contains 19 common corruptions including image transformation, image blurring, weather, and digital noises on CIFAR s test datasets. Average results on 5 degrees CIFAR-10 corruption datasets list in Figure 2.

From the result, one can see that our GEQ based on Gaussian kernels is more robust than MOpt Eqs and Res Net. In particular, our GEQ can show better performance against structured noise and some image transformation. The above results also demonstrate the stability of our GEQ structure.

2For example, super class people contains five sub classes: baby , girl , man , man , woman

contrast elastic_trans jpeg_comp pixelate brightness fog frost saturate snow defocus_blur gaussian_blur motion_blur zoom_blur gaussian_noise impulse_noise shot_noise spatter speckle_noise 60

Accuracy on different corrupted datasets

GEQ Res Net

Figure 2: The results for different models under different corruptions.

4.4 Ablation Studies on Saliency Map

The saliency maps generated by Grad CAM [45] offer valuable insights into the visual attention of both MOpt Eqs and GEQ models. These maps highlight the regions of the image that are crucial for the model s predictions. Figure 3 presents the saliency maps obtained for an image from the Image Net dataset using both models. Upon observation, it becomes evident that GEQ exhibits a higher degree of focus on the significant regions directly associated with the predicted label "manta". In contrast, MOpt Eqs tends to allocate attention to unrelated regions such as the shells. This discrepancy indicates that the attention-like module induced by the Gaussian kernel in GEQ enhances the concentration of the model s attention, resulting in improved performance compared to MOpt Eqs.

(a) Original Image.

(b) MOpt Eqs s saliency map.

(c) GEQ s saliency map.

Figure 3: Saliency map for GEQ and MOpt Eqs on the Image Net image.

4.5 Ablation Studies on Patch splitting

The performance of our GEQ model is influenced by the channel splitting parameter and the patch size. Choosing large values for these parameters causes the kernel to focus mainly on global information while selecting small values makes the kernel concentrate on local features. To understand the impact of these choices on model performance, we conducted experiments, the results of which are presented in Figure 4. This figure illustrates the relationship between the channel splitting parameter, patch size, and the model s performance. By analyzing these results, we gain insights into the optimal values for these parameters that yield the best performance for our GEQ model.

The accuracy trend depicted in the figure shows an initial increase followed by a decrease as the channel split and patch size increased. Based on these empirical results, we select a patch size of 2 and a channel split of 8 for both the CIFAR and Image Net experiments. These parameter choices are made to optimize the performance of our models on the respective datasets.

4.6 Comparison with other kernel functions

Firstly, we introduce different commonly used kernels, such as polynomial, sigmoid, and Gaussian kernels, to reformulate the hidden optimization problem for equilibrium models. Table 4 illustrates the equilibrium models induced by these kernels. It can be observed that equilibrium models with polynomial and sigmoid kernels also incorporate new attentive modules. However, their attentive

4 8 16 32 64 channel split

(a) Influence of the channel splitting parameter.

1 2 4 8 patch size

(b) Influence of the patch size parameter.

Figure 4: The influence on the patch size and the channel splitting parameter for our GEQ on CIFAR-100 datasets.

kernels only constrain the input features Ux + b and do not directly affect the activation of the output z . As a result, their performance may be inferior to our GEQ model. To validate these claims, we evaluate the performance of different models on the CIFAR-100 dataset. Since the fusion module is the primary difference between MOpt Eqs and Opt Eqs, we can easily modify the structure of MOpt Eqs to include different kernel-induced attentive modules using the equations in Table 4, resulting in MOpt Eqs (Polynomial) and MOpt Eqs (Sigmoid). The results are presented in Table 5, which clearly demonstrates the superior performance of our GEQ model. For a more in-depth analysis of GEQ, we refer readers to the main paper.

Kernel Hidden Optimization Problem Equilibrium Model Linear minz 1 f(z) + 1

2 z 2 2 Ux + b, z 1

z = σ W Wz + Ux + b

Polynomial minz h 1 f(z) + 1

2 z 2 2 ( Ux + b, z )d 1

z = σ W Wz + d ( Ux + b, z )d 1 (Ux + b)

Sigmoid minz 1 f(z) + 1

2 z 2 2 tanh ( Ux + b, z ) 1

z = σ W Wz + 1 tanh2 ( Ux + b, z ) (Ux + b)

Gaussian minz h 1 f(z) + 1

2 z 2 2 1 2γ e γ Ux+b Wz 2 2 i

z = σ h e γ Ux+b Wz 2 2W ( Wz + Ux + b) i

Table 4: The hidden optimization problems and their related equilibrium models. d > 1 is an integer denoting the polynomial order.

Model Size Accuracy MOpt Eqs 8M 75.6 0.2% MOpt Eqs (Polynomial) 8M 75.1 0.4% MOpt Eqs (Sigmoid) 8M 76.1 0.3% GEQ 8M 78.2 0.2% Table 5: Comparison of equilibrium models with different kernel functions on CIFAR-100.

5 Conclusions

In this paper, we introduce a novel optimization-induced equilibrium model called GEQ, which utilizes Gaussian kernels in its optimization-induced framework. Our model incorporates a new attentive module that arises from its novel hidden optimization problem formulation. Notably, GEQ exhibits significantly improved performance in classification tasks, outperforming deep models as well. Moreover, GEQ can be interpreted as a weight-tied model with infinite width and depth, highlighting its expressive power. We also provide theoretical analysis demonstrating the superiority of our models in terms of generalization ability and stability compared to previous Opt Eqs. Empirical results further validate the effectiveness of our proposed approach.

Acknowledgments

Zhouchen Lin was supported by National Key R&D Program of China (2022ZD0160302), the major key project of PCL, China (No. PCL2021A12), the NSF China (No. 62276004), and Qualcomm. Yisen Wang was supported by National Natural Science Foundation of China (62006153, 62376010, 92370129), Open Research Projects of Zhejiang Lab (No. 2022RC0AB05), and Beijing Nova Program (20230484344).

[1] Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In ICML, 2017.

[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Neur IPS, 2019.

[3] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. In Neur IPS, 2019.

[4] Minmin Chen, Jeffrey Pennington, and Samuel Schoenholz. Dynamical isometry and a mean field theory of rnns: Gating enables signal propagation in recurrent neural networks. In ICML, 2018.

[5] Qi Chen, Yifei Wang, Zhengyang Geng, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Equilibrium image denoising with implicit differentiation. IEEE Transactions on Image Processing, 32:1868 1881, 2023.

[6] Qi Chen, Yifei Wang, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. Optimization-induced graph implicit nonlinear diffusion. In ICML, 2022.

[7] Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Neur IPS, 2018.

[8] KR1442 Chowdhary. Natural language processing. Fundamentals of artificial intelligence, pages 603 649, 2020.

[9] Filipe de Avila Belbute-Peres, Kevin Smith, Kelsey Allen, Josh Tenenbaum, and J Zico Kolter. End-to-end differentiable physics for learning and control. Neur IPS, 2018.

[10] Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In ICLR, 2018.

[11] Josip Djolonga and Andreas Krause. Differentiable learning of submodular models. Neur IPS, 2017.

[12] Zhili Feng and J Zico Kolter. On the neural tangent kernel of equilibrium models, 2021.

[13] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. In ICLR, 2019.

[14] Zhengyang Geng, Xin-Yu Zhang, Shaojie Bai, Yisen Wang, and Zhouchen Lin. On training implicit models. In Neur IPS, 2021.

[15] Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, and Armin Askari. Implicit deep learning. ar Xiv preprint ar Xiv:1908.06315, 2019.

[16] Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S Schoenholz, Ed H Chi, and Jeffrey Pennington. Dynamical isometry and a mean field theory of lstms and grus. ar Xiv preprint ar Xiv:1901.08987, 2019.

[17] Stephen Gould, Richard Hartley, and Dylan Campbell. Deep declarative networks: A new hope. ar Xiv preprint ar Xiv:1909.04866, 2019.

[18] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In ICML, 2017.

[19] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. Mean-field behaviour of neural tangent kernel for deep neural networks. ar Xiv preprint ar Xiv:1905.13654, 2019.

[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.

[21] Shell Xu Hu, Sergey Zagoruyko, and Nikos Komodakis. Exploring weight symmetry in deep neural networks. Computer Vision and Image Understanding, 187:102786, 2019.

[22] Arthur Jacot, Franck Gabriel, and Clément Hongler. Freeze and chaos for dnns: an NTK view of batch normalization, checkerboard and boundary effects. ar Xiv preprint ar Xiv: abs/1907.05715, 2019.

[23] Nikhil Ketkar and Nikhil Ketkar. Stochastic gradient descent. Deep learning with Python: A hands-on introduction, pages 113 132, 2017.

[24] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017.

[25] Mingjie Li, Lingshen He, and Zhouchen Lin. Implicit euler skip connections: Enhancing adversarial robustness via numerical stability. In ICML, 2020.

[26] Mingjie Li, Yifei Wang, Yisen Wang, and Zhouchen Lin. Unbiased stochastic proximal solver for graph neural networks with equilibrium states. In ICLR, 2023.

[27] Mingjie Li, Yisen Wang, and Zhouchen Lin. Cerdeq: Certifiable deep equilibrium model. In ICML, 2022.

[28] Mingjie Li, Yisen Wang, Xingyu Xie, and Zhouchen Lin. Optimization inspired multi-branch equilibrium models. In ICLR, 2021.

[29] Guangcan Liu, Shiyu Chang, and Yi Ma. Blind image deblurring using spectral properties of convolution operators. IEEE Transactions on image processing, 23(12):5047 5056, 2014.

[30] Guangcan Liu and Ping Li. Low-rank matrix completion in the presence of high coherence. IEEE Transactions on Signal Processing, 64(21):5623 5633, 2016.

[31] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search. ar Xiv preprint ar Xiv:1806.09055, 2018.

[32] Juncheng Liu, Kenji Kawaguchi, Bryan Hooi, Yiwei Wang, and Xiaokui Xiao. Eignn: Efficient infinite-depth graph neural networks. In Neur IPS, 2021.

[33] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse coding. In ICML, 2009.

[34] Lassi Meronen, Christabella Irwanto, and Arno Solin. Stationary activations for uncertainty calibration in deep learning. In Neur IPS, 2020.

[35] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29 53. Springer, 1996.

[36] Behnam Neyshabur, Srinadh Bhojanapalli, David Mc Allester, and Nathan Srebro. A pacbayesian approach to spectrally-normalized margin bounds for neural networks. ar Xiv preprint ar Xiv: 1707.09564, 2017.

[37] Roman Novak, Lechao Xiao, Yasaman Bahri, Jaehoon Lee, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. In ICLR, 2019.

[38] Chirag Pabbaraju, Ezra Winston, and J Zico Kolter. Estimating lipschitz constants of monotone deep equilibrium models. In ICLR, 2021.

[39] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Neur IPS, 2019.

[40] Yi-Ling Qiao, Junbang Liang, Vladlen Koltun, and Ming Lin. Scalable differentiable physics for learning and control. In ICML, 2020.

[41] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.

[42] Max Revay, Ruigang Wang, and Ian R Manchester. Lipschitz bounded equilibrium networks. ar Xiv preprint ar Xiv:2010.01732, 2020.

[43] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE transactions on neural networks, 20(1):61 80, 2008.

[44] Bernhard Scholkopf, Kah-Kay Sung, Christopher JC Burges, Federico Girosi, Partha Niyogi, Tomaso Poggio, and Vladimir Vapnik. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE transactions on Signal Processing, 45(11):2758 2765, 1997.

[45] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.

[46] Ivana Toši c and Pascal Frossard. Dictionary learning. IEEE Signal Processing Magazine, 28(2):27 38, 2011.

[47] Asher Trockman and J Zico Kolter. Patches are all you need? ar Xiv preprint ar Xiv:2201.09792, 2022.

[48] Russell Tsuchida, Fred Roosta, and Marcus Gallagher. Invariance of weight distributions in rectified mlps. In ICML, 2018.

[49] Russell Tsuchida, Suk Yee Yong, Mohammad Ali Armin, Lars Petersson, and Cheng Soon Ong. Declarative nets that are equilibrium models. In ICLR, 2022.

[50] Po-Wei Wang, Priya Donti, Bryan Wilder, and Zico Kolter. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In ICML, 2019.

[51] Ezra Winston and J Zico Kolter. Monotone operator equilibrium networks. In Neur IPS, 2020.

[52] Xingyu Xie, Qiuhao Wang, Zenan Ling, Xia Li, Guangcan Liu, and Zhouchen Lin. Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3604 3616, 2022.

[53] Xingyu Xie, Jianlong Wu, Guangcan Liu, Zhisheng Zhong, and Zhouchen Lin. Differentiable linearized admm. In ICML, 2019.

[54] Chong You, Daniel Robinson, and René Vidal. Scalable sparse subspace clustering by orthogonal matching pursuit. In CVPR, 2016.

[55] Hongyang Zhang, Zhouchen Lin, Chao Zhang, and Edward Y Chang. Exact recoverability of robust pca via outlier pursuit with tight recovery bounds. In AAAI, 2015.

[56] Xiao Zhang, Lingxiao Wang, Yaodong Yu, and Quanquan Gu. A primal-dual analysis of global optimality in nonconvex low-rank matrix recovery. In ICML, 2018.

[57] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. ar Xiv preprint ar Xiv:1611.01578, 2016.

A.1 Forward Procedure for GEQ

The pseudo-code for our GEQ is listed in Algorithm 1.

Algorithm 1: Calculating one layer GEQ.

Require: initial state z(0), weight parameter W, Ux Rchw, channel split cs, patch size p, hidden layer Wh Rcsp2 32

Ensure: Get the output z of i-th fixed point iteration.

def g(z(i); x, U, W, b):

Rearrage Wz(i) z, Ux + b x R

chw csp2 (csp2)

m = diag e γ ( x z )Wh 2 2 ( x z) Wh W h Rearrage m m Rc h w

return z(i+1) = σ W m

with torch.no_grad():

Use anderson algorithm to solve z = g(z ; x, U, W, b) # calculate gradient via phantom gradient: for i in range(5) do z = 0.2 z + 0.8 g(z ; x, U, W, b). end for return z

A.2 Proofs for proposition 1

Proposition 1. The outputs of GEQ with Gaussian kernels (Eqn (6)) is the same as Optimized induced Equilibrium Models output whose output is the optimal solution of the hidden optimization problems:

min z G(z; x) =1 f(z) + 1

2 z 2 λ ΦU(x + U 1b), ΦW(z) (17)

where ΦW(z) = 1, 2γΦ(1) W(z), ..., q

i Φ(i) W(z), ... R1 which projects the features to

the infinite-dimensional space. And Φ(i) W : Rn Rini is the k-tuple permutation with repetitions formulated as follows:

i z }| { (Wx)1(Wx)1...(Wx)1,

i z }| { (Wx)1(Wx)1...(Wx)2, ...,

i z }| { (Wx)j(Wx)k...(Wx)m, ... | {z } ini

where (Wx)j denotes the j-th element of vector Wx. Then the Gaussian kernel can also be regarded as calculating the input features after the weight-tied infinite wide projection ΦW and ΦU.

Proof. We can formulate the Opt Eqs hidden optimization problem with Gaussian kernels as below:

min z G(z; x) =1 f(z) + 1

2γ e γ Ux+b Wz 2 2, (19)

For e γ (Ux+b Wz) 2 2, we have

e γ Ux+b Wz 2 2 = e γ Ux+b 2 2 γ Wz 2 2+2γ Ux+b,Wz ,

= e γ Ux+b 2 2 γ Wz 2e2γ Ux+b,Wz , (20)

letting λ = e γ Ux+b 2 2e γ Wz 2 and we do the Taylor expansion for e Ux+b,Wz , we have:

e γ Ux+b Wz 2 2 = λ

2γ(Ux + b), 2γ(Wz) i

For any i we have from the permutation theory: 2γ(Ux + b), 2γ(Wz) i

i! = λ D Φ(i) U (x + U 1b), Φ(i) W(z) E , (22)

where Φ(i) W(x) is the i-tuple permutation with the repetition for given (Wx)1, (Wx)2, ..., (Wx)n. Each element of Φ(i) W(x) is one possible permutation. Since there are ni tuples, then Φ(i) W can project the input features to ini space as follows,

i z }| { (Wx)1(Wx)1...(Wx)1,

i z }| { (Wx)1(Wx)1...(Wx)2, ...,

i z }| { (Wx)j(Wx)k...(Wx)m, ... | {z } ini

Thereby, the hidden optimization problem for our GEQ can be reformulated as,

min z G(z; x) =1 f(z) + 1

2 z 2 λ ΦU(x + U 1b), ΦW(z) (24)

Then our conclusion is proved.

A.3 Proofs for proposition 2

Proposition 2. If input x 2 is bounded by B, µ := max { U 2, W 2, Wc 2, b 2} < 1, then we have following results for GEQ and Opt Eqs with Re LU activations. For δ, η > 0, with probability at least 1 δ over the training set of size M, we have:

L0(f c geq) ˆLη(f c geq) +

16hln(24h) [βmaxµ4B + (2µβmax + 1)(1 βmaxm)µB + (1 βmaxm)2]2 BW

η2(1 βmaxm)4M + ln( M

L0(f c opteq) ˆLη(f c opteq) +

16hln(24h) [µ3B + (1 m)µB + (1 m)2]2 BW

η2(1 m)4M + ln( M

where ˆLη(f c geq) denotes the empirical margin loss on the training set, the maximum scaling number is defined by βmax := max x D e γ Ux+b Wz 2 2, BW := W W 2 F + U 2 F + b 2 2+ Wc 2 F + bc 2 2,

and m = W W 2 is less than 1 to ensure the convergence of equilibrium models.

Before the proof our results, we need to introduce a lemma in former work [38] for the perturbation bound for GEQs and reformulated Opt Eqs as follows.

Lemma 1. Let W 2 m and W 2 m. Then change in the output of the DEqs z = σ(Wz + Ux + b) on perturbation the weights and biases from W, U, b to W, U, b is bounded as follows: f(Wz + Ux + b) f(Wz + Ux + b) 2 W W 2 Ux + b 2 + (U U)x 2 + b b 2 (1 m)2 (26)

Like former works [38], we we also introduce another lemma [36] here: Lemma 2. Let fw be any predictor with parameters w, and let P denote any distribution on the parameters that are independent of the training data. Then, for any δ, γ > 0, with probability 1 δ over the training data of size M, for any w, and any random perturbation u such that P maxx fw+u(x) fw(x) < η

L0(fw) ˆLηfw + 4

KL(w + u||P) + ln 6M

Then we can derive the perturbation bound for the reformulated Opt Eqs and our GEQ following former settings [38]. First, we also incorporate a fully connected layer at the end as we mentioned in the paper. f c geq(x) = Wcfgeq(x) + bc, f c opteq(x) = Wcfopteq(x) + bc. (28)

Since the entries in the perturbations obeying the distribution os N(0, σ2), we have that all the perturbations of weights are bounded by σ p

2hln(24h) : ω with probability larger than 1/2.

Since the only difference between Opt Eqs and mon DEQ [51] is the weight parameterization, our reformulate Opt Eqs is parameterized by Ws = W W while the mon DEQ s weight parameter is parameterized by a series of weights Wmondeq = I+A+A +B+B . Therefore, the pertubation in Ws is different from former analysis [38]. We have that:

Ws 2 = WW + W W 2 2ωµ (29)

Then using the above lemma, we have that for all x with probability at least 1/2.

f c opteq(x) f c opteq(x) 2 Wcf opteq(x) + bc Wcf c opteq(x) b 2

2µ2ω(B + 1)

(1 m)2 + 2µω(B + 1)

1 m + ω (30)

Setting σ = η(1 m2)

2hln(24h)(2µ3(B+1)+2(1 m)µ+(1 m)2) will make the above perturbation less than η

Then we have,

KL(W + W |P) BW

2σ2 = 16hln(24h)(2µ3(B + 1) + 2(1 m)µ(B + 1) + (1 m)2)2

η2(1 m)4 BW

(31) With the same choice of β s bound like former work [38],

2(B + 1) β η(1 m)

M 2(B + 1) , (32)

we can finally get the upper bound as our Opt Eqs bound as our proposition shows.

The difference between GEQ and Opt Eqs is that GEQ s can be viewed as multiplying scaler β = e γ Ux+b Wz 2 2 with depend on x since z is also depended on x. Setting βmax = maxx D β(x) < 1 and βmin = minx D β(x) > c. Assuming β changes a little with respect to the small perturbations on weights, we have:

z (W, U, b) = σ( βWsz(i) + βW(Ux + b))

z (W, U, b) 2 βmaxµ Ws Ws 2 Ux + b 2

(1 βmaxm)2 + βmax( (WU WU)x 2 + Wb Wb 2)

1 βmaxm (33) With the same setting as above Opt Eqs, we have:

(WU WU)x 2 = WU W Ux 2 2ωµ (Wb Wb)x 2 = Wb W bx 2 2ωµ (34) Then we can obtain that: f c geq(x) f c kereq(x) 2 Wcf opteq(x) + bc Wcf c opteq(x) b 2

2βmaxµ3ω(B + 1)

(1 βmaxm)2 + 2µ2ωβmax(B + 1)

1 βmaxm + µω(B + 1)

1 βmaxm + ω

= (βmaxµ) (2µ2ω(B + 1))

(1 βmaxm)2 + (2µβmax + 1) (µω(B + 1))

1 βmaxm + ω

(35) With the same setting as above, we have:

KL(W + W |P) BW

2σ2 = TBW , (36)

where T is defined as follows:

T = 16hln(24h)(2(βmaxµ)µ3(B + 1) + 2(2µβmax + 1)(1 βmaxm)µ(B + 1) + (1 βmaxm)2)2

η2(1 βmaxm)4 (37) then we can finally we can get the upper bound as our Opt Eqs bound as our proposition shows.

A.4 Proofs for proposition 3

Lipschitz constant is the minimal constant for f and x, y suits the following equation: f(x) f(y) 2 L x y 2. (38) Thereby, our analysis in Proposition 3 can be viewed as proposing an upper bound for different models. In this section, we are going to prove the Lipschitz upper bounds for our GEQ and Opt Eqs. First, we restate the proposition as follows: Proposition 3. If norms for the inputs and outputs are bounded by B, the spectral norm for the weight parameter W, U of equilibrium models with Re LU activation are bounded by µ < 1 to ensure convergence, then we have the conclusions as below:

fgeq(x1) fgeq(x2) 2 Lgeq x1 x2 2 = βmaxµ2 + γBµ3

1 βmaxµ2 γBµ3 x1 x2 2, (39)

fopteq(x1) fopteq(x2) 2 Lopteq x1 x2 2 = µ 1 µ2 x1 x2 2, (40)

where x1 and x2 are input samples, fgeq(x ) and fopteq(x ) denotes the equilibrium states for GEQ and Opt Eqs given input x , and βmax := maxx D e γ Ux Wz 2 2 < 1.

Proof. We first prove the inequality for Opt Eqs:

fopteq(x1) fopteq(x2) 2 = zx1 zx2 2 σ W Wzx1 + Ux1 σ W Wzx2 + Ux2 2 W W(zx1 zx2) 2 + U(x1 x2) 2 µ2 zx1 zx2 2 + µ x1 x2 2, (41) where zx1 denotes the equilibrium states for Opt Eq with input x1, which means the following equation is satisfied: zx1 = σ W Wzx1 + Ux1 . (42)

Reformulating the equations, we can get:

fopteq(x1) fopteq(x2) 2 µ 1 µ2 x1 x2 2. (43)

Then for GEQ, we have:

zx1 zx2 2 σ e γ Wzx1 Ux1 2

2 W Wzx1 + W Ux1

σ e γ Wzx2 Ux2 2

2 W Wzx2 + W Ux2 2 βmax W W(zx1 zx2) 2 + W U(x1 x2) 2 + Bµ2 |βx1 βx2| ,

βmaxµ2 zx1 zx2 2 + βmaxµ2 x1 x2 2 + Bµ2 |βx1 βx2|

where zx1 denotes the equilibrium states for GEQ with input x1, and βx1 is defined as follows:

βx1 = e γ Wzx1 Ux 2

2. (45) Then with the mean value theorem, we have the following equations:

|βx1 βx2| = e γ Wzx1 Ux2 2

2 e γ Wzx1 Ux2 2

γ| Wzx1 Ux2 2 Wzxx Ux2 2 | max c γ[ Wzx1 Ux1 2, Wzx2 Ux2 2] ce c2

γ| Wzx2 Ux2 2 Wzx2 Ux2 2 | γ W(zx1 zx2) 2 + U(x1 x2) 2 γµ ( zx1 zx2 2 + x1 x2 2) (46) Then reformulating Eqn (44), we can finally obtain our proposition as below:

zx1 zx2 2 = fgeq(x1) fgeq(x2) 2 βmaxµ2 + γBµ3

1 βmaxµ2 γBµ3 x1 x2 2 (47)

A.5 Proofs for proposition 4

In this section, we are going to prove the output similarity bounds for our GEQ and Opt Eqs. First, we restate the proposition as follows: Proposition 4. If norms for the inputs and outputs are bounded by B, the spectral norm for the weight parameter W of equilibrium models with Re LU activation are bounded by µ < 1 to ensure the convergence, and each row in U obeys the spherical Gaussian distributions N(0, E[U 2 i ]I). Then we have the following conclusions for the expectation of the output similarity for GEQ and Opt Eqs with respect to input x1, x2 as follows,

κgeq(x1, x2) κgeq = µ2De γ

4 (σmin(U)2 x1 x2 2 2)E[U 2 i ] x1 x2 (sin θ0 + (π θ0) cos θ0) 2π(1 βmaxµ2)2 ,

κopteq(x1, x2) κopteq = E[U 2 i ] x1 x2 (sin θ0 + (π θ0) cos θ0)

2π(1 µ2)2 , (49)

where x1 and x2 are input samples, D = eγB W 2 2 and βmax := maxx D e γ Ux Wz 2 2 < 1. θ0 = cos 1( x1,x2

x1 x2 ) is defined as the angle between the samples.

Proof. Before the proof, we first introduce the following lemma:

Lemma 3. [48] If U obeys the spherical Gaussian distributions of variance E[U 2 i ] and mean 0, then the expectation of the Similarity for the one-layer Neural Network σ(Ux) is:

κNN(x1, x2) = E[U 2 i ] x1 x2

2π (sin θ0 + (π θ0) cos θ0) (50)

where θ0 = cos 1( x,y

Letting m := Ws 2 = W W 2 < 1 and µ := W 2 < 1 as our assumptions demonstrate and neglecting the bias b for convenience. Then for our reformulated Opt Eqs, we have:

z x1 zx2 σ(Wszx1) zx2 + σ(Ux1) zx2 µ2σ(zx1) zx2 + σ(Ux1) zx2 (51)

And we have σ(Ux1) zx2 σ(Ux1) σ(Wzx2) + σ(Ux1) σ(Ux2)

µ2σ(Ux1) zx2 + σ(Ux1) σ(Ux2) (52)

σ(Ux1) zx2 σ(Ux1) σ(Ux2)

z x1 zx2 σ(Ux1) σ(Ux2)

Therefore, we can conclude

κopteq(x1, x2) E[U 2 i ] x1 x2 (sin θ0 + (π θ0) cos θ0)

2π(1 µ2)2 (54)

For GEQ, we set βx = e γ Ux Wz 2 2 and βmax := max x D e γ Ux Wz 2 2 < 1

z x1 zx2 βmaxσ(Wszx1) zx2 + βx W 2σ(Ux1) zx2 βmaxµ2σ(zx1) zx2 + βxµσ(Ux1) zx2 (55)

Like Opt Eqs, we obtain the following equations for GEQ:

z x1 zx2 µ2βx1βx2σ(Ux1) σ(Ux2)

(1 βmaxµ2)2 (56)

κgeq(x1, y2) µ2βx1βx2E[U 2 i ] x1 2 x2 2 (sin θ0 + (π θ0) cos θ0)

2π(1 βmaxµ2)2 (57)

And we have, βx1βx2 = e γ( Wzx1 Ux1 2 2+ Wzx2 Ux2 2 2)

e γ/2( W(zx1 zx2) U(x1 x2) 2 2)

4 (σmin(U)2 x1 x2 2 2),

where D = eγ W 2 2B, the latter two inequality is acquired by Jensen s inequality. Then we have,

κgeq(x1, x2) µ2De γ

4 (σmin(U)2 x1 x2 2 2)E[U 2 i ] x1 2 x2 2 (sin θ0 + (π θ0) cos θ0) 2π(1 βmaxµ2)2 (59)

A.6 Experiment Settings

A.6.1 Experiments on CIFAR

For our GEQ, we parallel 6 branches with each branch taking the scale of 32, 16, 8, 8, 4, 4 and using the average fusion method for branches fusion. The output channels for 6 branches are all 256 or 320 but the mid-channel number(output channel for weight U and W) for the six branches are 64, 128, 128, 128, 256, 256 or 80, 160, 160, 160, 320, 320 with patch size 2 and c splitting is 8. And the inner MLP inner Wh output 64 hidden dimension for each patch. We use the SGD [23] optimizer with momentum and step learning rate schedule for all the models. We also use Random Aug for all the models for comparison.

A.6.2 Experiments on Image Nette and Image Net-100

We take the input scale as 256 for all models. For our GEQ, we parallel 6 branches with each branch taking the scale of 64, 32, 16, 16, 8, 8 after two downsampling convolutions and using the average fusion method for branches fusion. The output channels for 6 branches are all 256 or 384 but the mid-channel number(output channel for weight U and W) for the six branches are 32, 64, 128, 128, 256, 256 or 48, 96, 192, 192, 384, 384 with patch size 2 and c splitting is 8. And the inner MLP inner Wh output 128 hidden dimension for each patch. We use the SGD optimizer with momentum and step learning rate schedule for all the models. We also use Random Aug for all the models for comparison.