# positive_concave_deep_equilibrium_models__b97448da.pdf

Positive Concave Deep Equilibrium Models

Mateusz Gabor 1 Tomasz Piotrowski 2 Renato L. G. Cavalcante 3

Deep equilibrium (DEQ) models are widely recognized as a memory efficient alternative to standard neural networks, achieving state-of-the-art performance in language modeling and computer vision tasks. These models solve a fixed point equation instead of explicitly computing the output, which sets them apart from standard neural networks. However, existing DEQ models often lack formal guarantees of the existence and uniqueness of the fixed point, and the convergence of the numerical scheme used for computing the fixed point is not formally established. As a result, DEQ models are potentially unstable in practice. To address these drawbacks, we introduce a novel class of DEQ models called positive concave deep equilibrium (pc DEQ) models. Our approach, which is based on nonlinear Perron-Frobenius theory, enforces nonnegative weights and activation functions that are concave on the positive orthant. By imposing these constraints, we can easily ensure the existence and uniqueness of the fixed point without relying on additional complex assumptions commonly found in the DEQ literature, such as those based on monotone operator theory in convex analysis. Furthermore, the fixed point can be computed with the standard fixed point algorithm, and we provide theoretical guarantees of its geometric convergence, which, in particular, simplifies the training process. Experiments demonstrate the competitiveness of our pc DEQ models against other implicit models.

1Faculty of Electronics, Photonics, and Microsystems,Wrocław University of Science and Technology, Wrocław, Poland 2Faculty of Physics, Astronomy and Informatics, Nicolaus Copernicus University, Toru n, Poland 3Fraunhofer Heinrich Hertz Institute, Berlin, Germany. Correspondence to: Mateusz Gabor <mateusz.gabor@pwr.edu.pl>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

1. Introduction

Implicit models (Bai et al., 2019; 2020; Chen et al., 2018; El Ghaoui et al., 2021; Baker et al., 2023; Tsuchida & Ong, 2023; Revay et al., 2020; Wei & Kolter, 2021; Geng et al., 2021) are attracting considerable interest owing to their improved memory efficiency compared to standard neural networks. These models solve implicit equations instead of explicitly computing the output of the layers, and they can be divided into two main categories: neural ordinary differential equations (Neural ODEs) (Chen et al., 2018) and deep equilibrium models (DEQ) models (Bai et al., 2019).

Neural ODEs solve differential equations parameterized by the neural network input, with the output representing the solution to these equations. On the other hand, the implicit layers of DEQ models solve nonlinear fixed point equations that are not necessarily derived from differential equations. An interesting aspect about DEQ models is that a single implicit DEQ layer emulates a standard neural network with an infinite number of layers and tied weights. While both DEQ models and neural ODEs require constant training memory, DEQ models often outperform neural ODEs, achieving state-of-the-art results in language modeling tasks (Bai et al., 2019) and computer vision tasks (Bai et al., 2020; Xie et al., 2022). However, a potential limitation of standard DEQ models is that they are based on iterative methods that require careful initialization, tuning of hyperparameters, and special regularization (e.g., recurrent dropout) to ensure convergence to the fixed point, which is hard to guarantee with existing approaches. In general, standard DEQ models operate heuristically, and in many cases they even lack formal guarantees regarding the existence and uniqueness of the fixed point problem being solved.

To overcome the limitations of existing DEQ models, we introduce a new variant called positive concave deep equilibrium (pc DEQ) models. In particular, these pc DEQ models address issues related to the existence and uniqueness of the fixed point. Furthermore, the fixed points of the proposed pc DEQ models can be easily computed with the standard fixed point iteration, and we provide formal guarantees of geometric convergence. The theoretical foundation of pc DEQ models is rooted in nonlinear Perron-Frobenius (NPF) theory (Lemmens & Nussbaum, 2012; Lins, 2023), which is commonly used in the analysis of nonnegative

Positive Concave Deep Equilibrium Models

monotonic (order-preserving) and scalable functions. To ensure these properties in pc DEQ models, we enforce nonnegative weights and activation functions that are concave in the nonnegative orthant. An additional advantage of our proposed model is that the standard Jacobian-based backpropagation algorithm (Bai et al., 2019) can be used for training without requiring any changes.

We summarize our contributions as follows:

1. We propose a new class of DEQ models, called pc DEQ models, which are based on nonlinear Perron Frobenius theory and are equipped with guarantees of the existence and uniqueness of the fixed point. Furthermore, we prove that the standard fixed point iteration for the proposed pc DEQ models converges to the fixed point geometrically fast.

2. We empirically show that, for the proposed pc DEQ architectures, in practice, only a few iterations are needed to achieve numerical convergence, and the number of iterations does not increase over the course of training.

3. We demonstrate competitive improvement of the proposed approach in terms of accuracy and number of parameters over existing alternatives for image classification tasks.

2. Related Work

In the seminal paper (Bai et al., 2019), DEQ models have been applied to language modeling tasks, and they have been shown to outperform standard neural networks constructed with a similar number of parameters. In subsequent studies, Bai et al. proposed multiscale extensions of DEQ models (Bai et al., 2020), where a single DEQ model is used for image classification and image segmentation. Since these pioneering studies, DEQ models have been successful in many applications, including, to name a few, object detection (Wang et al., 2023), optical flow estimation (Bai et al., 2022), video semantic segmentation (Ertenli et al., 2022), medical image segmentation (Zhang et al., 2022), snapshot compressive imaging (Zhao et al., 2023), image denoising (Chen et al., 2023; Gkillas et al., 2023a), machine translation (Zheng et al., 2023), inverse problems (Gilton et al., 2021; Zou et al., 2023), music source separation (Koyama et al., 2022), federated learning (Gkillas et al., 2023b), and diffusion models (Geng et al., 2023; Pokle et al., 2022). These models are strongly based on the fixed point theory that we summarize below.

Let Rn be the standard Euclidean metric space and let f : Rn Rn. If f is a Lipschitz contraction, meaning it is Lipschitz continuous with a Lipschitz constant L < 1, then the Banach fixed point theorem guarantees the uniqueness and existence of the fixed point x = f(x ) in Rn. Moreover, the fixed point iteration xk+1 = f(xk) converges

linearly to x for any initial point x1 Rn. However, if f is not a Lipschitz contraction, questions related to the existence and uniqueness of the unique point, and also the convergence of the fixed point iteration, become more delicate, and these issues have been the subject of extensive research in the mathematical literature.

In modern convex analysis, there has been a significant focus on nonexpansive mappings in Hilbert spaces (Yamada et al., 2011), which are mappings with Lipschitz constant equal to one, and their relation to monotone operator theory (Bauschke & Combettes, 2017). Nonexpansive mappings do not necessarily have a fixed point, and, if the fixed point set is nonempty, it may not be a singleton in general. Furthermore, even if the fixed point exists, then the fixed point iteration may fail to converge, in which case one can resort to various iterative methods based on the the Krasnosel skii-Mann iteration to ensure it. Particular instances of this iteration include well-known algorithms used in machine learning, such as the projected gradient method, the proximal forward-backward splitting method, the Douglas-Rachford splitting method, the projection onto convex sets method, and many others (Yamada et al., 2011).

Monotone operator theory in DEQ models has been explored in (Winston & Kolter, 2020), a study that introduces the monotone operator deep equilibrium (mon DEQ) models. These models ensure both the existence and uniqueness of the fixed point, while also guaranteeing the convergence of the forward-backward splitting algorithm and Peaceman Rachford splitting algorithm to the fixed point. However, approaches of this type require restrictions on weights that can only be enforced with complex numerical techniques.

To avoid the above difficulties, we use in this study nonlinear Perron-Frobenius theory (Krause, 2015; Lemmens & Nussbaum, 2012) as an alternative to traditional theory in Hilbert spaces. More specifically, we consider deep equilibrium layers that belong to the class of standard interference mappings. These mappings, introduced in the next section, have been widely used in wireless networks (Yates, 1995; Schubert & Boche, 2011; Stanczak et al., 2009; You & Yuan, 2020; Miretti et al., 2023a;b; Shindoh, 2020; 2019; Cavalcante et al., 2016; 2019). They are not necessarily nonexpansive in Hilbert spaces, but they are contractive (though not necessarily Lipschitz contractions) in some metric spaces defined on cones, such as the cone of positive vectors. Before introducing SI mappings and their applications in DEQ models, we first need to establish the notation and formally introduce the concept of DEQ layers.

3. Preliminaries

The nonnegative cone and its interior (i.e., the positive cone) are denoted as, respectively,

Rn + := {(x1, . . . , xn) Rn | ( k {1, . . . , n}) xk 0}.

Positive Concave Deep Equilibrium Models

int(Rn +) := {(x1, . . . , xn) Rn + |

( k {1, . . . , n}) xk > 0}.

Let x, y Rn +. The partial ordering induced by the nonnegative cone is denoted as x y y x Rn +. In a similar way, for x = y, x < y y x Rn +, and x y y x int(Rn +). The fixed point set of a function f : X Y with Y and X being subsets of a given set S is denoted as

Fix(f) = {x X | f(x ) = x }.

3.1. Deep Equilibrium Layers

We now have all the necessary notation to introduce generic DEQ models. Definition 3.1. A DEQ layer maps an input x X Rn

to an output z Fix(gx) Y X, where gx : X Y is an explicit function given by

gx : X Y : z 7 σ(Wz + x); (1)

W : X Y is a linear operator (weight matrix); and σ: X Y is a (vector-valued nonlinear) activation function, composed elementwise from a given scalar activation function.

In the above definition, closed-form expressions for the implicit function x 7 z Fix(gx) are not required, and the numerical scheme used to compute the output z Fix(gx) from a given input x is not specified. As a result, to ensure that the implicit function x 7 z Fix(gx) is well-defined and independent of the numerical scheme used to compute the output, we require Fix(gx) to be a singleton for every x X. This fixed point formulation allows for direct implicit differentiation, which is crucial for training DEQ models (Bai et al., 2019). In the text that follows, for convenience, we always refer to a DEQ layer using its explicit function gx because, with the above restriction of Fix(gx) being a singleton, the implicit function x 7 z Fix(gx) is well-defined.

One of the simplest numerical schemes for computing fixed points, which is the numerical scheme we consider in this study, is the standard fixed point iteration, which, using the notation in Definition 3.1, we can write as

( k N) zk+1 = gx(zk) with z1 X. (2)

3.2. Standard Interference and Positive Concave Mappings

The DEQ layers that we propose in this study are a proper subclass of standard interference mappings, defined as follows.

Definition 3.2. A mapping g: Rn + int(Rn +) is said to be a standard interference if it is

1. monotonic

( x Rn +)( x Rn +) x x = g(x) g( x), and (3)

2. scalable

( x Rn +)( λ > 1) g(λx) λg(x). (4)

Remark 3.3. Monotonic mappings in the sense of Definition 3.2 are also known as order-preserving mappings in the mathematical literature, and they should not be confused with monotone operators used in convex analysis (Ryu & Boyd, 2016; Bauschke et al., 2017), which is a different concept in general.

SI mappings have at most one fixed point (Yates, 1995), and its existence can be established with Proposition A.4 in the appendix, which uses the concepts of asymptotic mappings (Definition A.2) and nonlinear spectral radius (Definition A.3). A proper subclass of SI mappings is the class of the positive concave mappings (see Proposition A.6), defined below.

Definition 3.4. Let g : Rn + Rn + be concave with respect to the cone order; i.e.,

( x Rn +)( y Rn +)( t (0, 1)) g(tx + (1 t)y) tg(x) + (1 t)g(y). (5)

Then g is called a nonnegative concave (NC) mapping. Furthermore, if the codomain of g is in the set of positive vectors (i.e., g : Rn + int(Rn +)), then g is called a positive concave (PC) mapping.

We emphasize that NC mappings are not SI mappings in general (e.g., x 7 x for x Rn + is an NC, but not an SI mapping), only PC mappings are guaranteed to be SI mappings without any further assumptions, see (Cavalcante et al., 2016; 2019).

An important property of PC mappings, which we exploit in this study, is that, if a fixed point exists, the fixed point iteration is guaranteed to converge geometrically fast to the fixed point (Proposition A.7).

4. PC Deep Equilibrium Layers

We now proceed to construct the proposed pc DEQ layers gx based on the theory described in the previous section. In particular, recall that, by restricting DEQ layers to the class of PC mappings, we obtain simple conditions to guarantee the existence and uniqueness of the fixed point z , and we

Positive Concave Deep Equilibrium Models

also have the simple iterative scheme in (2) to compute the fixed point, which is an algorithm that converges geometrically fast for PC mappings. We start by restricting the activation functions to be nonnegative concave in the nonnegative cone, and, for later reference, the next remark gives examples of such functions. Remark 4.1. We divide the activation functions allowed by the proposed framework into two classes: nonnegative concave functions and positive concave activation functions. Common examples in the neural network literature are shown in the two lists below.

1. continuous NC activation functions (R+ R+):

(Re LU6) x 7 min{x, 6} (hyperbolic tangent) x 7 tanh x (softsign) x 7 x 1+x

2. continuous PC activation functions (R+ int(R+)):

(sigmoid) x 7 1 1+exp ( x)

We recall that the activation functions from the above lists are applied elementwise to vector arguments, see Definition 3.1. The following lemma provides two simple sufficient conditions to construct positive concave DEQ layers.

Lemma 4.2. Consider a DEQ layer gx : Rn + int(Rn +) of the form in (1) in Definition 3.1 for an input x. Then:

(Assumption 1) z 7 gx(z) := σ(Wz+x) is a PC mapping gx : Rn + int(Rn +) if z Rn +, W Rn n + , x int(Rn +), and σ is constructed elementwise from any scalar activation function from List 1 in Remark 4.1;

(Assumption 2) z 7 gx(z) := σ(Wz + x) is a PC mapping gx : Rn + int(Rn +) if z Rn +, W Rn n + , x Rn +, and σ is constructed elementwise from the scalar activation function from List 2 in Remark 4.1.

Proof. Let fx : Rn + Rn + be given by z 7 Wz + x and let σ: Rn + Rn + be the activation function. Let fx and σ satisfy either Assumption 1 or Assumption 2, so that gx = σ fx. We note that both fx and σ belong to the class of NC mappings. Let z1, z2 Rn + and t (0, 1). Then, by concavity of fx:

fx(tz1 + (1 t)z2) tfx(z1) + (1 t)fx(z2).

Hence, monotonicity of σ implies that

σ[fx(tz1 + (1 t)z2)] σ[tfx(z1) + (1 t)fx(z2)].

From the concavity of σ we deduce

σ[tfx(z1)+(1 t)fx(z2)] tσ(fx(z1))+(1 t)σ(fx(z2)).

By combining the above two inequalities, we conclude that

(σ fx)(tz1+(1 t)z2) t(σ fx)(z1)+(1 t)(σ fx)(z2),

implying that gx = σ fx is an NC mapping under either Assumption 1 or Assumption 2. To show that gx is a PC mapping, we first note that, by the monotonicity of fx, one has fx(z) fx(0) for z Rn +.

Under Assumption 1, we have y0 := fx(0) = x 0, so that ( z Rn +) fx(z) 0. We now note that a selection of any of the scalar activation functions from List 1 for σ satisfies ( v 0) σ(v) 0, so that σ(y0) = (σ fx)(0) = gx(0) 0. Since gx is monotonic, one has ( y Rn +) gx(y) gx(0) 0, thus gx is a PC mapping.

On the other hand, if Assumption 2 is satisfied, we note that, in this case, σ(0) 0, hence ( z Rn +) gx(z) = (σ fx)(z) 0, which implies gx is also a PC mapping in this case.

For convenience, the DEQ layers satisfying the assumptions in Lemma 4.2 are formally defined below.

Definition 4.3. Let gx : Rn + int(Rn +) be a DEQ layer of the form in (1) for a given input x. If gx (including the input x) satisfies Assumption 1 in Lemma 4.2, then gx is called a pc DEQ-1 layer. If gx satisfies Assumption 2 in Lemma 4.2, then gx is called a pc DEQ-2 layer. By pc DEQ we mean a layer gx that is either a pc DEQ-1 layer or a pc DEQ-2 layer.

Pc DEQ-1 and pc DEQ-2 layers are illustrated in Figure 1.

Figure 1. The visualization of the possible construction of pc DEQ layers. The symbols shown in the figures mean: Rn n + W+ - nonnegative weights, Rn + z+ - nonnegative vector of fixed point iteration, Rn + x+ - nonnegative input to the layer, int(Rn +) x++ - positive input to the layer, σNC - nonnegative concave activation function (List 1 in Remark 4.1) and σP C - positive concave activation function (List 2 in Remark 4.1).

Lemma 4.2 asserts that pc DEQ layers are PC mappings for any allowed input x, which, in view of Proposition A.6, implies that they are also SI mappings. We can now use Definition A.2 to provide the form of the asymptotic mapping associated with a pc DEQ layer. The assertion of Proposition 4.4 follows from Proposition 11 in (Piotrowski et al., 2024), and, for completeness, we include the proof below.

Positive Concave Deep Equilibrium Models

Proposition 4.4. For a given input x, let gx, : Rn + Rn + be the asymptotic mapping (in the sense of Definition A.2) of a pc DEQ layer gx : Rn + int(Rn +) satisfying Definition 4.3. Then gx, (z) = 0 for every z Rn +.

Proof. The asymptotic mapping gx, according to Definition A.2 is defined as follows

gx, (z) = lim p 1 pgx(pz) = lim p σ(Wpz + x)

Applying L Hˆopital s rule to the activation function σ with p as an argument, we have

gx, (z) = lim p σ (Wpz + x)Wz. (7)

For activation functions composed from any scalar activation functions in Remark 4.1, we have limp σ (Wpz + x) = 0, hence also

lim p σ (Wpz + x)Wz = 0. (8)

The following Corollary 4.5 follows directly from Proposition 4.4.

Corollary 4.5. With notation in Proposition 4.4, let ρ(gx, ) R+ be the spectral radius of gx, in the sense of Definition A.3 for a given input x. Then ρ(gx, ) = 0.

Based on the previous results, Proposition 4.6 establishes the existence and uniqueness of the fixed point for the pc DEQ layers, in addition to the geometric convergence of the fixed point iteration.

Proposition 4.6. Let gx : Rn + int(Rn +) be a pc DEQ layer satisfying Definition 4.3. Then gx has a unique fixed point z for every input x satisfying the conditions in Definition 4.3. Moreover, the fixed point iteration of gx in (2) converges geometrically in the sense of Definition A.1 to z for any z1 Rn + and any input x.

Proof. Choose any input x satisfying the conditions in Definition 4.3. From Lemma 4.2, gx is a PC mapping. Thus, from Proposition A.6, gx is also an SI mapping. From Corollary 4.5, the nonlinear spectral radius of gx is ρ(gx, ) = 0. If follows from Proposition A.4 that gx has a unique fixed point z . Then, from Proposition A.7 it follows that the fixed point iteration of gx in (2) converges geometrically to z with a factor c [0, 1) for any starting point z1 Rn +.

5. Experiments

The experiments were carried out on three commonly known computer vision datasets: MNIST, SVHN, and CIFAR-10.

The proposed pc DEQ1 models were compared with competitive approaches for which the uniqueness of the fixed point is mathematically established, namely: monotone operator deep equilibrium models (mon DEQ) (Winston & Kolter, 2020), neural ordinary differential equations (NODE) (Chen et al., 2018), and augmented neural differential equations (ANODE) (Dupont et al., 2019). The comparison was performed according to the results reported in (Dupont et al., 2019; Winston & Kolter, 2020). For pc DEQ models, the fixed point z is computed using the standard fixed point iteration in (2). The stopping criterion of fixed point iteration is based on the relative error, calculated as ||zk+1 zk||

||zk+1|| ϵ, where || || is a Frobenius norm and ϵ is a tolerance. In our experiments, the tolerance ϵ was set to 1e 4.

The experiments were performed using the Google Colab platform with a NVIDIA Tesla T4 16GB GPU.

5.1. Architecture Setup

According to Definition 4.3, we consider two options to build pc DEQ layers, pc DEQ-1 (Figure 1 (a)) and pc DEQ-2 (Figure 1 (b)). For both options, the nonnegative weights W+ are achieved by projecting the negative values to zeros after backpropagation. To construct pc DEQ-1, it is necessary to provide a positive input x++ (see Figure 1 (a)), which is performed by applying elementwise the softplus activation function (σ: R int(R+)) before the pc DEQ-1 layer, as the activation functions in networks with pc DEQ-1 layers are functions from List 1 in Remark 4.1. In the case of the pc DEQ-2 model, the activation function is a PC activation function from List 2 in Remark 4.1. For pc DEQ-2, the nonnegativity of x+ is achieved by applying the Re LU activation function before the DEQ layer.

Compared to mon DEQ models, the construction of pc DEQ layers is much simpler and does not require a special layer implementation. Mon DEQ layer has to be carefully parameterized by a set of two weights to satisfy the assumptions of strong monotonicity, which results in overparameterization and a more complicated implementation. Furthermore, computing convolutions in mon DEQ models requires calculating fast Fourier transforms, which, as noted in (Winston & Kolter, 2020), are empirically 2-3 times slower than computing convolutions in a standard manner. On the other hand, in pc DEQ models, as mentioned previously, the only requirement is to use concave activation functions and constrain the weights to be nonnegative. Such a parameterization does not produce extra computational overhead compared to standard DEQ models.

In experiments, we consider three types of pc DEQ mod-

1The Pytorch source code of pc DEQs along with examples of use is available at the following link: https://github.com/ mateuszgabor/pcdeq

Positive Concave Deep Equilibrium Models

els with four activation functions from Remark 4.1. The first type of network is based on using one linear pc DEQ layer. The second type of network is based on using one convolutional pc DEQ layer, and the last type is based on the use of three convolutional pc DEQ layers, between which the explicit downsampling layers occur. The architectural details are discussed in Appendix B, and the experimental hyperparameters for each network in Appendix C. Networks using a single linear pc DEQ layer have L suffix in their name. A similar scenario is for networks with a single convolutional pc DEQ layer, to network name suffix SC is added. For networks with three convolutional pc DEQ layers, the suffix MC is added to the network name. The suffix MT in mon DEQs refers to the multi-tier architecture used in (Winston & Kolter, 2020).

5.2. Results

Tables 1, 2, and 3 show the results obtained by pc DEQ models with compared methods for MNIST, SVHN and CIFAR10, respectively. As the results show, the architectures based on pc DEQ achieve competitive results compared to other implicit models. In each scenario, pc DEQ models were constructed with a smaller number of parameters compared to NODE, ANODE, and mon DEQ. In the case of the results obtained on the MNIST dataset (Table 1), all pc DEQ configurations outperform the NODE, ANODE, and mon DEQ approaches. For the results obtained on the SVHN dataset (Table 2), it can be seen that the highest accuracy is obtained by pc DEQ with Re LU6 activation functions and three convolutional layers. In the case of results obtained on the CIFAR-10 dataset, Table 3 shows that with a lower number of parameters, pc DEQ models can achieve similar or better results compared to NODE, ANODE, or mon DEQ. Moreover, similar to previous works, we trained larger pc DEQ models with data augmentation on CIFAR-10. As can be seen, for this setup, pc DEQ with three pc DEQ convolutional layers and softsign activation functions achieves the highest accuracy among all pc DEQ configurations.

Figure 2 shows the training curves for the pc DEQ models with a single convolutional layer. For other cases, the figures are attached in Appendix D.

5.3. Convergence Analysis

Figure 3 shows the average number (among all batches in epoch) of fixed point iterations to compute a fixed point per epoch for the pc DEQ model with a single convolutional layer. From this figure, it can be seen that the convergence is very fast and accurate. For pc DEQ with sigmoid activation function, the fixed point iteration satisfies the stopping criterion with less than eight iterations. From Proposition 4.6, we know that the convergence of the fixed point iteration for the pc DEQ models is geometric. We could improve the con-

Table 1. Test accuracies of pc DEQ models averaged over five runs on MNIST dataset compared with NODE, ANODE and mon DEQ; as reported in (Dupont et al., 2019); as reported in (Winston & Kolter, 2020).

Method MNIST

#Parameters Accuracy [%]

NODE 84K 96.4 ANODE 84K 98.2 mon DEQ-L 84K 98.1 mon DEQ-SC 84K 99.1 mon DEQ-MT 81K 99.0 pc DEQ-1-L-Re LU6 70K 98.1 pc DEQ-1-L-Tanh 70K 98.2 pc DEQ-1-L-Softsign 70K 98.1 pc DEQ-2-L-Sigmoid 70K 98.1 pc DEQ-1-SC-Re LU6 69K 99.2 pc DEQ-1-SC-Tanh 69K 99.2 pc DEQ-1-SC-Softsign 69K 99.1 pc DEQ-2-SC-Sigmoid 69K 98.9 pc DEQ-1-MC-Re LU6 41K 99.3 pc DEQ-1-MC-Tanh 41K 99.2 pc DEQ-1-MC-Softsign 41K 99.2 pc DEQ-2-MC-Sigmoid 41K 98.7

Table 2. Test accuracies of pc DEQ models averaged over five runs on SVHN dataset compared with NODE, ANODE and mon DEQ; as reported in (Dupont et al., 2019); as reported in (Winston & Kolter, 2020).

Method SVHN

#Parameters Accuracy [%]

NODE 172K 81.0 ANODE 172K 83.5 mon DEQ-SC 172K 88.7 mon DEQ-MT 170K 92.4 pc DEQ-1-SC-Re LU6 165K 88.0 pc DEQ-1-SC-Tanh 165K 88.1 pc DEQ-1-SC-Softsign 165K 88.4 pc DEQ-2-SC-Sigmoid 165K 87.3 pc DEQ-1-MC-Re LU6 131K 93.0 pc DEQ-1-MC-Tanh 131K 92.3 pc DEQ-1-MC-Softsign 131K 92.3 pc DEQ-2-MC-Sigmoid 131K 91.5

vergence rate with vector extrapolation techniques. This fact is very interesting because such fast convergence is achieved without using any acceleration method. The interesting fact is that in standard DEQ and mon DEQ models, the number of iterations of the used iterative method increases during training; in the proposed pc DEQ architectures, this effect does not occur. For other architecture configurations, the

Positive Concave Deep Equilibrium Models

0 5 10 15 20 25 30 35 40 Epochs

Test accuracy

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 60 70 80 Epochs

Test accuracy

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 60 70 80 Epochs

Test accuracy

CIFAR-10-SC

Re LU6 softsign

tanh sigmoid

Figure 2. Test accuracies during training for the pc DEQ model with a single convolutional layer over five experiment runs.

Table 3. Test accuracies of pc DEQ models averaged over five runs on CIFAR-10 dataset compared with NODE, ANODE and mon DEQ; as reported in (Dupont et al., 2019); as reported in (Winston & Kolter, 2020); * with data augmentation.

Method CIFAR-10

#Parameters Accuracy [%]

NODE 172K 53.7 NODE * 1M 59.9 ANODE 172K 60.6 ANODE * 1M 73.4 mon DEQ-SC 172K 74.0 mon DEQ-SC * 854K 82.0 mon DEQ-MT 170K 72.0 mon DEQ-MT * 1M 89.0 pc DEQ-1-SC-Re LU6 165K 76.3 pc DEQ-1-SC-Tanh 165K 76.6 pc DEQ-1-SC-Softsign 165K 76.4 pc DEQ-2-SC-Sigmoid 165K 75.5 pc DEQ-1-MC-Re LU6 131K 78.2 pc DEQ-1-MC-Re LU6* 661K 89.2 pc DEQ-1-MC-Tanh 131K 76.5 pc DEQ-1-MC-Tanh* 661K 88.5 pc DEQ-1-MC-Softsign 131K 77.1 pc DEQ-1-MC-Softsign* 661K 89.0 pc DEQ-2-MC-Sigmoid 131K 71.0 pc DEQ-2-MC-Sigmoid* 661K 85.6

situation is similar to that described in this section, and the results are given in Appendix D.

5.4. Lipschitz Continuity

In this section, by considering the standard Euclidean metric space, we show empirically that the assumptions used to determine the uniqueness of the fixed point are weaker compared to the standard assumptions in convex analysis.

In general, to determine the uniqueness of the fixed point based on the Banach fixed point theorem, we need information about a Lipschitz constant of the function, and, in the machine learning literature, we often consider the standard Euclidean space. In the case of neural networks, if the used activation functions are nonexpansive (such as those in Remark 4.1), then a Lipschitz constant L w.r.t. the standard Euclidean space is upper bounded by the product of spectral norms of weight operators (Combettes & Pesquet, 2020) as follows:

i=1 ||Wi||, (9)

where m is the number of weight operators and || || is the spectral norm.

For the simple scenario, we investigated the pc DEQ model containing one linear pc DEQ layer (m = 1) with four activation functions. The average largest singular value per epoch for the entire training process is shown in Figure 4. It can be concluded that for the network with softsign, hyperbolic tangent, and sigmoid activation function, the Lipschitz constant in (9) is L > 1 for most of the training time. For such a case, the uniqueness of the fixed point cannot be determined from the Banach fixed point theorem, because the Lipschitz constant L in (9) does not satisfy L < 1. In the case of a network with the Re LU6 activation function, the Lipschitz constant in (9) satisfies L < 1 for the entire training process. In such a case, the Banach fixed point theorem can be used to determine the uniqueness of the fixed point and the linear convergence to it by using fixed point iteration. However, there are no guarantees that even for Re LU6 the Lipschitz constant L will always be less than one, for example, with different learning rates, initialization, architecture, etc. On the other hand, the necessary and sufficient conditions of the uniqueness of the fixed point of SI mappings are weaker compared to the Banach fixed point theory, because the uniqueness of the fixed point is independent of a Lipschitz constant and the unique fixed

Positive Concave Deep Equilibrium Models

0 5 10 15 20 25 30 35 40 Epochs

Forward iterations

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 60 70 80 Epochs

Forward iterations

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 60 70 80 Epochs

Forward iterations

CIFAR-10-SC

Re LU6 softsign

tanh sigmoid

0 5 10 15 20 25 30 35 40 Epochs

Backward iterations

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 60 70 80 Epochs

Backward iterations

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 60 70 80 Epochs

Backward iterations

CIFAR-10-SC

Re LU6 softsign

tanh sigmoid

Figure 3. Number of fixed point iterations for computing the fixed point in forward and backward passes for the pc DEQ model with a single convolutional layer over five experiment runs.

point can still exist even if L > 1.

0 5 10 15 20 25 30 35 40 Epochs

Largest singular value

Re LU6 softsign

tanh sigmoid

Figure 4. Largest singular value of pc DEQ linear layer over five experiment runs.

6. Limitations

The proposed approach may appear restrictive due to the non-negativity constraint on weights and the use of concave activation functions. However, these restrictions are essential to ensure theoretical guarantees, such as the uniqueness of the fixed point and the convergence of the standard

fixed point iteration. It is important to note that existing approaches offering similar guarantees, such as mon DEQ models, also impose constraints to ensure that the mappings are strongly monotone in the sense used in convex analysis. The advantage of our approach is that the imposed restrictions are simpler to implement in practice compared to those based on monotone operator theory.

7. Conclusions

In this study, we have proposed a new class of DEQ models with guarantees of the existence of a unique fixed point. The parametrization of pc DEQ is very simple, and it does not require any sophisticated training modifications compared to standard DEQ models. Moreover, the fixed point can be easily computed with the standard fixed point iteration, and the convergence is guaranteed to be geometric. The proposed pc DEQ models are based on the theory of SI mappings, which are widely used in the wireless literature. To the best of our knowledge, this is the first practical application of SI mappings theory in the deep learning literature.

This study provides the foundation for extending the proposed method to a larger and less restricted class of mappings with guarantees similar to SI mappings. As an example, one may aim to weaken further the assumptions used to construct pc DEQ models to obtain more versatile models, capable of incorporating a wider class of weight operators

Positive Concave Deep Equilibrium Models

and activation functions in DEQ layers. Such a direction seems feasible in view of the recent results on the existence and shape of the fixed point sets of (subhomogeneous) weakly standard interference (WSI) neural networks provided in (Piotrowski et al., 2024). Another promising future research in the direction proposed in this paper should focus on providing even stronger guarantees of convergence rate compared with the geometric convergence of the fixed point iteration of pc DEQ models established in this paper. Indeed, the empirical convergence analysis provided in Section 5.3 suggests that the actual rate of convergence may actually be linear.

Acknowledgments

This work is supported by the National Center of Science under Preludium 22 project No. UMO-2023/49/N/ST6/02697 (Mateusz Gabor), National Centre for Research and Development of Poland (NCBR) under grant EIG CONCERTJAPAN/04/2021, and by the Federal Ministry of Education and Research of Germany under grant 01DR21009 and the programme Souver an. Digital. Vernetzt. Joint project 6G-RIC, project identification numbers: 16KISK020K and 16KISK030.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.

Bai, S., Koltun, V., and Kolter, J. Z. Multiscale deep equilibrium models. Advances in Neural Information Processing Systems, 33:5238 5250, 2020.

Bai, S., Geng, Z., Savani, Y., and Kolter, J. Z. Deep equilibrium optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 620 630, 2022.

Baker, J., Wang, Q., Hauck, C. D., and Wang, B. Implicit graph neural networks: a monotone operator viewpoint. In International Conference on Machine Learning, pp. 1521 1548. PMLR, 2023.

Bauschke, H. H. and Combettes, P. L. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2th edition edition, 2017.

Bauschke, H. H., Combettes, P. L., Bauschke, H. H., and Combettes, P. L. Correction to: Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2017.

Cavalcante, R. L., Shen, Y., and Sta nczak, S. Elementary properties of positive concave mappings with applications to network planning and optimization. IEEE Transactions on Signal Processing, 64(7):1774 1783, 2016.

Cavalcante, R. L., Liao, Q., and Sta nczak, S. Connections between spectral properties of asymptotic mappings and solutions to wireless network problems. IEEE Transactions on Signal Processing, 67(10):2747 2760, 2019.

Chen, Q., Wang, Y., Geng, Z., Wang, Y., Yang, J., and Lin, Z. Equilibrium image denoising with implicit differentiation. IEEE Transactions on Image Processing, 32:1868 1881, 2023.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Combettes, P. L. and Pesquet, J.-C. Lipschitz certificates for layered network structures driven by averaged activation operators. SIAM Journal on Mathematics of Data Science, 2(2):529 557, 2020.

Dupont, E., Doucet, A., and Teh, Y. W. Augmented neural odes. Advances in neural information processing systems, 32, 2019.

El Ghaoui, L., Gu, F., Travacca, B., Askari, A., and Tsai, A. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021.

Ertenli, C. U., Akbas, E., and Cinbis, R. G. Streaming multiscale deep equilibrium models. In European Conference on Computer Vision, pp. 189 205. Springer, 2022.

Geng, Z., Zhang, X.-Y., Bai, S., Wang, Y., and Lin, Z. On training implicit models. Advances in Neural Information Processing Systems, 34:24247 24260, 2021.

Geng, Z., Pokle, A., and Kolter, J. Z. One-step diffusion distillation via deep equilibrium models. 2023.

Gilton, D., Ongie, G., and Willett, R. Deep equilibrium architectures for inverse problems in imaging. IEEE Transactions on Computational Imaging, 7:1123 1133, 2021.

Gkillas, A., Ampeliotis, D., and Berberidis, K. Connections between deep equilibrium and sparse representation models with application to hyperspectral image denoising. IEEE Transactions on Image Processing, 32:1513 1528, 2023a.

Positive Concave Deep Equilibrium Models

Gkillas, A., Ampeliotis, D., and Berberidis, K. Deep equilibrium models meet federated learning. ar Xiv preprint ar Xiv:2305.18646, 2023b.

Kolter, Z., Duvenaud, D., and Johnson, M. Deep implicit layers-neural odes, deep equilibirum models, and beyond. URL http://implicit-layers-tutorial. org/(visited on 202203-25), 2020.

Koyama, Y., Murata, N., Uhlich, S., Fabbro, G., Takahashi, S., and Mitsufuji, Y. Music source separation with deep equilibrium models. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 296 300. IEEE, 2022.

Krause, U. Positive dynamical systems in discrete time: theory, models, and applications, volume 62. Walter de Gruyter Gmb H & Co KG, 2015.

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207 1216, Stanford, CA, 2000. Morgan Kaufmann.

Lemmens, B. and Nussbaum, R. Nonlinear Perron Frobenius Theory, volume 189. Cambridge University Press, 2012.

Lins, B. A unified approach to nonlinear perron-frobenius theory. Linear Algebra and its Applications, 2023.

Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. ar Xiv preprint ar Xiv:1711.05101, 2017.

Miretti, L., Cavalcante, R. L., and Bj ornson, E. Ul-dl duality for cell-free networks under per-ap power and information constraints. In ICC 2023-IEEE International Conference on Communications, pp. 5017 5023. IEEE, 2023a.

Miretti, L., Cavalcante, R. L., and Sta nczak, S. Fixed-point methods for long-term power control and beamforming design in large-scale mimo. 2023b. submitted, available as ar Xiv preprint ar Xiv:2312.02080.

Ortega, J. M. and Rheinboldt, W. C. Iterative solution of nonlinear equations in several variables. SIAM, 2000.

Oshime, Y. Perron-frobenius problem for weakly sublinear maps in a euclidean positive orthant. Japan journal of industrial and applied mathematics, 9:313 350, 1992.

Piotrowski, T. and Cavalcante, R. L. G. The fixed point iteration of positive concave mappings converges geometrically if a fixed point exists: implications to wireless systems. IEEE Transactions on Signal Processing, 70: 4697 4710, 2022.

Piotrowski, T. J., Cavalcante, R. L. G., and Gabor, M. Fixed points of nonnegative neural networks. Journal of Machine Learning Research, 25(139):1 40, 2024.

Pokle, A., Geng, Z., and Kolter, J. Z. Deep equilibrium approaches to diffusion models. Advances in Neural Information Processing Systems, 35:37975 37990, 2022.

Revay, M., Wang, R., and Manchester, I. R. Lipschitz bounded equilibrium networks. ar Xiv preprint ar Xiv:2010.01732, 2020.

Ryu, E. K. and Boyd, S. Primer on monotone operator methods. Appl. comput. math, 15(1):3 43, 2016.

Salimans, T. and Kingma, D. P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. Advances in neural information processing systems, 29, 2016.

Schubert, M. and Boche, H. Interference calculus: A general framework for interference management and network utility optimization, volume 7. Springer Science & Business Media, 2011.

Shindoh, S. The structures of sinr regions for standard interference mappings. In 2019 58th Annual Conference of the Society of Instrument and Control Engineers of Japan (SICE), pp. 1280 1285. IEEE, 2019.

Shindoh, S. Some properties of sinr regions for standard interference mappings. SICE Journal of Control, Measurement, and System Integration, 13(3):50 56, 2020.

Stanczak, S., Wiczanowski, M., and Boche, H. Fundamentals of resource allocation in wireless networks: theory and algorithms, volume 3. Springer Science & Business Media, 2009.

Tsuchida, R. and Ong, C. S. Deep equilibrium models as estimators for continuous latent variables. In International Conference on Artificial Intelligence and Statistics, pp. 1646 1671. PMLR, 2023.

Wang, S., Teng, Y., and Wang, L. Deep equilibrium object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6296 6306, 2023.

Wei, C. and Kolter, J. Z. Certified robustness for deep equilibrium models via interval bound propagation. In International Conference on Learning Representations, 2021.

Winston, E. and Kolter, J. Z. Monotone operator equilibrium networks. Advances in neural information processing systems, 33:10718 10728, 2020.

Positive Concave Deep Equilibrium Models

Xie, X., Wang, Q., Ling, Z., Li, X., Liu, G., and Lin, Z. Optimization induced equilibrium networks: An explicit optimization perspective for understanding equilibrium models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3604 3616, 2022.

Yamada, I., Yukawa, M., and Yamagishi, M. Minimizing the moreau envelope of nonsmooth convex functions over the fixed point set of certain quasi-nonexpansive mappings. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 345 390, 2011.

Yates, R. D. A framework for uplink power control in cellular radio systems. IEEE Journal on selected areas in communications, 13(7):1341 1347, 1995.

You, L. and Yuan, D. A note on decoding order in user grouping and power optimization for multi-cell noma with load coupling. IEEE Transactions on Wireless Communications, 20(1):495 505, 2020.

Zhang, S., Zhu, L., and Gao, Y. An efficient deep equilibrium model for medical image segmentation. Computers in Biology and Medicine, 148:105831, 2022.

Zhao, Y., Zheng, S., and Yuan, X. Deep equilibrium models for snapshot compressive imaging. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 3642 3650, 2023.

Zheng, Z., Zhou, Y., and Zhou, H. Deep equilibrium nonautoregressive sequence learning. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 11763 11781, 2023.

Zou, Z., Liu, J., Wohlberg, B., and Kamilov, U. S. Deep equilibrium learning of explicit regularization functionals for imaging inverse problems. IEEE Open Journal of Signal Processing, 2023.

Positive Concave Deep Equilibrium Models

A. Known Results

Definition A.1. (Ortega & Rheinboldt, 2000)[Chapter 9] Let (xk : k N) be a sequence in Rn, then (xk : k N) converges geometrically to x Rn with a rate c [0, 1) and a constant γ > 0 if

k N ||xk+1 x || γck, (10)

where || || is a given norm.

Definition A.2. (Cavalcante et al., 2019; Oshime, 1992) Let f : Rn + int(Rn +) be an SI mapping in the sense of Definition 3.2. The asymptotic mapping associated with f is the mapping defined by

f : Rn + Rn + : z 7 lim p 1 pf(pz). (11)

We recall that the above limit always exists and that the resulting asymptotic mapping f is positively homogeneous; i.e., ( α > 0) ( z Rn +) f (αx) = αf (z).

Definition A.3. The (nonlinear) spectral radius of an SI mapping is defined as the largest eigenvalue of the corresponding asymptotic mapping (Cavalcante et al., 2019; Oshime, 1992), and it is given by

p(f ) := max{λ R+ | z Rn + \ {0} s.t. f (z) = λz} R+. (12)

Proposition A.4. (Cavalcante et al., 2019) Let f : Rn + int(Rn +) be an SI mapping. Then Fix(f) = if and only if p(f ) < 1. Furthermore, if a fixed point exists, then it is positive and unique.

Corollary A.5. (Cavalcante et al., 2016) If f : Rn + Rn + is an NC mapping or f : Rn + int(Rn +) is a PC mapping in the sense of Definition 3.4, then f is monotonic.

Proposition A.6. (Cavalcante et al., 2016; 2019) If f : Rn + int(Rn +) is a PC mapping, then f is an SI mapping.

Proposition A.7. (Piotrowski & Cavalcante, 2022) Let f : Rn + int(Rn +) be a PC mapping with a fixed point x int(Rn +). Then, for any x1 Rn +, the fixed point iteration of f converges geometrically to x with a factor c [0, 1) w.r.t. any metric induced by a norm in Rn.

B. Architecture Details

As noted in Section 5, we consider three types of architectures of pc DEQ models. Based on the results of previous DEQ papers (Bai et al., 2019; 2020), using some regularization techniques can prevent overfitting and improve the final results. Therefore, similar to standard DEQ models (Bai et al., 2019) for each pc DEQ layer, we apply weight normalization (Salimans & Kingma, 2016). Similarly to the official tutorial on implicit models (Kolter et al., 2020), we also used batch normalization before and after the DEQ layer, which allows improving the final results. In Figures 5, 6, 7, the meaning of the blocks is as follows:

Linear linear layer,

BN batch normalization layer,

pc DEQ-1 pc DEQ-1 layer in Definition 4.3,

pc DEQ-2 pc DEQ-2 layer in Definition 4.3,

Conv2D 2D convolutional layer,

Conv2D, s=2 2D convolutional downsampling layer with stride equal to 2,

BN + Softplus batch normalization layer followed by a softplus activation function,

BN + Re LU batch normalization followed by Re LU activation functions,

Max Pool + BN max pooling layer followed by batch normalization,

Avg Pool average pooling layer.

Positive Concave Deep Equilibrium Models

B.1. Architecture with Single pc DEQ Linear Layer

Architectures of pc DEQ models with a single linear pc DEQ layer for two variants of pc DEQ layers in the sense of Definition 4.3 are presented in Figure 5. As in standard DEQ models, the first and last linear layer is an explicit layer.

Figure 5. Architectures of pc DEQ models with single linear pc DEQ layer. Subfigure (a) presents architecture with pc DEQ-1 layer and (b) with pc DEQ-2 layer.

B.2. Architecture with Single pc DEQ Convolutional Layer

Architectures of pc DEQ models with a single convolutional pc DEQ layer are similar to those with linear layers in the previous subsection. Figure 6 shows the architectures used in the experiments. The max. pooling layers have a kernel of size 3 3 with padding p = 1 and stride s = 1. The avg. pooling layer has kernel of size 8 8 with padding p = 0 and stride s = 8.

Figure 6. Architecture of pc DEQ models with single convolutional pc DEQ layer. Subfigure (a) presents architecture with pc DEQ-1 layer and (b) with pc DEQ-2 layer.

B.3. Architecture with Multiple pc DEQ Convolutional Layers

In practice, larger DEQ models with multiple convolutional layers are implemented as single layer fusion of multiple scales (Bai et al., 2020), between which the upsampling and downsampling are performed. The same idea was applied to mon DEQ models in the mult-tier architecture. This is a nice engineering idea, but it is complicated in practice. Moreover, computing the fusion of fixed points (for each scale) simultaneously is difficult to analyze and can involve an increased computational cost. In this work, we take another approach that combines implicit DEQ layers with explicit downsampling layers between them. A similar approach was used in (Xie et al., 2022), which investigated the idea of having multiple implicit layers instead of one. It should be noted that the explicit downsampling layers are unconstrained, such as the first and last layers. The proposed architecture is shown in Figure 7. The max. pooling layers have a kernel of size 3 3 with padding p = 1 and stride s = 1. The avg. pooling layer has kernel of size 4 4 with padding p = 0 and stride s = 4.

Positive Concave Deep Equilibrium Models

Figure 7. Architecture of pc DEQ models with three convolutional pc DEQ layer. Subfigure (a) presents architecture with pc DEQ-1 layers and (b) with pc DEQ-2 layers.

C. Experimental Details and Hyperparameters

In our experiments, we used three commonly known computer vision datasets: MNIST, SVHN, and CIFAR-10. MNIST dataset consists 70,000 grayscale handwritten digit images. SVHN dataset consists of 99,289 RGB digit images from house numbers. CIFAR-10 consists of 60,000 RGB images of 10 classes. For the MNIST dataset, the images have dimensions of 28 28 pixels, and for the SVHN and CIFAR-10 32 32 pixels. The statistics of the datasets are shown in Table 4.

Table 4. Dataset statistics. Dataset #Train examples #Test examples

MNIST 60,000 10,000 SVHN 73,257 26,032 CIFAR-10 50,000 10,000

As mentioned in the paper, to compute fixed point in pc DEQ layers, the standard fixed point iteration was used. We especially do not use any more sophisticated iterative methods such as Anderson acceleration or Broyden s method, because for such methods the guarantees of convergence to fixed point for PC mappings have not been proved in the literature. For all networks, the Adam W (Loshchilov & Hutter, 2017) optimizer was used with a batch size of 64. All other hyperparameters for each dataset and architecture are shown in Tables 5, 6 and 7. The expeeriments with pc DEQs were run five times.

Positive Concave Deep Equilibrium Models

Table 5. MNIST hyperparameters.

Method Number of channels Epochs LR LR decay steps LR decay factor WD

pc DEQ-1-L-Re LU6 80 40 0.001 30 0.1 0.02 pc DEQ-1-L-Tanh 80 40 0.001 30 0.1 0.02 pc DEQ-1-L-Softsign 80 40 0.001 30 0.1 0.02 pc DEQ-2-L-Sigmoid 80 40 0.001 30 0.1 0.02 pc DEQ-1-SC-Re LU6 82 40 0.0007 30 0.1 0.02 pc DEQ-1-SC-Tanh 82 40 0.0007 30 0.1 0.02 pc DEQ-1-SC-Softsign 82 40 0.0007 30 0.1 0.02 pc DEQ-2-SC-Sigmoid 82 40 0.0002 30 0.1 0.02 pc DEQ-1-MC-Re LU6 12,24,48 40 0.0005 30 0.1 0.015 pc DEQ-1-MC-Tanh 12,24,48 40 0.0005 30 0.1 0.015 pc DEQ-1-MC-Softsign 12,24,48 40 0.0005 30 0.1 0.015 pc DEQ-2-MC-Sigmoid 12,24,48 40 0.0002 30 0.1 0.015

Table 6. SVHN hyperparameters.

Method Number of channels Epochs LR LR decay steps LR decay factor WD

pc DEQ-1-SC-Re LU6 125 80 0.0007 70 0.1 0.02 pc DEQ-1-SC-Tanh 125 80 0.0007 70 0.1 0.02 pc DEQ-1-SC-Softsign 125 80 0.0007 70 0.1 0.02 pc DEQ-2-SC-Sigmoid 125 80 0.0005 70 0.1 0.02 pc DEQ-1-MC-Re LU6 20,50,80 50 0.0005 40 0.1 0.015 pc DEQ-1-MC-Tanh 20,50,80 50 0.0005 40 0.1 0.015 pc DEQ-1-MC-Softsign 20,50,80 50 0.0005 40 0.1 0.015 pc DEQ-2-MC-Sigmoid 20,50,80 50 0.0002 40 0.1 0.015

Table 7. CIFAR-10 hyperparameters.

Method Number of channels Epochs LR LR decay steps LR decay factor WD

pc DEQ-1-SC-Re LU6 125 80 0.0005 70 0.1 0.02 pc DEQ-1-SC-Tanh 125 80 0.0005 70 0.1 0.02 pc DEQ-1-SC-Softsign 125 80 0.0005 70 0.1 0.02 pc DEQ-2-SC-Sigmoid 125 80 0.0002 70 0.1 0.02 pc DEQ-1-MC-Re LU6 20,50,80 50 0.0007 40 0.1 0.015 pc DEQ-1-MC-Tanh 20,50,80 50 0.0007 40 0.1 0.015 pc DEQ-1-MC-Softsign 20,50,80 50 0.0007 40 0.1 0.015 pc DEQ-2-MC-Sigmoid 20,50,80 50 0.0002 40 0.1 0.015 pc DEQ-1-MC-Re LU6* 100,120,140 120 0.0007 100 0.1 0.02 pc DEQ-1-MC-Tanh* 100,120,140 120 0.0007 100 0.1 0.02 pc DEQ-1-MC-Softsign* 100,120,140 120 0.0007 100 0.1 0.02 pc DEQ-2-MC-Sigmoid* 100,120,140 120 0.0002 100 0.1 0.02

Positive Concave Deep Equilibrium Models

D. Additional Figures

0 5 10 15 20 25 30 35 40 Epochs

Test accuracy

Re LU6 softsign

tanh sigmoid

0 5 10 15 20 25 30 35 40 Epochs

Forward iterations

Re LU6 softsign

tanh sigmoid

0 5 10 15 20 25 30 35 40 Epochs

Backward iterations

Re LU6 softsign

tanh sigmoid

0 5 10 15 20 25 30 35 40 Epochs

Test accuracy

Re LU6 softsign

tanh sigmoid

0 5 10 15 20 25 30 35 40 Epochs

Forward iterations

Re LU6 softsign

tanh sigmoid

0 5 10 15 20 25 30 35 40 Epochs

Backward iterations

Re LU6 softsign

tanh sigmoid

Figure 8. Test accuracies, number of fixed point iterations in forward and backward passes during training for pc DEQ models with single linear pc DEQ layer and three convolutional pc DEQ layers (average of forward and backward passes of three pc DEQ layers) on MNIST dataset over five experiment runs.

0 10 20 30 40 50 Epochs

Test accuracy

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 Epochs

Forward iterations

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 Epochs

Backward iterations

Re LU6 softsign

tanh sigmoid

Figure 9. Test accuracies, number of fixed point iterations in forward and backward passes (average of three pc DEQ layers) during training for pc DEQ models with multiple convolutional pc DEQ layers on SVHN dataset over five experiment runs.

Positive Concave Deep Equilibrium Models

D.3. CIFAR-10

0 10 20 30 40 50 Epochs

Test accuracy

CIFAR-10-MC

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 Epochs

Forward iterations

CIFAR-10-MC

Re LU6 softsign

tanh sigmoid

0 10 20 30 40 50 Epochs

Backward iterations

CIFAR-10-MC

Re LU6 softsign

tanh sigmoid

0 20 40 60 80 100 120 Epochs

Test accuracy

CIFAR-10-MC*

Re LU6 softsign

tanh sigmoid

0 20 40 60 80 100 120 Epochs

Forward iterations

CIFAR-10-MC*

Re LU6 softsign

tanh sigmoid

0 20 40 60 80 100 120 Epochs

Backward iterations

CIFAR-10-MC*

Re LU6 softsign

tanh sigmoid

Figure 10. Test accuracies, number of fixed point iterations in forward and backward passes (average of three pc DEQ layers) during training for pc DEQ models with three convolutional pc DEQ layers with and without data augmentation on CIFAR-10 dataset over five experiment runs.