# subhomogeneous_deep_equilibrium_models__131c06fa.pdf

Subhomogeneous Deep Equilibrium Models

Pietro Sittoni 1 Francesco Tudisco 1 2

Implicit-depth neural networks have grown as powerful alternatives to traditional networks in various applications in recent years. However, these models often lack guarantees of existence and uniqueness, raising stability, performance, and reproducibility issues. In this paper, we present a new analysis of the existence and uniqueness of fixed points for implicit-depth neural networks based on the concept of subhomogeneous operators and the nonlinear Perron-Frobenius theory. Compared to previous similar analyses, our theory allows for weaker assumptions on the parameter matrices, thus yielding a more flexible framework for well-defined implicit networks. We illustrate the performance of the resulting subhomogeneous networks on feedforward, convolutional, and graph neural network examples.

1. Introduction

Implicit-depth Neural Networks (NNs) have emerged as powerful tools in deep learning. Rather than using a sequence of nested layers, these models define feature embeddings as the solution to specific nonlinear equations. Two popular examples of implicit-depth networks are Neural ODEs (Chen et al., 2018), whose output is calculated by solving an ordinary differential equation, and Deep Equilibrium (DEQ) Models (Bai et al., 2019), which define the output in terms of a fixed-point equation. These approaches have a number of advantages as (a) they have been shown to match or even exceed the performance of traditional NNs on several tasks, including time series and sequence modeling (Bai et al., 2019; Rusch & Mishra, 2021), and (b) memory-wise, they are more efficient than traditional NNs as backpropagation is done analytically and does not require

1School of Mathematics, Gran Sasso Science Institute, L Aquila, Italy 2School of Mathematics and Maxwell Institute, University of Edinburgh, Edinburgh, UK. Correspondence to: Pietro Sittoni <pietro.sittoni@gssi.it>, Francesco Tudisco <f.tudisco@ed.ac.uk>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

storage of internal weights, allowing to handle deep NN architectures more efficiently.

DEQ models can be viewed as infinite-depth feed-forward neural networks, with weight tying, i.e. where the same transformation is used on each layer (Dabre & Fujita, 2019; Dehghani et al., 2019). Indeed, in this type of model, the evaluation of the network is executed by solving a fixedpoint equation z = fθ(z; x) which can be thought of as the limit for the number of layers n of an n-deep network z(n) = fθ(z(n 1); x). As the network is now implicitly defined as the solution of the equation fθ(z; x) z = 0, one can use the Implicit Function Theorem to compute and propagate the gradients, thus reducing the memory cost with respect to standard backpropagation (Bai et al., 2019; 2020).

Despite their potential advantages, not all DEQ models are well-defined. In fact, one of the main open questions for DEQ architectures is whether any fixed point actually exists and whether this is unique. Lack of uniqueness, in particular, can be a problem as for a given data x and fixed pre-trained weights θ, the resulting fixed-point embedding z may change, raising stability, performance, and reproducibility issues. These potential drawbacks are often ignored or addressed only via empirical arguments based on experimental evidence that deep networks work well in practice. The most prominent line of analysis of uniqueness for deep equilibrium fixed points is based on monotone operator theory (Winston & Kolter, 2020; Ryu & Boyd, 2016). While elegant and efficient, monotone operator DEQs (Mon DEQs) require special parametrizations of the layer weights to guarantee the DEQ model fθ is operator monotone.

In this work, we present a new analysis of existence and uniqueness of DEQ fixed points based on positive, subhomogeneous operators. Using the Thomson projective metric and techniques from nonlinear Perron Frobenius theory (Lemmens & Nussbaum, 2012; Gautier et al., 2019), we provide a new theorem showing that a broad class of operators fθ admits unique fixed points. In particular, we show that our uniqueness theorem holds for several example architectures of the form fθ(z; x) = σ(Wz) + MLP(x), provided the activation function is subhomogeneous, which we show is the case for a variety of commonly used activations.

This existence and uniqueness result allow us to design stable DEQ models under much weaker assumptions than

Subhomogeneous Deep Equilibrium Models

available literature, avoiding any restriction on the learnable weight and using a large class of subhomogeneous activation functions.

Our theoretical findings are complemented by several experimental evaluations where we compare simple fullyconnected and convolutional DEQ architectures based on monotone operators with the newly introduced subhomogeneous deep equilibrium model (Sub DEQ) on benchmark image classification tasks.

2. Related work

The classical approach to layer design in deep learning is based on explicitly defined function compositions and the corresponding computational graph. In contrast, implicitdepth approaches do not explicitly define the computational graph and define the model implicitly as the solution of a specific equation. The computational graph needs then to be extracted from the implicit formulation. Neural ODE is a popular example where the model is defined as the solution of an ordinary differential equation and backpropagation can be implemented via the adjoint method (Chen et al., 2018). Another example is the DEQ, where the model is defined as the fixed point of a nonlinear function and backpropagation can be done using the implicit function theorem (Bai et al., 2019). However, DEQ may not be well-posed, as fixed points of nonlinear mappings may not exist or may not be unique. Several works in the literature propose variations of DEQ architecture with criteria for certified well-posedness using different methods, including mon DEQ (Winston & Kolter, 2020) based on monotone operator theory, applying contraction theory (Jafarpour et al., 2021), using an over-parametrized DEQ with a condition on the initial equilibrium point, or exploiting linear and nonlinear Perron-Frobenius theory on graph neural networks and autoencoders, (Gu et al., 2020; El Ghaoui et al., 2021; Piotrowski et al., 2021). Just like standard explicit networks, DEQ models are vulnerable to adversarial input perturbations (Yang et al., 2022) and well-posed parameterizations may improve the robustness of the model. For example, using semialgebraic representation of mon DEQ, Chen et al. (2021) showed that mon DEQ are more robust to l2 perturbation. Similarly, Wei & Kolter (2022) showed that unlike generic DEQs, mon DEQ can achieve comparable l certified robustness to similarly-sized fully explicit networks, via interval bound propagation. Finally, we highlight that models based on DEQ architectures have been successfully employed in a wide range of specific applications. The first competitive performance was shown in the sequence modeling domain (Bai et al., 2019; 2020). Soon after that, efficient DEQ models have been designed for inverse problems in imaging (Gilton et al., 2021; Zhao et al., 2022), image denoising (Chen et al., 2023), optical flow estimation

(Bai et al., 2022), landmark detection (Micaelli et al., 2023), semantic segmentation (Bai et al., 2020), and generative modeling using a diffusion-based approach (Pokle et al., 2022).

3. Subhomogeneous operators

The fundamental brick of our analysis is the notion of subhomogeneous operator. Recall that an operator F : Rn Rn

is (positively) µ-homogeneous, for some µ > 0 if F(αz) = αµF(z) for all positive coefficients α > 0 and all entrywise positive vectors z > 0. When F is differentiable, Euler s theorem for homogeneous functions provides an elegant equivalent characterization of homogeneous operators as the set of F such that the identity F (z)z = µF(z) holds for all z, where F (z) denotes the Jacobian of F in z. The proposed notion of subhomogeneous operators generalizes the concept of homogeneous mappings, starting from this second characterization.

Definition 3.1 (Subhomogeneous operator). Let F : Rn Rm be a Lipschitz mapping. For a coefficient µ > 0 and Ω dom+(F) := {z Rn | F(z) 0}, we say that F is µ-subhomogeneus in Ω, briefly F subhomµ(Ω), if

|Mz| µ F(z), (1)

for all z Ωand all M F(z), where F(z) denotes Clarke s generalized Jacobian of F at the point z, and where all the inequalities, as well as the absolute value, are meant entrywise. Similarly, we say that F is strongly µ-subhomogeneus in Ω, briefly F s-subhomµ(Ω), if

|M| |z| µ F(z), (2)

for all z Ωand all M F(z).

Subhomogeneity generalizes both homogeneity and strongsubhomogeneity. In fact, all µ-homogeneous differentiable operators are µ-subhomogeneous due to Euler s theorem and, as |Mz| |M||z| for all z, we immediately see that every strongly-subhomogeneus operator is subhomogeneus. However, homogeneity does not necessarily imply strongsubhomogeneity. This is shown for instance in Example 3.2 below. On the other hand, we will see that strong subhomogeneity is preserved under composition while subhomogeneity is not, and this additional property will be useful to establish uniqueness results for brother families of deep equilibrium architectures. Example 3.2. Let F : R2 {[0, 0]} R2, be defined as F(z) = F(x, y) = [ x2y x2+y2 , 0]. Clearly F is 1-homogeneus, with Jacobian given by

(x2+y2)2 x2(x2 y2)

(x2+y2)2 0 0

Subhomogeneous Deep Equilibrium Models

Thus, |F (z)||z| calculated at z = [1, 2] is equal to [0.56, 0], while F(z) = [0.25, 0] in z = [1, 2]. This shows that F is not strongly 1-subhomogeneus in Rn ++. Remark 3.3. Note that subhomogeneity and strong subhomogeneity coincide when the mapping has a positive valued subgradient. In fact, if F subhomµ(Rn +) and F(z) Rn + for all z Rn +, then |M z| = M z = |M| |z| for all z Ωand all M F(z). Thus, F s-subhomµ(Rn +).

In order to better understand the connection between homogeneity and the two proposed notions of subhomogeneity, we provide below an analogous of Euler s theorem for subhomogeneous operators, directly connecting the notion of subhomogeneous operator with the usual notion of homogeneity F(αz) = αµF(z), see also (Lemmens & Nussbaum, 2012). The proof is deferred to Appendix A.

Proposition 3.4. Let F : Rn Rn be differentiable and Lipschitz. If F s-subhomµ(Rn ++). Then, F subhomµ(Rn ++) and for all λ 1 we have

F(λz) λµF(z).

Assume moreover that F (z) > 0, for all z > 0, i.e. the Jacobian F (z) is an entry-wise strictly positive matrix. Then, for µ > 0, it holds

F subhomµ(Rn ++) = s-subhomµ(Rn ++)

if and only if for all λ 1 we have

F(λz) λµF(z).

We provide below our main result showing uniqueness of fixed points for subhomogeneous operators, provided the homogeneity coefficient is small enough. Then, in Section 4, we will show that standard feed-forward linear layers of the form σ(Wx) with a variety of broadly used activation functions σ are indeed subhomogeneous, sometimes up to a minor modification.

3.1. Main result: Existence and uniqueness of fixed points

In this section, we show that µ-subhomogeneus operators with small enough µ admit a unique fixed point. All proofs are moved to Appendix A.

The proof of the existence and uniqueness of the fixed point is based on the Banach fixed-point theorem and the following fundamental completeness result

Theorem 3.5 (See e.g. (Lemmens & Nussbaum, 2012)). Let Rn ++ := {z Rn : z > 0, entrywise} denote the interior of the nonnegative orthant. Consider the Thomson distance δ(x, y) = ln(x) ln(y) . Then, the pair (Rn ++, δ) is a complete metric space.

Based on the theorem above, if we have an operator F that maps Rn ++ to itself and that is contractive with respect to δ, then F must have a unique fixed point in Rn ++. Our main theorem below shows that this is always the case when F is defined by a positive subhomogeneous operator. We start by proving that if F mapping Rn ++ to Rm ++ is subhomogeneous, then it is Lispchitz continuous with respect to the Thompson distance, with a Lispchitz constant that depends only on its subhomogeneity degree.

Theorem 3.6. Let F : Rn Rm be a Lipschitz operator. Assume that F is positive, i.e. F(z) > 0 for all z > 0, and that F subhomµ(Rn ++) for some µ > 0. Then,

δ (F(x), F(y)) µ δ(x, y),

for all x, y Rn ++.

The following uniqueness result is now a direct consequence of Theorem 3.5 and Theorem 3.6.

Theorem 3.7. Let F : Rn Rn be such that F subhomµ(Rn ++). If 0 < µ < 1, then there exists a unique fixed point F(z ) = z Rn ++. Moreover, the sequence zk+1 = F(zk) converges to z with linear convergence rate, namely zk z Cµk ,

for any z0 Rn ++.

In some occasions, for example when F is matrix-valued as in the context of graph neural networks, one may need to normalize rows or columns of the hidden embedding to e.g. simulate a random walk. See also Section 6.3.

In order to show the uniqueness of fixed points for neural networks with this type of normalization layer, we consider a general scaling function φ : Rn R with the following properties:

(1-homogeneous) φ(αz) = αφ(z) for all coefficients α > 0; (positive) φ(z) > 0 for each z > 0; (order-preserving) φ(z) > φ(x) for each z > x > 0.

As (Rn ++, δ) is a complete metric space, then also (Rn ++/φ, δ) must be complete, where Rn ++/φ := {z Rn ++ : φ(z) = 1}. Our next result shows how Theorem 3.6 transfer to this setting

Theorem 3.8. Let F : Rn Rm be as in Theorem 3.6. For a positive, 1-homogeneous, order-preserving functional φ : Rm R, define the φ-normalization layer map

normφ : Rn ++ Rm ++/φ, normφ(z) = z/φ(z)

and let G(z) = normφ(F(z)) . Then,

δ (G(x), G(y)) 2µ δ(x, y),

Subhomogeneous Deep Equilibrium Models

for all x, y Rn ++. Moreover, if F is differentiable and its Jacobian matrix F (z) is entry-wise positive for all z Rn ++, then

δ (G(x), G(y)) µ δ(x, y),

for all x, y Rn ++,

The following equivalent of Theorem 3.7 is now a relatively direct consequence. We formulate it explicitly for normalization layers acting column-wise, but we underline that its proof (see Appendix A) can be easily adapted to different normalization patterns.

Theorem 3.9. Let G : Rn d Rn d be defined as

G(x) = [normφ1(F1(x)), . . . , normφd(Fd(x))],

where Fi : Rn d Rn are µ-subhomogeneous and φi : Rn R are 1-homogeneous, positive, order-preserving functions. If 0 < µ < 1/2, then there exists a unique fixed point G(z ) = z . Moreover, if all the Fi are differentiable and their Jacobian matrices are entry-wise positive for all z > 0, then a unique fixed point exists provided 0 < µ < 1. In both cases, the sequence zk+1 = G(zk) converges to z with linear convergence rate, namely

for any z0 Rn ++.

4. Subhomogeneous deep equilibrium models

Consider now the weight-tied, input-injected neural network

z(k+1) = σ1( σ2(Wz(k)) + fθ(x) ) (3)

in which x Rd denotes the input, z(k) Rn denotes the hidden unit at layer k, σ1, σ2 : R R denote activation functions applied entrywise, W Rn n are the hidden unit weights, fθ : Rd Rn is an input-injection embedding, e.g. defined by an MLP. We use here a fully connected layer for simplicity, but everything transfers unchanged to the convolutional case (i.e. if Wz is replaced by W z, with W being the convolutional kernel). While in practice using both nonlinearities σ1 and σ2 may be redundant, we provide here a theoretical investigation of the model (3) in its generality and we will then study specific architectures where either σ1 = Id or σ2 = Id.

In the following, we show that a unique fixed point for Equation (3) exists, when the number of layers k grows to infinity, provided the activation functions are subhomogeneous.

Consider the following DEQ architecture

z = σ1(σ2(Wz) + fθ(x)) (4)

or the corresponding normalized version

z = normφ σ1(σ2(Wz) + fθ(x)) (5)

where φ is any positive, 1-homogeneous, order-preserving normalizing function (or multiple functions if the normalization layer is applied locally as in Theorem 3.9). Note that φ can be, for example, any standard p-norm. This type of layer normalization step is also used to e.g. reduce the network sensitivity to small perturbations (Zhang et al., 2022; Farnia et al., 2018) and could accelerate training (Ba et al., 2016; Ioffe & Szegedy, 2015).

To study (4) and (5) using Theorem 3.7, we now notice that it is enough to study the subhomogeneity of the activation functions σ1 and σ2. In fact, we can think of F(z) := σ1(σ2(Wz) + fθ(x)) as the composition of an activation function σ1 applied entry-wise, with a translation T(u) = u + fθ(x), another activation function σ2, and a linear map L(z) = Wz (or L(z) = W z for convolutional layers),

F = σ1 T σ2 L . (6)

Linear mappings are particular examples of homogeneous operators while translations are subhomogeneous. In the next lemma, we observe that the subhomogeneity of F coincides with the subhomogeneity of σ1 and σ2, provided the input injection fθ(x) is positive (entrywise). This requirement is not excessively restrictive, as it can be satisfied by using a positive nonlinear activation function, such as Re LU, Soft Plus, or by shifting bounded activations such as tanh. All proofs are moved to Appendix A.

Lemma 4.1. Let Ωbe a subset of Rn, H be h-homogeneous, P subhomµ(H(Ω)), Ty denote the translation by y, Ty(z) = z + y, and let Q s-subhomλ(Ty(P(H(Ω)))). Then, if the following composition rules hold:

P H subhomhµ(Ω) If y 0, then Q Ty P H subhomhµλ(Ω)

Thus, studying the subhomogeneity (resp. strong subhomogeneity) of F in (6) boils down to the analysis of the subhomogeneity (resp. strong subhomogeneity) of σ1 and σ2. In particular, note that subhomogeneity is enough for σ2 while strong homogeneity is required on σ1 in order for the whole model to be subhomogeneous. However, we notice that in the typical case of activation functions acting entrywise this additional requirement is redundant. In fact, note that these subhomogeneity properties are inherited from the univariate function defining the activation layer when they act entrywise. Precisely, if σ : R R is a Lipschitz function with σ subhomµ(Ω), for Ω dom+(σ), then we automatically have that the entrywise action of σ on Rn is subhomµ(Ωn), with Ωn = Ω Ω, as the elements of the Clarke s generalized Jacobian of σ in z Rn are simply diagonal matrices whose j-th diagonal element is an

Subhomogeneous Deep Equilibrium Models

element of the Clarke s generalized Jacobian of σ evaluated on the j-th component of z. With the following remark, we can notice that the definitions of subhomogeneous and strongly subhomogeneous coincide in the univariate case. Remark 4.2. Let σ: R R be σ subhomµ(Ω) for some µ > 0 and some Ω R. Notice that, |σ (z) z| = |σ (z)||z| for each z Ω, this implies that σ s-subhomµ(Ω).

In the next Section 5 we will show a variety of examples of commonly used activation functions σ that are subhomogeneous. For all such choices, the DEQ model (4) is well-defined as the existence of a unique equilibrium point z is guaranteed.

5. Subhomogeneous activation functions with corresponding subhomogeneity coefficient

We now propose several examples of subhomogeneous activation functions, they are well-known activation functions commonly used in deep learning. All the proofs are moved to Appendix A.

As a first example, we show that the sigmoid function is subhomogeneous, on the positive real axis, with µ = 1.

Proposition 5.1. Let σ : R R be defined as

σ(z) = sigmoid(z) := ez

Then σ subhom1(R+).

Similarly to the sigmoid, also the Soft Plus is subhomogeneous on R+ with µ = 1.

Proposition 5.2. Let σ: R R be defined as

σ(z) = softplus(z) := 1

β ln(1 + eβz),

where β > 0. Then σ subhom1(R+).

Also, the hyperbolic tangent is subhomogeneous in R+.

Proposition 5.3. Let σ: R R defined as

σ(z) = tanh(z) := ez e z

Then, σ subhom1(R+).

The examples considered above are activations that are subhomogeneous only on the nonnegative/positive orthant. We provide below an example of subhomogeneous activation function that is subhomogeneous globally. This is obtained as a positive shift of the hyperbolic tangent. In fact, it is not difficult to notice that if σ(z) = tanh(z)+1+ϵ with ϵ > 0, then µ(ϵ) = maxz |z|σ (z)σ(z) 1 is a decreasing function

of ϵ, i.e. µ(ϵ1) < µ(ϵ2) for ϵ1 > ϵ2 > 0. Moreover, it holds µ(ϵ) < 1 for all

ϵ > 0.199 > max z |z|sech2(z) tanh(z) 1

and µ(ϵ) < 1/2 for all

ϵ > 0.602 > max z 2|z|sech2(z) tanh(z) 1 .

These computations directly lead to

Proposition 5.4. Let σ: R R be defined as

σ(z) = tanh(z) + α .

If α > 1, then σ subhomµ(R). In particular, if α > 1.2 then µ < 1 and if α > 1.602 then µ < 1/2.

Finally, similar to the hyperbolic tangent, we notice below that the non-differentiable piece-wise linear hardtanh activation function is 1-subhomogeneus on R.

Proposition 5.5. Let σ : R: R be defined as

σ(z) = hardtanh(z) :=

α1 if z < α1 α2 if z > α2 z otherwise

with 0 < α1 < α2 < . Then, σ subhom1(R).

Other two examples of 1-subhomogeneus operator over the whole real axis are the Re LU and the Leaky Re LU with a slope α 0, both are positive 1-homogenous operator, thus they are also 1-subhomogeneus in R.

All activation functions considered above are univariate functions acting entrywise. Thus they are both subhomogeneous and strongly subhomogeneous as observed in Remark 4.2. In the final example below we consider an example of a subhomogeneous function that is also strongly subhomogeneous but is not pointwise.

Proposition 5.6. Let σ : Rn R be defined as

σ(z) = Approxmax(z) = ln X

i ezi max i zi

Then, σ subhom1(Rn +) and σ s-subhom1(Rn +).

The power scaling trick

Theorem 3.7 ensures a unique fixed point for any subhomogeneous operator F exists, provided the subhomogeneity degree is small enough. As we have shown with the examples above, subhomogeneity is not a very stringent requirement, and many operators commonly used in deep learning are actually subhomogeneous without any modification. However, it is often the case that µ 1. To reduce the degree of tanh, we applied a positive shift α. However, applying

Subhomogeneous Deep Equilibrium Models

Name Sigmoid Soft Plus Tanh Tanh + 1.2 Tanh + 1.603 Hard Tanh Leaky Re LU Approxmax

µ 1 1 1 0.99 0.499 1 1 1 Ω Rn + Rn + Rn + Rn Rn Rn Rn Rn + Differentiable

Table 1. Examples of subhomogeneous activation functions

0 5 10 15 20 25 30 35 Iteration

Sub DEQ (norm 1) Sub DEQ (norm 10) Sub DEQ (norm inf) Sub DEQ Mon DEQ (relu) Mon DEQ (tanh)

0 2 4 6 8 10 12 Iteration

Sub DEQ (norm 1) Sub DEQ (norm 10) Sub DEQ (norm inf) Sub DEQ Mon DEQ (relu) Mon DEQ (tanh)

Figure 1. Iteration required by the fixed point method for Sub DEQ vs Peaceman-Rachford method for Mon DEQ. Left: linear layer; Right: convolutional layer.

a scalar shift might not work all the time. When the subhomogeneity constant is not small enough, a simple power scaling trick can be implemented for any subhomogeneous F. In fact, it follows directly from Lemma 4.1 that if F is µsubhomogeneous, then F α defined as the entrywise power F α(x) = F(x)α, is αµ-subhomogeneous, since the map x 7 xα is strongly α-subhomogeneous. Thus, to ensure uniqueness for (4) when σi are 1-subhomogeneous, we can mildly perturb σi into σi(x) = σi(x)1 ε, for any ε > 0 arbitrary small. With this perturbed activation the DEQ in (4) is guaranteed to have a unique fixed point.

6. Experiments

All the models are implemented in Py Torch and the code is available at https://github.com/COMPi LELab/Sub DEQ. We illustrate the performance of DEQ networks as in equations (4) and (5) on benchmark image datasets, as compared to alternative implicit neural networks, precisely Monotone operator-based DEQ (Mon DEQ) (Winston & Kolter, 2020) and neural ordinary differential equations (n ODE) (Chen et al., 2018), as well as standard explicit baseline architectures. As discussed in Section 4, we will use only one σi different from the identity function. To this end, we first consider the case σ1 = Id, that is

z = σ(Wz) + fθ(x)

and z = normφ( σ(Wz) + fθ(x) ).

Thanks to Theorem 3.7, Theorem 3.9, and Lemma 4.1 we can choose a suitable activation function σ to guarantee the subhomogeneity of the architecture and thus the existence

Model normφ σ subhomµ(Ω) Conditions µ Ω on W

no µ < 1 Ω= Rn None

yes µ < 1/2 Ω= Rn None µ < 1 Ω= Rn ++ W 0

σ(W z + y) no µ < 1 Ω= Rn ++ W 0 yes µ < 1 Ω= Rn ++ W 0

Table 2. Summary of requirements on weights and activations to guarantee existence and uniqueness of the DEQ fixed point, as well as the convergence of corresponding fixed point iteration. The setting requiring the fewest conditions is highlighted in gray color.

and uniqueness of the fixed point z, without any assumption on W, we summarize in 2 all the possible combination of well-posed implicit-layers. Here we experiment with σ(z) = tanh(z) + α, with α chosen accordingly to Proposition 5.4 to ensure a small enough subhomogeneity degree (µ < 1 for the standard model and µ < 1/2 for the normalized one) and thus uniqueness by Theorem 3.7. Precisely, we consider

z = tanh(Wz) + fθ(x) + 1.2, (7)

z = norm p( tanh(Wz) + fθ(x) + 1.603 ), (8)

where 1 p + and fθ(x) is a one-layer MLP with entry-wise positive final activation layer. As the architectures are now subhomogeneous and have a unique fixed point, we call (7) and (8) Sub DEQ models.

6.1. Efficiency of Sub DEQ

We now compare the convergence rate of the standard fixed point method to find the equilibrium point of a Sub DEQ, which is globally convergent due to Theorem 3.7, with the convergence rate of the Peaceman-Rachford method to find the equilibrium point of a Mon DEQ (Winston & Kolter, 2020). We analyze the implicit layer of the Sub DEQ in (8) varying p {1, 10, + } and letting fθ(x) = Re LU(Ux + b) be a one-layer MLP with Re LU activation function. We implement two different Mon DEQs: the first one is defined as in (Winston & Kolter, 2020) via the following equation

z = Re LU(Wz + Ux + b) . (9)

Subhomogeneous Deep Equilibrium Models

The second one is defined as

z = tanh(Wz + fθ(x)) . (10)

to replicate the architecture of the proposed Sub DEQ.

In order to guarantee that the Mon DEQ architectures have a unique fixed point, we restrict the weight matrix W in both the Mon DEQ equations to satisfy the operator monotone parametrization

W = A A + B B

with A, B parameter matrices, c.f. (Winston & Kolter, 2020).

For all models, as input, we choose x R128 400 sampled from a uniform distribution on the interval [0, 1) using torch.rand, the first dimension represents the batch size and the second the dimension of the features space. The hidden embedding z has a width of 150.

In Figure 1 we plot the relative residual

zk+1 zk F / zk+1 F ,

where F is the Frobenius norm and zk are the iterates of the fixed point methods. From Figure 1(left) we can notice that Sub DEQ systematically requires fewer steps to converge with respect to Mon DEQ. We also compare with the case of convolutional layers, instead of dense DEQ layers. In this setting, the input x R128 1 28 28 is sampled from a uniform distribution on the interval [0, 1) using torch.rand, the first dimension represent the batch size, the second the number of the image channels and the last two the size of the image. The hidden fixed point embedding z has 40 channels and each channel has a size of 28 28. The relative residual of the iterations is shown in Figure 1(right). Also in this case, we can notice that Sub DEQ systematically requires fewer steps to converge to converge than Mon DEQ.

6.2. Sub DEQ on benchmark image datasets

We now show the capacity of Sub DEQ in terms of classification tasks. We train them on different image benchmark datasets: CIFAR-10 (Krizhevsky & Hinton, 2009), SVHN (Netzer et al., 2011), and MNIST (Le Cun & Cortes, 2010). We compare Sub DEQ with Mon DEQ as well as neural ode (n ODE) architectures (Chen et al., 2018) and standard explicit network baselines. Overall, we consider the following models (for the feedforward implicit layer case):

1. Sub DEQ (Normalized Tanh): norm (tanh(Wz) + fθ(x) + 1.603) 2. Sub DEQ (Tanh): tanh(Wz) + fθ(x) + 1.2 3. Mon DEQ (Re LU): z = Re LU(Wz + Ux + b) 4. Mon DEQ (Tanh): z = tanh(Wz + fθ(x)) 5. n ODE (Re LU): z(t) = Re LU(Wz(t)), z(0) = fθ(x)

Model Error % MNIST (Dense) Sub DEQ (Normalized Tanh) 2.088 0.1405 % Sub DEQ (Tanh) 1.92 0.102 % n ODE (Re LU) 2.356 0.0689 % n ODE (Tanh) 3.296 0.1082 % Mon DEQ (Re LU) 2.056 0.0484 % Mon DEQ (Tanh) 2.736 0.7491 % Standard MLP (Tanh) 2.052 0.1452 % MNIST (Convolutional) Sub DEQ (Normalized Tanh) 1.354 0.98 % Sub DEQ (Tanh) 0.706 0.011 % n ODE (Re LU) 1.184 0.3845 % n ODE (Tanh) 0.826 0.0432 % Mon DEQ (Re LU) 0.654 0.0662 % Mon DEQ (Tanh) 1.096 0.0589 % Standard CNN (Tanh) 0.876 0.0739 % CIFAR-10 Sub DEQ (Normalized Tanh) 28.364 0.377 % Sub DEQ (Tanh) 27.946 1.7564 % n ODE (Re LU) 33.58 1.2882 % n ODE (Tanh) 28.792 1.3343 % Mon DEQ (Re LU) 24.414 0.6521 % Mon DEQ (Tanh) 35.618 1.1766 % Standard CNN (Tanh) 27.157 0.4154 % SVHN Sub DEQ (Normalized Tanh) 9.3562 0.2122 % Sub DEQ (Tanh) 10.3987 0.41296 % n ODE (Re LU) 33.8253 11.3008 % n ODE (Tanh) 22.277 2.8740 % Mon DEQ (Re LU) 11.0356 0.319 % Mon DEQ (Tanh) 17.8849 0.9747 % Standard CNN (Tanh) 12.7243 0.1024 % Tiny Image Net Sub DEQ (Normalized Tanh) 70.1791 0.3666 % Sub DEQ (Tanh) 74.6633 0.1511 % Mon DEQ (Re LU) 85.49 0.2406 % Standard CNN (Tanh) 73.76 1.4323 % %

Table 3. Mean std of the misclassification error on test set

6. n ODE (Tanh): z(t) = tanh(Wz(t)), z(0) = fθ(x).

We also consider a convolutional variant of these implicit layers, the only difference with the above dense layers is in the weights matrices, which we substitute with the standard convolutional kernels. For the standard explicit network baseline, we replace the implicit layer with a standard one using the same hyperparameter and tanh as the activation function, in both, the dense and the convolutional case. For the normalized Sub DEQ as in (8), we decide to normalize each element of the batch for the feedforward, while with the convolutions we normalize along each row. For the

Subhomogeneous Deep Equilibrium Models

Figure 2. Validation accuracy of the dense architectures during training on MNIST.

n ODE, we integrate the ODE over the interval [0, 1], as the output of the implicit layer we took the solution at time t = 1. For all the models, we apply 1d batch normalization for the feedforward layers and 2d batch normalization for the convolutional layers. Moreover, as a last step, we feed the embedding into a softmax classifier, but with the convolutional architectures, before it, we apply an average pooling and then we flatten the tensor. We describe all the details about the hyperparameters in Table 8 in Appendix C.

For every experiment, with the Sub DEQ architectures, we use the fixed point method speed up with Anderson acceleration to calculate the fixed point and as the stopping criterion we use the relative residual

zk+1 zk F / zk+1 F ,

and we stop the method when the residual reaches the values of 10 3. We decided to train the dense architecture only on MNIST, instead, we trained the convolutional models on all three data sets. Regarding the data, we use the hold-out approach, dividing the dataset into train validation and test sets. Appendix C describes the proportion of the splittings. All training data is normalized to mean µ = 0, standard deviation σ = 1, and the validation and the test are rescaled using the mean and standard deviation of the training data. On the same data split we run the experiments 5 times and in Table 3 we show the mean the standard deviation of the misclassification error. As we can notice from Table 3, the Sub DEQ (Normalized Tanh), depending on the dataset, achieves performance that exceeds the performance of the other model or reaches similar performance to the best model. We highlight that adding the vector 1.603 to the activation function to ensure the uniqueness and convergence, does not negatively affect the performance, instead ensures higher stability and leads the model to a higher accuracy.

6.3. Deep Equilibrium Graph Neural Network: nonlinear graph propagation

We conclude with an experiment on a graph neural network architecture. Graph neural networks are typically relatively

Dataset (% labeled) Accuracy Cora citation (5.2 %) APPNP 80.2027 1.9557% APPNP (Normalized Tanh) 73.3905 2.3818 % APPNP (Tanh) 72.2369 2.3013 % Cora author (5.2%) APPNP 70.834 2.1591 % APPNP (Normalized Tanh) 73.3437 1.8252 % APPNP (Tanh) 72.5175 2.4537 % Cite Seer (4.2 %) APPNP 62.8625 1.4477 % APPNP (Normalized Tanh) 62.66078 1.8676 % APPNP (Tanh) 62.1564 1.479 % DBLP (4.0 %) APPNP 88.8095 0.2866 APPNP (Normalized Tanh) 89.4007 0.3619 % APPNP (Tanh) 86.87046 0.3851 % Pub Med (0.8%) APPNP 77.45168 1.4433 % APPNP (Normalized Tanh) 78.5827 0.9741 % APPNP (Tanh) 77.103 1.251 %

Table 4. Mean standard deviation accuracy on test set

shallow due to oversmoothing and oversquashing phenomena (Nguyen et al., 2023; Giraldo et al., 2023). Thus, a fixed point implicit graph neural architecture may be ineffective. However, it is shown in (Gasteiger et al., 2018) that the Page Rank random walk on the graph may be used to propagate the input injected by a simple MLP and the limit point of the propagation leads to excellent, sometimes state-of-theart, results. This approach, named there APPNP, is effective also because the Page Rank diffusion process converges linearly to a unique Page Rank vector. As the model uses the Page Rank fixed point as the final latent embedding, it can be interpreted as a simple DEQ with linear activations. This is possibly a limitation of APPNP as nonlinear activations may allow for a better modeling power. Using Theorem 3.7 one can propose variations of this approach implementing nonlinear diffusion processes based on subhomogeneous activation functions and still maintain the same fundamental guarantees of uniqueness and linear convergence. We obtain this way a Sub DEQ variation of the APPNP graph neural network model, which we discuss next.

First, we review the standard APPNP architecture (Gasteiger et al., 2018). Consider an undirected graph G = (V, E) on n = |V | nodes, let di be the degree of node i, and let A be the graph normalized adjacency matrix, defined as Aij = di if ij E and Aij = 0 otherwise. APPNP defines the node embedding Z via the fixed point equation: ( Z = (1 α) A Z + αfθ(X) Z = softmax( Z) (11)

Subhomogeneous Deep Equilibrium Models

where A = A + I denotes the shifted adjacency matrix and X Rn f is the input node feature matrix.

We consider here the following nonlinear

variations of (11)

( Z = tanh (1 α) A Z + αfθ(X) + 1.2 Z = softmax( Z)

and ( Z = norm tanh (1 α) A Z + αfθ(X) + 1.2

Z = softmax( Z)

where the final l normalization layer is implemented columnwise in the model above, as in Theorem 3.9.

If we add to tanh a translation vector with all entries equal to 1.2, we obtain a subhomogeneous operator with degree µ = 0.99, due to the Proposition 5.4 and Lemma 4.1. Notice also that the Jacobian with respect Z of this transformation is entrywise positive respect all Z > 0 and is differentiable. Thus, both the nonlinear fixed point equations above have a unique fixed point due to Theorem 3.9. We test APPNP and its nonlinear variations on different graph datasets: Cora citation, Cora author, Cite Seer, DBLP, and Pub Med, always in a node classification semi-supervised learning setting. We divide the dataset into training, validation, and test sets. The percentages of observation used for the training set are shown in Table 4, and the remaining observations are equally split between validation and test sets. For both methods, we use similar hyperparameters, as fθ( ) we use 2-layers MLP of width 64 for each layer; we regularize the architectures with dropout p = 0.5, we use Adam as optimizer with constant learning rate equal to 0.01 and a weight decay equal to 0.005. We set α = 0.1 and K = 10. We repeat the splitting and the training 5 times and we report the average std results in Table 4.

7. Conclusions

We have presented a new analysis of the existence and uniqueness of fixed points for DEQ models, as well as convergence guarantees for the corresponding fixed point iterations. Unlike previous approaches that require constraints on the weight matrices to guarantee uniqueness, our theoretical framework allows us to use general weight matrices, provided the activation functions of the network are subhomogeneous. We observe that many well-known activation functions are indeed subhomogeneous possibly up to some minor modification, showing the vast applicability of our framework. Thus, we provide several examples of new subhomogeneous deep equilibrium architectures designed for image classification and nonlinear graph propagation.

Impact statement

This paper present work whose goal is to advance the field of implicit-depth deep learning architectures from a mathematical point of view. There are many pontential societal consequences of our work, none of which we feel must be specifically highlighted here.

Ba, J. L., Kiros, J. R., and Hinton, G. E. Layer normalization. stat, 1050:21, 2016.

Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. Advances in neural information processing systems, 32:688 699, 2019.

Bai, S., Koltun, V., and Kolter, J. Z. Multiscale deep equilibrium models. Advances in neural information processing systems, 33:5238 5250, 2020.

Bai, S., Geng, Z., Savani, Y., and Kolter, J. Z. Deep equilibrium optical flow estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 620 630, 2022.

Chen, Q., Wang, Y., Geng, Z., Wang, Y., Yang, J., and Lin, Z. Equilibrium image denoising with implicit differentiation. IEEE Transactions on Image Processing, 32:1868 1881, 2023. doi: 10.1109/TIP.2023.3255104.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. Neural ordinary differential equations. Advances in neural information processing systems, 31:6571 6583, 2018.

Chen, T., Lasserre, J. B., Magron, V., and Pauwels, E. Semialgebraic representation of monotone deep equilibrium models and applications to certification. Advances in Neural Information Processing Systems, 34:27146 27159, 2021.

Clarke, F. H. Optimization and Nonsmooth Analysis. Society for Industrial and Applied Mathematics, 1990. doi: 10.1137/1.9781611971309. URL https://epubs.siam.org/doi/abs/10. 1137/1.9781611971309.

Dabre, R. and Fujita, A. Recurrent stacking of layers for compact neural machine translation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 6292 6299, 2019.

Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., and Kaiser, L. Universal transformers. In International Conference on Learning Representations (ICLR), 2019.

Subhomogeneous Deep Equilibrium Models

El Ghaoui, L., Gu, F., Travacca, B., Askari, A., and Tsai, A. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021.

Farnia, F., Zhang, J., and Tse, D. Generalizable adversarial training via spectral normalization. In International Conference on Learning Representations, 2018.

Gasteiger, J., Bojchevski, A., and G unnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations, 2018.

Gautier, A., Tudisco, F., and Hein, M. The perron frobenius theorem for multihomogeneous mappings. SIAM Journal on Matrix Analysis and Applications, 40(3):1179 1205, 2019.

Gilton, D., Ongie, G., and Willett, R. Deep equilibrium architectures for inverse problems in imaging. IEEE Transactions on Computational Imaging, 7:1123 1133, 2021.

Giraldo, J. H., Skianis, K., Bouwmans, T., and Malliaros, F. D. On the trade-off between over-smoothing and oversquashing in deep graph neural networks. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM 23. ACM, October 2023. doi: 10.1145/3583780.3614997. URL http: //dx.doi.org/10.1145/3583780.3614997.

Gu, F., Chang, H., Zhu, W., Sojoudi, S., and El Ghaoui, L. Implicit graph neural networks. Advances in Neural Information Processing Systems, 33:11984 11995, 2020.

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448 456. pmlr, 2015.

Jafarpour, S., Davydov, A., Proskurnikov, A., and Bullo, F. Robust implicit networks via non-euclidean contractions. Advances in Neural Information Processing Systems, 34: 9857 9868, 2021.

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf.

Le Cun, Y. and Cortes, C. MNIST handwritten digit database. 2010. URL http://yann.lecun.com/ exdb/mnist/.

Lemmens, B. and Nussbaum, R. Nonlinear Perron Frobenius Theory. Cambridge Tracts in Mathematics. Cambridge University Press, 2012. doi: 10.1017/ CBO9781139026079.

Micaelli, P., Vahdat, A., Yin, H., Kautz, J., and Molchanov, P. Recurrence without recurrence: Stable video landmark detection with deep equilibrium models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22814 22825, 2023.

Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http: //ufldl.stanford.edu/housenumbers/ nips2011_housenumbers.pdf.

Nguyen, K., Hieu, N. M., Nguyen, V. D., Ho, N., Osher, S., and Nguyen, T. M. Revisiting over-smoothing and over-squashing using ollivier-ricci curvature. In International Conference on Machine Learning, pp. 25956 25979. PMLR, 2023.

Piotrowski, T., Cavalcante, R. L., and Gabor, M. Fixed points of nonnegative neural networks. ar Xiv e-prints, pp. ar Xiv 2106, 2021.

Pokle, A., Geng, Z., and Kolter, J. Z. Deep equilibrium approaches to diffusion models. Advances in Neural Information Processing Systems, 35:37975 37990, 2022.

Rusch, T. K. and Mishra, S. Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies. In International Conference on Learning Representations, 2021.

Ryu, E. K. and Boyd, S. Primer on monotone operator methods. Appl. comput. math, 15(1):3 43, 2016.

Wei, C. and Kolter, J. Z. Certified robustness for deep equilibrium models via interval bound propagation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum? id=y1PXylgr XZ.

Winston, E. and Kolter, J. Z. Monotone operator equilibrium networks. Advances in neural information processing systems, 33:10718 10728, 2020.

Yang, Z., Pang, T., and Liu, Y. A closer look at the adversarial robustness of deep equilibrium models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp. 10448 10461. Curran Associates, Inc., 2022.

Zhang, B., Jiang, D., He, D., and Wang, L. Rethinking lipschitz neural networks and certified robustness: A boolean function perspective. Advances in neural information processing systems, 35:19398 19413, 2022.

Subhomogeneous Deep Equilibrium Models

Zhao, Y., Zheng, S., and Yuan, X. Deep equilibrium models for video snapshot compressive imaging. ar Xiv preprint ar Xiv:2201.06931, 2022.

Subhomogeneous Deep Equilibrium Models

Proof. (Proposition 3.4) We now prove the first part. Let g: [1, + ) Rn, defined as

g(λ) = [g1(λ), . . . , gn(λ)] = F(λz) λµF(z).

Clearly, g(1) = 0 and g is a differentiable function since F(z) > 0 for each z > 0 and F is differentiable.

g (λ) = F (λz)z µλµ 1F(z)

λg (λ) = F (λz)λz µλµF(z) |F (λz)|λz µλµF(z).

Using the definition of subhomogeneous operator we get

λg (λ) µ(F(λz) λµF(z)) = µg(λ).

Then g j(λ) µ

λgj(λ) for each j = 1, . . . , n, thanks to the Gr onwall s inequality,

g j(λ) gj(1) exp

= gj(1)λ = 0.

Therefore, g j(λ) 0, this shows that gj is a decreasing function and gj(λ) 0, thus g(λ) 0 entry-wise. We now prove the second part of the proposition, the necessary condition is implied by the first part so we will prove only the sufficient condition. Let g: [0, + ) Rn be defined as g(λ) = F(λz) λµF(z), by hypothesis, for each λ 1, g(λ) 0, also note that g(1) = 0. This shows that g (λ) 0 for each λ 1.

g (λ) = F (λz)z µλµ 1F(z),

we get g (1) = F (z)z µF(z),

which implies F (z)z µF(z).

Proof. (Theorem 3.6) Let D(z) be Clark s generalized Jacobian of the map z ln(G(ez)). For the Mean Theorem (Clarke, 1990) we have

ln F eln(x) ln F eln(y) co (D(Ω(x, y)) (ln(x) ln(y)),

where Ω(x, y): = {z Rn | z = t ln(x) + (1 t) ln(y), t [0, 1]} and co (D(Ω(x, y)) (ln(x) ln(y)) denotes the convex hull of all points of the form Z(ln(x) ln(y)) where Z D(u) for some u in Ω(x, y). For the Caratheodory Theorem, we get

ln F eln(x) ln F eln(y) =

l=0 βjξl(ln(x) ln(y)),

where ξl D(u) for a u in Ω(x, y), βl 0 and P

l βl = 1. Let v = ln(x) ln(y), using the Chain Rule (Clarke, 1990), we obtain D(u)v Diag(F(eu)) 1 F(eu) Diag(eu) v,

F(u) detonates the Clark s Generalized Jacobian of F with respect u. Therefore, we can write the element of D(u)v as

Diag(F(eu)) 1Q Diag(eu)v,

where Q F(u). At this point, we can estimate δ(G(x), G(y)) as follows,

δ(G(x), G(y)) = ln(G(x)) ln(G(y)) =

j=0 βjξj(ln(x) ln(y))

l=0 βl ξl (ln(x) ln(y)) max l=0,...,n2 ξl (ln(x) ln(y)) . (12)

Subhomogeneous Deep Equilibrium Models

Moreover, using the definition of subhomogeneous operator we obtain

ξl = max i=1,...,n

j=1 |ξl|ij = max i=1,...,n

j=1 | Diag(F(eu)) 1Q Diag(eu)|ij =

= max i=1,...,n

1 F(eu)i Qij eu j

= max i=1,...,n 1 F(eu)i

Qij eu j max i=1,...,n 1 F(eu)i F(eu)i µ = µ.

Proof. (Theorem 3.8) For the Euler s Homogeneous Function Theorem φ(z) = w z z, thus G(z) = F (z) w z F (z). Let D(z) be Clark s generalized Jacobian of the map z ln(G(ez)). For the Mean Theorem (Clarke, 1990) we have

ln(G(eln(x))) ln(G(eln(y))) co (D(Ω(x, y)) (ln(x) ln(y)),

where Ω(x, y): = {z Rn | z = t ln(x) + (1 t) ln(y), t [0, 1]} and co (D(Ω(x, y)) (ln(x) ln(y)) denote the convex hull of all points of the form Z(ln(x) ln(y)) where Z D(u) for some u in Ω(x, y). For the Caratheodory Theorem, we get

ln(G(eln(x))) ln(G(eln(y))) =

j=0 βjξj(ln(x) ln(y)).

where ξj D(u) for a u in Ω(x, y), βj 0 and P i βi = 1. For simplicity we will pose v = ln(x) ln(y). Using the Chain Rule (Clarke, 1990) we obtain

D(u)v Diag(G(eu)) 1 G(eu) Diag(eu) v,

G(u) detonates the Clark s Generalized Jacobian of G with respect u. Since G(x) = F (x) w u F (x) and the generalized Jacobian of wu is zero, if we apply the chain rule (Clarke, 1990) several times we obtain

G(eu) Diag(eu)v F(eu)

w u F(eu) F(eu)w u F(eu) (w u F(eu))2

the right-hand side above denote the set of points Q Diag(eu) v where Q = H w u F (eu) F (eu)w u K (w u F (eu))2 and H, K, F(eu). Therefore we can write the element of D(u)v as

Diag(G(eu)) 1Q Diag(eu)v.

At this point, we will estimate δ(G(x), G(y)) using the calculations that we have done so far.

δ(G(x), G(y)) = ln(G(x)) ln(G(y)) =

j=0 βjξj(ln(x) ln(y))

j=0 βj ξj (ln(x) ln(y)) max j=0,...,mn ξj (ln(x) ln(y)) . (13)

Now we estimate the infinity norm of a matrix of the form Diag(G(eu)) 1Q Diag(eu). Let us begin by focusing on the first two matrices of multiplication, thus

Diag(G(eu)) 1Q = Diag(F(eu)) 1w u F(eu)Q =

= Diag(F(eu)) 1H 1w u K w u F(eu),

Subhomogeneous Deep Equilibrium Models

consequently, the entries are

Diag(G(eu)) 1Q ij = Diag(F(eu)) 1H 1 w u K w u P(eu)

r wu,r F(eu)r

wu,l F(eu)l P

r wu,r F(eu)r

l=1 γl Klj F(eu)l

where γi = wu,i F (eu)i P

r wu,r F (eu)r . Notice that γi are positive and P

Diag(G(eu)) 1Q Diag(eu) = max i=1,...,m

j=1 | Diag(G(eu)) 1Q Diag(eu)|ij =

= max i=1,...,m

Hijeu j F(eu)i

l=1 γl Kljeu j F(eu)i

max i=1,...,m

+ max i=1,...,m

µ + µ = 2µ. (14)

Finally we obtain maxj=0,...,mn ξj 2µ, that conclude the proof of the first part. We now assume that F is differentiable and its Jacobian is entrywise positive for each z > 0. Until Equation (13) the proof is the same as the first part. Therefore, we can start estimating Diag(G(eu)) 1Q Diag(eu) as following

Diag(G(eu)) 1Q ij = Diag(F(eu)) 1JF (eu) 1w u JF (eu) w u F(eu)

wu,l JF (eu)lj P

r wu,r F(eu)r

wu,l F(eu)l P

r wu,r F(eu)r

where γi = wu,i F (eu)i P

r wu,r F (eu)r and Cij = JF (eu)ij

F (eu)i . Notice that γi and Cij are positive and P

l=1 γl Clj| max l=1,...,m Clj =: Clj.

Diag(G(eu)) 1Q Diag(eu) = max i=1,...,m

j=1 | Diag(G(eu)) 1Q Diag(eu)|ij =

= max i=1,...,m

j=1 | Diag(G(eu)) 1Q|ijeu j

j=1 Cljeu j =

F(eu)l eu j µ.

That conclude the proof.

Proof. (Theorem 3.9) Let δ(G(x), G(y)) = max i=1,...,n j=1,...,d | ln(xij) ln(yij)|

be the Thompson distance on Rn d. Let Gj : Rn d Rn be the j-th column mapping of G.

δ(G(x), G(y)) = max i=1,...,n j=1,...,d | ln(xij) ln(yij)| = max j=1,...,d

max i=1,...,n | ln(xij) ln(yij)| = max j=1,...,d δ(Gj(x), Gj(y)) .

Subhomogeneous Deep Equilibrium Models

Applying Theorem 3.8 with input space Rn d and output space Rn, we obtain

δ (Gj(x), Gj(y)) 2µδ(x, y)

for all j = 1, . . . , d. Thus max j=1,...,d δ (Gj(x), Gj(y)) 2µδ(x, y).

This concludes the proof of the first part. Applying the same argument one can prove the case of positive Jacobian yielding the same upper bound without 2.

Proof. (Lemma 4.1) Now we start proving the first part, hence that P H is hµ-subhomogeneous. Using the chain rule we obtain (P H)(z)z = P(H(z))JH(z)z, where (P H)(z) and P(H(z)) are the Clarke s generalized Jacobians of P and P H, respectively evaluated in z and P(z), and JH(z) is the Jacobian of H in z. Therefore we can write an element of (P H)(z)z as MJH(z)z, with M P(z). Moreover, applying Euler s Homogeneous Function Theorem and the definition of subhomogeneity we get

|MJH(z)z| = h|MH(z)| hµP(H(z)) = hµ(P H)(z).

Thus, P H subhomhµ. We now prove the second part of the Lemma, hence Q Ty P H is hµλ-subhomogeneous. Let F = Q Ty P H and P = P H, for the first point is hµ-subhomogeneous. let F(z)z co{ Q(P (z) + y) P (z)}z. Thus, an element M F(z) can be written as

k=0 βk M Q k M P k z,

where βk 0 and P k βk = 1, M Q k Q(P (z) + y) and M P k P (z). We get

k=0 βk M Q k M P k z| λ

k=0 βk|M Q k |P (z) λ

k=0 βk|M Q k |(P (z) + y) λµQ(P (z) + y) = λµF(z).

Therefore, Q Ty P H subhomhµλ.

Proof. (Proposition 5.1) If z 0, σ(z) > 0, then R+ dom+(σ). Also note that

σ (z) = ez(1 + ez) ezez

(1 + ez)2 = σ(z)(1 σ(z)).

Since σ(z) = ez 1+ez , the inequality that we want to prove is equivalent to z 1+ez. In 0 the inequality is verified, moreover, if we calculate the derivative of both sides we obtain

e0 = 1 and ez is a monotone increasing function, thus the inequality is verified. Therefore σ subhom1(R+).

Proof. (Proposition 5.2) If z 0, σ(z) > 0, thus R+ dom+(σ).

σ (z) = eβz

0 < σ (z) < 1, thus σ is a monotonic increasing function. Therefore what we want to show coincides with

σ (z)x σ(z).

Moreover, σ (z)z x, since 0 < σ (z) < 1. Clearly is enough to proof βz ln(1 + eβz), this is equivalent to eβz 1 + eβz that holds for all z R+.

Subhomogeneous Deep Equilibrium Models

Proof. (Proposition 5.3) Note that, if z 0, σ(z) > 0, then R+ dom+(σ).

σ (z) = 4 (ez + e z)2 . (15)

Since, if z 0 σ (z) > 0, what we want to prove is equivalent to σ (z) z σ(z). If we plug in the last equation the exponential formulation of the hyperbolic tangent and 15 we get

4z ez + e z ez e z.

Using the exponential formulation of the hyperbolic sine we obtain

2z sinh(2z).

The Taylor expansion of the sinh is

sinh(2z) = 2z + (2z)3

Since the first term of the series is 2z and z R+, the inequality is verified. Thus tanh subhom1(R+).

Proof. (Proposition 5.5) First, note that, for each z R, σ(z) > 0, thus dom+(σ) = R. The Clarke s generalized Jacobian of H respect to z is:

0 if z < α1 or z > α2, [0, 1] if z = α1 or z = α2 1 otherwise

Then we get

0 if z < α1 or z > α2, [0, α1] if z = α1 [0, α2] if z = α2 z otherwise

Thus by the definition of σ we obtain the inequality |M z| σ(z) for each z R and for each M σ(z).

Proof. (Proposition 5.6) We compute the gradient of σ and we get

i ezi [ez1, . . . , ezn] = ez P

Thanks to the convexity of the exponential function we obtain

max i zi = log(emaxi zi) log(e P

i zi) log( X

i ezi) = σ(z)

from the previous inequality we obtain

| σ(z) z| = σ(z) z = P

i ezi max i zi σ(z).

This shows that σ subhom1(Rn +). Thanks to the fact that σ(z) is entrywise positive for each z Rn +, for Remark 3.3, σ s-subhom1(Rn +).

Subhomogeneous Deep Equilibrium Models

B. Additional Experiment

We initiate by considering the general equation for a Sub DEQ

z = normφ( σ1(σ2(Wz) + fθ(x)) ) .

As we illustrate in Section 4, we can build several categories of Sub DEQ layer by choosing either σ1 or σ2 as the nonlinear activation function. The two Sub DEQs proposed in Section 6 have σ1 = Id and σ2(z) as the activation functions, if we want to swap σ1 and σ2, so σ1 be the nonlinear function and σ2 the identity, we must restrict the hidden weights of our DEQ layers to be positive, since in order to apply Lemma 4.1 σ2(Wz) must be positive in the positive orthant. Two examples of this kind are: norm (tanh(|W| z + Re LU, (U x + b)) + 1.2),

and exploiting the power scaling trick

norm (tanh(|W| z + Re LU(U x + b))0.99),

where |W| is meant as the absolute value applied entrywise to the weight matrix W. Both layers are well-posed, as their unnormalized versions are subhomogeneous with a degree of 0.99 and differentiable with an entrywise positive Jacobian. We also implement a third variant of Sub DEQ without the normalization using the power scaling trick, with the implicit layer defined as follows tanh(W z)0.99 + Re LU(U x + b),

We test them on the same dataset used in Section 6.2 using the same training, validation, and test splitting with the same hyperparameter of the Sub DEQ models in Section 6.2, in Appendix C we report the misclassification error on the test set.

We also conduct other experiments with the graph neural network architecture. In section Section 6.3 we considered the following architecture ( Z = norm tanh (1 α) A Z + αfθ(X) + 1.2

Z = softmax( Z) (16)

Notice that, the map Z 7 (1 α) A Z

is 1-subhomogeneous, αfθ(X) is a positive translation vector, and tanh( ) + 1.2 is 0.99-storngly subhomogeneous. Thus, tanh (1 α) A Z + αfθ(X) + 1.2, due to Lemma 4.1, has a subhomogeneity degree of 0.99, since is differentiable with an entrywise positive Jacobian, the following iteration converge ( Z = norm tanh (1 α) A Z + αfθ(X) + 1.2

Z = softmax( Z) (17)

We test also this architecture on the same dataset used in Section 6.3 comparing it with APPNP and the subhomogenoeus version of APPNP. Moreover, we tune the α parameter making a grid search on the value [0.05, 0.1, 0.3, 0.5, 0.7, 0.9]. To tune α we fixed the test set as half of the unknown labels, and we trained the models by varying α. For each model, we select the optimal α by choosing the one that achieved the highest accuracy on the validation set, the validation set is the half remaining part of the unknown labels. Subsequently, we trained the model five times using the optimal α, altering the training set by sampling from all observations in the dataset that were not part of the initially fixed test set. After each training session, we measured the accuracy of the test set and reported the mean and standard deviation. The results are reported in the table C, where 17 is APPPNP (Normalized Tanh) 2 and 16 is APPPNP (Normalized Tanh). We tested the analogous architectures for the version without the normalization layer.

Subhomogeneous Deep Equilibrium Models

C. Additional table

Model Error % MNIST (Dense) Sub DEQ (Normalized Tanh (1.603)) 2.088 0.1405 % Sub DEQ (Tanh) 1.92 0.102 % Sub DEQ (Normalized Tanh (1.2)) 2.437 0.0788 % Sub DEQ (Normalizedwith Powerscale Tanh) 2.568 0.0495 % Sub DEQ (Powerscale Tanh) 1.964 0.125 % MNIST (Convolutional) Sub DEQ (Normalized Tanh (1.603)) 1.354 0.98 % Sub DEQ (Tanh) 0.706 0.011 % Sub DEQ (Normalized Tanh (1.2)) 1.829 0.0888 % Sub DEQ (Normalizedwith Powerscale Tanh) 1.773 0.1948 % Sub DEQ (Powerscale Tanh) 1.316 0.0459 % CIFAR-10 Sub DEQ (Normalized Tanh (1.603)) 28.364 0.377 % Sub DEQ (Tanh) 27.946 1.7564 % Sub DEQ (Normalized Tanh (1.2)) 35.809 0.5953 % Sub DEQ (Normalizedwith Powerscale Tanh) 33.864 0.7469 % Sub DEQ (Powerscale Tanh) 30.871 0.2268 % SVHN Sub DEQ (Normalized Tanh (1.603)) 9.3562 0.2122 % Sub DEQ (Tanh) 10.3987 0.41296 % Sub DEQ (Normalized Tanh (1.2)) 25.3903 3.7394 % Sub DEQ (Normalizedwith Powerscale Tanh) 23.4688 4.7033 % Sub DEQ (Powerscale Tanh) 19.4077 0.1346 %

Table 5. Mean standard deviation of the misclassification error on test set

Dataset Train set size Validation set size Test set size MNIST 71 % 14.5 % 14.5 % CIFAR-10 70 % 15 % 15 % SVHN 50.358 % 23.4235 % 26.2184 %

Table 6. Hold out split proportion

Subhomogeneous Deep Equilibrium Models

Dataset (% labeled) Accuracy Cora citation (7.8 %) APPNP 76.3835 0.7716% APPNP (Normalized Tanh) 73.4996 0.952 % APPNP (Normalized Tanh) 2 68.978 0.6919 % APPNP (Tanh) 74.621 3.2222 % APPNP (Tanh) 2 75.3079 3.1284 % Cora author (7.8%) APPNP 69.2128 2.7963 % APPNP (Normalized Tanh) 69.5713 1.3922 % APPNP (Normalized Tanh) 2 68.1995 2.2836 % APPNP (Tanh) 68.6672 2.8515 % APPNP (Tanh) 2 67.8878 2.7859 % Cite Seer (7.8 %) APPNP 60.9079 1.3739 % APPNP (Normalized Tanh) 59.5334 1.1373 % APPNP (Normalized Tanh) 2 59.8739 1.6284 % APPNP (Tanh) 61.2358 1.2669 % APPNP (Tanh) 2 60.4035 1.4294 % DBLP (6.0 %) APPNP 89.1939 0.2812 % APPNP (Normalized Tanh) 89.545 0.0526 % APPNP (Normalized Tanh) 2 86.9683 0.1177 % APPNP (Tanh) 89.4138 0.269 % APPNP (Tanh) 2 89.6610 0.212 % Pub Med (1.2%) APPNP 75.9526 0.76875 % APPNP (Normalized Tanh) 76.9303 0.777 % APPNP (Normalized Tanh) 2 75.0607 0.70495 % APPNP (Tanh) 77.9957 1.0701 % APPNP (Tanh) 2 77.2589 0.8646 %

Table 7. Mean standard deviation accuracy on test set

Models Hyperparameter MNIST (Dense) MNIST (Conv) Cifar-10 SVHN Number of input channels (x) - 1 3 3 Number of hidden channels (z) - 16 48 48 Size of hidden channels (z) - 28 28 32 32 32 32 Hidden kernel size hidden (z) - 3 3 3 3 3 3 Input kernel size (x) - 3 3 3 3 3 3 Dimension of input weight matrix (x) 784 87 - - - Dimension of hidden weight matrix (z) 87 87 - - - Average pooling - 4 4 8 8 8 8 Epochs 30 40 40 40 Initial learning rate 10 3 10 3 10 3 10 3

Learning rate schedule Cosine Annealing Cosine Annealing Cosine Annealing Cosine Annealing minimum learning rate 10 6 10 5 10 3 10 3

Weight decay 10 5 10 5 10 5 10 5

Batch size 256 256 128 128

Table 8. Models Hyperparameter