# robust_implicit_networks_via_noneuclidean_contractions__77804a7e.pdf

Robust Implicit Networks via Non-Euclidean Contractions

Saber Jafarpour1, , Alexander Davydov1, , Anton V. Proskurnikov2,3, and Francesco Bullo1

1 Center for Control, Dynamical Systems and Computation, University of California, Santa Barbara, 93106-5070, USA, {saber, davydov, bullo}@ucsb.edu. 2 Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy; 3 Institute for Problems in Mechanical Engineering, Russian Academy of Sciences, St. Petersburg, Russia, anton.p.1982@ieee.org

Implicit neural networks, a.k.a., deep equilibrium networks, are a class of implicitdepth learning models where function evaluation is performed by solving a ﬁxed point equation. They generalize classic feedforward models and are equivalent to inﬁnite-depth weight-tied feedforward networks. While implicit models show improved accuracy and signiﬁcant reduction in memory consumption, they can suffer from ill-posedness and convergence instability. This paper provides a new framework, which we call Non-Euclidean Monotone Operator Network (NEMON), to design well-posed and robust implicit neural networks based upon contraction theory for the non-Euclidean norm ℓ . Our framework includes (i) a novel condition for well-posedness based on one-sided Lipschitz constants, (ii) an average iteration for computing ﬁxed-points, and (iii) explicit estimates on input-output Lipschitz constants. Additionally, we design a training problem with the well-posedness condition and the average iteration as constraints and, to achieve robust models, with the input-output Lipschitz constant as a regularizer. Our ℓ well-posedness condition leads to a larger polytopic training search space than existing conditions and our average iteration enjoys accelerated convergence. Finally, we evaluate our framework in image classiﬁcation through the MNIST and the CIFAR-10 datasets. Our numerical results demonstrate improved accuracy and robustness of the implicit models with smaller input-output Lipschitz bounds. Code is available at https://github. com/davydovalexander/Non-Euclidean_Mon_Op_Net.

1 Introduction

Implicit neural networks are inﬁnite-depth learning models with layers deﬁned implicitly through a ﬁxed-point equation. Examples of implicit neural networks include deep equilibrium models [Bai et al., 2019] and implicit deep learning models [El Ghaoui et al., 2021]. Implicit networks can be considered as generalizations of feedforward neural networks with input-injected weight tying, i.e., training parameters are transferable between layers. Indeed, in implicit networks, function evaluation is executed by solving a ﬁxed-point equation and backpropagation is implemented by computing gradients using implicit differentiation. Due to these unique features, implicit models enjoy more ﬂexibility and improved memory efﬁciency compared to traditional neural networks. At the same time, implicit networks can suffer from instability in their training due to the nonlinear nature of their ﬁxed-point equations and can show brittle input-output behaviors due to their model ﬂexibility.

These authors contributed equally.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

It is known that implicit neural networks require careful tuning and initialization to avoid ill-posed training procedures. Indeed, without additional assumptions, their ﬁxed-point equation may not have a unique solution and the numerical algorithms for ﬁnding their solutions might not converge. Several recent works in the literature have focused on studying well-posedness and convergence of the ﬁxedpoint equations of implicit networks using frameworks such as monotone operator theory [Winston and Kolter, 2020], contraction theory [El Ghaoui et al., 2021], and a mixture of both [Revay et al., 2020]. Despite several insightful results, important questions about conditions for well-posedness of implicit networks and efﬁcient algorithms that converge to their solutions are still open.

One of the key features of implicit neural networks is their ﬂexibility, which might come at the cost of low input-output robustness. As ﬁrst noted in [Szegedy et al., 2014], the input-output behavior of deep neural networks can be vulnerable to perturbations; close enough input data can lead to completely different outputs. This lack of robustness can lead to unreliable performance of neural networks in safety-critical applications. Among several notions of robustness, the Lipschitz constant of a neural network is a coarse but rigorous measure which can be used to estimate input-output sensitivity of the network [Szegedy et al., 2014]. For this reason, there has been a growing interest in estimating the input-output Lipschitz constant of deep neural networks with respect to the ℓ2-norm [Fazlyab et al., 2019, Combettes and Pesquet, 2020]. However, it turns out that in some applications, the inputoutput Lipschitz constants with respect to non-Euclidean norms are more informative measures for studying robustness. One such application appears in the robustness analysis of neural networks with large-scale inputs under widely-distributed adversarial perturbations (examples of these adversarial perturbations can be found in [Szegedy et al., 2014]). For these examples, the input-output ℓ2Lipschitz constant does not provide complete information about robustness of the network; a neural network with small input-output ℓ2-Lipschitz constant can be very sensitive to widespread entrywisesmall perturbations of the input signal. On the other hand, the input-output ℓ -Lipschitz constant provides a different metric which appears to be well-suited for the analysis of widespread distributed perturbations. Another application is the estimation of input signal conﬁdence intervals from output deviations, where the input-output ℓ -Lipschitz constant of the network provides more scalable bounds than its ℓ2 counterpart.

Related works

Implicit learning models. Numerous works in learning theory have shown the power of deep learning models with implicit layers. In these learning models, the notion of layers are replaced by a composition rule, which can be either a ﬁxed-point iteration or a solution to a differential equation. Well-known frameworks for deep learning using implicit inﬁnite-depth layers include deep equilibrium networks [Bai et al., 2019], implicit deep learning [El Ghaoui et al., 2021], and Neural ODEs [Chen et al., 2018]. In [Kag et al., 2020], a class of implicit recurrent neural networks is considered and it is demonstrated that, with this architecture, the models do not suffer from vanishing nor exploding gradients. Implicit layers have also been used to study convex optimization problems [Agrawal et al., 2019] and to design control strategies [Amos et al., 2018]. Convergence to global minima of certain classes of implicit networks is studied in [Kawaguchi, 2021].

Well-posedness and numerical algorithms for ﬁxed-point equations. There has been a recent interest in studying well-posedness and numerical stability of implicit-depth learning models. [El Ghaoui et al., 2021] proposes a sufﬁcient spectral condition for well-posedness and for convergence of the Picard iterations associated with the ﬁxed-point equation of implicit networks. In [Winston and Kolter, 2020, Revay et al., 2020], using monotone operator theory, a suitable parametrization of the weight matrix is proposed which guarantees the stable convergence of suitable ﬁxed-point iterations. A recent inﬂuential survey on monotone operators is [Ryu and Boyd, 2016]. A recent survey on ﬁxed point strategies in data science is given by [Combettes and Pesquet, 2021].

Robustness of learning models It is known that neural networks can be vulnerable to adversarial input perturbations [Szegedy et al., 2014]. A large body of literature is devoted to improve robustness of neural networks using various defense strategies against adversarial examples [Goodfellow et al., 2015, Papernot et al., 2016]. While these strategies are effective in many scenarios, they do not provide formal guarantees for robustness [Carlini and Wagner, 2017]. However, there has been a recent interest in designing classiﬁers that are provably robust with respect to adversarial perturbations [Madry et al., 2018, Wong and Kolter, 2018]. The input-output Lipschitz constant of a neural

network is a rigorous metric for its worst-case sensitivity with respect to input perturbations. Several recent works have focused on estimating the Lipschitz constant and enforcing its boundedness. For example, [Fazlyab et al., 2019, 2020] propose a convex optimization framework using quadratic constraints and semideﬁnite programming to obtain upper bounds on Lipschitz constants of deep neural networks. In [Pauli et al., 2021], a training algorithm is designed to ensure boundedness of the Lipschitz constant of the neural network via a semideﬁnite program. Other methods for estimating the Lipschitz constant of deep neural networks include [Krishnan et al., 2020, Revay et al., 2021, Combettes and Pesquet, 2020]. For implicit neural networks, a sensitivity-based robustness analysis is proposed in [El Ghaoui et al., 2021]. Lipschitz constants of deep equilibrium networks have also been studied in [Pabbaraju et al., 2021, Revay et al., 2020] using monotone operator theory.

Contributions

In this paper, using non-Euclidean contraction theory with respect to the ℓ -norm, we propose our novel framework, Non-Euclidean Monotone Operator Network (NEMON), to design implicit neural networks and study their well-posedness, stability, and robustness. First, we develop elements of a novel non-Euclidean monotone operator theory akin to the frameworks in [Bauschke and Combettes, 2017, Ryu and Boyd, 2016]. Using the concept of matrix measure, we introduce the essential notion of one-sided Lipschitz constant of a map. Based upon this notion, we prove a general ﬁxed-point theorem with weaker requirements than classical results on Picard and Krasnosel skii Mann iterations. For maps with one-sided Lipschitz constant less than unity, we show that an average iteration converges for sufﬁciently small step sizes and optimize its rate of convergence. For the special case of the weighted ℓ -norm, we show that this average iteration can be accelerated by choosing a larger step size. Additionally, we study perturbed ﬁxed-point equations and establish a bound on the distance between perturbed and nominal equilibrium points as a function of one-sided Lipschitz condition. Second, for implicit neural networks, we use our new ﬁxed-point theorem to (i) establish ℓ -norm conditions for their well-posedness, (ii) design accelerated numerical algorithms for computing their solutions, and (iii) provide upper bounds on their input-output ℓ -Lipschitz constants. Third, we propose a parametrization for matrices with appropriate bound on their one-sided Lipschitz constants and use this parametrization with the average iteration to design a training optimization problem. Finally, we perform several numerical experiments illustrating improved performance of NEMON in image classiﬁcation on the MNIST and the CIFAR-10 datasets compared to the state-of-the-art models in [El Ghaoui et al., 2021, Winston and Kolter, 2020]. Additionally, by adding the input-output Lipschitz constant as regularizer in the training problem, we observe improved robustness to some classes of adversarial perturbations. We include all relevant proofs in Appendix C.

2 Review material

Matrix measures Let be a norm on Rn and its induced norm on Rn n. The matrix measure of A Rn n with respect to is deﬁned by µ(A) := limh 0+ In+h A 1

h , that is, the one-sided directional derivative of the induced norm in direction of A, evaluated at In. Remarkably, the matrix measure is a tighter upper bound on the spectral abscissa of A than A and the set of matrices A Rn n satisfying µ(A) 1 is an unbounded subset of Rn n strictly containing the compact ball A 1. We refer to [Desoer and Haneda, 1972] for a list of properties enjoyed by matrix measures.

We will be speciﬁcally interested in diagonally weighted ℓ norms deﬁned by

x ,[η] 1 = max i 1 ηi |xi|, (1)

where, given a positive vector η Rn >0, we use [η] to denote the diagonal matrix with diagonal entries η. The corresponding matrix norm and measure are

A ,[η] 1 = max i {1,...,n}

ηj ηi |aij|, µ ,[η] 1(A) = max i {1,...,n}

j=1,j =i |aij|ηj

Lipschitz maps Given a norm with induced matrix measure µ( ), a differentiable map F : Rn Rn is Lipschitz continuous with constant Lip(F) R 0 if DF(x) Lip(F) for all x Rn. (3) For example, for an afﬁne F(x) = Ax + b, the (smallest) Lipschitz constant is Lip(F) = A .

One-sided Lipschitz maps Given a norm , a differentiable map F : Rn Rn is one-sided Lipschitz continuous with constant os L(F) R if

µ(DF(x)) os L(F) for all x Rn. (4)

For example, for an afﬁne F(x) = Ax + b, the (smallest) one-sided Lipschitz constant is os L(F) = µ(A). Note that (i) the one-sided Lipschitz constant is upper bounded by the Lipschitz constant, (ii) a Lipschitz continuous map is always one-sided Lipschitz continuous, and (iii) the one-sided Lipschitz constant may be negative. For a more in-depth review we refer to Appendix A. The notion of onesided Lipschitz continuity uniﬁes several important concepts in dynamical systems and optimization theory. In operator theory, the map F is called a monotone operator if it is one-sided Lipschitz continuous with respect to the ℓ2-norm with the constant os L( F) > 0 [Ryu and Boyd, 2016, Bauschke and Combettes, 2017]. In control theory, the vector ﬁeld F is called strongly inﬁnitesimally contracting if it is one-sided Lipschitz continuous with the constant os L(F) < 0 [Desoer and Haneda, 1972, Lohmiller and Slotine, 1998, Pavlov et al., 2004]. In what follows, we let os L ,[η] 1(F) R denote the one-sided Lipschitz constants with respect to the weighted ℓ -norm.

3 Fixed-point equations and one-sided Lipschitz constants

In this section, we show that the notion of one-sided Lipschitz constant can be used to study solvability of ﬁxed-point equation:

x = F(x), (5)

where F : Rn Rn is a differentiable map. Let be a norm on Rn, then in view of the Banach ﬁxed-point theorem, a simple sufﬁcient condition for existence of a unique solution for the ﬁxed-point equation (5) is Lip(F) < 1. We note that the sufﬁcient condition Lip(F) < 1 depends on the speciﬁc form of the ﬁxed-point equation (5) and can be relaxed by a suitable rewriting of this ﬁxed-point equation. Given an averaging parameter α (0, 1] we deﬁne the average map Fα : Rn Rn by Fα := (1 α)I + αF, where I is the identity map. Using this notion, an equivalent reformulation of the ﬁxed-point equation (5) is:

x = (1 α)x + αF(x) = Fα(x). (6)

For α = 1, we have Fα(x) = F(x) and equation (6) coincides with equation (5). For every α (0, 1), the map Fα is different from F but equations (5) and (6) are equivalent. Hence, if Lip(Fα) < 1, then by the Banach ﬁxed-point theorem, the ﬁxed point equation (6) (and therefore the ﬁxed point equation (5)) has a unique solution x and the sequence {yk} k=1 deﬁned by

yk+1 = (1 α)yk + αF(yk), for all k Z 0 (7)

converges geometrically to x with rate Lip(Fα). As a result of the parametrization (6), the condition Lip(F) < 1 for existence and uniqueness of the ﬁxed-point can be relaxed to sufﬁcient conditions

Lip(Fα) < 1, (8)

parametrized by α (0, 1]. Additionally, if condition (8) is satisﬁed, then algorithm (7) computes the ﬁxed point x . It can be shown that the condition (8) becomes less conservative as α decreases. The next theorem shows that in the limit as α 0+, condition (8) approaches the condition os L(F) < 1. Theorem 1 (Fixed points via one-sided Lipschitz conditions). Let F : Rn Rn be differentiable and Lipschitz with constant ℓ> 0 with respect to a norm . Deﬁne the average map Fα = (1 α)I+αF and, for c > 0, the function γℓ,c : ]0, c (c+ℓ+1)(ℓ+1)[ R by:

γℓ,c(α) := 1 + αc α2(ℓ+ 1)2

Then the following statements are equivalent:

(i) os L(F) < 1 c,

(ii) Lip(Fα) = γℓ,c(α), for 0 < α < c (c+ℓ+1)(ℓ+1).

Moreover, if the equivalent conditions (i) or (ii) hold, then, for condition number κ = 1+Lip(F)

(iii) F has a unique ﬁxed point x ;

(iv) for 0 < α < 1 κ(κ+1), Fα is a contraction mapping with contraction factor γℓ,c(α) < 1;

(v) the minimum contraction factor γ ℓ,c = 1 1 4κ2 + 1 8κ3 +O 1

κ4 and the minimizing averaging parameter α of Fα is

α = κ 1 os L(F)

= 1 1 os L(F)

2κ2 3 8κ3 + O 1

The average iteration (6) is often referred to as the Krasnosel skii Mann iteration or the damped iteration [Bauschke and Combettes, 2017]. Compared to [Bauschke and Combettes, 2017, Theorem 5.15], Theorem 1(iv) studies convergence of the Krasnosel skii Mann iteration for arbitrary norms, proposes a weaker convergence condition of the form os L(F) < 1 (hence, F need not be nonexpansive). However, it ensures convergence for only sufﬁciently small α > 0 and assumes that F is differentiable (as will be shown, however, the latter assumption can be relaxed).

3.1 Accelerated convergence for weighted ℓ norms

For diagonally weighted ℓ norms, one can strengthen Theorem 1(iv) to prove the convergence of the average iteration (6) on a larger domain of the parameter α. Theorem 2 (Accelerated ﬁxed point algorithm for ℓ norms). Let F : Rn Rn be differentiable and Lipschitz with respect to the weighted non-Euclidean norm ,[η] 1. Deﬁne the average map Fα = (1 α)I + αF and pick diag L(F) [ Lip(F), os L(F)] to satisfy

diag L(F) min i {1,...,n} inf x Rn DFii(x). (9)

If os L(F) < 1, then F has a unique ﬁxed-point x and

(i) for 0 < α 1 1 diag L(F), Fα is a contraction mapping with the contraction factor

1 α(1 os L(F)) < 1;

(ii) the minimum contraction factor and minimizing averaging parameter of Fα are, respectively,

Lip(Fα ) = 1 1 os L(F) 1 diag L(F) = 1 1 κ , for κ = 1 diag L(F)

1 os L(F) 1 + Lip(F)

α = 1 1 diag L(F).

Note that diag L(F) is well-deﬁned because of the Lipschitz continuity assumption. Speciﬁcally, one can show that diag L(F) is the minus the minimum over i {1, . . . , n} of the one-sided Lipschitz constants of the maps xi 7 F(xi, x i) at x i = (x1, . . . , xi 1, xi+1, . . . , xn) ﬁxed.

It is instructive to compare the minimum contraction factor in the general Theorem 1 with the minimum contraction factor for ℓ norms in Theorem 2 and how they depend upon the corresponding condition numbers κ and κ . We note that (i) the relevant condition number diminishes κ κ , and (ii) the minimum contraction factor Lip(Fα ) = 1 1 4κ2 +O(1/κ4) improves to Lip(Fα ) = 1 1 κ . This acceleration justiﬁes the title of this section.

3.2 Perturbed ﬁxed-point problems

In this subsection, we focus on solvability of the perturbed ﬁxed-point equation:

x = F(x, u), (10)

where F : Rn Rr Rn is differentiable in x. We deﬁne Fu(x) = F(x, u) and Fx(u) = F(x, u). Given a norm X in Rn and U in Rr, F is Lipschitz in its ﬁrst argument with constant Lipx(F) R 0 if

F(x1, u) F(x2, u) X Lipx(F) x1 x2 X for all x1, x2 Rn and u Rr,

and it is Lipschitz in its second argument with constant Lipu(F) R 0 if

F(x, u1) F(x, u2) X Lipu(F) u1 u2 U for all x Rn and u1, u2 Rr,

and it is one-sided Lipschitz in its ﬁrst argument with constant os Lx(F) R if

µ(Dx F(x, u)) os Lx(F) for all x1, x2 Rn and u Rr.

The following result, which is in the spirit of Lim s Lemma [Lim, 1985], provides an upper bound on the distance between ﬁxed-points of the perturbed equation (10).

Theorem 3 (Perturbed ﬁxed-points). Given a norm X in Rn and a norm U in Rr, consider a map F : Rn Rr Rn differentiable in the ﬁrst argument and Lipschitz in both arguments. If F is one-sided Lipschitz with constant os Lx(F) < 1, then

(i) for every u Rm, the map Fu has a unique ﬁxed point x u;

(ii) for every u, v Rm, x u x v X Lipu(F) 1 os Lx(F) u v U.

Finally, Theorems 1, 2, and 3 are not directly applicable to activation function that are not differentiable. In Appendix C.3, we show that for speciﬁc form of the ﬁxed-point equation (5), where F = Φ H and Φ : Rn Rn is a weakly increasing, non-expansive, diagonal activation function and H : Rn Rr Rn is a differentiable function, all of the conclusions of Theorems 1, 2, and 3 hold by requiring equation (9) to be true almost everywhere.

4 Contraction analysis of implicit neural networks

In this section, we use contraction theory to lay the foundation for our Non-Euclidean Monotone Operator Network (NEMON) model of implicit neural networks. Given A Rn n, B Rn r, C Rq n, and D Rq r, we consider the implicit neural network

x = Φ(Ax + Bu) := N(x, u), y = Cx + Du, (11)

where x Rn, u Rr, y Rq, and Φ : Rn Rn is deﬁned by Φ(x) = (φ1(x1), . . . , φn(xn)). For every i {1, . . . , n}, we assume the activation function φi : R R is weakly increasing, i.e., φi(xi) φi(zi) for xi zi, and non-expansive, i.e., |φi(xi) φi(zi)| |xi zi| for all xi and zi; if φi is differentiable, these conditions are equivalent to 0 φ i(xi) 1 for all xi R.

We are able to provide the following estimates on all relevant Lipschitz constants. Theorem 4 (Lipschitz and one-sided Lipschitz constants for the implicit neural network). Consider the implicit neural network in equation (11) with weakly increasing and non-expansive activation functions Φ. With respect to ,[η] 1, η Rn >0, on Rn and U on the input space Rr, the map N : Rn Rr Rn is one-sided Lipschitz continuous in the ﬁrst variable and Lipschitz continuous in both variables with constants:

os Lx(N) = µ ,[η] 1(A)+ , Lipx(N) = A ,[η] 1 , (12)

Lipu(N) = B ( ,[η] 1),U , diag L(N) = mini {1,...,n}(Aii) , (13)

where (z)+ = z if z 0 and (z)+ = 0 if z < 0; and (z) = 0 if z 0 and (z) = z if z < 0.

We now use these estimates to establish multiple properties of the implicit neural network. Corollary 5 (Well posedness, input-state Lipschitz constant, and computation). Consider the model (11), with parameters (A, B, C, D) and with weakly increasing and non-expansive activation functions Φ. Deﬁne the average map Nα := (1 α)I+αN and consider the norms ,[η] 1, η Rn >0, on Rn, U on the input space Rr and Y on the output space Rq. Then

(i) if µ ,[η] 1(A) < 1, then (11) is well posed, i.e., there exists a unique ﬁxed point,

(ii) the map Nα is a contraction mapping for 0 < α α := 1 mini {1,...,n}(Aii) 1

with minimum contraction factor Lip(Nα ) = 1 1 µ ,[η] 1(A)+ 1 mini {1,...,n}(Aii) .

(iii) the Lipschitz constants from input u to ﬁxed point x u and to the output y = Cx u + Du are

Lipu x := Lipu(N) 1 os Lx(N) = B ( ,[η] 1),U 1 µ ,[η] 1(A)+ , (14)

Lipu y := B ( ,[η] 1),U C Y,( ,[η] 1)

1 µ ,[η] 1(A)+ + D Y,U. (15)

5 Training implicit neural networks

Problem setup Given an input data matrix U = [u1, . . . , um] Rr m and a corresponding output data matrix Y = [y1, . . . , ym] Rq m, we aim to learn matrices A, B, C, D so that the neural network (11) approximates the input-output relationship. We rewrite the model for matrix inputs as b Y = CX + DU, where X = Φ(AX + BU). From Corollary 5(i), if each φi is weakly increasing and non-expansive, the ﬁxed point problem is well-posed when µ ,[η] 1(A) < 1 for some η Rn >0. We consider a training problem of the form

min A,B,C,D,X L(Y, CX + DU) + P(A, B, C, D)

X = Φ(AX + BU), µ ,[η] 1(A) γ, (16)

where L is a loss function, P is a penalty function, and γ < 1 is a hyperparameter ensuring the ﬁxed point problem is well-posed. For η = 1n, we can remove the constraint µ (A) γ in the training optimization problem (16) using the following parametrization of weight matrix A:

A = T diag(|T|1n) + γIn. (17)

In Appendix B, we show that parametrization (17) characterizes the set of matrices in Rn n satisfying µ (A) γ. Using the parametrization (17) in the training problem not only improves the computational efﬁciency of the optimization but also allows for the design of implicit neural networks with additional structure such as convolutions. Suppose u Rrs2 is a r-channel input of size s s and x Rns2 is an n-channel hidden layer. To deﬁne our implicit CNN, we select the weight matrix A Rns2 ns2 as the matrix form of a 2D convolutional operator. If we consider a circular convolution operator, then A is a circulant matrix. Using the parametrization (17), A is circulant if and only if T is circulant. Therefore, the training problem for implicit CNNs can be cast as an unconstrained optimization problem using the above parametrization with a circulant T.

Improving robustness via Lipschitz regularization We now focus on learning robust implicit neural networks with bounded Lipschitz constants via a regularization strategy. Setting both U and Y as in the input-output Lipschitz bound (15), we get

Lipu y = B ( ,[η] 1),( ) C ( ),( ,[η] 1)

1 µ ,[η] 1(A)+ + D ,

B 2 ( ,[η] 1),( ) + C 2 ( ),( ,[η] 1) 1 µ ,[η] 1(A)+ + D , ,

where the inequality provides a convex upper bound for the input-output Lipschitz constant. Therefore, using the hyperparameter λ > 0, the regularized optimization problem is written as

min A,B,C,D,X L(Y, CX + DU) + λ 1

B 2 ( ,[η] 1),( ) + C 2 ( ),( ,[η] 1) 1 µ ,[η] 1(A)+ + D ,

X = Φ(AX + BU), µ ,[η] 1(A) γ. (18)

Certiﬁed adversarial robustness via Lipschitz bounds Given a nominal input u Rr, we consider any perturbed input v within an ℓ -ball of radius ε around u. In this case, we have

yu yv Lipu y u v Lipu y ε. (19)

Then we deﬁne margin(u) = (yu)i maxj =i(yu)j, where (yu)i is the logit corresponding to the (correct) label i for the input u. Then provided Lε 1

2margin(u), NEMON is certiﬁably robust to any perturbed input v within an ℓ -ball of radius ε centered at u.

Backpropagation of gradients via average iteration From [El Ghaoui et al., 2021] we now show how the average iteration can be used to perform backpropagation via the implicit function theorem. For simplicity, we assume that each activation function φi is differentiable and consider mini-batches of size 1, i.e., we have X = x Rn, U = u Rr and b Y = by Rq. Let x be the unique solution of the ﬁxed-point equation (11). Then the chain rule implies L A = ( x L)x , L B = ( x L)u ,

L C = ( by L)x , L D = ( by L)u .

Since L depends explicitly on by, computing by L is straightforward. Computing x L is more complicated since X is deﬁned only implicitly. However, it be shown that x L = (C(I DΦA) 1DΦ) by L. Since µ ,[η] 1(A) < 1, by Lemma 8 we get that µ ,[η] 1(DΦA) < 1. This implies that the matrix G := (In DΦA) 1DΦ Rn n exists and is the solution to the following ﬁxed-point equation [El Ghaoui et al., 2021, Section 6.2] G = DΦ(AG + In). (20) Moreover, µ ,[η] 1(DΦA) < 1 and Theorem 2 together imply that the ﬁxed-point equation (20) has

a unique solution G and, for every 0 < α α := 1 mini(Aii) 1, the average iterations Gk+1 = (1 α)Gk + αDΦ(AGk + In), for all k Z 0 are contracting with the minimum contraction factor 1 α 1 µ ,[η] 1(A)+ at step size α .

6 Theoretical and numerical comparisons

In this section, we provide a comprehensive comparison of our framework with the state-of-the-art implicit neural networks2.

6.1 Implicit network models

We start by reviewing the existing models for implicit networks in the literature.

Implicit deep learning model. [El Ghaoui et al., 2021] proposes a class of implicit neural networks with input-output behavior described by (11). It is shown that a sufﬁcient condition for existence and uniqueness of a solution and convergence of the Picard iterations for the ﬁxed point equation x = Φ(Ax + Bu) is λpf(|A|) < 1, where |A| denotes the entrywise absolute value of the matrix A and λpf denotes the Perron-Frobenius eigenvalue. For training, the optimization problem (16) is used where the constraint µ ,[η] 1(A) γ is replaced by A γ [El Ghaoui et al., 2021, Equation 6.3].3 It is easy to see that our well-posedness condition in Corollary 5(i) is less conservative than λpf(|A|) < 1 and its convex relaxation A < 1.

Monotone operator deep equilibrium network (MON). [Winston and Kolter, 2020] proposes to use monotone operator theory to guarantee well-posedness of the ﬁxed-point equation as well as its convergence to the solutions. The input-output behavior of the network is described by (11). For training, the optimization problem (16) is used where the constraint µ ,[η] 1(A) γ is replaced by In 1

2(A + A ) (1 γ)In. In order to ensure that this constraint is always satisﬁed in the training procedure, the weight matrix A is parametrized as A = γIn W W Z +Z , for arbitrary W, Z Rn n [Winston and Kolter, 2020, Appendix D].4 In the context of contraction theory, In 1

2(A + A ) (1 γ)In µ2(A) γ,

which is shown in Appendix A. Thus, the parametrization A = γIn W W Z + Z can be considered as the ℓ2-norm version of the parametrization described by equation (17). In other words, the monotone operator network formulation is a Euclidean transcription of the framework we propose in this paper.

2All models were trained using Google Colab with a Tesla P100-PCIE-16GB GPU. 3The implicit deep learning implementation is available at https://github.com/beeperman/idl. 4The MON implementation is available at https://github.com/locuslab/monotone_op_net.

6.2 MNIST experiments

0 2 4 6 8 10 Epochs

Accuracy vs epochs on MNIST handwritten digits

MON µ (A) 0.95

0 2 4 6 8 10 Epochs

Test loss vs epochs on MNIST handwritten digits

MON µ (A) 0.95

Figure 1: Performance comparison between the NEMON model with µ (A) 0.95, the implicit deep learning model with A 0.95, and MON with In 1

2(A + A ) 0.05In on the MNIST dataset. The curves are generated by mean accuracy and mean loss over 5 different runs while light envelopes around the curves correspond to the standard deviation over the runs. Average best accuracy for the NEMON model is 0.9772, while it is 0.9721 for implicit deep learning model and 0.9762 for the MON model.

In the digit classiﬁcation dataset MNIST, input data are 28 28 pixel images of handwritten digits between 0-9. There are 60000 training images and 10000 test images. For training, images are reshaped into 784 dimensional column vectors and entries are scaled into the range [0, 1]. As a loss function, we use the cross-entropy. All models are of order n = 100, used the Re LU activation function φi(x) = (x)+, and are trained with a batch size of 300 over 10 epochs with a learning rate of 1.5 10 2. Curves for accuracy and loss versus epochs for the three models are shown in Figure 1. Regarding training times, using the average iteration, NEMON took, on average, 12 forward iterations, 13 backward iterations, and 9.8 seconds to train per epoch. Using the Peaceman-Rachford iteration, MON took, on average, 17 forward iterations, 16 backward iterations, and 9.5 seconds to train per epoch. Using the Picard iteration, the implicit deep learning model took, on average, 10 forward iterations, 5 backward iterations, and 5.8 seconds to train per epoch. We observe that the NEMON model performs better than the implicit deep learning model and has a comparable performance to MON.

102 103 104 105

Lipschitz constant

Test error (%)

Test error vs Lipschitz constant on MNIST handwritten digits

λ = 0 A 0.95

0.0 0.1 0.2 0.3 0.4 0.5 ℓ amplitude of perturbation

Accuracy vs perturbation on MNIST handwritten digits

λ = 0 A 0.95

Figure 2: On the left is a plot of test error versus Lipschitz constant for the implicit deep learning model with A 0.95 and for NEMON with µ (A) 0.95 and parametrized by the regularization hyperparameter λ. We deﬁne the test error as 1 minus the accuracy. On the right is a plot of accuracy versus ℓ perturbation of a deterministic adversarial image inversion attack where we additionally include the MON model with In 1

2(A + A ) 0.05In.

We also study the robustness of the NEMON model compared to the implicit deep learning model and the MON model on the MNIST dataset. We train various models regularized by the input-output Lipschitz constant as in (18). Additionally, to verify robustness of the different models, we consider several adversarial attacks and plot the accuracy versus perturbation of such an attack. In Figure 2, we consider a continuous image inversion attack [Hosseini et al., 2017], where each pixel is perturbed

in the direction of pixel value inversion with amplitude given by the ℓ perturbation. For more details on this and other types of adversarial perturbations, we refer to Appendix D. We observe that for λ = 10 5, the regularized NEMON model achieves a two order of magnitude decrease in its input-output Lipschitz constant compared to the un-regularized NEMON models. In addition, we see that the implicit deep learning model and the MON model are more sensitive to the continuous image inversion attack than NEMON. Moreover, as the regularization parameter λ increases, the NEMON model becomes increasingly robust to this attack.

6.3 CIFAR-10 experiments

In the image classiﬁcation dataset CIFAR-10, input data are 32 32 color images in 10 classes. There are 50000 training images and 10000 test images. We compare our proposed NEMON model with a convolutional structure to a single convolutional layer MON model. Each model used 81 channels. We train both models with a batch size of 256 and a learning rate of 10 3 for 40 epochs. For training, using the average iteration, NEMON took, on average, 10 forward iterations, 10 backward iterations, and 75.0 seconds per epoch to train. Using the Peaceman-Rachford iteration, MON took, on average, 5 forward iterations, 5 backward iterations, and 101.8 seconds per epoch to train.

We focus primarily on the robustness of NEMON and MON with respect to ℓ -norm bounded perturbations on CIFAR-10. To this end, we additionally trained two NEMON models with regularization parameters λ {10 4, 10 5}. In Figure 3, on the left is a plot of the certiﬁed robustness of each of the models via their ℓ -Lipschitz constants. For MON, we got the ℓ -Lipschitz bound using the method in [Pabbaraju et al., 2021] for the ℓ2-Lipschitz bound and using the upper bound u 2

rs2 u . On the right is a plot of the accuracy of different models with respect to the projected gradient descent attack. We observe that the un-regularized and regularized NEMON models are more robust to ℓ -norm bounded perturbations than is MON.

0.000 0.002 0.004 0.006 0.008 0.010 ℓ amplitude of perturbation

Certified robust test accuracy

Certified robustness vs perturbation on CIFAR-10 images

λ = 10 4, Lip=116.1

λ = 10 5, Lip=249.2

λ = 0, Lip=6044.9

MON, Lip=5398.2

0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 ℓ amplitude of perturbation

Accuracy vs perturbation on CIFAR-10 images

Figure 3: On the left is a plot of certiﬁed robustness via the Lipschitz constants of MON with the constraint In 1

2(A + A ) In and NEMON with the constraint µ (A) 0. On the right is a plot of accuracy versus ℓ perturbation of the projected gradient descent attack.

7 Conclusion

Using non-Euclidean contraction theory, we propose a framework to study stability of ﬁxed-point equations. We apply this framework to analyze well-posedness and convergence of implicit neural networks and to design an efﬁcient training algorithm to incorporate robustness guarantees. For future research, we envision that our framework is applicable to study stability and robustness of implicit learning models with additional structure such as graph neural networks.

8 Acknowledgments

The authors thank Ian Manchester for stimulating discussions about contraction theory and the anonymous reviewers for their insightful feedback. This work was supported in part by DTRA under grant HDTRA1-19-1-0017, AFOSR under grant FA9550-22-1-0059, and NSF Graduate Research Fellowship under grant 2139319.

A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, 2019. URL https: //arxiv.org/abs/1910.12430.

B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018. URL https://arxiv.org/abs/1810.13400.

S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In Advances in Neural Information

Processing Systems, 2019. URL https://arxiv.org/abs/1909.01377.

H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert

Spaces. Springer, 2 edition, 2017. ISBN 978-3-319-48310-8.

N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In ACM Workshop on Artiﬁcial Intelligence and Security, pages 3 14, 2017. doi:10.1145/3128572.3140444.

R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018. URL https://arxiv.org/abs/ 1806.07366.

P. L. Combettes and J.-C. Pesquet. Lipschitz certiﬁcates for layered network structures driven by averaged activation operators. SIAM Journal on Mathematics of Data Science, 2(2):529 557, 2020. doi:10.1137/19M1272780.

P. L. Combettes and J.-C. Pesquet. Fixed point strategies in data science. IEEE Transactions on

Signal Processing, 2021. doi:10.1109/TSP.2021.3069677.

A. Davydov, S. Jafarpour, and F. Bullo. Non-Euclidean contraction theory for robust nonlinear stability. 2021. URL https://arxiv.org/abs/2103.12263.

C. A. Desoer and H. Haneda. The measure of a matrix as a tool to analyze computer algorithms for circuit analysis. IEEE Transactions on Circuit Theory, 19(5):480 486, 1972. doi:10.1109/TCT.1972.1083507.

L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai. Implicit deep learning. SIAM Journal on

Mathematics of Data Science, 3(3):930 958, 2021. doi:10.1137/20M1358517.

Y. Fang and T. G. Kincaid. Stability analysis of dynamical neural networks. IEEE Transactions on

Neural Networks, 7(4):996 1006, 1996. doi:10.1109/72.508941.

M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas. Efﬁcient and accurate estimation of Lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 2019. URL https://arxiv.org/abs/1906.04893.

M. Fazlyab, M. Morari, and G. J. Pappas. Safety veriﬁcation and robustness analysis of neural networks via quadratic constraints and semideﬁnite programming. IEEE Transactions on Automatic Control, 2020. doi:10.1109/TAC.2020.3046193.

I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015. URL https://arxiv. org/abs/1412.6572.

W. He and J. Cao. Exponential synchronization of chaotic neural networks: a matrix measure approach. Nonlinear Dynamics, 55:55 65, 2009. doi:10.1007/s11071-008-9344-4.

H. Hosseini, B. Xiao, M. Jaiswal, and R. Poovendran. On the limitation of convolutional neural networks in recognizing negative images. In IEEE International Conference on Machine Learning and Applications, pages 352 358, 2017. doi:10.1109/ICMLA.2017.0-136.

A. Kag, Z. Zhang, and V. Saligrama. RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hylpq A4Fw S.

K. Kawaguchi. On the theory of implicit deep learning: Global convergence with implicit layers. In

International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=p-NZIuwqh I4.

V. Krishnan, A. A. A. Makdah, and F. Pasqualetti. Lipschitz bounds and provably robust training by Laplacian smoothing. In Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2006.03712.

T. C. Lim. On ﬁxed point stability for set-valued contractive mappings with applications to generalized differential equations. Journal of Mathematical Analysis and Applications, 110(2):436 441, 1985. doi:10.1016/0022-247X(85)90306-3.

W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-linear systems. Automatica, 34(6): 683 696, 1998. doi:10.1016/S0005-1098(98)00019-3.

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Machine Learning, 2018. URL https://arxiv.org/abs/1706.06083.

C. Pabbaraju, E. Winston, and J. Z. Kolter. Estimating Lipschitz constants of monotone deep equilibrium models. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Vc B4Qk Sfy O.

N. Papernot, P. Mc Daniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, pages 582 597, 2016. doi:10.1109/SP.2016.41.

P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgower. Training robust neural networks using Lipschitz bounds. IEEE Control Systems Letters, 2021. doi:10.1109/LCSYS.2021.3050444.

A. Pavlov, A. Pogromsky, N. Van de Wouw, and H. Nijmeijer. Convergent dynamics, a tribute to Boris Pavlovich Demidovich. Systems & Control Letters, 52(3-4):257 261, 2004. doi:10.1016/j.sysconle.2004.02.003.

H. Qiao, J. Peng, and Z.-B. Xu. Nonlinear measures: A new approach to exponential stability analysis for Hopﬁeld-type neural networks. IEEE Transactions on Neural Networks, 12(2):360 370, 2001. doi:10.1109/72.914530.

M. Revay, R. Wang, and I. R. Manchester. Lipschitz bounded equilibrium networks. 2020. URL

https://arxiv.org/abs/2010.01732.

M. Revay, R. Wang, and I. R. Manchester. A convex parameterization of robust recurrent neural networks. IEEE Control Systems Letters, 5(4):1363 1368, 2021. doi:10.1109/LCSYS.2020.3038221.

E. K. Ryu and S. Boyd. Primer on monotone operator methods. Applied Computational Mathematics, 15(1):3 43, 2016.

C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL https://arxiv.org/abs/1312.6199.

E. Winston and J. Z. Kolter. Monotone operator equilibrium networks. In Advances in Neural

Information Processing Systems, 2020. URL https://arxiv.org/abs/2006.08591.

E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286 5295, 2018. URL http://proceedings.mlr.press/v80/wong18a.html.