# robust_implicit_networks_via_noneuclidean_contractions__77804a7e.pdf Robust Implicit Networks via Non-Euclidean Contractions Saber Jafarpour1, , Alexander Davydov1, , Anton V. Proskurnikov2,3, and Francesco Bullo1 1 Center for Control, Dynamical Systems and Computation, University of California, Santa Barbara, 93106-5070, USA, {saber, davydov, bullo}@ucsb.edu. 2 Department of Electronics and Telecommunications, Politecnico di Torino, Turin, Italy; 3 Institute for Problems in Mechanical Engineering, Russian Academy of Sciences, St. Petersburg, Russia, anton.p.1982@ieee.org Implicit neural networks, a.k.a., deep equilibrium networks, are a class of implicitdepth learning models where function evaluation is performed by solving a fixed point equation. They generalize classic feedforward models and are equivalent to infinite-depth weight-tied feedforward networks. While implicit models show improved accuracy and significant reduction in memory consumption, they can suffer from ill-posedness and convergence instability. This paper provides a new framework, which we call Non-Euclidean Monotone Operator Network (NEMON), to design well-posed and robust implicit neural networks based upon contraction theory for the non-Euclidean norm ℓ . Our framework includes (i) a novel condition for well-posedness based on one-sided Lipschitz constants, (ii) an average iteration for computing fixed-points, and (iii) explicit estimates on input-output Lipschitz constants. Additionally, we design a training problem with the well-posedness condition and the average iteration as constraints and, to achieve robust models, with the input-output Lipschitz constant as a regularizer. Our ℓ well-posedness condition leads to a larger polytopic training search space than existing conditions and our average iteration enjoys accelerated convergence. Finally, we evaluate our framework in image classification through the MNIST and the CIFAR-10 datasets. Our numerical results demonstrate improved accuracy and robustness of the implicit models with smaller input-output Lipschitz bounds. Code is available at https://github. com/davydovalexander/Non-Euclidean_Mon_Op_Net. 1 Introduction Implicit neural networks are infinite-depth learning models with layers defined implicitly through a fixed-point equation. Examples of implicit neural networks include deep equilibrium models [Bai et al., 2019] and implicit deep learning models [El Ghaoui et al., 2021]. Implicit networks can be considered as generalizations of feedforward neural networks with input-injected weight tying, i.e., training parameters are transferable between layers. Indeed, in implicit networks, function evaluation is executed by solving a fixed-point equation and backpropagation is implemented by computing gradients using implicit differentiation. Due to these unique features, implicit models enjoy more flexibility and improved memory efficiency compared to traditional neural networks. At the same time, implicit networks can suffer from instability in their training due to the nonlinear nature of their fixed-point equations and can show brittle input-output behaviors due to their model flexibility. These authors contributed equally. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). It is known that implicit neural networks require careful tuning and initialization to avoid ill-posed training procedures. Indeed, without additional assumptions, their fixed-point equation may not have a unique solution and the numerical algorithms for finding their solutions might not converge. Several recent works in the literature have focused on studying well-posedness and convergence of the fixedpoint equations of implicit networks using frameworks such as monotone operator theory [Winston and Kolter, 2020], contraction theory [El Ghaoui et al., 2021], and a mixture of both [Revay et al., 2020]. Despite several insightful results, important questions about conditions for well-posedness of implicit networks and efficient algorithms that converge to their solutions are still open. One of the key features of implicit neural networks is their flexibility, which might come at the cost of low input-output robustness. As first noted in [Szegedy et al., 2014], the input-output behavior of deep neural networks can be vulnerable to perturbations; close enough input data can lead to completely different outputs. This lack of robustness can lead to unreliable performance of neural networks in safety-critical applications. Among several notions of robustness, the Lipschitz constant of a neural network is a coarse but rigorous measure which can be used to estimate input-output sensitivity of the network [Szegedy et al., 2014]. For this reason, there has been a growing interest in estimating the input-output Lipschitz constant of deep neural networks with respect to the ℓ2-norm [Fazlyab et al., 2019, Combettes and Pesquet, 2020]. However, it turns out that in some applications, the inputoutput Lipschitz constants with respect to non-Euclidean norms are more informative measures for studying robustness. One such application appears in the robustness analysis of neural networks with large-scale inputs under widely-distributed adversarial perturbations (examples of these adversarial perturbations can be found in [Szegedy et al., 2014]). For these examples, the input-output ℓ2Lipschitz constant does not provide complete information about robustness of the network; a neural network with small input-output ℓ2-Lipschitz constant can be very sensitive to widespread entrywisesmall perturbations of the input signal. On the other hand, the input-output ℓ -Lipschitz constant provides a different metric which appears to be well-suited for the analysis of widespread distributed perturbations. Another application is the estimation of input signal confidence intervals from output deviations, where the input-output ℓ -Lipschitz constant of the network provides more scalable bounds than its ℓ2 counterpart. Related works Implicit learning models. Numerous works in learning theory have shown the power of deep learning models with implicit layers. In these learning models, the notion of layers are replaced by a composition rule, which can be either a fixed-point iteration or a solution to a differential equation. Well-known frameworks for deep learning using implicit infinite-depth layers include deep equilibrium networks [Bai et al., 2019], implicit deep learning [El Ghaoui et al., 2021], and Neural ODEs [Chen et al., 2018]. In [Kag et al., 2020], a class of implicit recurrent neural networks is considered and it is demonstrated that, with this architecture, the models do not suffer from vanishing nor exploding gradients. Implicit layers have also been used to study convex optimization problems [Agrawal et al., 2019] and to design control strategies [Amos et al., 2018]. Convergence to global minima of certain classes of implicit networks is studied in [Kawaguchi, 2021]. Well-posedness and numerical algorithms for fixed-point equations. There has been a recent interest in studying well-posedness and numerical stability of implicit-depth learning models. [El Ghaoui et al., 2021] proposes a sufficient spectral condition for well-posedness and for convergence of the Picard iterations associated with the fixed-point equation of implicit networks. In [Winston and Kolter, 2020, Revay et al., 2020], using monotone operator theory, a suitable parametrization of the weight matrix is proposed which guarantees the stable convergence of suitable fixed-point iterations. A recent influential survey on monotone operators is [Ryu and Boyd, 2016]. A recent survey on fixed point strategies in data science is given by [Combettes and Pesquet, 2021]. Robustness of learning models It is known that neural networks can be vulnerable to adversarial input perturbations [Szegedy et al., 2014]. A large body of literature is devoted to improve robustness of neural networks using various defense strategies against adversarial examples [Goodfellow et al., 2015, Papernot et al., 2016]. While these strategies are effective in many scenarios, they do not provide formal guarantees for robustness [Carlini and Wagner, 2017]. However, there has been a recent interest in designing classifiers that are provably robust with respect to adversarial perturbations [Madry et al., 2018, Wong and Kolter, 2018]. The input-output Lipschitz constant of a neural network is a rigorous metric for its worst-case sensitivity with respect to input perturbations. Several recent works have focused on estimating the Lipschitz constant and enforcing its boundedness. For example, [Fazlyab et al., 2019, 2020] propose a convex optimization framework using quadratic constraints and semidefinite programming to obtain upper bounds on Lipschitz constants of deep neural networks. In [Pauli et al., 2021], a training algorithm is designed to ensure boundedness of the Lipschitz constant of the neural network via a semidefinite program. Other methods for estimating the Lipschitz constant of deep neural networks include [Krishnan et al., 2020, Revay et al., 2021, Combettes and Pesquet, 2020]. For implicit neural networks, a sensitivity-based robustness analysis is proposed in [El Ghaoui et al., 2021]. Lipschitz constants of deep equilibrium networks have also been studied in [Pabbaraju et al., 2021, Revay et al., 2020] using monotone operator theory. Contributions In this paper, using non-Euclidean contraction theory with respect to the ℓ -norm, we propose our novel framework, Non-Euclidean Monotone Operator Network (NEMON), to design implicit neural networks and study their well-posedness, stability, and robustness. First, we develop elements of a novel non-Euclidean monotone operator theory akin to the frameworks in [Bauschke and Combettes, 2017, Ryu and Boyd, 2016]. Using the concept of matrix measure, we introduce the essential notion of one-sided Lipschitz constant of a map. Based upon this notion, we prove a general fixed-point theorem with weaker requirements than classical results on Picard and Krasnosel skii Mann iterations. For maps with one-sided Lipschitz constant less than unity, we show that an average iteration converges for sufficiently small step sizes and optimize its rate of convergence. For the special case of the weighted ℓ -norm, we show that this average iteration can be accelerated by choosing a larger step size. Additionally, we study perturbed fixed-point equations and establish a bound on the distance between perturbed and nominal equilibrium points as a function of one-sided Lipschitz condition. Second, for implicit neural networks, we use our new fixed-point theorem to (i) establish ℓ -norm conditions for their well-posedness, (ii) design accelerated numerical algorithms for computing their solutions, and (iii) provide upper bounds on their input-output ℓ -Lipschitz constants. Third, we propose a parametrization for matrices with appropriate bound on their one-sided Lipschitz constants and use this parametrization with the average iteration to design a training optimization problem. Finally, we perform several numerical experiments illustrating improved performance of NEMON in image classification on the MNIST and the CIFAR-10 datasets compared to the state-of-the-art models in [El Ghaoui et al., 2021, Winston and Kolter, 2020]. Additionally, by adding the input-output Lipschitz constant as regularizer in the training problem, we observe improved robustness to some classes of adversarial perturbations. We include all relevant proofs in Appendix C. 2 Review material Matrix measures Let be a norm on Rn and its induced norm on Rn n. The matrix measure of A Rn n with respect to is defined by µ(A) := limh 0+ In+h A 1 h , that is, the one-sided directional derivative of the induced norm in direction of A, evaluated at In. Remarkably, the matrix measure is a tighter upper bound on the spectral abscissa of A than A and the set of matrices A Rn n satisfying µ(A) 1 is an unbounded subset of Rn n strictly containing the compact ball A 1. We refer to [Desoer and Haneda, 1972] for a list of properties enjoyed by matrix measures. We will be specifically interested in diagonally weighted ℓ norms defined by x ,[η] 1 = max i 1 ηi |xi|, (1) where, given a positive vector η Rn >0, we use [η] to denote the diagonal matrix with diagonal entries η. The corresponding matrix norm and measure are A ,[η] 1 = max i {1,...,n} ηj ηi |aij|, µ ,[η] 1(A) = max i {1,...,n} j=1,j =i |aij|ηj Lipschitz maps Given a norm with induced matrix measure µ( ), a differentiable map F : Rn Rn is Lipschitz continuous with constant Lip(F) R 0 if DF(x) Lip(F) for all x Rn. (3) For example, for an affine F(x) = Ax + b, the (smallest) Lipschitz constant is Lip(F) = A . One-sided Lipschitz maps Given a norm , a differentiable map F : Rn Rn is one-sided Lipschitz continuous with constant os L(F) R if µ(DF(x)) os L(F) for all x Rn. (4) For example, for an affine F(x) = Ax + b, the (smallest) one-sided Lipschitz constant is os L(F) = µ(A). Note that (i) the one-sided Lipschitz constant is upper bounded by the Lipschitz constant, (ii) a Lipschitz continuous map is always one-sided Lipschitz continuous, and (iii) the one-sided Lipschitz constant may be negative. For a more in-depth review we refer to Appendix A. The notion of onesided Lipschitz continuity unifies several important concepts in dynamical systems and optimization theory. In operator theory, the map F is called a monotone operator if it is one-sided Lipschitz continuous with respect to the ℓ2-norm with the constant os L( F) > 0 [Ryu and Boyd, 2016, Bauschke and Combettes, 2017]. In control theory, the vector field F is called strongly infinitesimally contracting if it is one-sided Lipschitz continuous with the constant os L(F) < 0 [Desoer and Haneda, 1972, Lohmiller and Slotine, 1998, Pavlov et al., 2004]. In what follows, we let os L ,[η] 1(F) R denote the one-sided Lipschitz constants with respect to the weighted ℓ -norm. 3 Fixed-point equations and one-sided Lipschitz constants In this section, we show that the notion of one-sided Lipschitz constant can be used to study solvability of fixed-point equation: x = F(x), (5) where F : Rn Rn is a differentiable map. Let be a norm on Rn, then in view of the Banach fixed-point theorem, a simple sufficient condition for existence of a unique solution for the fixed-point equation (5) is Lip(F) < 1. We note that the sufficient condition Lip(F) < 1 depends on the specific form of the fixed-point equation (5) and can be relaxed by a suitable rewriting of this fixed-point equation. Given an averaging parameter α (0, 1] we define the average map Fα : Rn Rn by Fα := (1 α)I + αF, where I is the identity map. Using this notion, an equivalent reformulation of the fixed-point equation (5) is: x = (1 α)x + αF(x) = Fα(x). (6) For α = 1, we have Fα(x) = F(x) and equation (6) coincides with equation (5). For every α (0, 1), the map Fα is different from F but equations (5) and (6) are equivalent. Hence, if Lip(Fα) < 1, then by the Banach fixed-point theorem, the fixed point equation (6) (and therefore the fixed point equation (5)) has a unique solution x and the sequence {yk} k=1 defined by yk+1 = (1 α)yk + αF(yk), for all k Z 0 (7) converges geometrically to x with rate Lip(Fα). As a result of the parametrization (6), the condition Lip(F) < 1 for existence and uniqueness of the fixed-point can be relaxed to sufficient conditions Lip(Fα) < 1, (8) parametrized by α (0, 1]. Additionally, if condition (8) is satisfied, then algorithm (7) computes the fixed point x . It can be shown that the condition (8) becomes less conservative as α decreases. The next theorem shows that in the limit as α 0+, condition (8) approaches the condition os L(F) < 1. Theorem 1 (Fixed points via one-sided Lipschitz conditions). Let F : Rn Rn be differentiable and Lipschitz with constant ℓ> 0 with respect to a norm . Define the average map Fα = (1 α)I+αF and, for c > 0, the function γℓ,c : ]0, c (c+ℓ+1)(ℓ+1)[ R by: γℓ,c(α) := 1 + αc α2(ℓ+ 1)2 Then the following statements are equivalent: (i) os L(F) < 1 c, (ii) Lip(Fα) = γℓ,c(α), for 0 < α < c (c+ℓ+1)(ℓ+1). Moreover, if the equivalent conditions (i) or (ii) hold, then, for condition number κ = 1+Lip(F) (iii) F has a unique fixed point x ; (iv) for 0 < α < 1 κ(κ+1), Fα is a contraction mapping with contraction factor γℓ,c(α) < 1; (v) the minimum contraction factor γ ℓ,c = 1 1 4κ2 + 1 8κ3 +O 1 κ4 and the minimizing averaging parameter α of Fα is α = κ 1 os L(F) = 1 1 os L(F) 2κ2 3 8κ3 + O 1 The average iteration (6) is often referred to as the Krasnosel skii Mann iteration or the damped iteration [Bauschke and Combettes, 2017]. Compared to [Bauschke and Combettes, 2017, Theorem 5.15], Theorem 1(iv) studies convergence of the Krasnosel skii Mann iteration for arbitrary norms, proposes a weaker convergence condition of the form os L(F) < 1 (hence, F need not be nonexpansive). However, it ensures convergence for only sufficiently small α > 0 and assumes that F is differentiable (as will be shown, however, the latter assumption can be relaxed). 3.1 Accelerated convergence for weighted ℓ norms For diagonally weighted ℓ norms, one can strengthen Theorem 1(iv) to prove the convergence of the average iteration (6) on a larger domain of the parameter α. Theorem 2 (Accelerated fixed point algorithm for ℓ norms). Let F : Rn Rn be differentiable and Lipschitz with respect to the weighted non-Euclidean norm ,[η] 1. Define the average map Fα = (1 α)I + αF and pick diag L(F) [ Lip(F), os L(F)] to satisfy diag L(F) min i {1,...,n} inf x Rn DFii(x). (9) If os L(F) < 1, then F has a unique fixed-point x and (i) for 0 < α 1 1 diag L(F), Fα is a contraction mapping with the contraction factor 1 α(1 os L(F)) < 1; (ii) the minimum contraction factor and minimizing averaging parameter of Fα are, respectively, Lip(Fα ) = 1 1 os L(F) 1 diag L(F) = 1 1 κ , for κ = 1 diag L(F) 1 os L(F) 1 + Lip(F) α = 1 1 diag L(F). Note that diag L(F) is well-defined because of the Lipschitz continuity assumption. Specifically, one can show that diag L(F) is the minus the minimum over i {1, . . . , n} of the one-sided Lipschitz constants of the maps xi 7 F(xi, x i) at x i = (x1, . . . , xi 1, xi+1, . . . , xn) fixed. It is instructive to compare the minimum contraction factor in the general Theorem 1 with the minimum contraction factor for ℓ norms in Theorem 2 and how they depend upon the corresponding condition numbers κ and κ . We note that (i) the relevant condition number diminishes κ κ , and (ii) the minimum contraction factor Lip(Fα ) = 1 1 4κ2 +O(1/κ4) improves to Lip(Fα ) = 1 1 κ . This acceleration justifies the title of this section. 3.2 Perturbed fixed-point problems In this subsection, we focus on solvability of the perturbed fixed-point equation: x = F(x, u), (10) where F : Rn Rr Rn is differentiable in x. We define Fu(x) = F(x, u) and Fx(u) = F(x, u). Given a norm X in Rn and U in Rr, F is Lipschitz in its first argument with constant Lipx(F) R 0 if F(x1, u) F(x2, u) X Lipx(F) x1 x2 X for all x1, x2 Rn and u Rr, and it is Lipschitz in its second argument with constant Lipu(F) R 0 if F(x, u1) F(x, u2) X Lipu(F) u1 u2 U for all x Rn and u1, u2 Rr, and it is one-sided Lipschitz in its first argument with constant os Lx(F) R if µ(Dx F(x, u)) os Lx(F) for all x1, x2 Rn and u Rr. The following result, which is in the spirit of Lim s Lemma [Lim, 1985], provides an upper bound on the distance between fixed-points of the perturbed equation (10). Theorem 3 (Perturbed fixed-points). Given a norm X in Rn and a norm U in Rr, consider a map F : Rn Rr Rn differentiable in the first argument and Lipschitz in both arguments. If F is one-sided Lipschitz with constant os Lx(F) < 1, then (i) for every u Rm, the map Fu has a unique fixed point x u; (ii) for every u, v Rm, x u x v X Lipu(F) 1 os Lx(F) u v U. Finally, Theorems 1, 2, and 3 are not directly applicable to activation function that are not differentiable. In Appendix C.3, we show that for specific form of the fixed-point equation (5), where F = Φ H and Φ : Rn Rn is a weakly increasing, non-expansive, diagonal activation function and H : Rn Rr Rn is a differentiable function, all of the conclusions of Theorems 1, 2, and 3 hold by requiring equation (9) to be true almost everywhere. 4 Contraction analysis of implicit neural networks In this section, we use contraction theory to lay the foundation for our Non-Euclidean Monotone Operator Network (NEMON) model of implicit neural networks. Given A Rn n, B Rn r, C Rq n, and D Rq r, we consider the implicit neural network x = Φ(Ax + Bu) := N(x, u), y = Cx + Du, (11) where x Rn, u Rr, y Rq, and Φ : Rn Rn is defined by Φ(x) = (φ1(x1), . . . , φn(xn)). For every i {1, . . . , n}, we assume the activation function φi : R R is weakly increasing, i.e., φi(xi) φi(zi) for xi zi, and non-expansive, i.e., |φi(xi) φi(zi)| |xi zi| for all xi and zi; if φi is differentiable, these conditions are equivalent to 0 φ i(xi) 1 for all xi R. We are able to provide the following estimates on all relevant Lipschitz constants. Theorem 4 (Lipschitz and one-sided Lipschitz constants for the implicit neural network). Consider the implicit neural network in equation (11) with weakly increasing and non-expansive activation functions Φ. With respect to ,[η] 1, η Rn >0, on Rn and U on the input space Rr, the map N : Rn Rr Rn is one-sided Lipschitz continuous in the first variable and Lipschitz continuous in both variables with constants: os Lx(N) = µ ,[η] 1(A)+ , Lipx(N) = A ,[η] 1 , (12) Lipu(N) = B ( ,[η] 1),U , diag L(N) = mini {1,...,n}(Aii) , (13) where (z)+ = z if z 0 and (z)+ = 0 if z < 0; and (z) = 0 if z 0 and (z) = z if z < 0. We now use these estimates to establish multiple properties of the implicit neural network. Corollary 5 (Well posedness, input-state Lipschitz constant, and computation). Consider the model (11), with parameters (A, B, C, D) and with weakly increasing and non-expansive activation functions Φ. Define the average map Nα := (1 α)I+αN and consider the norms ,[η] 1, η Rn >0, on Rn, U on the input space Rr and Y on the output space Rq. Then (i) if µ ,[η] 1(A) < 1, then (11) is well posed, i.e., there exists a unique fixed point, (ii) the map Nα is a contraction mapping for 0 < α α := 1 mini {1,...,n}(Aii) 1 with minimum contraction factor Lip(Nα ) = 1 1 µ ,[η] 1(A)+ 1 mini {1,...,n}(Aii) . (iii) the Lipschitz constants from input u to fixed point x u and to the output y = Cx u + Du are Lipu x := Lipu(N) 1 os Lx(N) = B ( ,[η] 1),U 1 µ ,[η] 1(A)+ , (14) Lipu y := B ( ,[η] 1),U C Y,( ,[η] 1) 1 µ ,[η] 1(A)+ + D Y,U. (15) 5 Training implicit neural networks Problem setup Given an input data matrix U = [u1, . . . , um] Rr m and a corresponding output data matrix Y = [y1, . . . , ym] Rq m, we aim to learn matrices A, B, C, D so that the neural network (11) approximates the input-output relationship. We rewrite the model for matrix inputs as b Y = CX + DU, where X = Φ(AX + BU). From Corollary 5(i), if each φi is weakly increasing and non-expansive, the fixed point problem is well-posed when µ ,[η] 1(A) < 1 for some η Rn >0. We consider a training problem of the form min A,B,C,D,X L(Y, CX + DU) + P(A, B, C, D) X = Φ(AX + BU), µ ,[η] 1(A) γ, (16) where L is a loss function, P is a penalty function, and γ < 1 is a hyperparameter ensuring the fixed point problem is well-posed. For η = 1n, we can remove the constraint µ (A) γ in the training optimization problem (16) using the following parametrization of weight matrix A: A = T diag(|T|1n) + γIn. (17) In Appendix B, we show that parametrization (17) characterizes the set of matrices in Rn n satisfying µ (A) γ. Using the parametrization (17) in the training problem not only improves the computational efficiency of the optimization but also allows for the design of implicit neural networks with additional structure such as convolutions. Suppose u Rrs2 is a r-channel input of size s s and x Rns2 is an n-channel hidden layer. To define our implicit CNN, we select the weight matrix A Rns2 ns2 as the matrix form of a 2D convolutional operator. If we consider a circular convolution operator, then A is a circulant matrix. Using the parametrization (17), A is circulant if and only if T is circulant. Therefore, the training problem for implicit CNNs can be cast as an unconstrained optimization problem using the above parametrization with a circulant T. Improving robustness via Lipschitz regularization We now focus on learning robust implicit neural networks with bounded Lipschitz constants via a regularization strategy. Setting both U and Y as in the input-output Lipschitz bound (15), we get Lipu y = B ( ,[η] 1),( ) C ( ),( ,[η] 1) 1 µ ,[η] 1(A)+ + D , B 2 ( ,[η] 1),( ) + C 2 ( ),( ,[η] 1) 1 µ ,[η] 1(A)+ + D , , where the inequality provides a convex upper bound for the input-output Lipschitz constant. Therefore, using the hyperparameter λ > 0, the regularized optimization problem is written as min A,B,C,D,X L(Y, CX + DU) + λ 1 B 2 ( ,[η] 1),( ) + C 2 ( ),( ,[η] 1) 1 µ ,[η] 1(A)+ + D , X = Φ(AX + BU), µ ,[η] 1(A) γ. (18) Certified adversarial robustness via Lipschitz bounds Given a nominal input u Rr, we consider any perturbed input v within an ℓ -ball of radius ε around u. In this case, we have yu yv Lipu y u v Lipu y ε. (19) Then we define margin(u) = (yu)i maxj =i(yu)j, where (yu)i is the logit corresponding to the (correct) label i for the input u. Then provided Lε 1 2margin(u), NEMON is certifiably robust to any perturbed input v within an ℓ -ball of radius ε centered at u. Backpropagation of gradients via average iteration From [El Ghaoui et al., 2021] we now show how the average iteration can be used to perform backpropagation via the implicit function theorem. For simplicity, we assume that each activation function φi is differentiable and consider mini-batches of size 1, i.e., we have X = x Rn, U = u Rr and b Y = by Rq. Let x be the unique solution of the fixed-point equation (11). Then the chain rule implies L A = ( x L)x , L B = ( x L)u , L C = ( by L)x , L D = ( by L)u . Since L depends explicitly on by, computing by L is straightforward. Computing x L is more complicated since X is defined only implicitly. However, it be shown that x L = (C(I DΦA) 1DΦ) by L. Since µ ,[η] 1(A) < 1, by Lemma 8 we get that µ ,[η] 1(DΦA) < 1. This implies that the matrix G := (In DΦA) 1DΦ Rn n exists and is the solution to the following fixed-point equation [El Ghaoui et al., 2021, Section 6.2] G = DΦ(AG + In). (20) Moreover, µ ,[η] 1(DΦA) < 1 and Theorem 2 together imply that the fixed-point equation (20) has a unique solution G and, for every 0 < α α := 1 mini(Aii) 1, the average iterations Gk+1 = (1 α)Gk + αDΦ(AGk + In), for all k Z 0 are contracting with the minimum contraction factor 1 α 1 µ ,[η] 1(A)+ at step size α . 6 Theoretical and numerical comparisons In this section, we provide a comprehensive comparison of our framework with the state-of-the-art implicit neural networks2. 6.1 Implicit network models We start by reviewing the existing models for implicit networks in the literature. Implicit deep learning model. [El Ghaoui et al., 2021] proposes a class of implicit neural networks with input-output behavior described by (11). It is shown that a sufficient condition for existence and uniqueness of a solution and convergence of the Picard iterations for the fixed point equation x = Φ(Ax + Bu) is λpf(|A|) < 1, where |A| denotes the entrywise absolute value of the matrix A and λpf denotes the Perron-Frobenius eigenvalue. For training, the optimization problem (16) is used where the constraint µ ,[η] 1(A) γ is replaced by A γ [El Ghaoui et al., 2021, Equation 6.3].3 It is easy to see that our well-posedness condition in Corollary 5(i) is less conservative than λpf(|A|) < 1 and its convex relaxation A < 1. Monotone operator deep equilibrium network (MON). [Winston and Kolter, 2020] proposes to use monotone operator theory to guarantee well-posedness of the fixed-point equation as well as its convergence to the solutions. The input-output behavior of the network is described by (11). For training, the optimization problem (16) is used where the constraint µ ,[η] 1(A) γ is replaced by In 1 2(A + A ) (1 γ)In. In order to ensure that this constraint is always satisfied in the training procedure, the weight matrix A is parametrized as A = γIn W W Z +Z , for arbitrary W, Z Rn n [Winston and Kolter, 2020, Appendix D].4 In the context of contraction theory, In 1 2(A + A ) (1 γ)In µ2(A) γ, which is shown in Appendix A. Thus, the parametrization A = γIn W W Z + Z can be considered as the ℓ2-norm version of the parametrization described by equation (17). In other words, the monotone operator network formulation is a Euclidean transcription of the framework we propose in this paper. 2All models were trained using Google Colab with a Tesla P100-PCIE-16GB GPU. 3The implicit deep learning implementation is available at https://github.com/beeperman/idl. 4The MON implementation is available at https://github.com/locuslab/monotone_op_net. 6.2 MNIST experiments 0 2 4 6 8 10 Epochs Accuracy vs epochs on MNIST handwritten digits MON µ (A) 0.95 0 2 4 6 8 10 Epochs Test loss vs epochs on MNIST handwritten digits MON µ (A) 0.95 Figure 1: Performance comparison between the NEMON model with µ (A) 0.95, the implicit deep learning model with A 0.95, and MON with In 1 2(A + A ) 0.05In on the MNIST dataset. The curves are generated by mean accuracy and mean loss over 5 different runs while light envelopes around the curves correspond to the standard deviation over the runs. Average best accuracy for the NEMON model is 0.9772, while it is 0.9721 for implicit deep learning model and 0.9762 for the MON model. In the digit classification dataset MNIST, input data are 28 28 pixel images of handwritten digits between 0-9. There are 60000 training images and 10000 test images. For training, images are reshaped into 784 dimensional column vectors and entries are scaled into the range [0, 1]. As a loss function, we use the cross-entropy. All models are of order n = 100, used the Re LU activation function φi(x) = (x)+, and are trained with a batch size of 300 over 10 epochs with a learning rate of 1.5 10 2. Curves for accuracy and loss versus epochs for the three models are shown in Figure 1. Regarding training times, using the average iteration, NEMON took, on average, 12 forward iterations, 13 backward iterations, and 9.8 seconds to train per epoch. Using the Peaceman-Rachford iteration, MON took, on average, 17 forward iterations, 16 backward iterations, and 9.5 seconds to train per epoch. Using the Picard iteration, the implicit deep learning model took, on average, 10 forward iterations, 5 backward iterations, and 5.8 seconds to train per epoch. We observe that the NEMON model performs better than the implicit deep learning model and has a comparable performance to MON. 102 103 104 105 Lipschitz constant Test error (%) Test error vs Lipschitz constant on MNIST handwritten digits λ = 0 A 0.95 0.0 0.1 0.2 0.3 0.4 0.5 ℓ amplitude of perturbation Accuracy vs perturbation on MNIST handwritten digits λ = 0 A 0.95 Figure 2: On the left is a plot of test error versus Lipschitz constant for the implicit deep learning model with A 0.95 and for NEMON with µ (A) 0.95 and parametrized by the regularization hyperparameter λ. We define the test error as 1 minus the accuracy. On the right is a plot of accuracy versus ℓ perturbation of a deterministic adversarial image inversion attack where we additionally include the MON model with In 1 2(A + A ) 0.05In. We also study the robustness of the NEMON model compared to the implicit deep learning model and the MON model on the MNIST dataset. We train various models regularized by the input-output Lipschitz constant as in (18). Additionally, to verify robustness of the different models, we consider several adversarial attacks and plot the accuracy versus perturbation of such an attack. In Figure 2, we consider a continuous image inversion attack [Hosseini et al., 2017], where each pixel is perturbed in the direction of pixel value inversion with amplitude given by the ℓ perturbation. For more details on this and other types of adversarial perturbations, we refer to Appendix D. We observe that for λ = 10 5, the regularized NEMON model achieves a two order of magnitude decrease in its input-output Lipschitz constant compared to the un-regularized NEMON models. In addition, we see that the implicit deep learning model and the MON model are more sensitive to the continuous image inversion attack than NEMON. Moreover, as the regularization parameter λ increases, the NEMON model becomes increasingly robust to this attack. 6.3 CIFAR-10 experiments In the image classification dataset CIFAR-10, input data are 32 32 color images in 10 classes. There are 50000 training images and 10000 test images. We compare our proposed NEMON model with a convolutional structure to a single convolutional layer MON model. Each model used 81 channels. We train both models with a batch size of 256 and a learning rate of 10 3 for 40 epochs. For training, using the average iteration, NEMON took, on average, 10 forward iterations, 10 backward iterations, and 75.0 seconds per epoch to train. Using the Peaceman-Rachford iteration, MON took, on average, 5 forward iterations, 5 backward iterations, and 101.8 seconds per epoch to train. We focus primarily on the robustness of NEMON and MON with respect to ℓ -norm bounded perturbations on CIFAR-10. To this end, we additionally trained two NEMON models with regularization parameters λ {10 4, 10 5}. In Figure 3, on the left is a plot of the certified robustness of each of the models via their ℓ -Lipschitz constants. For MON, we got the ℓ -Lipschitz bound using the method in [Pabbaraju et al., 2021] for the ℓ2-Lipschitz bound and using the upper bound u 2 rs2 u . On the right is a plot of the accuracy of different models with respect to the projected gradient descent attack. We observe that the un-regularized and regularized NEMON models are more robust to ℓ -norm bounded perturbations than is MON. 0.000 0.002 0.004 0.006 0.008 0.010 ℓ amplitude of perturbation Certified robust test accuracy Certified robustness vs perturbation on CIFAR-10 images λ = 10 4, Lip=116.1 λ = 10 5, Lip=249.2 λ = 0, Lip=6044.9 MON, Lip=5398.2 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 ℓ amplitude of perturbation Accuracy vs perturbation on CIFAR-10 images Figure 3: On the left is a plot of certified robustness via the Lipschitz constants of MON with the constraint In 1 2(A + A ) In and NEMON with the constraint µ (A) 0. On the right is a plot of accuracy versus ℓ perturbation of the projected gradient descent attack. 7 Conclusion Using non-Euclidean contraction theory, we propose a framework to study stability of fixed-point equations. We apply this framework to analyze well-posedness and convergence of implicit neural networks and to design an efficient training algorithm to incorporate robustness guarantees. For future research, we envision that our framework is applicable to study stability and robustness of implicit learning models with additional structure such as graph neural networks. 8 Acknowledgments The authors thank Ian Manchester for stimulating discussions about contraction theory and the anonymous reviewers for their insightful feedback. This work was supported in part by DTRA under grant HDTRA1-19-1-0017, AFOSR under grant FA9550-22-1-0059, and NSF Graduate Research Fellowship under grant 2139319. A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond, and J. Z. Kolter. Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, 2019. URL https: //arxiv.org/abs/1910.12430. B. Amos, I. Jimenez, J. Sacks, B. Boots, and J. Z. Kolter. Differentiable MPC for end-to-end planning and control. In Advances in Neural Information Processing Systems, 2018. URL https://arxiv.org/abs/1810.13400. S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. In Advances in Neural Information Processing Systems, 2019. URL https://arxiv.org/abs/1909.01377. H. H. Bauschke and P. L. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, 2 edition, 2017. ISBN 978-3-319-48310-8. N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In ACM Workshop on Artificial Intelligence and Security, pages 3 14, 2017. doi:10.1145/3128572.3140444. R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equations. In Advances in Neural Information Processing Systems, 2018. URL https://arxiv.org/abs/ 1806.07366. P. L. Combettes and J.-C. Pesquet. Lipschitz certificates for layered network structures driven by averaged activation operators. SIAM Journal on Mathematics of Data Science, 2(2):529 557, 2020. doi:10.1137/19M1272780. P. L. Combettes and J.-C. Pesquet. Fixed point strategies in data science. IEEE Transactions on Signal Processing, 2021. doi:10.1109/TSP.2021.3069677. A. Davydov, S. Jafarpour, and F. Bullo. Non-Euclidean contraction theory for robust nonlinear stability. 2021. URL https://arxiv.org/abs/2103.12263. C. A. Desoer and H. Haneda. The measure of a matrix as a tool to analyze computer algorithms for circuit analysis. IEEE Transactions on Circuit Theory, 19(5):480 486, 1972. doi:10.1109/TCT.1972.1083507. L. El Ghaoui, F. Gu, B. Travacca, A. Askari, and A. Tsai. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021. doi:10.1137/20M1358517. Y. Fang and T. G. Kincaid. Stability analysis of dynamical neural networks. IEEE Transactions on Neural Networks, 7(4):996 1006, 1996. doi:10.1109/72.508941. M. Fazlyab, A. Robey, H. Hassani, M. Morari, and G. J. Pappas. Efficient and accurate estimation of Lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, 2019. URL https://arxiv.org/abs/1906.04893. M. Fazlyab, M. Morari, and G. J. Pappas. Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. IEEE Transactions on Automatic Control, 2020. doi:10.1109/TAC.2020.3046193. I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR), 2015. URL https://arxiv. org/abs/1412.6572. W. He and J. Cao. Exponential synchronization of chaotic neural networks: a matrix measure approach. Nonlinear Dynamics, 55:55 65, 2009. doi:10.1007/s11071-008-9344-4. H. Hosseini, B. Xiao, M. Jaiswal, and R. Poovendran. On the limitation of convolutional neural networks in recognizing negative images. In IEEE International Conference on Machine Learning and Applications, pages 352 358, 2017. doi:10.1109/ICMLA.2017.0-136. A. Kag, Z. Zhang, and V. Saligrama. RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients? In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hylpq A4Fw S. K. Kawaguchi. On the theory of implicit deep learning: Global convergence with implicit layers. In International Conference on Learning Representations, 2021. URL https://openreview.net/ forum?id=p-NZIuwqh I4. V. Krishnan, A. A. A. Makdah, and F. Pasqualetti. Lipschitz bounds and provably robust training by Laplacian smoothing. In Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2006.03712. T. C. Lim. On fixed point stability for set-valued contractive mappings with applications to generalized differential equations. Journal of Mathematical Analysis and Applications, 110(2):436 441, 1985. doi:10.1016/0022-247X(85)90306-3. W. Lohmiller and J.-J. E. Slotine. On contraction analysis for non-linear systems. Automatica, 34(6): 683 696, 1998. doi:10.1016/S0005-1098(98)00019-3. A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Machine Learning, 2018. URL https://arxiv.org/abs/1706.06083. C. Pabbaraju, E. Winston, and J. Z. Kolter. Estimating Lipschitz constants of monotone deep equilibrium models. In International Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=Vc B4Qk Sfy O. N. Papernot, P. Mc Daniel, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy, pages 582 597, 2016. doi:10.1109/SP.2016.41. P. Pauli, A. Koch, J. Berberich, P. Kohler, and F. Allgower. Training robust neural networks using Lipschitz bounds. IEEE Control Systems Letters, 2021. doi:10.1109/LCSYS.2021.3050444. A. Pavlov, A. Pogromsky, N. Van de Wouw, and H. Nijmeijer. Convergent dynamics, a tribute to Boris Pavlovich Demidovich. Systems & Control Letters, 52(3-4):257 261, 2004. doi:10.1016/j.sysconle.2004.02.003. H. Qiao, J. Peng, and Z.-B. Xu. Nonlinear measures: A new approach to exponential stability analysis for Hopfield-type neural networks. IEEE Transactions on Neural Networks, 12(2):360 370, 2001. doi:10.1109/72.914530. M. Revay, R. Wang, and I. R. Manchester. Lipschitz bounded equilibrium networks. 2020. URL https://arxiv.org/abs/2010.01732. M. Revay, R. Wang, and I. R. Manchester. A convex parameterization of robust recurrent neural networks. IEEE Control Systems Letters, 5(4):1363 1368, 2021. doi:10.1109/LCSYS.2020.3038221. E. K. Ryu and S. Boyd. Primer on monotone operator methods. Applied Computational Mathematics, 15(1):3 43, 2016. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. URL https://arxiv.org/abs/1312.6199. E. Winston and J. Z. Kolter. Monotone operator equilibrium networks. In Advances in Neural Information Processing Systems, 2020. URL https://arxiv.org/abs/2006.08591. E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286 5295, 2018. URL http://proceedings.mlr.press/v80/wong18a.html.