# monotone_bilipschitz_and_polyakłojasiewicz_networks__ad1f61ad.pdf

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks

Ruigang Wang 1 Krishnamurthy (Dj) Dvijotham 2 Ian R. Manchester 1

This paper presents a new bi-Lipschitz invertible neural network, the Bi Lip Net, which has the ability to smoothly control both its Lipschitzness (output sensitivity to input perturbations) and inverse Lipschitzness (input distinguishability from different outputs). The second main contribution is a new scalar-output network, the PLNet, which is a composition of a Bi Lip Net and a quadratic potential. We show that PLNet satisfies the PolyakŁojasiewicz condition and can be applied to learn non-convex surrogate losses with a unique and efficiently-computable global minimum. The central technical element in these networks is a novel invertible residual layer with certified strong monotonicity and Lipschitzness, which we compose with orthogonal layers to build the Bi Lip Net. The certification of these properties is based on incremental quadratic constraints, resulting in much tighter bounds than can be achieved with spectral normalization. Moreover, we formulate the calculation of the inverse of a Bi Lip Net and hence the minimum of a PLNet as a series of three-operator splitting problems, for which fast algorithms can be applied.

1. Introduction

In many applications, it is desirable to learn neural networks with certified input-output behaviors, i.e., certain properties that are guaranteed by design. For example, Lipschitzbounded networks have proven to be beneficial for stabilizing of generative adversarial network (GAN) training (Arjovsky et al., 2017; Gulrajani et al., 2017), certifying robustness against adversarial attacks (Tsuzuku et al., 2018; Singla & Feizi, 2021; Zhang et al., 2021; Araujo et al., 2023; Wang & Manchester, 2023) and robust reinforcement

1Australian Centre for Robotics, School of Aerospace, Mechanical and Mechatronic Engineering, The University of Sydney, Sydney, NSW 2006, Australia. 2Google Deep Mind. Correspondence to: Ruigang Wang <ruigang.wang@sydney.edu.au>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

2 1 0 1 2 3

Target fn. i-Res Net i-Densenet Bi Lip Net Optimal fit

Model inv. Lip. ( ) Lip. ( ) loss ( ) i-Res Net 0.80 4.69 0.2090 i-Dense Net 0.82 4.66 0.2091 Bi Lip Net 0.11 9.97 0.0685 Best Possible 0.10 10.0 0.0677

Figure 1. Fitting a step function, which is not Lipschitz, with certified (0.1, 10)-Lipschitz models. Compared to the analyticallycomputed optimum, the proposed Bi Lip Net achieves much tighter bounds than models based on spectral normalization.

learning (Russo & Proutiere, 2021; Barbara et al., 2024).

Another input-output property invertibility has received much attention in the deep learning literature since the introduction of normalizing flows (Dinh et al., 2015) for probability-density learning. Invertible neural networks have been applied in applications such as generative modeling (Dinh et al., 2017; Kingma & Dhariwal, 2018), probabilistic inference (Bauer & Mnih, 2019; Ward et al., 2019; Louizos & Welling, 2017), solving inverse problems (Ardizzone et al., 2018) and uncertainty estimation (Liu et al., 2020). A common way to construct invertible networks is to compose invertible affine transformations with more sophisticated invertible layers, including coupling flows (Dinh et al., 2017; Kingma & Dhariwal, 2018), auto-regressive models (Huang et al., 2018; De Cao et al., 2020; Ho et al., 2019), invertible residual layers (Chen et al., 2019; Behrmann et al., 2019), monotone networks (Ahn et al., 2022), and neural ordinary differential equations (Grathwohl et al., 2019), see also in the surveys (Papamakarios et al., 2021; Kobyzev et al., 2020).

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 2

However, (Behrmann et al., 2021) observed that commonlyused invertible networks suffer from exploding inverses and are thus prone to becoming numerically non-invertible. This observation motivates the input-output property of bi Lipschitzness. A layer F : Rn Rn is said to be bi Lipschitz with bound of (µ, ν), or simply (µ, ν)-Lipschitz, if the following inequalities hold for all x, x Rn:

µ x x F(x) F(x ) ν x x ,

where is the Euclidean norm. The bound ν controls the output sensitivity to input perturbations while µ controls the input distinguishability from different outputs (Liu et al., 2020). We call µ as the inverse Lipschitz bound of F as F 1

exists and is 1/µ-Lipschitz. The ratio τ := ν/µ is called distortion (Liang et al., 2023), which is the upper bound of the condition number of the Jacobian matrix of F. A larger distortion implies more expressive flexibility in the model.

In this paper we argue that the bi-Lipschitz property is also useful for learning of surrogate loss (or reward) functions. Given some input/output pairs of a loss function, the objective is to learn a function which matches the observed data and is easy to optimize in some sense. This problem appears in many areas, including Q-learning with continuous action spaces, see e.g. (Gu et al., 2016; Amos et al., 2017; Ryu et al., 2019), offline data-driven optimization (Grudzien et al., 2024), learning reward models in inverse reinforcement learning (Arora & Doshi, 2021), and datadriven surrogate losses for engineering process optimization (Cozad et al., 2014; Misener & Biegler, 2023). An important contribution was the input convex neural network (ICNN) (Amos et al., 2017). However, the requirement of input convexity could be too strong in many applications.

1.1. Contributions

We propose a novel strongly monotone and Lipschitz residual layer of the form F(x) = µx + H(x). For the nonlinear block H, we introduce a new architecture feed-through network (FTN), which takes a multi-layer perceptron (MLP) as its backbone and adds connections from each hidden layer to the input and output variables. For deep networks, this architecture can improve the model expressivity without suffering from vanishing gradients.

We parameterize FTNs with certified strong monotonicity (which implies inverse Lipschitzness) and Lipschitzness for F via the integral quadratic constraint (IQC) framework (Megretski & Rantzer, 1997) and the Cayley transform.

By composing strongly-monotone and Lipschitz FTN layers with orthogonal affine layers we obtain the Bi Lip Net, a new network architecture with smoothlyparameterized bi-Lipschitz bounds.

We formulate the model inversion F 1 as a threeoperator splitting problem, which admits a numerically efficient solver (Davis & Yin, 2017).

We introduce a new scalar-output network f : Rn R, which we call a Polyak-Łojasiewicz network (or PLNet) since it satisfies the condition of the same name (Polyak, 1963; Lojasiewicz, 1963). It consists of a bi-Lipschitz network composed with a quadratic potential, and automatically satisfies favourable properties for surrogate loss learning, in particular existence of a unique global optimum which is efficiently computable.

1.2. Related work

Bi-Lipschitz invertible layer. In literature, there are two types of invertible layers closely related to our models. The first is the invertible residual layer F(x) = x+H(x) (Chen et al., 2019; Behrmann et al., 2019), where the nonlinear block H is a shallow network with Lipschitz bound of c < 1. In (Perugachi-Diaz et al., 2021), H is further extended to a deep MLP. It is easy to show that F is (1 c)-inverse Lipschitz and (1 + c)-Lipschitz. In both cases, the Lipschitz regularization is via spectral normalization (Miyato et al., 2018), which we observe to be very conservative (see Figure 1). Alternatively, a bi-Lipschitz layer can be defined by an implicit equation (Lu et al., 2021; Ahn et al., 2022). However, these require an iterative solver for both the forward and inverse model inference. In contrast, our model has an explicit forward pass and iterative solution is only required for the inverse.

IQC-based Lipschitz estimation and training. In (Fazlyab et al., 2019), the IQC framework of (Megretski & Rantzer, 1997) was first applied to obtain accurate Lipschitz bound estimation of deep networks with slope-restricted activations. It was later pointed out by (Wang et al., 2022) that IQC for Lipschitzness (Fazlyab et al., 2019) is Shor s relaxation of a Rayleigh quotient quadratically constrained quadratic programming (QCQP). Direct (i.e. unconstrained) parameterizations based on IQC were were proposed in (Revay et al., 2020) for deep equilibrium networks, in (Araujo et al., 2023) for residual networks, for deep MLPs and CNNs in (Wang & Manchester, 2023), and recurrent models in (Revay et al., 2023). It was pointed out by (Havens et al., 2023) that many recent Lipschitz model parameterizations (Meunier et al., 2022; Prach & Lampert, 2022; Araujo et al., 2023; Wang & Manchester, 2023) are special cases of (Revay et al., 2020). In a recent work (Pauli et al., 2024), the IQC-based Lipschitz estimation was recently extended to more general activations such as Group Sort and Max Min. All of these are for one-sided (upper) Lipschitzness, whereas our work applies the IQC framework for monotonicity and bi-Lipschitzness.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 3

Bi-Lipschitz networks for learning-based surrogate optimization. (Liang et al., 2023) uses Bi-Lipschitz networks to learn a surrogate constraint set while our work focuses on surrogate loss learning. Both works take distortion bound as an important regularization technique. The difference is that the distortion estimation in (Liang et al., 2023) is based on data samples while our work offers certified and smoothly-parameterized distortion bounds.

2. Preliminaries

We give some definitions for a mapping F : Rn Rn.

Definition 2.1. F is said to be µ-strongly monotone with µ > 0 if for all x, x Rn we have

F(x) F(x ), x x µ x x 2,

where , is the Euclidean inner product: a, b = a b. F is monotone if the above condition holds for µ = 0.

Definition 2.2. F is said to be ν-Lipschitz with ν > 0 if

F(x) F(x ) ν x x , x, x Rn.

F is said to be µ-inverse Lipschitz with µ > 0 if

F(x) F(x ) µ x x , x1, x2 Rn.

F is said to be bi-Lipschitz with ν µ > 0, or simply (µ, ν)-Lipschitz, if it is µ-inverse Lipschitz and ν-Lipschitz.

For any (µ, ν)-Lipschitz mapping F, its inverse F 1 is well-defined and (1/ν, 1/µ)-Lipschitz (Yeh, 2006). By the Cauchy Schwarz inequality, strong monotonicity implies inverse Lipschitzness, see Figure 2. A notable difference between monotonicity and bi-Lipschitzness is their composition behaviour. Given two bi-Lipschitz mappings F1, F2, their composition F = F2 F1 is also bi-Lipschitz with bound of (µ1µ2, ν1ν2) where (µ1, ν1) and (µ2, ν2) are the bi-Lipschitz bounds of F1 and F2, respectively. However, given two strongly monotone F1, F2 with monotonicity bounds µ1, µ2, the composition F = F2 F1 does not need to be strongly monotone. However, it is still µ1µ2-inverse Lipschitz. To quantify the flexibility of bi-Lipschitz maps, we introduce the following:

Definition 2.3. F satisfies a distortion bound τ with τ 1 if F is (µ, ν)-Lipschitz with τ = ν/µ.

For an invertible affine mapping F(x) = Px + q, the condition number of P is a distortion bound. An orthogonal mapping (i.e., P P = I) has the smallest possible model distortion τ = 1. Distortion bounds satisfy a composition property, i.e., if F1, F2 have distortion bounds of τ1, τ2, then F2 F1 satisfies a distortion bound of τ1τ2. Both F and F 1 have the same distortion.

cos α = µ/ν

(µ, ν)-Lipschitz µ-strongly monotone

& ν-Lipschitz

Figure 2. This figure depicts the possible ranges of y = F(x ) F(x) on R2 for a given x = x x. The ring (blue area) is for (µ, ν)-Lipschitz F while the half moon (red area) is for a µstrongly monotone and ν-Lipschitz F. The largest angle between x and y satisfies cos α = τ 1 with τ = ν/µ as the distortion.

Surrogate loss learning. Let D be a dataset containing finite samples of xi Rn and yi = f(xi) R where f is an unknown loss function. The task is to learn a surrogate loss ˆf from D, i.e., ˆf = arg minf F E(x,y) D (f(x) y)2

where F is the model set (e.g. neural networks). In many applications, it is highly desirable that each f F has a unique and efficiently-computable global minimum. An important model class is the input convex neural network (ICNN) (Amos et al., 2017). Since f F is convex w.r.t x, then any local minimum is a global minimum. Moreover, there exists a rich literature for convex optimization. Although convexity is more favourable for downstream optimization problems, it might be a very stringent requirement for fitting the dataset D. In this work we aim to construct a model set F such that every f F does not need to be convex but still poses those favourable properties for optimization. In Section 5, we will show that the construction of such model set relies on bi-Lipschitz neural networks.

3. Monotone and bi-Lipschitz Networks

In this section we first present the construction of µ-strongly monotone and ν-Lipschitz residual layers of the form F(x) = µx + H(x). We then construct bi-Lipschitz networks by deep composition of the new monotone and Lipschitz layers with orthogonal linear layers.

3.1. Feed-through network

For the nonlinear block H, we introduce a network architecture, called feed-through network (FTN), which takes an MLP as its backbone and then connects each hidden layer to input and output variables, see Figure 3. To be specific, the residual layer F(x) = µx + H(x) can be written as

zk = σ(Wkzk 1 + Ukx + bk), z0 = 0

k=1 Ykzk + by (1)

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 4

z1 z2 z L x Input

Hidden layers

Figure 3. The proposed invertible residual network F(x) = µx + H(x) where the nonlinear block H is a feed-through network, whose hidden layers are directly connected to the input and output.

where zk Rmk are the hidden variables, Uk, Wk, Yk and bk, by are the learnable weights and biases, respectively. Throughout the paper we assume that the activation σ is a scalar nonlinearity with slope restricted in [0, 1], which is satisfied (possibly with rescaling) by common activation functions such as Re LU, tanh, and sigmoid. Remark 3.1. FTN contains both short paths x zi y preventing vanishing gradients and long paths x zi zj y improving model expressivity (see Figure 3).

3.2. SDP conditions for monotonicity and Lipschitzness

The first step towards our parameterization is to establish strong monotonicity and Lipschitzness for F via semidefinite programming (SDP) conditions. For this, we rewrite F in a compact form:

z = σ(Wz + Ux + b), y = µx + Y z + by (2)

where z = z 1 z L , b = b 1 b L , and

0 W2 0 ... ... WL 0

U1 U2 ... UL

Y = Y1 Y2 YL .

Theorem 3.2. F is µ-strongly monotone and ν-Lipschitz if there exists a Λ Dm +, where Dm + is the set of positive diagonal matrices, such that the following conditions hold:

Y = U Λ, 2Λ ΛW W Λ 2

where γ = ν µ > 0.

Remark 3.3. The above conditions are obtained by applying the IQC theory (Megretski & Rantzer, 1997) to (2).

3.3. Model parameterization

Let Θ be the set of all θ = {U, W, Y, Λ} such that Condition (3) holds. Since it is generally not scalable to train a

model with SDP constraints, we instead construct a direct parameterization, i.e. both unconstrained and complete: Definition 3.4. A direct parameterization of a constraint set Θ is a surjective differentiable mapping M : RN Θ, i.e. for any ϕ RN we have M(ϕ) Θ, and the image of RN maps onto Θ, i.e. M(RN) = Θ.

A direct parameterization allows us to replace a constrained optimization over θ Θ with an unconstrained optimization over ϕ RN without loss of generality. This enables use of standard first-order optimization algorithms such as SGD or ADAM (Kingma & Ba, 2015).

We now construct a direct parameterization for FTNs satisfying (3). Here we present the main ideas, see Appendix A for full details. First, we introduce the free parameters

ϕ = {F p, F q} {dk, F a k , F b k}1 k L

where F p Rn n, F q Rm n, dk Rmk, F a k Rmk mk and F b k Rmk 1 mk with m0 = 0. Then, we compute some intermediate variables Ψk = diag edk and A k B k

= Cayley F a k F b k

= Cayley F p

where Cayley : Rn p Rn p with n p is defined by

J = Cayley G H

:= (I + Z) 1(I Z) 2H(I + Z) 1

with Z = G G+H H. It can be verified that J J = I for any G Rp p and H R(n p) p. Note that P will not be used for further weight construction as its purpose is to ensure that Q Q I. Next we set

Vk = 2Bk A k 1, Sk = Ak Qk Bk Qk 1

where Q = Q 1 Q L and B1 = 0, Q0 = 0. Finally, we construct the weights in (1) as:

2γΨ 1 k Sk, Wk = Ψ 1 k VkΨk 1,

2 S k Ψk, Λk = 1

Proposition 3.5. The model parameterization M defined in (5) is a direct parameterization for the set Θ, i.e. all models (1) satisfying Condition (3).

This means that we can learn the free parameter ϕ using first-order methods without any loss of model expressivity.

The construction is now done, but we note that Ψk is shared between layers k and k + 1. To have a modular implementation, we introduce new variables ˆz = Ψz and bias ˆb = Ψb with Ψ = diag(Ψ1, Ψ2, . . . , ΨL). Then, (2) can be rewritten as follows (see Appendix A)

ˆz = ˆσ V ˆz+ p

2γSx+ˆb , y = µx+ p

γ/2S ˆz+by (6)

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 5

where ˆσ(x) := Ψσ Ψ 1x is a (0, 1)-Lipschitz layer with learnable scaling Ψ, the weights S, V can be written as

S1 S2 ... SL

0 V2 0 ... ... VL 0

Number of free parameters. Consider an L-layer FTN (1) where each layer has the same width, i.e. mk = d. The bi-Lipschitz network based on spectral normalization (Liu et al., 2020) has 2Ld2 free parameters while our model size is (3L + 1)d2 + Ld. Since the Ld2 term dominates for deep and wide networks, our model has roughly 1.5 times as many parameter as the model from (Liu et al., 2020).

3.4. Bi-Lipschitz networks

We construct bi-Lipschitz networks (referred as Bi Lip Nets) by composing strongly monotone and Lipschitz layers,

G = OK+1 FK OK FK 1 O2 F1 O1 (8)

where Ok(x) = Pkx + qk with P k Pk = I is an orthogonal layer and Fk is a µk-strongly monotone and νk-Lipschitz layer (6). By the composition rule, the above Bi Lip Net is (µ, ν)-Lipschitz with µ = QL k=1 µk and ν = QL k=1 νk. The orthogonal matrix P can be parameterized via the Cayley transformation (4) or Householder transformation (Singla et al., 2022). Since the distortion of Ok is 1, it can improve network expressivity without increasing model distortion.

In some applications, e.g., normalising flows (Dinh et al., 2015; Papamakarios et al., 2021), we need to compute the inverse of G, which can be done in a backward manner:

G 1(y) = O 1 1 F 1 1 O 1 K F 1 K O 1 K+1(y), (9)

where Ok has an explicit inverse O 1 k (y) = P k (y qk). Computing the inverse F 1 k (y) requires an iterative solver, which will be addressed in Section 4.

Partially bi-Lipschitz networks. A neural network G : Rn Rl Rn is said to be partially bi-Lipschitz if for any fixed value of p Rl, the mapping y = G(x; p) is (µ, ν)- Lipschitz from x to y. We can construct such mappings via G(x; p) = Gh(p)(x) where Gϕ is a(µ, ν)-Lipschitz network for any free parameter ϕ RN and h : p ϕ is a new learnable function. Since the dimension of ϕ is often very high, a practical approach is to make ϕ partially depend on p. For instance, we can learn p-dependent bias via an MLP while the weight matrices of Gϕ is independent of p.

4. Model inverse via operator splitting

In this section we give an efficient algorithm to compute F 1(y) where F is a µ-strongly monotone and ν-Lipschitz

layer (6). First, we write its model inverse F 1 as

ˆz = ˆσ V γ

µSS ˆz + bz

γ/2S ˆz) (10)

with bz = 2γ/µS(y by) + ˆb. Both F and F 1 can be treated as special cases of deep equilibrium networks (Bai et al., 2019; Winston & Kolter, 2020; Revay et al., 2020) or implicit networks (El Ghaoui et al., 2021). The difference is that F has an explicit formula due to the strictly lowertriangular V while F 1 is an implicit equation as SS is a full matrix. A natural question for (10) is its well-posedness, i.e., for any y Rn, does there exists a unique ˆz Rm

satisfying (10)? Proposition 4.1. F 1 is well-posed if V, S are given by (7).

Certain classes of equilibrium networks were solved via twooperator splitting problems (Winston & Kolter, 2020; Revay et al., 2020). We follow a similar strategy, but our structure admits a three-operator splitting, see Proposition 4.2 with background in Appendix B. To state the result, we first recall the following fact from (Li et al., 2019). For the monotone and 1-Lipschitz activation ˆσ, there exists a proper convex function f : Rn R satisfying ˆσ( ) = prox1 f( ) with

proxα f (x) = arg min z Rn 1 2 x z 2 + αf(z).

A list of f for popular activations is given in Appendix B.1. Proposition 4.2. Finding a solution ˆz Rm to (10) is equivalent to finding a zero to the three-operator splitting problem 0 A(z) + B(z) + C(z) where A, B, C are monotone operators defined by

A(z) = (I V )z bz, B(z) = f(z), C(z) = γ

where f satisfies ˆσ( ) = prox1 f( ).

For three-operator problems, the Davis-Yin splitting algorithm (DYS) (Davis & Yin, 2017) can be applied, obtaining the following fixed-point iteration:

zk+1/2 = proxα f (uk)

uk+1/2 = 2zk+1/2 uk

zk+1 = RA(uk+1/2 αC(zk+1/2))

uk+1 = uk + zk+1 zk+1/2

where RA(v) = ((1 + α)I αV ) 1(v + αbz). Since V is strictly lower triangular, we can solve RA(v) using forward substitution. Furthermore, we can show that (11) is guaranteed to converge with α 0, 1 τ 1 , where τ is the model distortion.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 6

5. Polyak-Łojasiewicz Networks

We call a network f : Rn R a Polyak-Łojasiewicz (PL) network, or PLNet for short, if it satisfies the following PL condition (Polyak, 1963; Lojasiewicz, 1963):

1 2 xf(x) 2 m(f(x) min x f(x)), x Rn, (12)

where m > 0. The PL condition is significant in optimization since it is weaker than convexity, but still implies that gradient methods converge to a global minimum with a linear rate (Karimi et al., 2016), making PLNet a promising candidate for learning a surrogate loss models.

Proposition 5.1. If G is µ-inverse Lipschitz, then

2 G(x) 2 + c, c R (13)

is a PLNet with m = µ2.

Remark 5.2. We can further relax the quadratic assumption: f(x) = h G(x) is a PLNet if h : Rn R is strongly convex (Karimi et al., 2016). Remark 5.3. For parametric optimization problem, one can learn a surrogate loss via f(x; p) = 1/2 G(x; p) 2 + c where p Rm is the problem-specific parameter and G is a partially bi-Lipschitz network. Remark 5.4. Any sub-level set Lα = {x : f(x) < α} with α > c is homeomorphic to a unit ball, making PLNets suitable for neural Lyapunov functions (Wilson, 1967). Applications of PLNets to learning Lyapunov stable neural dynamics can be found in (Cheng et al., 2024).

Computing global optimum of a PLNet. If f takes the form (13) and G is bi-Lipschitz network (8), then f has a unique global optimum x = G 1(0) with G 1 given by (9). This can be efficiently computed by analytical inversion of orthogonal layers and applying the DYS algorithm (11) to monotone and Lipschitz layers.

Limitations of gradient descent for finding global optimum. An alternative way to compute the global optimum x is the standard gradient descent (GD) method xk+1 = xk α xf(xk). If xf is L-Lipschitz, then the above GD solver with α = 1/L has a linear global convergence rate of 1 m/L with m = µ2 (Karimi et al., 2016). However, this method has two drawbacks. First, the gradient function xf may not be globally Lipschitz, see Example 5.5. Secondly, even if a global Lipschitz bound exists, it is generally hard to estimate. Example 5.5. Consider a scalar function f(x) = 0.5g2(x) with g(x) = 2x + sin x, which satisfies the PL condition. Note that f/ x = (2 + cos x)(2x + sin x) is not globally Lipschitz due to the term 2x cos x.

6. Experiments

Here we present experiments which explore the expressive quality of the proposed models, regularisation via model distortion, and performance of the DYS solution method. Code is available at https://github.com/acfr/PLNet.

6.1. Uncertainty quantification via neural Gaussian process

It was shown in (Liu et al., 2020) that accurate uncertainty quantification of neural network models depends on a model s ability to quantify the distance of a test example from the training data. This distance-awareness can be achieved with bi-Lipschitz residual layers F(x) = x+H(x) and a Gaussian process output layer. In (Liu et al., 2020) this is achieved by imposing Lipschitz bound of 0 < c < 1 for H via spectral normalization. The resulting model is called Spectral-normalized Neural Gaussian Process (SNGP). In this section we examine the benefits of using the proposed Bi Lip Net in place of spectrally-normalized layers.

Toy example. Using the two-moon dataset, we compare our (µ, ν)-Lipschitz network to an SNGP using a 3layer i-Res Net under the same bi-Lipschitz constraints, i.e., µ = (1 c)3 and ν = (1 + c)3, see Figure 4. For the lower-distortion case (i.e., small c = 0.1), SNGP fails to completely separate the train and out-of-distribution (OOD) data due to its loose Lipschitz bound. Our model can distinguish the OOD examples from training dataset and the uncertainty surface is close to the SNGP with much higher distortion (c = 0.9). As the model distortion increases, our model can have an uncertainty surface very close the dataset. The uncertainty surface of SNGP does not change much from c = 0.1 to c = 0.9, see Figure 4 and additional results in Appendix D.2.

CIFAR-10/100. For image datasets, the SNGP model in (Liu et al., 2020) contains three bi-Lipschitz components, each with four residual layers of the form x+H(x) where H is constructed to be c-Lipschitz using spectral normalization. To ensure certifiable bi-Lipschitzness, we modify the SNGP model by choosing c (0, 1) and removing batch normalization from H since it may re-scale a layer s spectral norm in unexpected ways (Liu et al., 2023). The results of SNGP with batch normalization can be found in Appendix D.2. Our Bi Lip Net model has a similar architecture as SNGP except replacing the bi-Lipschitz components with the proposed (µ, ν)-Lipschitz network (8). To ensure both models have the same bi-Lipschitz bound, we choose µ = (1 c)4

and ν = (1 + c)4.

Table 1 reports the results of SNGP and Bi Lip Net under different bounds c = 0.95, 0.65, 0.35. For CIFAR-10 dataset, our model uniformly outperforms SNGP on both clean

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 7

AUROC 98.1%

AUROC 100.0%

AUROC 100.0%

AUROC 100.0%

Figure 4. Predictive uncertainty of different NGPs with the same bi-Lipschitz bound. The points from dark blue and regions are classified as in-domain distribution and OOD data, respectively. Light blue and orange points (different colors indicate different labels) are training samples from the two-moon dataset. The red points are ODD test examples. For the case with small distortion, our model can still distinguish the train and OOD data, achieving similar results of SNGP with large distortion.

and corrupted data, i.e., it achieves higher accuracy (about 10 20% improvement), lower expected calibration error (ECE) and negative log liklihood (NLL). Similar conclusion also holds for CIFAR-100 on accuracy and NLL, though our model has sightly higher ECE.

As with the previous toy example, our model with small distortion (τ = 18.6 for c = 0.35) achieves better accuracy than SNGP with large distortion (τ = 2.3 106 for c = 0.95). Thus, we observe that our parameterization is much more expressive for a given distortion bound.

6.2. Surrogate loss learning

We explore the PLNet s performance with the Rosenbrock function r(x, y) = 1/200(x 1)2 + 0.5(y x2)2 and its higher-dimensional generalizations. The Rosenbrock function is a classical test problem for optimization, since it is non-convex but a unique global minimum point (1, 1), at which the Hessian is poorly-conditioned. We also consider the sum of the Rosenbrock function and a 2D sine wave function, which still has a unique global minimum at (1, 1) while having many local minima, see Appendix D.1.

We learned models of the form (13) where G is parameterized by MLP, i-Res Net (Behrmann et al., 2019), i-Dense Net (Perugachi-Diaz et al., 2021) and the proposed Bi Lip Net (8). We also trained the ICNN, a scalar-output model which is convex w.r.t. inputs (Amos et al., 2017).

From Figure 7, we have the following observations. The unconstrained MLP can achieve small test errors. However, it has many local minima near the valley y = x2. This phenomena is more easily visible for the Rosenbrock+Sine case but also occurs in the plain Rosenbrock case. The ICNN model has a unique global minimum but the fitting error is large as its sub-level sets are convex. For i-Dense Net, the sub-level sets become mildly non-convex but their bi Lipschitz bound is quite conservative, so they do not capture the overall shape. In contrast, our proposed Bi Lip Net is more flexible and captures the non-convex shape while maintaining a unique global minimum. We note that in the Rosenbrock+Sine case, the Bi Lip Net surrogate has errors of similar magnitude to the MLP, but remains easily optimizable , i.e. it satisfies the PL condition and has a unique global minimum. Additional results are in Appendix D.2.

Partial PLNet. We also fit a parameterized Rosenbrock function r(x, y; p) using partial PLNet with p-dependent biases (see Remark 5.3). The results in Figure 8 indicate that the approach can be effective even if only bias terms are modified by the external parameter p, and not weights.

High-dimensional case. We now turn to scalability of the approach to higher-dimensional problems and analyse convergence of the DYS method for computing the global minimum. We apply the approach to a N=20-dimensional version of the Rosenbrock function:

R(x) = 1 N 1

i=1 r(xi, xi+1) (14)

which has a global minimum of zero at x = (1, 1, ..., 1) but is non-convex and has spurious local minima (Kok & Sandrock, 2009). We sample 10K training points uniformly over [ 2, 2]20. Note that, in contrast to the 2D example above, this is very sparse sampling of 20-dimensional space.

A comparison of train and test error vs model distortion is shown in Figure 5. It can be seen that our proposed Bi Lip Net model achieves far better fits than i Res Net (Behrmann et al., 2021) and i Dense Net (Perugachi-Diaz et al., 2021), which can not achieve small training error for any value of the distortion parameter. Furthermore, for our network, the distortion parameter appears to act as an effective regularizer. Note that the best test error occurs after training error drops to near zero ( 10 8) but distortion is still relatively small.

Solver comparison. Given the surrogate loss function learned by Bi Lip Net, we now compare methods to compute the location of its global minimum. In Figure 6 we compare the proposed DYS solver to the forward step method (FSM), see, e.g., (Ryu & Boyd, 2016). Specifically, the inverse x = F 1(y) with F as a µ-strongly monotone and ν-Lipschitz

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 8

Accuracy ( ) ECE ( ) NLL ( ) Method c Clean Corrupted Clean Corrupted Clean Corrupted

0.95 76.7 0.629 58.7 1.000 0.057 0.007 0.079 0.006 0.682 0.015 1.199 0.041 0.65 72.5 1.500 54.7 1.778 0.058 0.006 0.078 0.006 0.797 0.046 1.303 0.057 0.35 62.7 0.334 52.3 0.721 0.069 0.010 0.065 0.006 1.055 0.010 1.356 0.018

0.95 86.2 0.250 70.8 0.469 0.020 0.003 0.052 0.005 0.423 0.006 0.895 0.020 0.65 86.7 0.129 72.8 0.592 0.015 0.005 0.047 0.009 0.400 0.006 0.830 0.024 0.35 84.5 0.184 72.6 0.216 0.010 0.002 0.052 0.004 0.457 0.002 0.827 0.008

0.95 36.9 1.656 25.5 1.406 0.131 0.010 0.068 0.005 2.493 0.068 3.073 0.069 0.65 33.0 0.481 24.3 0.749 0.117 0.006 0.068 0.003 2.683 0.015 3.140 0.048 0.35 26.5 1.630 19.3 1.296 0.101 0.016 0.056 0.010 3.020 0.062 3.406 0.073

0.95 51.0 0.480 35.8 0.397 0.230 0.006 0.137 0.007 2.064 0.024 2.718 0.014 0.65 55.2 0.426 39.2 0.495 0.225 0.004 0.137 0.005 1.887 0.021 2.576 0.022 0.35 54.4 0.438 41.1 0.200 0.194 0.008 0.126 0.009 1.876 0.031 2.447 0.016

Table 1. Results for SNGP and Bi Lip Net on CIFAR-10/100, averaged over 5 seeds. To ensure bi-Lipschitz bounds, batch normalization is removed from SNGP. Bi Lip Net uniformly significantly outperforms SNGP in term of accuracy on both clean and corrupted data.

model distortion τ

Bi Lip Net i-Res Net i-Dense Net

model distortion τ

Bi Lip Net i-Res Net i-Dense Net

Figure 5. Surrogate loss learning for 20-dimensional Rosenbrock function. Comparison of training and test error vs model distortion for PLNet with different bi-Lipschitz models.

layer can be computed via

xk+1 = xk α(F(xk) y) (15)

which has a convergence rate of 1 µ2/ν2 if α = µ/ν2. We also consider a commonly used gradient-based method ADAM (Kingma & Ba, 2015) applied directly to the surrogate loss. We take two values of the distortion parameter: τ = 5 (optimal) and τ = 50. In both cases, the proposed DYS method converges significantly faster than the alternatives, and the results illustrate an additional benefit of regularising via distortion, besides improving the test error: the τ = 5 case converges significantly faster than τ = 50.

At the computed point x = G 1(0) for τ = 5, the true function (14) takes a value of R(x ) = 0.041. This is more than an order of magnitude better than the smallest value of R(x) over the training data, which ranged over [0.475, 6.532], indicating that PLNets have a useful implicit bias and do not simply interpolate the training data.

0 25 50 75 100 Steps

Surrogate loss

Distortion τ = 5

0 25 50 75 100 Steps

Distortion τ = 50

Figure 6. Solver comparison for finding the global minimum of a PLNet. We try a range of rates [0.1, 0.5, 1.0, 2.0, 5.0] for ADAM and present the best result. The proposed back solve method with DYS algorithm (11) converges much faster than ADAM applied to f or back solve method with FSM algorithm (15).

7. Conclusion

This paper has introduced a new bi-Lipschitz network architecture, the Bi Lip Net, and a new scalar-output network, the PLNet which satisfies the Polyak-Łojasiewicz condition, and is hence easily optimizable .

The core technical contribution is a new layer-type: the feed-through layer, which has certified bounds for strong monotonicity and Lipschitzness. By composing with orthogonal layers we obtain a bi-Lipschitz network structure (Bi Lip Net) which has much tighter bounds than existing bi-Lipschitz residual networks based on spectral normalization. The PLNet composes a Bi Lip Net with a quadratic output layer, and guarantees unique global minimum which is efficiently computable.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 9

Rosenbrock Rosenbrock + Sine

Error Error

ICNN (0.04, 16)-i-Dense Net (0.04, 16)-Bi Lip Net

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.000 0.015 0.030 0.045 0.060 0.075 0.090 0.105 0.120

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2

0.4 2.0 3.6 5.2 6.8 8.4 10.0 11.6

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

Figure 7. Learning a surrogate loss for the Rosenbrock and Rosenbrock+Sine functions, which is non-convex and has many local minima. The first row contains the true functions while the remaining rows show learned functions and errors for various surrogate loss models.

(a, b) = (1, 1) (a, b) = (0, 0) (a, b) = ( 1, 1)

Partially Bi Lip Net

0.0 1.5 3.0 4.5 6.0 7.5 9.0 10.5 12.0

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2

0.0 0.6 1.2 1.8 2.4 3.0 3.6 4.2

0.0 0.6 1.2 1.8 2.4 3.0 3.6 4.2

0.0 3.2 6.4 9.6 12.8 16.0 19.2 22.4

0.0 3.2 6.4 9.6 12.8 16.0 19.2 22.4

Figure 8. Learning a parameterized Rosenbrock function r(x, y; a, b) = 1/200(x a)2 + 0.5(y bx2)2 via a partial PLNet.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 10

Impact Statement

There are many application domains in which the trustworthiness of machine learning is a live topic of debate and raises important and challenging questions. The goal of this paper is to advance the sub-field of machine learning methods which have mathematically-certified properties. In particular, in this paper one application is uncertainty quantification. We hope that a positive impact of our paper and others like it will be to the development of ML methods that can better satisfy societal expectations of trustworthiness and transparency.

We are not aware of any potentially significant negative impacts that are particularly associated with this line of research (models with certified properties).

Ahn, B., Kim, C., Hong, Y., and Kim, H. J. Invertible monotone operators for normalizing flows. Advances in Neural Information Processing Systems, 35:16836 16848, 2022.

Amos, B., Xu, L., and Kolter, J. Z. Input convex neural networks. In International Conference on Machine Learning (ICML), pp. 146 155. PMLR, 2017.

Araujo, A., Havens, A. J., Delattre, B., Allauzen, A., and Hu, B. A unified algebraic perspective on Lipschitz neural networks. In The Eleventh International Conference on Learning Representations (ICLR), 2023.

Ardizzone, L., Kruse, J., Rother, C., and K othe, U. Analyzing inverse problems with invertible neural networks. In International Conference on Learning Representations (ICLR), 2018.

Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In International conference on machine learning (ICML), pp. 214 223. PMLR, 2017.

Arora, S. and Doshi, P. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.

Bai, S., Kolter, J. Z., and Koltun, V. Deep equilibrium models. In Advances in Neural Information Processing Systems, pp. 690 701, 2019.

Barbara, N. H., Wang, R., and Manchester, I. R. On robust reinforcement learning with lipschitz-bounded policy networks. ar Xiv preprint ar Xiv:2405.11432, 2024.

Bauer, M. and Mnih, A. Resampled priors for variational autoencoders. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 66 75. PMLR, 2019.

Behrmann, J., Grathwohl, W., Chen, R. T., Duvenaud, D., and Jacobsen, J.-H. Invertible residual networks. In International conference on machine learning (ICML), pp. 573 582. PMLR, 2019.

Behrmann, J., Vicol, P., Wang, K.-C., Grosse, R., and Jacobsen, J.-H. Understanding and mitigating exploding inverses in invertible neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 1792 1800. PMLR, 2021.

Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual flows for invertible generative modeling. Advances in Neural Information Processing Systems, 32, 2019.

Cheng, J., Wang, R., and Manchester, I. R. Learning stable and passive neural differential equations. ar Xiv preprint ar Xiv:2404.12554, 2024.

Coleman, C., Narayanan, D., Kang, D., Zhao, T., Zhang, J., Nardi, L., Bailis, P., Olukotun, K., R e, C., and Zaharia, M. Dawnbench: An end-to-end deep learning benchmark and competition. Training, 100(101):102, 2017.

Cozad, A., Sahinidis, N. V., and Miller, D. C. Learning surrogate models for simulation-based optimization. AICh E Journal, 60(6):2211 2227, 2014.

Davis, D. and Yin, W. A three-operator splitting scheme and its optimization applications. Set-valued and variational analysis, 25:829 858, 2017.

Davis, T. A. Direct methods for sparse linear systems. SIAM, 2006.

De Cao, N., Aziz, W., and Titov, I. Block neural autoregressive flow. In Uncertainty in artificial intelligence, pp. 1263 1273. PMLR, 2020.

Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. ICLR Workshop Track, 2015.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real NVP. In International Conference on Learning Representations (ICLR), 2017.

El Ghaoui, L., Gu, F., Travacca, B., Askari, A., and Tsai, A. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021.

Fazlyab, M., Robey, A., Hassani, H., Morari, M., and Pappas, G. Efficient and accurate estimation of Lipschitz constants for deep neural networks. In Advances in Neural Information Processing Systems, pp. 11427 11438, 2019.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 11

Grathwohl, W., Chen, R. T., Bettencourt, J., and Duvenaud, D. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations (ICLR), 2019.

Grudzien, K., Uehara, M., Levine, S., and Abbeel, P. Functional graphical models: Structure enables offline datadriven optimization. In International Conference on Artificial Intelligence and Statistics, pp. 2449 2457. PMLR, 2024.

Gu, S., Lillicrap, T., Sutskever, I., and Levine, S. Continuous deep Q-learning with model-based acceleration. In International conference on machine learning (ICML), pp. 2829 2838. PMLR, 2016.

Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. Advances in neural information processing systems, 30, 2017.

Havens, A. J., Araujo, A., Garg, S., Khorrami, F., and Hu, B. Exploiting connections between Lipschitz structures for certifiably robust deep equilibrium models. In Thirtyseventh Conference on Neural Information Processing Systems, 2023.

Helfrich, K., Willmott, D., and Ye, Q. Orthogonal recurrent neural networks with scaled Cayley transform. In International Conference on Machine Learning (ICML), pp. 1969 1978. PMLR, 2018.

Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In International Conference on Machine Learning (ICML), pp. 2722 2730. PMLR, 2019.

Huang, C.-W., Krueger, D., Lacoste, A., and Courville, A. Neural autoregressive flows. In International Conference on Machine Learning (ICML), pp. 2078 2087. PMLR, 2018.

Karimi, H., Nutini, J., and Schmidt, M. Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795 811. Springer, 2016.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.

Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.

Kobyzev, I., Prince, S. J., and Brubaker, M. A. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964 3979, 2020.

Kok, S. and Sandrock, C. Locating and Characterizing the Stationary Points of the Extended Rosenbrock Function. Evolutionary Computation, 17(3):437 453, 09 2009.

Li, J., Fang, C., and Lin, Z. Lifted proximal operator machines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4181 4188, 2019.

Li, J., Li, F., and Todorovic, S. Efficient riemannian optimization on the stiefel manifold via the Cayley transform. In International Conference on Learning Representations (ICLR), 2020.

Liang, E., Chen, M., and Low, S. Low complexity homeomorphic projection to ensure neural-network solution feasibility for optimization over (non-) convex set. In International conference on machine learning (ICML). PMLR, 2023.

Liu, J., Lin, Z., Padhy, S., Tran, D., Bedrax Weiss, T., and Lakshminarayanan, B. Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. Advances in Neural Information Processing Systems, 33:7498 7512, 2020.

Liu, J. Z., Padhy, S., Ren, J., Lin, Z., Wen, Y., Jerfel, G., Nado, Z., Snoek, J., Tran, D., and Lakshminarayanan, B. A simple approach to improve single-model deep uncertainty via distance-awareness. Journal of Machine Learning Research, 24(42):1 63, 2023.

Lojasiewicz, S. A topological property of real analytic subsets. Coll. du CNRS, Les equations aux d eriv ees partielles, 117(87-89):2, 1963.

Louizos, C. and Welling, M. Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning (ICML), pp. 2218 2227. PMLR, 2017.

Lu, C., Chen, J., Li, C., Wang, Q., and Zhu, J. Implicit normalizing flows. In International Conference on Learning Representations (ICLR), 2021.

Megretski, A. and Rantzer, A. System analysis via integral quadratic constraints. IEEE Transactions on Automatic Control, 42(6):819 830, 1997.

Meunier, L., Delattre, B. J., Araujo, A., and Allauzen, A. A dynamical system perspective for Lipschitz neural networks. In International Conference on Machine Learning (ICML), pp. 15484 15500. PMLR, 2022.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 12

Misener, R. and Biegler, L. Formulating data-driven surrogate models for process optimization. Computers & Chemical Engineering, 179:108411, 2023.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations (ICLR), 2018.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617 2680, 2021.

Pauli, P., Havens, A., Araujo, A., Garg, S., Khorrami, F., Allg ower, F., and Hu, B. Novel quadratic constraints for extending Lip SDP beyond slope-restricted activations. In International Conference on Learning Representations (ICLR), 2024.

Perugachi-Diaz, Y., Tomczak, J., and Bhulai, S. Invertible densenets with concatenated Lipswish. Advances in Neural Information Processing Systems, 34:17246 17257, 2021.

Polyak, B. Gradient methods for minimizing functionals (in russian). USSR Computational Mathematics and Mathematical Physics, 3(4):643 653, 1963.

Prach, B. and Lampert, C. H. Almost-orthogonal layers for efficient general-purpose Lipschitz networks. In European Conference on Computer Vision, pp. 350 365. Springer, 2022.

Rantzer, A. On the Kalman Yakubovich Popov lemma. Systems & control letters, 28(1):7 10, 1996.

Revay, M., Wang, R., and Manchester, I. R. Lipschitz bounded equilibrium networks. ar Xiv preprint ar Xiv:2010.01732, 2020.

Revay, M., Wang, R., and Manchester, I. R. Recurrent equilibrium networks: Flexible dynamic models with guaranteed stability and robustness. IEEE Transactions on Automatic Control, 2023.

Russo, A. and Proutiere, A. Towards optimal attacks on reinforcement learning policies. In 2021 American Control Conference (ACC), pp. 4561 4567. IEEE, 2021.

Ryu, E. K. and Boyd, S. Primer on monotone operator methods. Appl. comput. math, 15(1):3 43, 2016.

Ryu, M., Chow, Y., Anderson, R., Tjandraatmadja, C., and Boutilier, C. CAQL: Continuous action Q-learning. In International Conference on Learning Representations (ICLR), 2019.

Singla, S. and Feizi, S. Skew orthogonal convolutions. In International Conference on Machine Learning (ICML), pp. 9756 9766. PMLR, 2021.

Singla, S., Singla, S., and Feizi, S. Improved deterministic l2 robustness on CIFAR-10 and CIFAR-100. In International Conference on Learning Representations (ICLR), 2022.

Trockman, A. and Kolter, J. Z. Orthogonalizing convolutional layers with the Cayley transform. In International Conference on Learning Representations (ICLR), 2021.

Tsuzuku, Y., Sato, I., and Sugiyama, M. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. In Advances in neural information processing systems, pp. 6541 6550, 2018.

Wang, R. and Manchester, I. Direct parameterization of Lipschitz-bounded deep networks. In International Conference on Machine Learning (ICML), pp. 36093 36110. PMLR, 2023.

Wang, Z., Prakriya, G., and Jha, S. A quantitative geometric approach to neural-network smoothness. Advances in Neural Information Processing Systems, 35:34201 34215, 2022.

Ward, P. N., Smofsky, A., and Bose, A. J. Improving exploration in soft-actor-critic with normalizing flows policies. ICML Workshop on Invertible Neural Networks and Normalizing Flows,, 2019.

Wilson, F. W. The structure of the level surfaces of a Lyapunov function. Journal of Differential Equations, 3(3): 323 329, 1967.

Winston, E. and Kolter, J. Z. Monotone operator equilibrium networks. Advances in neural information processing systems, 33:10718 10728, 2020.

Yeh, J. Real analysis: theory of measure and integration second edition. World Scientific Publishing Company, 2006.

Zhang, B., Cai, T., Lu, Z., He, D., and Wang, L. Towards certifying l-infinity robustness using neural networks with l-inf-dist neurons. In International Conference on Machine Learning, pp. 12368 12379. PMLR, 2021.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 13

A. Model Parameterization

A model parameterization is a mapping M : ϕ θ where ϕ RN is a free learnable parameter while θ includes the model weights U Rm n, W Rm m, Y Rn m and IQC multiplier Λ Dm + with n, m as the dimensions of the input and hidden units, respectively. The aim of this section is to construct a parameterization such that the large-scale SDP constraint (3) holds, i.e., Y = U Λ and

H = 2Λ W Λ ΛW =

2Λ1 W 2 Λ2 Λ2W2 2Λ2 W 3 Λ3 ... ... ... ΛL 1WL 1 2ΛL 1 W L ΛL ΛLWL 2ΛL

Since H 0 has band structure, it can be represented by H = XX (Davis, 2006). Moreover, from Lemma 3 of (Rantzer, 1996) we have that any U, Y satisfying Y = U Λ and XX 2

γ Y Y can be represent by

γ/2Λ 1XQ, Y = p

γ/2Q X (17)

where Q Rm n with QQ I. The remaining task is to find X such that H = XX has the same sparse structure as (16), which was solved by (Wang & Manchester, 2023). For self-contained purpose, we provide detail construction as follows. First, we further parameterize X = ΨP, where Ψ = diag(Ψ1, . . . , ΨL) with Ψk Dmk + and

A1 B2 A2 ... ... BL AL

By comparing H = ΨPP Ψ with (16) we have

Hkk = Ψk(Bk B k + Ak A k )Ψk = 2Λk, Hk 1,k = Ψk Bk A k 1 = Λk Wk,

which further leads to

Ψ2 k = 2Λk, Bk B k + Ak A k = I, Wk = 2Ψ 1 k Bk A k 1Ψk 1 k = 1, . . . , L, (18)

with B1 = 0. We have converted the large-scale SDP constraint (16) into many simple and small-scale constraints such as

Ψ2 k = 2Λk, Rk R k = I, QQ I (19)

with Rk = Bk Ak , which further can be easily parameterized via the Cayley transformation (4), see Section 3.3. The Cayley transformation has been applied to construct orthogonal layers (Helfrich et al., 2018; Li et al., 2020; Trockman & Kolter, 2021) and 1-Lipschitz Sandwich layer (Wang & Manchester, 2023).

An equivalent model representation. The model weights U, Y, W defined in (5) can be rewritten as U = 2γΨ 1S, Y = p

γ/2S Ψ 1 and W = Ψ 1WΨ with

S1 S2 ... SL

A1Q1 A2Q2 B2Q1 ... ALQL BLQL 1

0 V2 0 ... ... VL 0

0 2B2A 1 0 ... ... 2BLA L 1 0

where Q = Q 1 Q L . Then, the network (2) can be written as

z = σ(Ψ 1V Ψz + p

2γΨ 1Sx + b), y = µx + p

γ/2S Ψz. (21)

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 14

By introducing the new hidden state ˆz = Ψz and bias ˆb = Ψb, we obtain an equivalent form:

ˆz = ˆσ V ˆz + p

2γSx + ˆb , y = µx + p

γ/2S ˆz + by. (22)

This representation is useful for computing the model inverse via monotone operator splitting, see Appendix B. We now give a lemma which will be used later for proving some propositions.

Lemma A.1. For the matrices V, S defined in (20) we have

2I V V 0, 2I SS 0. (23)

Proof. First, we have

2I (V + V ) = 2

I A1B 2 B2A 1 I A2B 3

B 3 A2 ... ... ... ...

where the inequality is obtained by sequentially applying the fact Ak A k + Bk B k = I and Schur complement to the top diagonal block. For the inequality on S, we have

2I SS = 2I PQQ P 2I PP

A1 B2 A2 ... ... BL AL

A1 B2 A2 ... ... BL AL

I A1B 2 B2A 1 I A2B 3

B 3 A2 ... ... ... ...

Similarly, the last inequality can be established by sequentially applying the Schur complement to the top diagonal block.

B. Monotone Operator Splitting for Computing Model Inverse

Inspired by (Winston & Kolter, 2020; Revay et al., 2020), we try to compute x = F 1(y) via an operator splitting method. We first present some background of monotone operator theory based on the survey (Ryu & Boyd, 2016), and then reformulate the model inverse as a three-operator splitting problem.

B.1. Monotone operator

An operator is a set-valued or single-valued map defined by a subset of the space A Rn Rn; we use the notation A(x) = {y | (x, y) A}. For example, the affine operator is defined by L(x) = {(x, Wx + b) | x Rn}. Another important example is the subdifferential operator f = {(x, f(x))} for a proper function f : Rn R { } with f(z) = for z / dom f, where f(x) = {g Rn | f(y) f(x) + y x, g , y Rn}. An operator A has a Lipschitz bound of L if u v L x y for all (x, u), (y, v) A. It is non-expansive if L = 1 and contractive if L < 1. A is strongly monotone with m > 0 if

u v, x y m x y , (x, u), (y, v) A. (24)

If the above inequality holds for m = 0, we call A a monotone operator. Similarly, A is said to be inverse monotone with ρ if u v, x y ρ u v , (x, u), (y, v) A. An operator is called maximal monotone if no other monotone operator strictly contains it. The linear operator L is m-strongly monotone if W + W 2m I, and ρ-inverse monotone if W + W 2ρW W. A subdifferential f is maximal monotone if and only if f is a convex closed proper (CCP) function. Here are some basic operations for operators:

the operator sum A + B = {(x, y + z) | (x, y) A, (x, z) B};

the composition AB = {(x, z) | y s.t. (x, y) A, (y, z) B} ;

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 15

Table 2. A list of common activation functions and their associated convex proper f(z) whose proximal operator is σ(x) (Revay et al.,

2020). For z / dom f, we have f(z) = . In the case of Softplus activation, Lis(z) is the polylogarithm function.

Activation σ(x) Convex f(z) dom f

Re Lu max(x, 0) 0 [0, )

Leaky Re Lu max(x, 0.01x) 99

2 min(z, 0)2 R

Tanh tanh(x) 1 2 h ln(1 z2) + z ln 1+z 1 z z2i

Sigmoid 1/(1 + e x) z ln z + (1 z) ln(1 z) z2

Arctan arctan(x) ln(| cos z|) z2

Softplus ln(1 + ex) Li2(ez) iπz z2/2 (0, )

the inverse operator A 1 = {(y, x) | (x, y) A};

the resolvent operator RA = (I + αA) 1 with α > 0;

the Cayley operator CA = 2RA I.

Note that the resolvent and Cayley operators are non-expansive for any maximal monotone A, and are contractive if A is strongly monotone. For a linear operator L we have RL(x) = (I + αW) 1(x αb). For a subdifferential operator f, its resolvent is R f(x) = proxα f (x) := arg minz 1/2 x z + αf(z), which is also called the proximal operator.

Activation as proximal operator. As shown in (Li et al., 2019; Revay et al., 2020), many popular slope-restricted scalar activation functions can also be treated as proximal operators. To be specific, if σ : R R is slope-restricted in [0, 1], then there exists a convex proper function f such that σ( ) = prox1 f( ). For self-contained purpose, we provide a list of common activations and their associated convex proper functions in Table 2, which can also be found in (Revay et al., 2020; Li et al., 2019).

B.2. Operator splitting

Many optimization problems (e.g. convex optimization) can be formulated as one of finding a zero of an appropriate monotone operator F, i.e., find x Rn such that 0 F(x). Note that x is a solution if and only if it is a fixed point x = T (x) with T = I αF for any nonzero α R. The corresponding fixed point iteration is xk+1 = T (xk) = xk αF(xk). If F is m-strongly monotone and L-Lipschitz, then this iteration converges by choosing α (0, 2m/L2). The optimal convergence rate is 1 (m/L)2, given by α = m/L2.

If F contains some non-smooth components, we then split F into two or three maximal operators:

two-operator splitting problem: 0 A(x) + B(x) (25)

three-operator splitting problem: 0 A(x) + B(x) + C(x) (26)

where A, B, and C are maximal monotone. The main benefit of such splitting is that the resolvent or Cayley operators for individual operator are easy to evaluate, which further leads to more computationally efficient algorithms. For two-operator splitting problem, some popular algorithms include

forward-backward splitting (FBS) x = RB(I αA)(x)

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 16

forward-backward-forward splitting (FBFS) x = ((I αA)RB(I αA) + αA)(x)

Peaceman-Rachford splitting (PRS) z = CACB(z), x = RB(z)

Douglas-Rachford splitting (DRS) z = (1/2I + 1/2CACB)(z), x = RB(z)

where the corresponding fixed-point iterations, the choices of hyper-parameter α and convergence results can be found in (Ryu & Boyd, 2016). For three-operator splitting problem, the Davis-Yin splitting (DYS) (Davis & Yin, 2017) can be expressed by z = T (z), x = RB(z) where T = CA(CB αCRB) αCRB.

B.3. Operator splitting perspective for F 1

As shown in the proof of Proposition 4.2, By applying the forward-backward splitting with parameter α = 1, we can compute the solution z via the following iteration:

zk+1 = RB(zk b A zk ) = ˆσ V γ/µSS zk + bz .

It is worth pointing out that the above iteration may not converge for the choice of α = 1. In practice we often use more stable and faster two-operator splitting algorithms (e.g., PRS or DRS), see (Winston & Kolter, 2020; Revay et al., 2020). In this work, the motivation for further decomposing the monotone operator b A into two monotone operators A, C is that RA is a large-scale linear equation with nice sparse structure while R b A is dense due to the full weight matrix in C.

Fixed-point iteration. We now apply the DYS algorithm from (Davis & Yin, 2017) to 0 A(z) + B(z) + C(z), resulting in the following fixed-point iteration:

zk+1/2 = RB(uk) = proxα f (uk)

uk+1/2 = 2zk+1/2 uk

zk+1 = RA(uk+1/2 αC(zk+1/2))

uk+1 = uk + zk+1 zk+1/2

where the third line is a large-scale sparse linear equation of the form

(1 + α)I αV21 (1 + α)I ... ... αVL,L 1 (1 + α)I

zk+1 1 zk+1 2... zk+1 L

= uk+1/2 + α bz γ

µSS zk+1/2 .

By introducing vk+1/2 = bz γ/µSS zk+1/2, we have

zk+1 0 = 0, zk+1 l = α 1 + α

Vl,l 1zk+1 l 1 + vk+1/2 l + 1 1 + αuk+1/2 l , l = 1, . . . , L. (28)

Convergence range for the hyper-parameter α. From the previous paragraph, we know that (11) is equivalent to the FPI (27). From Theorem 1.1 of (Davis & Yin, 2017), we have that (27) converges for any α (0, 2β) with β as the inverse-monotone bound of C. From Lemma A.1 we have 2I S S and

µS(S S)S = µ

γ (γ/µSS )2 = 2β(γ/µSS )2

i.e., C(z) is inverse monotone with β = µ/(2γ). Therefore, (11) converges for any α (0, µ/γ). Since γ = ν µ and τ = ν/µ, we then obtain the convergence range in term of model distortion τ, i.e., α (0, 1/(τ 1)). Larger α often implies faster convergence rate, see Figure 9.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 17

0 200 400 600 10 4

x x / x (hid_dim=384)

input_dim=8

DYS(0.7µ/γ)

DYS(0.5µ/γ)

DYS(0.3µ/γ)

0 200 400 600

input_dim=16

0 200 400 600

input_dim=32

0 200 400 600

input_dim=64

0 200 400 600 Steps

x x / x (hid_dim=768)

0 200 400 600 Steps

0 200 400 600 Steps

0 200 400 600 Steps

Figure 9. Solver comparison for computing the inverse of random µ-monotone and ν-Lipschitz layers with different input and hidden unit dimensions. We obverse that DYS (27) converges faster for lager α. If α is close to the bound µ/γ with γ = ν µ, DYS converges much faster rate than FSM (15) with hyper-parameter α = µ/ν2, which achieves its best convergence rate (Ryu & Boyd, 2016).

C.1. Proof of Theorem 3.2

We consider the neural network H : x y defined by

v = Wz + Ux + b, z = σ(v), y = Y z + by. (29)

Since F(x) = µx+H(x), then F is µ-strongly monotone and ν-Lipschitz if H is monotone and γ-Lipschitz with γ = ν µ.

For any pair of solutions s1 = (x1, v1, z1, y1) and s2 = (x2, v2, z2, y2), their difference s = s1 s2 satisfies

v = W z + U x, z = Jσ(v1, v2) v, y = Y z (30)

where Jσ is a diagonal matrix with [Jσ]ii [0, 1] since σ is an elementwise activation with slope restricted in [0, 1]. For any Λ Dm + we have v z, Λ z = v (I Jσ)ΛJσ v 0, v Rm. (31)

Based on (30), (31) and Condition (3) we have

x, y v z, Λ z = x, Y z (W I) z + U x, Λ z

= x, Y z x, U Λ z + (I W) z, Λ z

2 z Λ(I W) + (I W )Λ z Y z 2 0,

which further implies x, y µ x 2 v z, Λ z 0. Thus, H is monotone. We can use the similar technique to derive the Lipschitz bound of H. Firstly we have

γ y 2 2 v z, Λ z =γ x 2 1

γ y 2 + 2 (I W) z, Λ z 2 U x, Λ z

=γ x 2 2 x, y 1

γ y 2 + z (2Λ ΛW W Λ) z

γ x 2 2 x, y 1

γ Y z 2 = γ x 1 γ y

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 18

Due to (31) we can further obtain γ2 x 2 y 2, i.e., H is γ-Lipschitz.

C.2. Proof of Proposition 3.5

Sufficient part: (5) (3). From (17) we have that Y = U Λ. We check the inequality part of (3) as follows:

2Λ W Λ ΛW = ΨPP Ψ = XX XQQ X = 2

Necessary part: (3) (5). Since H 0 has band structure, then it can be decomposed into H = XX where X has the following block lower triangular structure (Davis, 2006):

X11 X21 X22 ... ... XL,L 1 XLL

For this special case, a way to construct X from Λ, W and further computation of the free parameters d, F a k , F b k, F q, F

can be found in (Wang & Manchester, 2023). Finally, we need to show that XX 2/γY Y is equivalent to Y = p

γ/2Q X for some QQ I, which can be directly followed by Lemma 3 of (Rantzer, 1996).

C.3. Proof of Proposition 4.1

From Lemma A.1 we have

2I (V γ/µSS ) (V γ/µSS ) = 2I V V + 2γ/µSS 2γ/µSS 0. (32)

Then, the equilibrium network (10) is well-posed by Theorem 1 of (Revay et al., 2020).

C.4. Proof of Proposition 4.2

We first show that 0 A(z) + B(z) + C(z) is a monotone operator splitting problem. It is obvious that B, C are maximal monotone operators. From Lemma A.1 we have (I V ) + (I V ) 0, i.e. A is also monotone. Then, we show that the above operator splitting problem shares the same set of equilibrium points with the model inverse (10). First, we rewrite it into a two-operator splitting problem 0 b A(z) + B(z) where b A = A + C. By applying the forward-backward splitting with parameter α = 1, we can compute the solution z via the following iteration:

zk+1 =RB(zk b A zk ) = prox1 f zk I V + γ/µSS zk + bz = ˆσ V γ/µSS zk + bz .

Thus, any solution z of the equilibrium network (10) is also an equilibrium point of the above iteration.

C.5. Proof of Proposition 5.1

First, we have f(x) = G (x)G(x) where G(x) = G(x) satisfies G(x) µ. Then, the PL inequality holds for f with m = µ2, i.e., 1 2 f(x) 2 = 1

2G(x) G(x) G(x)G(x) µ2

2 G(x) 2 = µ2(f(x) f ). (33)

D. Experiments

D.1. Training details

We choose Re LU as our default activation and use ADAM (Kingma & Ba, 2015) with one-cycle linear learning rate (Coleman et al., 2017) except the NGP case which SGD with piecewise constant scheduling. For the NGP case, we use the cross entropy loss while the L2 loss is used for the rest of the examples. We found that it can improve the model training by enforcing Q Q = I, which can be done by fixing F p = 0. Dataset and model architectures are described as follows.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 19

1D Step function. The target function is a step function

( 2, x > 0 2, x < 0

which is monotone and 0-Lipschitz everywhere except the singularity point x = 0. We try to fit this curve with (0.1, 10)- Lipschitz models. The optimal fit is a linear piecewise continuous function with slope of 10 near x = 0 and slope of 0.1 near x = 2. We take 1000 random samples from [ 2, 2] for training. Our model (Bi Lip Net) is an one-layer residual network F(x) = µx + H(x) where H has 8 hidden layers of width 32, giving the model 15.8K parameters. We compare to i-Res Net (Chen et al., 2019) and i-Dense Net (Perugachi-Diaz et al., 2021), where the nonlinear block H has 2 and 4 hidden layers, respectively. For those two models, we test for depth from 2 to 8 with proper hidden width (so that they has similar amount of parameters). And the empirical Lipschitz bound is computed via finite difference over the test data. As shown in Figure 1, our model achieves much tighter bounds than other models.

Neural Gaussian process. We take 1000 two-moon data points as training data and 1000 Gaussian samples with mean (1.3, 1.8) and variance (0.02, 0.01) as OOD data. For all models, we use fixed input weight to mapping the 2D input into 128D hidden space, then perform hidden space transformation using bi-Lipschitz models, and finally add a Gaussian process as the output layer. SNGP uses 3 residual layer x + H(x) where the Lipschitz bound of H is c < 1. Bi Lip Net has one monotone and Lipschitz layer with two orthogonal layer, i.e., K = 1 for (8). The nonlinear block H of our model has 6 hidden layers with width of 32. Both models are chosen to have the same amount of parameters, roughly 233K.

CIFAR-10/100 datasets. We first adopt the SNGP model from (Liu et al., 2020) and make some modifications as follows.

SNGP contains three bi-Lipschitz components with each including four residual layers of the form x + H(x). It used spectral norm bound c = 6 for the weights inside H, which means that the bi-Lipschitz property may not hold. To provide a certified guarantee of bi-Lipschitzness we need c (0, 1). We tried three values of c: 0.35, 0.65 and 0.95. Since the Lipschitz bounds are µ = (1 c)4 and ν = (1 + c)4, a larger c implies a more expressive SNGP model.

We ran the SNGP with/without batch normalization for the bi-Lipschitz components. As pointed out in (Liu et al.,

2023), the batch normalization may re-scale a layer s spectral norm in unexpected ways. So there is no theoretical guarantee on bi-Lipschitz property when batch normalization is applied.

Training the original SNGP takes about 95% GPU memory of an Nvidia RTX3090. With the same number of parameters, our model needs more GPU memory as it uses the approach from (Trockman & Kolter, 2021) to perform the Cayley transform of convolution operators, which involves FFT and inverse FFT. In order to use a single GPU to train both models, we reduce the width of SNGP so that it has a similar amount of parameters as our model ( 14M).

Our model has a similar structure to SNGP except that we replace their bi-Lipschitz components with our proposed bi-Lipschitz networks. Note that there is no batch normalization inside our bi-Lipschitz networks. All models are trained for 200 epochs using the mini-batch stochastic gradient descent (SGD) method with batch size of 256. We adjust the learning rate based on a piecewise constant schedule.

2D Rosenbrock function. The true function is a Rosenbrock function defined by

r(x, y) = 1 200(x 1)2 + 1

Note that we use a scaling factor of 1/200 for the classic Rosenbrock function. The above function is non-convex but has one minimum at (1, 1). We also consider the combination of the above Rosenbrock function with the following 2D Sine function: s(x, y) = 0.25(sin(8(x 1) π/2) + sin(8(y 1) π/2) + 2).

In this case r(x, y) + s(x, y) still has a unique global minimum at (1, 1). But there are many local minima. We take 5K random training samples from the domain [ 2, 2] [ 1, 3]. The proposed Bi Lip Net contains two monotone and Lipschitz layers (i.e., K = 2 for (8)). The nonlinear block H has 4 hidden layers of width 128. The model size is roughly 16K. The ICNN model has 8 hidden layers with width of 180. The MLP has hidden units of [128, 256, 256, 512]. We trained i-Res Net and i-Dense Net with different depth and width such that the total amount of parameters is comparable with Bi Lip Net.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 20

Parametric Rosenbrock function. We consider the following parametric Rosenbrock function

r(x, y; p) = 1 200(x a)2 + 1

2(y bx2)2, p = (a, b) [ 1, 1]2.

We take 10K random training data. The partially Bi Lip Net contains 3 orthogonal layers, and 2 monotone and Lipschitz layers (the H block of each layer has 4 hidden layer with width 128). The bias term of each orthogonal layer is produced by an MLP with hidden units of [64, 128, 2] while the bias for those hidden units inside the H block is generated by an MLP of [64, 128, 256, 512]. The model s bi-Lipschitz bound is chosen to be (0.04, 16). The resulting model size is 604K.

ND Rosenbrock function. We also consider the N-dimensional (with N = 20) Rosenbrock function:

R(x) = 1 N 1

i=1 r(xi, xi+1)

which is non-convex and has a unique global minimum at (1, 1, . . . , 1). Besides it also has many local minima. We take 10K random samples over the domain [ 2, 2]20 and do training with batch size of 200. Note that the data size is very small compared to the dimension. We then use 500K samples for testing. Bi Lip Net has two monotone and Lipschitz layers (i.e., K = 2 for (8)) where each layer has a nonlinear block H with 8 hidden layer of width 256 (model size 2.1M). For the i-Res Net/i-Dense Net, we try different depths from 2 to 10 and observe that depth of 5 yields slightly better results. The width of hidden layer is chosen so that it has a similar amount of parameters as Bi Lip Net.

D.2. Extra results

Some extra results for the bi-Lipschitz models on two-moon and CIFAR-10/100 datasets are shown in Figure 10 and Table 3, respectively. Figure 11 depicts the additional results on surrogate loss learning.

Accuracy ( ) ECE ( ) NLL ( ) Method c Clean Corrupted Clean Corrupted Clean Corrupted

0.95 94.7 0.079 73.0 0.461 0.017 0.002 0.127 0.010 0.166 0.004 0.991 0.054 0.65 94.1 0.159 72.3 0.561 0.016 0.000 0.116 0.005 0.182 0.005 0.985 0.029 0.35 92.3 0.260 70.4 0.800 0.008 0.003 0.095 0.007 0.231 0.006 0.995 0.031

0.95 86.2 0.250 70.8 0.469 0.020 0.003 0.052 0.005 0.423 0.006 0.895 0.020 0.65 86.7 0.129 72.8 0.592 0.015 0.005 0.047 0.009 0.400 0.006 0.830 0.024 0.35 84.5 0.184 72.6 0.216 0.010 0.002 0.052 0.004 0.457 0.002 0.827 0.008

0.95 72.3 0.513 44.8 0.470 0.071 0.006 0.091 0.006 1.042 0.018 2.476 0.025 0.65 67.8 1.006 41.5 0.916 0.117 0.007 0.092 0.002 1.231 0.035 2.573 0.036 0.35 61.9 0.741 37.0 0.660 0.158 0.006 0.098 0.006 1.510 0.029 2.760 0.043

0.95 51.0 0.480 35.8 0.397 0.230 0.006 0.137 0.007 2.064 0.024 2.718 0.014 0.65 55.2 0.426 39.2 0.495 0.225 0.004 0.137 0.005 1.887 0.021 2.576 0.022 0.35 54.4 0.438 41.1 0.200 0.194 0.008 0.126 0.009 1.876 0.031 2.447 0.016

Table 3. Results for SNGP-BN (SNGP with batch normalization) and Bi Lip Net (without batch normalization) on CIFAR-10/100, averaged over 5 seeds. As pointed out by (Liu et al., 2023), the batch normalization may rescale a layer s spectral norm in unexpected ways. So there is no theoretical guarantee on bi-Lipschitz property for SNGP-BN. This may offer it extra expressive power, leading to performance improvement in both clean and corrupted accuracy for a large distortion models (i.e. c = 0.95). For models with low distortion (i.e. c = 0.35), Bi Lip Net has better accuracy for the corrupted dataset.

Monotone, Bi-Lipschitz, and Polyak-Łojasiewicz Networks 21

0.0 0.5 1.0 0

2.5 0.0 2.5

0.0 0.5 1.0 0

800 OOD Train

2.5 0.0 2.5

0.0 0.5 1.0 0

800 OOD Train

2.5 0.0 2.5

0.0 0.5 1.0 Predictive uncertainty

2.5 0.0 2.5 Uncertainty surface

0.0 0.5 1.0 Predictive uncertainty

800 OOD Train

2.5 0.0 2.5 Uncertainty surface

0.0 0.5 1.0 Predictive uncertainty

800 OOD Train

2.5 0.0 2.5 Uncertainty surface

Lip. [0.343, 2.197] (c=0.3) Lip. [0.125, 3.375] (c=0.5) Lip. [0.027, 4.913] (c=0.7)

Figure 10. Uncertainty qualification via neural Gaussian process with different bi-Lipschitz bound specifications.

Rosenbrock Rosenbrock + Sine

(0.04, 16)-i-Res Net (0.25, 4)-Bi Lip Net (0.5, 2)-Bi Lip Net

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.6 3.2 4.8 6.4 8.0 9.6 11.2 12.8

0.0 1.2 2.4 3.6 4.8 6.0 7.2 8.4

0.0 0.4 0.8 1.2 1.6 2.0 2.4 2.8 3.2 3.6

0.0 1.2 2.4 3.6 4.8 6.0 7.2 8.4 9.6

Figure 11. Additional results for Learning a surrogate loss for the Rosenbrock and Rosenbrock + Sine functions. The first row contains the true functions while the remaining rows show learned functions and errors for various surrogate loss models. Our model has the flexibility of capturing the non-convex sub-level sets, but can also fit smoothed representations by reducing the distortion parameter.