# declarative_nets_that_are_equilibrium_models__527013cc.pdf

Published as a conference paper at ICLR 2022

DECLARATIVE NETS THAT ARE EQUILIBRIUM MODELS

Russell Tsuchida Suk Yee Yong Mohammad Ali Armin

Lars Petersson Cheng Soon Ong

Data61, CSIRO Canberra, Australia Space & Astronomy, CSIRO Epping, Australia Machine Learning & Artiﬁcial Intelligence Future Science Platform

Implicit layers are computational modules that output the solution to some problem depending on the input and the layer parameters. Deep equilibrium models (DEQs) output a solution to a ﬁxed point equation. Deep declarative networks (DDNs) solve an optimisation problem in their forward pass, an arguably more intuitive, interpretable problem than ﬁnding a ﬁxed point. We show that solving a kernelised regularised maximum likelihood estimate as an inner problem in a DDN yields a large class of DEQ architectures. Our proof uses the exponential family in canonical form, and provides a closed-form expression for the DEQ parameters in terms of the kernel. The activation functions have interpretations in terms of the derivative of the log partition function. Building on existing literature, we interpret DEQs as ﬁne-tuned, unrolled classical algorithms, giving an intuitive justiﬁcation for why DEQ models are sensible. We use our theoretical result to devise an initialisation scheme for DEQs that allows them to solve k GLMs in their forward pass at initialisation. We empirically show that this initialisation scheme improves training stability and performance over random initialisation.

1 INTRODUCTION

Implicit layers (Pineda, 1987; Almeida, 1990) have recently been subject to renewed attention (Kolter et al., 2020). In contrast with explicitly deﬁned layers, implicit layers deﬁne a mapping in terms of a solution to some problem depending on the input and problem parameters. For example, deep equilibrium models (DEQs) consist of layers that output ﬁxed points of parameterised functions (Bai et al., 2019). Deep declarative networks (DDNs) use declarative nodes which output solutions to optimisation problems (Gould et al., 2016; Amos & Kolter, 2017). Neural ordinary differential equations (NODEs) output solutions to ODEs (Chen et al., 2018; Dupont et al., 2019). Traditional explicit layers can always be represented as implicit layers (for example, see Proposition 4.10 of Gould et al. (2021)). Also, solutions to certain convex optimisation problems may be obtained via an iterative optimisation procedure such as Newton s method or gradient descent, and as such, may be represented as ﬁxed points of an iterative scheme. A correspondence between DDNs and DEQs is expected (but undiscovered), given the fundamental connection between ﬁxed points of iterative maps and critical points of optimisation problems (Ryu & Boyd, 2016). This leads to two natural questions: (1) For a given optimisation problem, what is the corresponding DEQ architecture? (2) Can this correspondence be exploited for theoretical, conceptual, or practical beneﬁt?

Contributions. (1) We prove an equivalence between a DEQ and a DDN with a classical statistical model a kernelised generalised linear model (k GLM) as the declarative node, as illustrated in Figure 1 and formally stated in Theorem 3 and Corollary 6. The weights of the DEQ layer have closed-form expressions in terms of the kernel. The surprise in our result is that the feature mapping involved in this correspondence is exactly the class of hidden layers that are most commonly used in practice. (2) We empirically demonstrate that initialising a DEQ as a DDN using the derived expression for the weights improves performance and stability over random parameter initialisation.

Published as a conference paper at ICLR 2022

Implicit network module

α = arg min k GLM(α, X, Y ) f (Y ) = k( e X, X)α α = arg min k GLM(α, X, Y ) f (Y ) = k( e X, X)α DDN

W1z + W2T(Y ) + W3ρ(z )

f (Y ) = V1 (z + ρ(z )) + V2T(Y ) DEQ z = σ

W1z + W2T(Y ) + W3ρ(z )

is equivalent to

f (Y ) = V1 (z + ρ(z )) + V2T(Y )

L(Fγ(Y ), e Y )

Figure 1: Implicit network modules such as DDNs and DEQs output solutions to problems depending on their inputs and parameters (top). We establish a connection between DDNs with an optimisation layer (middle) and DEQs with a ﬁxed point layer (bottom). Under mild conditions, a DDN that solves a k GLM as its inner problem over dataset (X, Y ) and forms predictions at e X (middle) is equivalent to a fully connected or convolutional DEQ accepting a datapoint Y with underlying ﬁxed coordinates X and e X (bottom). In an inpainting task, the inner yellow and outer regions represent pixel coordinates X and e X. Y and e Y represent corresponding image values. The DEQ consists of a ﬁxed point/linear layer with known ﬁxed parameters W1, W2, W3, V1, V2 determined by the kernel k and coordinates X, e X. The activation functions σ, ρ, T are determined by the exponential family and kernel regulariser of the k GLM. Red indicates that gradient signals are blocked for exact equivalence; equivalence is exact under our initialisation scheme.

Notation. Let X Rdx and Y R be coordinate and target spaces. Deﬁne a Y valued stochastic process {y(x, ω)}x X indexed by x X with outcome ω from a sample space Ω. Let X Rn dx

be a matrix such that the ith row of X is some element xi X. Similarly deﬁne e X Ren dx. Take n evaluations {yi}n i=1 = {y(xi, ω0)}n i=1 from a realisation of the stochastic process for every ith row in X, and form a corresponding matrix Y . Using the same realisation of the stochastic process, form e Y with evaluations {y(exi, ω0)}en j=1 for every ith row in e X. Call X and Y the inner coordinate and target data, and e X and e Y the outer coordinate and target data.

Example implication of result. Consider image inpainting a single 32 32 colour image. Take X = {1, . . . , 32}2 {1, 2, 3}, and X Zn 3 to be a matrix consisting of (non-repeated) triples, where n < 322 3. Take e X Zen 3 to be the matrix consisting of remaining triples, such that n + en = 322 3. An incomplete image Y Rn together with its missing values e Y Ren are jointly sampled from an image distribution. We may use kernel ridge regression to produce a perpixel prediction of e Y given X, Y and e X. This model represents a special case of a regularised k GLM with a closed-form predictor, but more generally an algorithm is required to compute the predictor see Appendix A. We now move from a single image to a set of images. We wrap the k GLM with an outer minimisation loop over the loss L between a neural network Fγ output applied to the k GLM predictor and target e Y to obtain a DDN, as sketched in Figure 1. We view the DDN as producing a per-image prediction. We show that such a DDN is equivalent to a DEQ with a closed-form expression for the parameters depending on the kernel. When the k GLM has ﬁxed hyperparameters, the DEQ has a ﬁxed set of parameters (made precise in Theorem 3 and Corollary 6). Initialising the DEQ with these parameters improves training stability and performance, as shown in 4. Another example using kernel logistic regression for image segmentation is given in Appendix G.

Published as a conference paper at ICLR 2022

(a) (b) (c) (d) (e) (f)

Figure 2: (Blue) Layers (1) on the interval [ 1, 1] when Y = 0, (Orange) identity. (a) tanh(0.9z+2) has derivative 0.9 so it admits a unique ﬁxed point. (b) tanh(z) has derivative 1. It satisﬁes the conditions of Proposition 2 on ( 1, 0) and (0, 1) and so admits a unique ﬁxed point. (c) tanh( 3z) is not a contraction but admits a unique ﬁxed point; contractions are not necessary. (d, e, f) tanh(3z), Re LU(z + 0.2), Re LU(z) are not contractions and admit 3, 0 and uncountably many ﬁxed points.

2 DEEP EQUILIBRIUM MODELS

Given some Y Rn dy containing row elements y Y, deep equilibrium models (DEQs) (Bai et al., 2019) compute an embedding z Z Rn dz as a ﬁxed point of some function and then output an afﬁne transformation f (Y ) of the ﬁxed point. More concretely, letting gθ( ; Y ) : Rn dz Rn dz denote a parameterised function (e.g. a neural network) with parameters θ Θ,

f (Y ) = V z , z = gθ(z ; Y ), z Z Rn dz, (1)

where V Rdu (n dz) are referred to as readout parameters, and V may be thought of as a du ndz matrix acting on a ﬂattened vector with ndz entries. We will ﬁnd it more natural to introduce a linear skip connection between Y and the output. We will also choose dy = dz = 1. That is, we consider

f (Y ) = V1z + V2Y, z = gθ(z ; Y ), z Z Rn (2)

Typically DEQs are trained in a supervised manner by running some iterative algorithm such as gradient descent on a loss that aims to match a neural network involving f (Y ) to some target e Y with respect to the DEQ and neural network parameters. Given a dataset {(Ys, e Ys)}N s=1, this procedure may be realised by running an iterative algorithm over the equality constrained optimisation problem

min γ,θ,V1,V2

s=1 L e Ys, Fγ f (Ys)

subject to z s = gθ(z s; Ys), f (Ys) = V1z s + V2Ys, s = 1, . . . , N

as if the algorithm were to solve the problem, even though it may be highly non-convex. Here L is some loss function and Fγ is a neural network with parameters γ. A large class of iterative algorithms requires the derivative with respect to the network parameters θ and V1, V2; these may be computed under mild conditions without backpropagating through the ﬁxed point solution via implicit differentiation. We refer the reader to Bai et al. (2019) for details.

Dealing with existence and uniqueness of ﬁxed points. In order for (1) and (2) to specify useful computational rules, they have to admit at least one ﬁxed point. To avoid having to choose which ﬁxed point should be returned with additional rules, it might be desirable that they admit exactly one ﬁxed point. We give example conﬁgurations of ﬁxed points in Figure 2. The classical Banach ﬁxed point theorem (BFPT) gives sufﬁcient conditions for the existence and uniqueness of ﬁxed points. Theorem 1 (BFPT). Let (Z, d) be a complete metric space. A function H : Z Z is said to be a contraction if there exists some 0 q < 1 such that for all z, z Z, d (H(z), H(z )) qd(z, z ). Every contraction admits a unique ﬁxed point.

The contraction property is suggestive of a simple algorithm (Hasselblatt & Katok, 2003) for ﬁnding the ﬁxed point z of a contraction H given an initial guess z0: While some termination condition is not met, Update the current estimate zr for the ﬁxed point to be H(zr 1). This algorithm allows one to interpret the inner problem of (3) as an inﬁnitely deep neural network with weights shared between each layer. More advanced algorithms are available, which we do not discuss here. H is a contraction if the induced matrix norm of the Jacobian of H is strictly less than 1 on Z, providing a useful test for checking whether a unique ﬁxed point of H exists. More generally, Proposition 2. Let Z be an open strictly convex set, Z its closure, H : Z Z differentiable on Z and continuous on Z. If DH λ < 1 on Z, then H is a contraction on Z.

Published as a conference paper at ICLR 2022

Figure 3: Grouped in sets of four. (Left to right) Uncorrupted target image e Ys, noisy image Ys, image output by randomly initialised DEQ, image output by k GLM initialised DEQ. k GLM but not random initialisation preserves some qualitative properties of the input.

We provide a proof in Appendix B. While elegant, contractions are stronger than necessary and using them limits the set of admissible networks. It is therefore desirable to work with tighter conditions, or otherwise sidestep the issue. A consensus on how to deal with existence and uniqueness does not yet appear to have been reached. One may run the algorithm as if a unique ﬁxed point exists (Bai et al., 2019), constrain or monitor the parameters during learning to ensure that a unique ﬁxed point exists, add a penalisation term on an estimate of the spectral norm of the Jacobian (Roosta-Khorasani & Ascher, 2015) to encourage a contractive map (Bai et al., 2021), or design variants of DEQs (through appropriate restriction) that explicitly ensure that a unique ﬁxed point exists (Winston & Kolter, 2020; Revay et al., 2020; El Ghaoui et al., 2021). We do not address this problem; we assume (in a precise sense, see Assumption 2) veriﬁable conditions for the BFPT.

Initialisation. For the vast majority of practical neural network architectures, learning involves applying an optimisation procedure (e.g. stochastic gradient descent or quasi-Newton s method) to an empirical risk minimisation problem in the overparameterised network s parameters outside of conditions that guarantee convergence to a minimum. Debate continues as to why such procedures lead to good generalisation performance (Belkin et al., 2019; Zhang et al., 2021). However, there are intuitive, empirical and theoretical reasons to suggest that appropriate parameter initialisation plays an important role (Glorot & Bengio, 2010; He et al., 2015; Poole et al., 2016; Hu et al., 2020). For DEQs, initialisations have not yet been studied in great detail. By interpreting DEQs as unrolled k GLMs, we ﬁnd parameter initialisations as a corollary of our main theoretical result. See Figure 3.

3 EXACT FIXED PARAMETER EQUIVALENCE BETWEEN DDNS AND DEQS

We present our main result here and give a special case of a tractable Ising model in Appendix C.2. Our main result is a derivation of a fully connected or convolutional DEQ architecture as a model that is equivalent to a DDN that solves a k GLM as its inner problem.

3.1 SETTING: A KGLM INSIDE A DDN

Inner problem. We work with the regular (Wainwright & Jordan, 2008) exponential family in canonical form (Deisenroth et al., 2020, Section 6.6), with log partition function A : R R and sufﬁcient statistic T : Y R. In our setting, A is both inﬁnitely differentiable due to regularity, and strictly convex due to minimality (see Appendix A). We choose a prior pλ over the predictor f, which generalises the commonly used Gaussian process prior1 (Sch olkopf et al., 2001; Canu & Smola, 2006). For log-concave and differentiable q, deﬁne pλ (f | X) = q (f(X)) φλ (f) , φλ f exp λ f 2 Hk/2 . (4)

A MAP f ( ; X, Y ) = k( , X)α with representer coefﬁcients α satisﬁes

α = arg min α Rn α KT(Y ) + 1 A(Kα) log q (Kα) + λα Kα/2, (5)

where we have used strict convexity of A to ensure at most one minimum exists. For convenience, we deﬁne ρ : ( log q) (A ) 1. When q is constant, pλ is a Gaussian process prior and ρ 0.

1This prior requires delicate treatment which may be avoided if X is ﬁnite. See (11), Appendix A.

Published as a conference paper at ICLR 2022

Deep declarative network. We will use (5) as a declarative node (Gould et al., 2021) by wrapping it with an outer optimisation task in order to learn the parameters of a neural network by empirical risk minimisation. Let L denote a loss function (for example, mean squared error), and let Fγ deﬁne a generic neural network with parameters γ. Form a set of N quartets {(Xs, Ys, e Xs, e Ys)}N s=1, for example a set of N images. Deﬁne Ks = k(Xs, Xs). We consider the optimisation problem

s=1 L e Ys, Fγ f ( e Xs; Xs, Ys)

subject to α s = arg min α Rn α Ks T(Ys) + 1 A(Ksα) log q (Ksα) + λα Kα/2, s = 1, . . . , N,

f s ( ) = k( , Xs)α s, s = 1, . . . , N, (6) The outer problem is in general non-convex and highly nonlinear in γ.

3.2 MAIN RESULT: DERIVING A FIXED-PARAMETER DEQ

Under our construction, the equivalence between the embeddings z of k GLMs and DEQs always holds. Under the additional Assumption 1, the DEQ and the k GLM are equivalent predictive models.

Assumption 1. The kernel matrix K is strictly positive deﬁnite.

We now state our main result, the proof of which is given in Appendix D. Theorem 3. Let α be the minimiser of one of the inner optimisation problems (5), inducing a function f ( ; X, Y ) = k( , X)α . Deﬁne z := A (Kα ). Then z is a ﬁxed point satisfying z = σ W1z + W2T(Y ) + W3ρ(z ) , (7)

where W1 = W3 = λ 1K and W2 = λ 1K are parameter matrices, σ A and ρ ( log q) (A ) 1 are monotone non-decreasing. Furthermore, under Assumption 1, for any test index e X Ren dx and for at least one z satisfying (7),

f ( e X; X, Y ) = V1 (z + ρ(z )) + V2T(Y ), (8)

where V1 = λ 1k( e X, X) and V2 = λ 1k( e X, X).

Further conditions (Assumption 2) force the ﬁxed point iteration to have a unique solution. Assumption 2. Let Z be an open strictly convex set and Z its closure. Suppose

K/λ sup z Z |A (z)| I + diagρ (z) < 1.

Note that when ρ is zero, that is pλ is Gaussian, Assumption 2 reduces to K/λ sup z Z |A (z)| < 1.

The second derivative of the log partition is equal to 1 for the Gaussian distribution with known variance and is bounded by r/4 for the Binomial distribution with r trials, further simplifying Assumption 2. Recall that the spectral radius (the largest eigenvalue) of K/λ being less than some constant is sufﬁcient to ensure the existence of an operator norm such that K/λ is less than the same constant. However, this guarantee does not identify the operator norm. In our experiments, we instead ﬁx the operator norm as the spectral norm and check Assumption 2 against this norm. Proposition 4. Under Assumption 2, (7) admits a unique ﬁxed point on Z.

The outer problem depends on the solution α s to the inner problem only through f ( e Xs; Xs, Ys), which under Assumption 1 in turn only depends on z s. We therefore have the following. Corollary 5. Under Assumptions 1 and 2, the optimisation problem (6) is equivalent to the constrained optimisation problem

s=1 L e Ys, Fγ f ( e Xs; Xs, Ys)

subject to z s = σ W1sz s + W2s T(Ys) + W3sρ(z s) , s = 1, . . . , N,

f ( e Xs; Xs, Ys) = V1s z s + ρ(z s) + V2s T(Ys), s = 1, . . . , N,

Published as a conference paper at ICLR 2022

where V1s = λ 1k( e Xs, Xs), V2s = λ 1k( e Xs, Xs), W1s = W3s = λ 1Ks and W2s = λ 1Ks take the role of DEQ parameters and σ = A and ρ are monotone non-decreasing and take the role of DEQ activation functions. The ﬁxed point condition for z s is met by exactly one element of Z.

The model in Corollary 5 allows the parameters W1s, W2s, V1s and V2s to vary with s whereas (3) does not. An additional assumption ensures these parameters are constant in s. Corollary 6. In the same setting as Corollary 5, if Xs = X and e Xs = e X are constant in s, the parameters W1, W2, W3, V1, V2 are constant in s and we may write f (Ys) f s ( e Xs; Xs, Ys).

Note that in (3), the outer problem is with respect to γ, θ, V1, V2 whereas in (9) the outer problem is only with respect to γ. The parameters V1, V2 and W1, W2 (which take the role of θ), are automatically determined according to X, e X and the k GLM kernel k.

Corollary 5 is suggestive of a DEQ model that we are yet to implement whose weights are a dynamic kernel function of the coordinate Xs. One parameterisation of such a model is to have an auxillary network that predicts a ﬁnite-dimensional feature mapping of the coordinates Xs.

Finally, we note one more special case when Xs = e Xs. In this case, since Ks = k( e Xs, Xs), predictions may be formed as f ( e Xs; Xs, Ys) = Ksα s = σ 1(z s) = W1sz s + W2s T(Ys) + W3sρ(z s), without the need to invert Ks and therefore without the need for Assumption 1.

3.3 REMARKS ON MATCHING DDNS AND DEQS

Special cases. Choosing p K/λ to be Gaussian, ρ 0. Additionally, if T is the identity and σ A tanh, we obtain a fully connected layer with tanh activations, as derived in Appendix C.2. Logistic sigmoidal activations are obtained when the k GLM is specialised as kernel logistic regression. When A and T are the identity, the k GLM becomes kernel ridge regression, with Gaussian process regression as the corresponding Bayesian model. Certain inﬁnitely wide (nonlinear) neural networks are Gaussian processes (Neal, 1995). Our result shows that one may also construct Gaussian processes whose posterior predictive mean is an inﬁnitely deep neural network with linear activations. The role of the coordinates X and e X differs between the two models.

Symbol matching. Theorem 3 says that an embedding (2) computed by a DEQ given Y is the same as an embedding (7) computed by a k GLM on coordinates X for a training set (X, Y ) if the hidden parameters W1, W2 and W3 are scalar multiples of the kernel matrix k(X, X).

Initialisation via optimisation warm-start. In a stronger setting, Corollary 6 shows a connection between the two optimisation tasks (3) and (9). Given a ﬁxed DEQ architecture with a ﬁxed point condition that may be expressed as (7) (and noting that this equation also includes convolutional architectures), we may initialise parameter values that ensure that the prediction is equivalent to a trained k GLM and identify the hyperparameters of the k GLM. This initialisation may be naive (i.e. assume a reasonable kernel function, coordinates and regularisation parameter for a wide range of tasks) or more informed (i.e. leverage some information about the task to suit the kernel and regularisation to the task being performed). Note that even without Assumptions 1 and 2, we may initialise the DEQ such that the feature embeddings z are equivalent for at least one of the ﬁxed points. We demonstrate both naive and informed approaches in 4.

On augmentation and assumptions. Assumption 2 allows one to compute without concern using (7) since a unique ﬁxed point is guaranteed to exist. DEQs are sometimes implemented without checking that a unique ﬁxed point exists. This assumption may be replaced by any other assumption that is sufﬁcient to ensure a unique ﬁxed point. Using the derivative test on the open set Z instead of Z allows us to handle important edge cases, namely when A tanh (with derivative 1 at the origin) or A is Leaky Re LU with gradients in {0, 1} on both sides of the origin (with undeﬁned derivative at the origin). Assumption 1 allows one to map the k GLM embedding to the prediction. The skip connection present in (2) but not (1) is not a severe limitation from a practical perspective. The augmented model considers a coordinate X absent from the original DEQ. Note that any DEQ may be written as an augmented model by simply ignoring the coordinate X. Under our construction but without Assumptions 1 or 2, the equivalence between the embeddings produced by k GLMs and DEQs is still valid for at least one of their respective ﬁxed points.

Published as a conference paper at ICLR 2022

Figure 4: A particular choice of kernel induces a sparse matrix K = k(X, X) with repeated entries. Pre-multiplication by this sparse matrix (left) is equivalent to convolution with reshaping (right). (Left) The kernel matrix applied to a ﬂattened image. (Right) Each row of the kernel matrix is identical (after zero-padding the image), so that each row s action may be thought of as applying one ﬁlter centered at a corresponding pixel. See Corollary 11, Appendix E for result on RGB images.

Convolutional layers. The derived network architecture involves only matrix multiplication, however as is well-known, we may represent a convolution through a sparse matrix equation (Figure 4). Conveniently, the local structure expected in images may be encoded by a particular informed kernel choice in order to recover a convolutional architecture. We formally discuss how this may be done in 2D RGB images in Corollary 11, Appendix E through a particular choice of kernel k.

Other priors. One may replace the strongly log-concave prior p K/λ with, for example, the conjugate prior of the exponential family (which is only log-concave). Here we focus on the strongly log-concave prior to recover a form more similar to DEQs that are implemented in practice.

Spectral normalisation. Supposing is the spectral norm, Assumption 2 is suggestive of a particular weight normalisation scheme: if λ is less than the spectral norm of K, then K/λ < 1. The technique of using layers with unit spectral norm is explored by Miyato et al. (2018) in the context of improving training stability of GANs. Our results indicate that such spectral normalisation also leads to improved stability in DEQs at initialisation, in the sense that it guarantees a unique ﬁxed point by Proposition 4. Abusing terminology, we refer to the spectral norm of the linear transformation corresponding with a convolutional layer as the spectral norm of the convolutional layer. An easy to implement method for calculating the spectral norm of convolutional layers is available (Sedghi et al., 2018), which we use in our implementation and experiments.

4 EXPERIMENTS

DEQ parameters may be partitioned into two sets: the parameters of the layer that deﬁnes the ﬁxed point iteration (θ = (W1, W2, W3, V1, V2)), and the remainder of the parameters. Leveraging our theoretical result, we give DEQs a warm-start by initialising θ such that the DEQ solves a k GLM. We empirically demonstrate that such initialisation is superior to random intialisation. We stress that our comparison is between randomly initialised DEQs and k GLM initialised DEQs only. We are not interested in demonstrating that DEQs can achieve results competitive with state-of-the-art models, as this has already been demonstrated in previous studies (Bai et al., 2019; 2021). Our limited scope allows us to perform a more deﬁnitive empirical study with meaningful statistics. We give a detailed explanation of two of the tasks in the main text, and refer the reader to Appendix F for more experiments and Appendix G for an empirical demonstration of Theorem 3. For random initialisation, we initialise weights from a symmetric uniform distribution (the Pytorch default).

Smooth sequence-to-sequence task. Let X = [ 2π 2, 2π] and deﬁne a function h(x) = e x2/10 sin(x) + e (x+9)2. Let Xs = X and e Xs = e X be 100 points on a uniform grid in [ 2π, 2π] and [ 2π 2, 2π 2] respectively. Deﬁne sequences Ys and e Ys to be joint evaluations of the sth realisation of a Gaussian process with covariance function ktrue(x, x ) = 0.1 exp x x 2

2 exp sin2 x x , having means h(X) and h( e X) respectively.

Published as a conference paper at ICLR 2022

0 100 200 300 400 500 600 Training epoch

Informed Naive Random Min Mean Max

7.5 5.0 2.5 0.0 2.5 5.0

7.5 5.0 2.5 0.0 2.5 5.0

7.5 5.0 2.5 0.0 2.5 5.0

7.5 5.0 2.5 0.0 2.5 5.0

Figure 5: (a) Test error on a smooth sequence-to-sequence task for 100 random seeds for each initialisation scheme. k GLM initialsation offers a better starting point and faster training than random initialisation. (b) Sample model outputs on smooth sequence-to-sequence task. Black curves show mean h and colours represent different samples. (Top left) Ground-truth test data e Ys. (Top right) Random and (Bottom left) Naive GLM initialisation after 20 epochs. (Bottom right) Informed k GLM initialisation without any training (i.e. epoch 0). k GLM initialisation, particularly informed, preserves some of the qualitative properties of the target sequence.

For informed k GLM initialisation, we pretend that we know the kernel and coordinate space. We choose k ktrue and form the required kernel matrices by taking Xs = X and e Xs = e X. For naive GLM initialisation, pretending we do not know the true underlying kernel or coordinate space, we default to knaive = exp x x 2

2 and choose Xs = X to be a uniform grid over the interval [ 2π, 2π]. We initialise the hidden parameters W1 and W2 according to Theorem 3 and initialise V1 and V2 according to the same rule as random initialisation. When training using naive GLM, we update only the readout parameters for the ﬁrst 10 epochs. We generate a training and test set of size N = 20, 000 and 2, 000. Each sequence is evaluated on an n = 100 dimensional uniform grid. We choose gθ(z ) = tanh (W1z + W2Y + b) and train for 400 epochs using Adam with default hyperparameters. We repeat this for 100 trials using seeds 0 to 99, see Figure 5(a) and (b).

Image denoising. We consider a convolutional architecture as described in Appendix E, with σ Re LU, T the identity, and ρ 0 applied to an image denoising task using the CIFAR10 dataset. Technically the Re LU is not admissible, but we are interested in examining its empirical properties. The coordinate space X is the space of all 2D pixel and channel triples (i, j, c) {1, . . . , 32}2 {1, 2, 3}. Elements Ys are CIFAR10 images corrupted by i.i.d. additive N(0, 0.2) noise, clipped to take values between 0 and 1. Elements e Ys are the corresponding uncorrupted CIFAR10 images.

For random initialisation, we sample the ﬁlter weights i.i.d. N(0, Var). For k GLM initialisation, we let λ = K /C, where denotes the spectral norm and C > 0 (see Assumption 2). Spectral norms are calculated using the fast and exact method described by Sedghi et al. (2018). The kernel is constructed randomly from the squared exponential kernel and described in Appendix F.

We are interested in separating the effects of scaling and initialisation scheme. For random initialisation, the scale is chosen through the variance, inducing a convolutional layer with random spectral norm. For k GLM initialisation, the scale C is equal to the spectral norm. We try all variances in logspace(10 3, 1, 25) and all C in logspace(10 2, 10, 25), resulting in a grid of spectral norms covering roughly the same space for both schemes. We try 100 random seeds for each conﬁguration, resulting in 2 100 25 = 5000 models.

In Figure 6, we plot individually after training for 0, . . . , 5 epochs the test MSE against the spectral norm for both initialisation schemes and all random seeds. k GLM initialisation offers improved test error, training stability and reduced variance in test error. The curves for k GLM initialised models adhere to the shape predicted by classical generalisation theory, where the spectral norm measures model complexity. For k GLM initialisation, spectral norms of 100 appear to represent a critical point; smaller values lead to stable training and larger values lead to unstable training. This contrasts with randomly initialised models, most of which have increasing test error with increasing spectral norm in their later epochs. This means it is easy to select an appropriate scale for k GLM

Published as a conference paper at ICLR 2022

10 2 10 1 100 101 Average spectral norm

10 2 10 1 100 101 Average spectral norm

10 2 10 1 100 101 Average spectral norm

10 2 10 1 100 101 Average spectral norm

10 2 10 1 100 101 Average spectral norm

10 2 10 1 100 101 Average spectral norm

Figure 6: Test MSE against the average spectral norm of each layer for image denoising task using k GLM initialisation (blue, ours) and random initialisation (red). The vertical axes change between plots. Markers at the edge of the top border indicate that a value greater than the axis limit or Na N was observed. Our initialisation scheme shows superior performance at all epochs.

initialisation, but not for random initialisation. In Figure 3, we plot sample targets, inputs and outputs for random and k GLM initialised DEQs without any training, showing that samples using k GLM initialisation are visually more similar to the expected output than random initialisation. Figure 3 and epoch 0 of Figure 6 provide a sanity check of Theorem 3. More plots are given in Appendix F.

5 DISCUSSION AND CONCLUSION

Related work. A review on connections between other optimisation-based iterative procedures and neural networks generally is given by Monga et al. (2021). Repeated applications of weight-tied layers can sometimes be seen as solving model-based problems, if the weights are chosen appropriately. Since the derived network implicitly minimises a conceptually tractable and principled model, it possesses a certain degree of interpretability. However, only the minimisation problems and not the forward pass computations of the network are interpretable, and as such the network does not satisfy the requirements of interpretability in many settings (Rudin, 2019). Monga et al. (2021) also discuss how disabling weight-tying and allowing parameters to be ﬁne-tuned, thereby escaping the optimisation-based iterative procedure, can sometimes lead to improved performance, perhaps at the cost of interpretability, generalisation performance and theoretical guarantees.

Notable models analysed under this framework include ISTA (Beck & Teboulle, 2009) for solving LASSO, which when unrolled and ﬁne-tuned yields LISTA (Gregor & Le Cun, 2010), and ADMM for compressive sensing (Boyd et al., 2011) with an unrolled version called ADMM-CSNet (Yang et al., 2018). Our work considers a different class of iterative models to those surveyed by Monga et al. (2021), and recovers exactly the DEQ model. In these works, the architectures, tasks, model classes and activations are speciﬁc to a particular setting. More recently, Ramsauer et al. (2020) introduce a class of networks whose predictions minimise a continuous analogue of a Hopﬁeld energy. This can be used to justify transformer architectures.

Conclusion. DDNs solve optimisation problems in their forward pass. We considered the problem of regularised maximum likelihood in an RKHS. Using this DDN layer, we derived a DEQ with a convolutional or fully connected implicit layer and a ﬁxed closed-form expression for the weights. Such a result gives intuitive justiﬁcation for computing with DEQ models: DEQs can solve classical statistical problems in their forward pass. Using this theoretical connection also allowed us to initialise DEQs as DDNs. Such initialisation scheme offered performance beneﬁts over random initialisation both at initialisation and during training. Future extensions of our work include considering other statistical estimators, learning the kernel, and obtaining Bayesian rather than point estimates of the canonical parameter.

Published as a conference paper at ICLR 2022

ETHICS STATEMENT

Our work is mostly theoretical in nature and is unlikely to directly cause harm, discrimination or privacy violations. However, as our work may be used in future in applications, we encourage users of our work to adhere to ICLR s general ethical principles. We welcome open enquiry and scrutiny of the theoretical results presented in this paper.

REPRODUCIBILITY STATEMENT

We have included code and the instructions required to reproduce our results. The code is publically available at https://github.com/Russell Tsuchida/deq-glm.

ACKNOWLEDGEMENTS

Russell, Suk Yee, Lars and Cheng Soon gratefully acknowledge the support of CSIRO s Machine Learning and Artiﬁcial Intelligence Future Science Platform.

Published as a conference paper at ICLR 2022

Luis B Almeida. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment. In Artiﬁcial neural networks: concept learning, pp. 102 111, 1990.

Brandon Amos and J Zico Kolter. Optnet: Differentiable optimization as a layer in neural networks. In International Conference on Machine Learning, pp. 136 145. PMLR, 2017.

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in Neural Information Processing Systems, 32:690 701, 2019.

Shaojie Bai, Vladlen Koltun, and Zico Kolter. Stabilizing equilibrium models by jacobian regularization. In International Conference on Machine Learning, pp. 554 565. PMLR, 2021.

Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, 2009.

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machinelearning practice and the classical bias variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849 15854, 2019.

Stephen Boyd, Neal Parikh, and Eric Chu. Distributed optimization and statistical learning via the alternating direction method of multipliers. Now Publishers Inc, 2011.

St ephane Canu and Alex Smola. Kernel methods and the exponential family. Neurocomputing, 69 (7-9):714 720, 2006.

Gavin C Cawley, Gareth J Janacek, and Nicola LC Talbot. Generalised kernel machines. In 2007 International Joint Conference on Neural Networks, pp. 1720 1725. IEEE, 2007.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential equations. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6572 6583, 2018.

Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for Machine Learning. Cambridge University Press, 2020.

Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural odes. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 3140 3150, 2019.

Laurent El Ghaoui, Fangda Gu, Bertrand Travacca, Armin Askari, and Alicia Tsai. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930 958, 2021.

Xavier Glorot and Yoshua Bengio. Understanding the difﬁculty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artiﬁcial intelligence and statistics, pp. 249 256. JMLR Workshop and Conference Proceedings, 2010.

Stephen Gould, Basura Fernando, Anoop Cherian, Peter Anderson, Rodrigo Santa Cruz, and Edison Guo. On differentiating parameterized argmin and argmax problems with application to bi-level optimization. ar Xiv preprint ar Xiv:1607.05447, 2016.

Stephen Gould, Richard Hartley, and Dylan John Campbell. Deep declarative networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.

Peter J Green and Brian S Yandell. Semi-parametric generalized linear models. In Generalized linear models, pp. 44 55. Springer, 1985.

Karol Gregor and Yann Le Cun. Learning fast approximations of sparse coding. In Proceedings of the 27th international conference on international conference on machine learning, pp. 399 406, 2010.

Boris Hasselblatt and Anatole Katok. A ﬁrst course in dynamics: with a panorama of recent developments. Cambridge University Press, 2003.

Published as a conference paper at ICLR 2022

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level performance on imagenet classiﬁcation. In Proceedings of the IEEE international conference on computer vision, pp. 1026 1034, 2015.

Nicholas J Higham. Computing a nearest symmetric positive semideﬁnite matrix. Linear algebra and its applications, 103:103 118, 1988.

Wei Hu, Lechao Xiao, and Jeffrey Pennington. Provable beneﬁt of orthogonal initialization in optimizing deep linear networks. In International Conference on Learning Representations, 2020.

Z Kolter, D Duvenaud, and M Johnson. Deep implicit layers - neural odes, deep equilibirum models, and beyond. Neural information processing systems tutorial, 2020. URL http: //implicit-layers-tutorial.org/.

Jiarou Lu, Huafeng Liu, Yazhou Yao, Shuyin Tao, Zhenmin Tang, and Jianfeng Lu. Hsi road: A hyper spectral image dataset for road segmentation. In IEEE International Conference on Multimedia and Expo (ICME), 2020.

P. Mc Cullagh and J.A. Nelder. Generalized Linear Models, Second Edition. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1989. ISBN 9780412317606. URL https://books.google.com.au/books?id=h9k FH2_Ff Bk C.

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018.

Vishal Monga, Yuelong Li, and Yonina C Eldar. Algorithm unrolling: Interpretable, efﬁcient deep learning for signal and image processing. IEEE Signal Processing Magazine, 38(2):18 44, 2021.

Radford M Neal. Bayesian learning for neural networks. Ph D thesis, University of Toronto, 1995.

H Chau Nguyen, Riccardo Zecchina, and Johannes Berg. Inverse statistical problems: from the inverse ising problem to data science. Advances in Physics, 66(3):197 261, 2017.

Finbarr O sullivan, Brian S Yandell, and William J Raynor Jr. Automatic smoothing of regression functions in generalized linear models. Journal of the American Statistical Association, 81(393): 96 103, 1986.

Fernando Pineda. Generalization of back propagation to recurrent and higher order neural networks. In Neural information processing systems, pp. 602 611, 1987.

Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. Advances in neural information processing systems, 29:3360 3368, 2016.

Hubert Ramsauer, Bernhard Sch aﬂ, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David Kreil, Michael K Kopp, et al. Hopﬁeld networks is all you need. In International Conference on Learning Representations, 2020.

Max Revay, Ruigang Wang, and Ian R Manchester. Lipschitz bounded equilibrium networks. ar Xiv preprint ar Xiv:2010.01732, 2020.

Farbod Roosta-Khorasani and Uri Ascher. Improved bounds on sample size for implicit matrix trace estimators. Foundations of Computational Mathematics, 15(5):1187 1212, 2015.

Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II IV: Invited Lectures, pp. 1576 1602. World Scientiﬁc, 2010.

Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206 215, 2019.

Ernest K Ryu and Stephen Boyd. Primer on monotone operator methods. Appl. Comput. Math, 15 (1):3 43, 2016.

Published as a conference paper at ICLR 2022

Craig Saunders, Alexander Gammerman, and Volodya Vovk. Ridge regression learning algorithm in dual variables. In Proceedings of the Fifteenth International Conference on Machine Learning, pp. 515 521, 1998.

Bernhard Sch olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In International conference on computational learning theory, pp. 416 426. Springer, 2001.

Bernhard Sch olkopf, Alexander J Smola, et al. Learning with kernels: support vector machines, regularization. Optimization, and Beyond. MIT press, 1(2), 2002.

Hanie Sedghi, Vineet Gupta, and Philip M Long. The singular values of convolutional layers. In International Conference on Learning Representations, 2018.

M.J. Wainwright and M.I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning, 1(1-2):1 305, 2008.

Ezra Winston and J Zico Kolter. Monotone operator equilibrium networks. Advances in Neural Information Processing Systems, 33:10718 10728, 2020.

Yan Yang, Jian Sun, Huibin Li, and Zongben Xu. Admm-csnet: A deep learning approach for image compressive sensing. IEEE transactions on pattern analysis and machine intelligence, 42 (3):521 538, 2018.

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107 115, 2021.

Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. Journal of Computational and Graphical Statistics, 14(1):185 205, 2005.

Published as a conference paper at ICLR 2022

A KERNELISED GENERALISED LINEAR MODELS

A.1 KERNEL TRICK AND REPRESENTER THEOREM

The predictions of some classical algorithms depend on the training input data X Rn dx only through the matrix product XX . A widely exploited technique in machine learning, the kernel trick, replaces X with ψ(X) Rn dψ, where dψ may be large or in fact inﬁnite. Here ψ : Rdx Rdψ and with an abuse of notation, when applied to a matrix it is understood to apply individually to each row so that ψ(X) Rn dψ. It is easy to change the resulting predictor since the type of ψ(X)ψ(X) Rn n matches that of XX , and scales with n and not dψ. One example is in kernel ridge regression (Saunders et al., 1998). Following our abuse of notation, we will use k(X, X) Rn n to denote the matrix with ijth entry k(xi, xj) := ψ(xi) ψ(xj). Any function of the form k together with the matrix X characterises the matrix ψ(X)ψ(X) . Such a function is called a kernel function, and is a valid choice if and only if it corresponds with an inner product in feature space or equivalently, if it is positive semi-deﬁnite (PSD). The space of functions Hk that the kernel k characterises is called a reproducing kernel Hilbert space (RKHS).

Let k be a positive semi-deﬁnite (PSD) function inducing a reproducing kernel Hilbert space (RKHS) Hk. We will use K = k(X, X) Rn n to denote the matrix with ijth entry k(xi, xj). We will make use of the representer theorem, which for appropriately regularised empirical risk minimisation problems, allows one to solve optimisation problems in nonparametric function space. Theorem 7 (Representer Theorem (Sch olkopf et al., 2002)). Denote by R : [0, ) R a strictly monotonic increasing function, by X a set and by L : (X R2)n R S{ } an arbitrary loss function. Let Hk be an RKHS with kernel k. Then

f arg min f Hk L xi, yi, f(xi) n i=1

+ R ( f Hk) = f (x) =

i=1 αik(xi, x).

A.2 KERNELISED GENERALISED LINEAR MODELS

We work with a special case of the exponential family in canonical form (Deisenroth et al., 2020, Section 6.6), as shown in (10). As has also been noted by others (O sullivan et al., 1986; Canu & Smola, 2006; Cawley et al., 2007), the representer theorem allows one to extend appropriately regularised GLMs (Mc Cullagh & Nelder, 1989) to k GLMs by replacing the linear predictor with an arbitrary element of an RKHS. Here we demonstrate this fact.

Furnish Y with a σ-algebra B, forming a measurable space (Y, B). Let T be a measurable function T : Y R, called the sufﬁcient statistic. Let ν be some reference measure and h : Y R 0 be any function such that R

Y d H(y) < , where d H(y) = h(y)dν(y). We call H the base measure and h the base density. H is absolutely continuous with respect to ν, which may be the Lebesgue measure or the counting measure. Deﬁne F to be the set of all η R such that R

Y exp(T(y)η)d H(y) < . Deﬁne the log-partition function A : F R by A(η) = log R

Y exp(ηT(y))d H(y).

We may deﬁne a probability measure P(dy | η) = p(y | η)ν(dy), where p y | η = h(y) exp ηT(y) A η (10) is with an abuse of terminology called a probability density function for both discrete and continuous cases. Finally, we form a joint probability density function as a product of marginals via p(Y | η) = Qn i=1 p(yi | ηi).

We will assume that F is an open set, so that following the terminology of Wainwright & Jordan (2008), (10) is called a member of the regular exponential family. Regularity ensures that the logpartition function is inﬁnitely differentiable, and that A

η = E[T(y)] (Wainwright & Jordan, 2008, Proposition 3.1). Since T(y) and η are scalars, (10) satisﬁes a technical deﬁnition of a minimal exponential family. Convexity of A is guaranteed for members of any regular exponential family, but minimality ensures that A is strictly convex (Wainwright & Jordan, 2008, Proposition 3.1).

Generalised linear models use an exponential family for the likelihood of observing response yi. We work exclusively with univariate responses yi and canonical parameters ηi, but GLMs are constructed more generally. For GLMs with a canonical link function, one chooses ηi = x i β = f(xi) to be a linear function of the predictor variables xi with parameters β Rdx.

Published as a conference paper at ICLR 2022

It is possible to extend GLMs to accommodate a non-linear predictor f in an RKHS Hk with a small modiﬁcation. This modiﬁcation can be motivated by either a Bayesian or frequentist perspective by placing a Gaussian prior over the function values or regularising respectively, and then solving the resulting functional optimisation problem over an RKHS using the representer theorem. Following the former philosophy, week seek the maximum a posteriori (MAP) estimate f of p f | X, Y

with a prior density (Sch olkopf et al., 2001; Canu & Smola, 2006)

φλ f = Z 1 exp λ f 2 Hk/2 , (11)

where λ is a parameter and Z is a normalisation constant. If f is an inﬁnite dimensional function, such a density is not deﬁned (recall that Gaussian processes are deﬁned in terms of their ﬁnite marginal distributions, which are jointly Gaussian). However, if we restrict the underlying coordinate space X to be ﬁnite, then f is simply a vector evaluated over all coordinates and such a notation for the prior may be employed without concern. The restriction of X to a ﬁnite space is without practical restriction in machine learning, since we only ever condition on and predict at a ﬁnite number of data points. Using (11) as the prior and (10) as the likelihood, the MAP satisﬁes

f = arg min f Hk

j=1 log h(yi) + A (f(xi)) f(xi)T(yi) + λ

so that by the representer theorem,

f ( ) = k( , X)α where α = arg min α Rn 1 A (Kα) α KT(Y ) + λ

2 α Kα, K = k(X, X).

As an example, in binary kernel logistic regression, using the representer theorem we seek to minimise

log p (f(X) | X, Y ) = Y Kα + 1 log 1 + e Kα + λ

2 α Kα + const.,

where K = K(X, X) Rn n is the kernel matrix with pqth entry Kpq = k(xp, xq). The solution can be found by applying Newton s method, resulting in an iterative ﬁxed-point numerical scheme for α . Details of this method can be found in a number of references (Zhu & Hastie, 2005). A detailed example of the Ising model is given in Appendix C.

Published as a conference paper at ICLR 2022

B DERIVATIVE TEST

We will require the mean value theorem.

Theorem 8 (Mean value theorem). Let S : [a, b] Rn be continuous on [a, b] and differentiable on (a, b). Then there exists some t (a, b) such that

S(b) S(a) d dt S(t) (b a).

We now restate the result to be proven. We follow the approach in theorem 2.2.16 of Hasselblatt & Katok (2003).

Proposition 2. Let Z be an open strictly convex set, Z its closure, H : Z Z differentiable on Z and continuous on Z. If DH λ < 1 on Z, then H is a contraction on Z.

Proof. Let z1, z2 Z and deﬁne c(t) := z1 + t(z2 z1) for t [0, 1] and S(t) := H S(t) . Then S : (0, 1) Z by strict convexity. By the mean value theorem, there exists some t (0, 1) such that

H(z2) H(z1) = S(1) S(0) d dt S(t) (1 0) = d dt S(t)

= DH (S(t)) d

= DH (S(t)) (z2 z1) DH (S(t)) z2 z1 λ z2 z1 < z2 z1

and therefore H is a contraction mapping.

Published as a conference paper at ICLR 2022

C WARM-UP: A TRACTABLE ISING MODEL

We demonstrate our main result for a special Ising model case. The general case is given in 3.

C.1 AS A KGLM

Let x and x be two dimensional indices on the ﬁnite lattice, x {1, . . . , a} {1, . . . , b} for integers a, b Z. The Ising model prescribes that

p (Y | f(X)) = Z 1 exp

pq Jpqyipyiq + X

where Z is the partition function, some normalising constant ensuring that p (Y | f(X)) is a valid probability mass function. Here the parameters Jpq determine interactions between the spin states yip and yiq, fxip controls an interaction between the spin state yip and an external magnetic ﬁeld at lattice position xip and t is a temperature parameter. We let fx = f(x) belong to an RKHS Hk with PSD kernel k. With a standard Gaussian prior over f, we have that

log p (f(X) | Y ) = t 1 X

p,q Jp,qyipyiq + X

p f(xip)yip

log Z λ/2 f 2 Hk.

Supposing no interactions, Jpq = 0 and Z = 2 cosh (t 1f(xip)) (Nguyen et al., 2017). By Theorem 7, letting K = t 1k(X, X) denote the matrix with pqth entry t 1k(xip, xiq),

α = arg min α Y Kα + 1 log (2 cosh (Kα)) + tλ

The problem is strictly convex so we may ﬁnd the solution by applying Newton s method. Using that the derivative of log (2 cosh( )) is tanh( ), stationarity of the solution implies that

0 = KY + K tanh (Kα ) + tλKα := F(α ). (12)

The Jacobian F

α is given by

α = K DK + tλK, where D : = diag sech2(Kα ) ,

so that the cth Newton iteration requires solving a linear system,

J(α (c 1)) α (c) α (c 1) = K Y tanh (Kα (c 1)) tλα (c 1) .

We employ the shorthand M(α (c)) = M(c) for any matrix function of α (c). The Newton update is typically rearranged as an Iteratively Re-Weighted Least Squares (IRWLS) form by substituting α (c 1) = J 1 (c 1)J(c 1)α (c 1), yielding

J(c 1)α (c) = KD(c 1)z(c 1),

where z := Kα + D 1(Y µ) and µ = tanh (Kα ). Once α has been approximated to some tolerance using this numerical scheme, the predictor f may be evaluated at points e X Ren dx using the representer theorem,

f( e X) = t e Kα Rn , e K := t 1k( e X, X). (13)

C.2 AS A DEQ

Rearranging (12), and applying tanh to both sides of the equation, we obtain

tanh (Kα ) = tanh 1

tλK (Y tanh (Kα )) .

Choosing z = tanh (Kα ), W1 = (tλ) 1K, W2 = (tλ) 1K, we have

z = tanh (W1z + W2Y ) . (14)

Published as a conference paper at ICLR 2022

Note that α being a unique ﬁxed point of (12) implies that z is a ﬁxed point of (14), but not that the ﬁxed point of (14) is necessarily unique. One condition that guarantees the uniqueness of the ﬁxed point of (14) is that the induced matrix norm of K is less than or equal to 1. Without this assumption, (14) is still true but when used as a computational model, the output will in general depend on the ﬁxed-point solver and initial conditions passed to that solver (and even still, might never return unstable ﬁxed points). Finally by assuming that K is invertible and using (13), we obtain

f(Y ) = t e Kα = t e KK 1 tanh 1 z = 1

λ e K (Y z ) = V1z + V2Y,

where V1 = (λ) 1 e K and V2 = (λ) 1 e K.

Published as a conference paper at ICLR 2022

D PROOF OF MAIN RESULT

Theorem 3. Let α be the minimiser of one of the inner optimisation problems (5), inducing a function f ( ; X, Y ) = k( , X)α . Deﬁne z := A (Kα ). Then z is a ﬁxed point satisfying

z = σ W1z + W2T(Y ) + W3ρ(z ) , (7)

where W1 = W3 = λ 1K and W2 = λ 1K are parameter matrices, σ A and ρ ( log q) (A ) 1 are monotone non-decreasing. Furthermore, under Assumption 1, for any test index e X Ren dx and for at least one z satisfying (7),

f ( e X; X, Y ) = V1 (z + ρ(z )) + V2T(Y ), (8)

where V1 = λ 1k( e X, X) and V2 = λ 1k( e X, X).

Proof. Since A is strictly convex, A = σ strictly increasing and invertible. Q := log q( ) is a convex function, and its derivative Q is monotone nondecreasing. Also note that Q (A ) 1 is monotone nondecreasing. Let K = k(X, X). From (5), stationarity at the minimum, differentiability of A and Q, and the representer theorem implies

0 = KT(Y ) + KA (Kα ) + KQ (Kα ) + λKα

λ (KT(Y ) KA (Kα ) KQ (Kα ))

A (Kα ) = A 1

λ (KT(Y ) KA (Kα ) KQ (Kα ))

A (Kα ) = A 1

λ KT(Y ) KA (Kα ) KQ (A ) 1 A (Kα ) (15)

z = σ W1z + W2T(Y ) + W3ρ(z ) . (16)

Note that z is only a (not necessarily unique) ﬁxed point of (16).

Suppose Assumption 1. Since K is invertible, and σ is invertible (since A is strictly convex), then again using the representer theorem, new predictions may be formed as

e Kα = e KK 1σ 1z = 1

λ e K (T(Y ) z ρ(z )) = V1 (z + ρ(z )) + V2T(Y ),

where e K = k( e X, X).

Proposition 4. Under Assumption 2, (7) admits a unique ﬁxed point on Z.

Proof. Let H be the mapping deﬁned by iteration (16),

H(z ) = σ W1z + W2T(Y ) + W3ρ(z ) .

H is a contraction since the induced matrix norm of the Jacobian of H satisﬁes

DH = W1 + W3diagρ (z ) diagσ W1z + W2T(Y )

K/λ sup z Z I + diagρ (z) |A (z)|

The result follows from Theorem 1.

Published as a conference paper at ICLR 2022

DEQ for empirical risk minimisation (3)

min γ,θ,V1,V2,b

s=1 L e Ys, Fγ f (Ys)

s.t. z s = gθ(z s; Ys) f (Ys) = V1z s + V2Ys + b

DEQ, fully connected or conv hidden layer

min γ,W1,W2, W3,V1,V2,b

s=1 L e Ys, Fγ f (Ys)

s.t. z s = σ W1z s + W2T(Ys) + W3ρ(z s)

f (Ys) = V1z s + V2Ys + b

DEQ, ﬁxed hidden parameters

s=1 L e Ys, Fγ f (Ys)

s.t. z s = σ W1z s + W2T(Ys) + W3ρ(z s)

f (Ys) = V1z s + V2Ys + b

DDN, k GLM regularised negative log likelihood as inner problem (6)

s=1 L e Ys, Fγ f s ( e Xs; Xs, Ys)

s.t. α s = arg min α k GLM(α, Xs, Ys)

f s ( ; Xs, Ys) = k( , Xs)α s

DDN for empirical risk minimisation

s=1 L eξs, Fγ

α s, eξs, ξs

s.t. α s arg min α Aθ,ξs Cθ(α, ξs)

Feature embedding z s = σ W1z s + W2T(Ys) + W3ρ(z s)

Fixed W1, W2, W3, V1, V2, b

Corollary 6 Assumptions 1 and 2 Fixed Xs, e Xs = f s ( e Xs; Xs, Ys) f (Ys) W1 = W3 = k(X, X)/λ, W2 = W1 V1 = k( e X, X)/λ, V2 = V1

Dataset {(eξs, ξs)}N s=1 Loss function L

Neural network Fγ

k GLM as inner problem

Theorem 3 z s := σ(Kα s)

Figure 7: Connections between implicit layer architectures.

Published as a conference paper at ICLR 2022

E CONVOLUTION AS MATRIX MULTIPLICATION

Let w, h Z>1 denote the pixel width and height of a space of images with c channels. Let

Xwh = {1, . . . , h} {0, . . . , w}, Xwhc = Xwh {1, . . . , c} (17)

denote the corresponding pixel and pixel-channel spaces. Let X R(whc) c denote a matrix, where every row is one of the pixel coordinates (i, j, c) Xwhc. Choose Xs = X for some set of N images {Ys}N s=1. The images Ys are of size whc, and contain pixel intensities for the corresponding pixel coordinates in X.

Let x1, x2 and x3 denote the ﬁrst, second and third coordinates of an element x Xwhc. Let

h1(x, x ) = 1(|x1 x 1| r1),

h2(x, x ) = 1(|x2 x 2| r2),

where 1 is the indicator function for some truncating parameters r1, r2 > 0. Let κ0 : Xwhc Xwhc R denote any stationary PSD kernel, for example κ0(x, x) = e 1

2 x x 2 2. Note that these functions each deﬁne a PSD kernel. Using the fact that kernels are closed under multiplication, deﬁne κ(x, x ) := h1(x, x )h2(x, x )κ0(x, x ). (18) Here h1 and h2 ensure that only close pixels have non-zero kernel values.

Note that the matrix κ(X, X) is sparse when r1 or r2 are small and has an off-diagonal structure. More concretely choosing κ(x, x ) = e 1

20 [x1,x2] [x 1,x 2] 2 21( x x 5) where denotes the supremum norm, we obtain a whc whc matrix. Each block of size wh is repeated horizontally.

This matrix represents c convolutional ﬁlters of ﬁlter length 5 2 + 1 = 11 in both directions. We plot the induced 322 322 kernel matrix when w = h = 32 and c = 1 in Figure 4, as well as the corresponding operation as a convolution. When c > 1, this matrix is simply repeated vertically and horizontally c times. For rows in the middle of the matrix, there are 112 non-zero entries, corresponding to the indices included in the convolution operation when centered on the pixel of the corresponding row.

The proof is tedious and mostly centres around the notation of appropriate reshaping of arrays and zero-padding. In order to state our result, we ﬁrst need to deﬁne the neural network convolution operation. Deﬁnition 9 (Single-channel convolution). Fix X = Xwh (in the sense of (17)) and let the rows of X be equal to the elements of X. Let 1 denote the single-channel image convolution operation (in the sense of convolutional networks).

More concretely, C 1 Y applies a ﬁlter C R(2r1+1) (2r2+1) to a ﬂattened image Y Rwh as follows. First ﬂatten C to be an element C of R(2r1+1)(2r2+1) and zero-pad Y to be an element e Y of R(2r1+w)(2r2+h). Then deﬁne for each j {1, wh} a vector Cj with the same dimensionality as e Y i, such that Cj,[j:j+(2r1+1)(2r2+1)] = C and all other elements of Cj are zero. Then for each j {1, wh}, deﬁne

(C 1 Y )j = C j e Y = (Cmat e Y )j

so that C 1 Y Rwh, where Cmat Rwh (2r1+w)(2r2+h) is a matrix with jth row equal to Cj. Deﬁnition 10 (Multi-channel convolution). Let denote the muti-channel image convolution operation (in the sense of convolutional networks).

More concretely, C Y applies a set of ﬁlters C Rc1 c2 (2r1+1) (2r2+1) to a ﬂattened image Y Rwhc2 as

r=1 Cj,r,:,: 1 Yr Rwh,

so that C Y Rc1wh, where Yr Rwh represents the rth channel of Y .

Generally, we have the following corollary of Theorem 3.

Published as a conference paper at ICLR 2022

Corollary 11. In the same setting as Theorem 3, under Deﬁnition 9, let k κ (in the sense of (18)) and let f ( ; Y, X) = κ( , X)α be a global minimiser of (5). Then z is a ﬁxed point satisfying

z = σ C1 z + C2 T(Y ) + C3 ρ(z ) , (19)

where z = A (κ(X, X)α ). C1 = C3 and C2 are convolutional ﬁlters with pqth entry λ 1κ0(X i,[r1+1,r2+1], Xi,[p,q]) and λ 1κ0(X i,[r1+1,r2+1], Xi,[p,q]) respectively. σ = A and ρ = ( log q) (A ) 1 are monotone non-decreasing. Furthermore,

1. Under Assumption 2, (7) admits a unique ﬁxed point on Z.

2. Under Assumption 1, for any test index e X Ren dx and for at least one z satisfying (7),

f ( e X; Y, X) = V1 (z + ρ(z )) + V2T(Y ), (20)

where V1 = λ 1κ( e X, X) and V2 = λ 1κ( e X, X).

Proof. We begin with the case c = 1. By Deﬁnition 9, we have that

C1 z = Cmat z Rwh,

where z R(2r1+w)(2r2+h) is a zero-padded z and Cmat Rwh (2r1+w)(2r2+h). The jth row of Cmat is equal to Cj, where Cj,[j:j+(2r1+1)(2r2+1)] = C and all other elements of Cj are zero. Here C is a ﬂattened version of C1.

By removing the zero entries from z and the corresponding columns from Cmat, we may write

Cmat z = Wmatz

for some W1mat Rwh wh. The same holds for W2 and W3. We may therefore write

σ C1 z + C2 T(Y ) + C3 ρ(z ) = σ W1matz + W2mat T(Y ) + W3matρ(z ) .

On the other hand, Theorem 3 says that

σ W1z + W2T(Y ) + W3ρ(z ) = z .

Choosing W1mat = W1 = λ 1κ(X, X) and similarly for W2, W3, we obtain (19). This implies that the ﬁlter C1 has pqth entry λ 1κ0(X i,[r1+1,r2+1], Xi,[p,q]). The rest of the corollary follows from Theorem 3.

Published as a conference paper at ICLR 2022

200 400 600 800 Width

Time (seconds)

Random initialisation Kernel initialisation

0 20 40 60 # Channels

Time (seconds)

Random initialisation Kernel initialisation

Figure 8: Empirically measured cost (in seconds) for initialisation schemes measured on a DELL Laptop (16GB RAM, Intel Core TM i7-8665U CPU), averaged over 100 runs. (Left) Fully connected DEQ layer. (Right) Convolutional DEQ layer. In all cases, initialisation represents a small cost compared with training.

F DETAILED EXPERIMENTS

Our experiments may be reproduced using the code provided in the supplementary material by following the README.md ﬁle.

Choice of kernel for initialisation. To intialise a convolutional network, we need to choose r1, r2 and κ0 according to (18). We choose r1 = r2 = 3 and κ0(x, x ) = Pc(c+1)/2 r=1 gr(x3, x 3)e 1 2ℓ2r x[1,2] x [1,2] 2 , where gr(a, b) = 1 if a == ar and b == br or a == br and b == ar and 0 otherwise. We sample the squared lengthscales from a uniform distribution between 0 and 4.

The cost of k GLM initialisation is minuscule, and often even faster than random initialisation.

For a fully connected layer with hidden and readout weight matrices W and V of size n n and en n respectively, we require n2 + nen evaluations of the kernel function and then a normalisation by the spectral norm (the default option for this calculation for a Python library such as Numpy will be the LAPACK divide and conquer algorithm, which will have a worst case of O(n3), but in practice the computation will be very fast). In contrast, random numbers (the usual method for initialising neural networks) requires n2 + nen random number generations, which depending on software platform, is usually done through calls to the inverse transform sampling method, perhaps using a stochastic collocation Monte Carlo sampler. Either way, empirically we ﬁnd that k GLM initialisation is actually much faster than random Gaussian initialisation for fully connected layers. See Figure 8.

For convolutional layers, which are basically just extremely sparse large fully connected layers, the kernel method induces an additional overhead associated with the various reshaping operations required to put the elements of the kernel matrix in the correct position in the convolutional ﬁlter layer. The calculation of the spectral norm is done through the method of Sedghi et al. (2018) at a cost of O w2c2(c + logw)), where c is the number of channels and w is the maximum of the pixel width/height of the image. Our implementation is far from optimised, but we ﬁnd that even with this overhead, the initialisation costs less than 1 second on a laptop PC. This is a tiny fraction of the cost typically associated with training the network, which depends on the problem but will usually be much less than 1%. See Figure 8.

Scaling up parameter counts. The number of rows of the square kernel matrix is equal to the number of rows in X. This means, for example, in the setting of CIFAR10 where X R32 32 3 3, there are roughly 9.4 106 elements in the kernel matrix. Using a convolutional representation, the sparse kernel matrix may be represented as a convolutional layer of size (3, 3, 32, 32). Crucially, the parameter count scales quadratically with the number of input channels, which in this case is 3.

One way to experiment on larger models is to scale up the size of the input data. To this effect, we take a dataset containing 3799 hyperspectral (25 channel) images (HSI) of roads (Lu et al., 2020) of total size (3799, 25, 192, 384). We consider a sequence of denoising tasks when the input data is a

Published as a conference paper at ICLR 2022

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

Figure 9: First 4 channels of HSI dataset. Test MSE against the average spectral norm of each layer for image denoising task using k GLM initialisation (blue, ours) and random initialisation (red). The vertical axes change between plots. Markers at the edge of the top border indicate that a value greater than the axis limit or Na N was observed. Our initialisation scheme shows superior performance at all epochs.

subset of the 25 hyperspectral channels. The resulting plots are given in Figures 9, 10, 11 12. The results are summarised in Table 1

Table 1: Summary of experimental results. Shown are average log test losses one sample standard deviation of the logarithm taken at the best performing spectral norm. The spectral norm grid was as described in the main text. If Na Ns or Infs were observed, these were replaced by the maximum before the average and variance were recorded. # failed indicates the number of Na Ns and Infs observed over the full grid. Due to training instability causing Na N and Infs data points, especially in randomly initialised models, the average and sample standard deviation are not true unbiased estimators. k GLM initialisation shows superior performance, both in terms of average performance and number of failed runs.

Model Parameter (#) Runs (#) Epochs (#) k GLM init Random init Mean std (MSE) Failed (#) Mean std (MSE) Failed (#)

4 channel 1600 70 0 4.80 0.55 0 3.37 0.17 220 3 6.61 0.03 17 6.51 0.01 222 5 6.64 0.02 21 6.57 0.01 222 8 channel 6400 100 0 4.87 0.37 0 3.53 0.15 450 3 6.76 0.01 42 6.73 0.01 485 5 6.80 0.01 46 6.79 0.01 486 10 channel 10,000 65 0 4.95 0.31 0 3.63 0.12 256 3 6.84 0.01 33 6.80 0.02 268 5 6.90 0.01 35 6.87 0.02 268 13 channel 16,900 65 0 4.60 0.37 0 3.23 0.15 185 3 6.82 0.22 28 6.42 0.40 206 5 6.87 0.22 28 6.46 0.42 208

Published as a conference paper at ICLR 2022

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

Figure 10: First 8 channels of HSI dataset. Test MSE against the average spectral norm of each layer for image denoising task using k GLM initialisation (blue, ours) and random initialisation (red). The vertical axes change between plots. Markers at the edge of the top border indicate that a value greater than the axis limit or Na N was observed. Our initialisation scheme shows superior performance at all epochs.

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

Figure 11: First 10 channels of HSI dataset. Test MSE against the average spectral norm of each layer for image denoising task using k GLM initialisation (blue, ours) and random initialisation (red). The vertical axes change between plots. Markers at the edge of the top border indicate that a value greater than the axis limit or Na N was observed. Our initialisation scheme shows superior performance at all epochs.

Published as a conference paper at ICLR 2022

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

10 2 10 1 100 101

Average spectral norm

Figure 12: First 11 channels of HSI dataset. Test MSE against the average spectral norm of each layer for image denoising task using k GLM initialisation (blue, ours) and random initialisation (red). The vertical axes change between plots. Markers at the edge of the top border indicate that a value greater than the axis limit or Na N was observed. Our initialisation scheme shows superior performance at all epochs.

Published as a conference paper at ICLR 2022

G EXAMPLE EMPIRICAL INSTANTIATION OF EQUIVALENCE

We take binarised MNIST and for each image, independently corrupt 30% of bits by randomly ﬂipping them. We are interested in obtaining a predictive model that can segment the original binarised image e Ys from its noisy corrupt input Ys, together with the constant coordinate locations Xs = e Xs = X that may be represented as 784 2 coordinate indices. The index 1 s 60, 000 runs over different images in the training dataset. We evaluate model performance on an independent testing dataset of size 10, 000. We empirically compare ﬁve related models.

1. Kernel logistic regression (KLR). Since each element of Ys R784 is binary, a natural k GLM model to use is kernel logistic regression (KLR) (Green & Yandell, 1985; Zhu & Hastie, 2005). With respect to the exponential family, such a model prescribes that the derivative of the log partition function A ( ) = σ( ) = (1 + e ) 1 is the logistic sigmoid and the sufﬁcient statistic T is the identity. As is common in the literature, we employ a zero-mean Gaussian prior so that ρ 0.

The only remaining choices are the kernel function k and the regularisation parameter λ. In order to ensure that Assumption 1 is satisﬁed, we choose k to be the sum of a squared exponential kernel with lengthscale 1 and a white noise kernel with variance 10 8. This forces the resulting kernel matrix K to be strictly positive deﬁnite. In order to satisfy Assumption 2, we need to ensure that K/λ 1 (assumption A3.1 is automatically satisﬁed due to the setting of A in KLR). Accordingly, we set λ = 2 K . Note that we do not actually need to ensure that Assumptions 1 and Assumptions 2 are satisﬁed for Theorem 3 to hold, but we do need these assumptions so that all of our theoretical results hold.

The KLR model is trained individually using every datapoint in {Ys}60,000 s=1 as a dataset and a manually implemented iterated reweighted least squares algorithm consisting of updates according to (exact) Newton s method. We allow Newton s method to run for 5 steps, at which point the norm of the difference between subsequent values of α is typically on the order of 10 14. The output of the KLR model is a kernelised predictor f that has been trained using data Xs and Ys, evaluated on Xs and converted to the maximum a posteriori of the Bernoulli distribution through 1 σ(f ) > 0.5 .

2. DEQ with kernel initial parameters (DEQ-kernel). Theorem 3 implies an equivalent DEQ model with ﬁxed parameters. We use this model, namely

z = σ W1z + W2Ys ,

f (Y ) = V1z + V2Y,

where W1 = V1 = λ 1K and W2 = V2 = λ 1K are parameter matrices and σ is the logistic sigmoid function. To convert the kernelised predictor f (Y ) to the mean of the Bernoulli distribution, we must pass it again through the inverse link function,

σ(f (Y )) = σ(V1z + V2Y ) (Greyscale prediction)

and the resulting maximum a posteriori prediction is determined by examining whether the mean of the Bernoulli distribution is greater than 0.5,

1 σ(f (Y )) > 0.5 . (Binarised prediction)

3. DEQ with kernel initialisation and ﬁne-tuned parameters (DEQ-kernel-trained). We interpret the DEQ with initial parameters model as a neural network, and update the parameters W1, W2, V1 and V2 using Adam to minimise the squared loss between the greyscale prediction and binarised ground truth. We also allow biases in both the hidden and readout layers. We train for 5 epochs using a batch size of 100, leading to 5 (60000/100) = 3000 gradient updates.

4. DEQ with random initialisation (DEQ-random). We take the same neural network architecture as with kernel initialisation, but randomly initialse the parameters from a Gaussian distribution using the Pytorch default.

5. DEQ with random initialisation and ﬁne-tuned parameters (DEQ-random-trained). We take the same neural network architecture as with kernel initialisation, but randomly initialse the parameters from a Gaussian distribution using the Pytorch default. We also allow biases and train using the same optimiser and loss.

Published as a conference paper at ICLR 2022

Observations. Remarkably (but not surprisingly, due to Theorem 3), the binarised prediction agrees exactly with KLR (an entirely independent piece of software, that uses an entirely different solver) over all but 11 of 60, 000 MNIST datapoints each containing 784 pixels. For those 11 where disagreements were observed, only 1 of 784 pixels were different. The very small differences arise due to the tolerance of the solvers and the way tied predictions (i.e. those whose logistic sigmoid output is 0.5) are handled. In all 11 cases, the DEQ output a value of 0.5 and KLR output some value less than but very close to 0.5. See Table 2 for posterior probabilities, and Figure 13 for the corresponding binarised predictions. Figure 14 shows the test error as a function of training epoch for DEQ-kernel-trained and DEQ-random-trained.

Example index DEQ posterior probability KLR posterior probability 3678 0.5 0.49999999405532286 13774 0.5 0.4999999622825579 18222 0.5 0.49999996963143295 19554 0.5 0.49999999447511095 19857 0.5 0.49999998512994437 22298 0.5 0.4999999968452693 23396 0.5 0.49999997421424147 38328 0.5 0.4999999917996173 43401 0.5 0.4999999890555562 48119 0.5 0.49999998394239925 58961 0.5 0.4999999743936929

Table 2: KLR and the equivalent DEQ agreed on all 784 pixels of 60000 examples in the training set except for the 11 examples above. In these cases, the predicted posterior probabilities were very close to borderline.

Figure 13: Visualisation of the 11 datapoints where KLR and DEQ-kernel predictions did not agree exactly. KLR predictions (top), DEQ-kernel predictions (middle), and DEQ-random predictions (bottom). The top and middle rows are very close, in contrast with the bottom row. The red pixel in the middle row shows where it disagrees with the top. On all other 59989 examples, the top and middle rows agree exactly.

Published as a conference paper at ICLR 2022

0 25 50 75 100 Training iteration

3 10 1 4 10 1

Kernel initialisation Random Initialisation

0 1000 2000 3000 Training iteration

Kernel initialisation Random Initialisation

0 25 50 75 100 Training iteration

Train Error

Kernel initialisation Random Initialisation

0 1000 2000 3000 Training iteration

Train Error

Kernel initialisation Random Initialisation

Figure 14: Testing and training error for DEQ-kernel-trained (green) and DEQ-random-trained (blue) models. The left column shows the ﬁrst 100 iterations only, and the right column shows the entire training process. The initial models DEQ-kernel and DEQ-random are correspondingly represented by iteration 0. KLR is equivalent to DEQ-kernel (up to tie predictions). The top two plots show mean squared error of the binarised images on the test set. The bottom two curves show mean squared error of the greyscale images on the training set. Kernel initialisation out-performs random initialisation. Note that the initial test error of kernel initialisaton is smaller than after training until about 30 training iterations have passed.

Published as a conference paper at ICLR 2022

H EXTENSION TO NONCENTERED CANONICAL PARAMETERS

It is straightforward to extend Theorem 3 (and related corrolaries) to the case where the canonical parameters η are the sum of evaluations of some deterministic function m and evaluations of the predictor f. We sketch this now and leave details to the reader to suit their application. Starting from (10) and using prior (4), we obtain a new MAP objective

f = arg min f Hk

j=1 log h(yi) + A (f(xi) + m(xi)) + Q (f(xi) + m(xi))

(f(xi) + m(xi)) T(yi) + λ

so that by the representer theorem, f ( ) = k( , X)α is given by

α = arg min α Rn 1 A (Kα + M) + Q (Kα + M) α KT(Y ) + λ

where M := m(X). The same reasoning in the proof of Theorem 3 then still follows if we deﬁne z = A (Kα + M). That is, the equilibrium condition implies that

0 = KT(Y ) + KA (Kα + M) + KQ (Kα + M) + λKα

λ (KT(Y ) KA (Kα + M) KQ (Kα + M))

A (Kα + M) = A 1

λ (KT(Y ) KA (Kα + M) KQ (Kα + M)) + M

A (Kα + M) = A 1

λ KT(Y ) KA (Kα + M) KQ (A ) 1 A (Kα + M) + M

z = σ W1z + W2T(Y ) + W3ρ(z ) + M ,

and so on. A similar result holds when an additional linear scaling of f(X) is introduced.

Published as a conference paper at ICLR 2022

I SENSITIVITY OF FIXED POINT EQUATIONS TO NON-PSD MATRIX PERTURBATIONS

The DEQ layers of the models we studied in the main text have parameter matrices that are symmetric PSD kernel matrices. This departs from the established practice, where one randomly initialises the parameter matrices (say from a zero-mean Gaussian distribution with small variance), and then trains the parameters using gradient descent. Here we investigate connections between our restricted parameter setting and more general parameters.

We begin by bounding the difference between ﬁxed points of DEQ layers with the same activation σ and input Y , but different parameter matrices.

Proposition 12. Let Z be an open strictly convex set and Z its closure. Denote by some vector norm on Z and its induced matrix norm. Let g1 : Z Z and g2 : Z Z be deﬁned by

g1(z) = σ A(Y z) , A Rn n

g2(z) = σ B(Y z) , B Rn n

for some Y Rn, function σ with Lipschitz constant Lσ 1 (with respect to on Z) and matrices A, B with matrix norms LA = A < 1 and LB = B < 1. Deﬁne L1 = LσLA and L2 = LσLB. Then for ﬁxed points z1 = g1(z1) and z2 = g2(z2) (each guaranteed to exist and be unique),

z1 z2 Lσ Y σ(0) (1 L1)(1 L2) (A B) .

Proof. Existence and uniqueness of ﬁxed points follows directly from Theorem 1. Exploiting only the triangle inequality and Lipschitz properties, we have

z1 z2 = g1(z1) g2(z2) = g1(z1) g1(z2) + g1(z2) g2(z2) g1(z1) g1(z2) + g1(z2) g2(z2) L1 z1 z2 + g1(z2) g2(z2) (1 L1) z1 z2 g1(z2) g2(z2)

z1 z2 Lσ 1 L1 (A B)(Y z2) ,

Using g2(Y ) = σ(0), and again the triangle inequality and Lipschitz properties, note that

z2 Y = g2(z2) g2(Y ) + σ(0) Y g2(z2) g2(Y ) + σ(0) Y L2 z2 Y + Y σ(0)

z2 Y 1 1 L2 Y σ(0) ,

z1 z2 Lσ Y σ(0) (1 L1)(1 L2) (A B) .

We will now use this result to bound the difference between ﬁxed points found by arbitrary parameter matrices and ﬁxed points found by PSD kernel matrices, which exactly solve k GLMs as in the main text.

It helps to think of real-valued PSD matrices as being different from general matrices in two respects. Firstly, real-valued PSD matrices are always symmetric (so their eigenvalues are real). Secondly, the eigenvalues of PSD matrices are always nonnegative reals. The matrix norm of the difference

Published as a conference paper at ICLR 2022

between any matrix and its closest PSD counterpart can be decomposed into these two distinct effects when the matrix norm is the Frobenius norm (Higham, 1988, Theorem 2.1). However, the Frobenius norm is difﬁcult to work with as it is not induced by a norm over the vector space Z, and only provides loose bounds to the spectral norm. We instead operate on 2 directly, and obtain a bound that contains a similar decomposition of two effects. Proposition 13. In the same setting as Proposition 12, choose to be the Euclidean and spectral norms 2. Let A be any n n matrix. A admits a decomposition into symmetric and skew symmetric parts A = A1 + A2 where A1 = 1

2 A + A and A2 = 1

2 A A . Let {B 0} denote the space of PSD matrices. Then

min {B 0} z1 z2 2 Lσ Y σ(0) 2

eλ + A2 2 ,

where f λ2 be the largest absolute value of the negative eigenvalues of A1.

Proof. From the symmetric, skew-symmetric decomposition, A = 1

2QΛQ +A2, where Q is (real) orthogonal and Λ is a real diagonal matrix. Choose B = 1

2QΛ+Q , where Λ+ is the elementwise maximum of Λ and zero. Then

A B 2 = A2 + 1

2Q(Λ Λ+)Q 2

The ﬁrst term is the operator norm of A2. The second term is the largest absolute value of the negative eigenvalues of A1. Finally note that LB A1 2 A 2, so that 1 1 L2 1 1 L1 , and apply Proposition 12.

This result helps us understand what happens when parameter matrices that are almost PSD kernel matrices are used. We note that if A is PSD, the bound evaluates to zero. If A is symmetric and real, then the bound is a linear scale of the most negative eigenvalue of A. If A is any matrix, we have to additionally account for the operator norm of the skew-symmetric part of A, which contains purely imaginary eigenvalues.

A more blunt observation can be made when we do not know anything about the norm of the skew symmetric part or the most negative eigenvalue of the symmetric part of A, by relating both of these quantities to the norm of A. Here rather than using relative spectral properties of A, we can just use the scale of A to conclude that if A contains small entries, there is a PSD matrix B that obtains a close ﬁxed point, and we can quantify this closeness. This makes use of the coarse inequality that eλ + A2 2 2 A 2.

A natural question to ask is when such a blunt device might be applied in practical scenarios. It is noted in the tutorial material (Kolter et al., 2020) that DEQs parameters are typically initialised with smaller weight variance than standard feedforward networks. Recall that Xavier Glorot initialisation for standard feedforward networks chooses a parameter standard deviation of 1 n. If we slightly alter

this initialisation scheme to a achieve a smaller weight variance (that is still O(n 1/2)), we can apply the conclusion of the rough reasoning above by applying results from random matrix theory. Proposition 14. In the same setting as Proposition 13, let A be a random matrix with Gaussian entries having zero mean and standard deviation ν = C/Lσ 2 n+t. Then with probability 1 2 exp t2/2

for t 0, the smallest sum of squared errors is bounded by

min {B 0} z1 z2 2 2 4 C Y σ(0) 2

In particular if Y σ(0) 2 2 Dn for some D 0, the smallest mean squared error is bounded by

min {B 0} z1 z2 2 2 n 4D C2

Published as a conference paper at ICLR 2022

Proof. eλ is bounded by the largest absolute value of all of the eigenvalues of the symmetric part of A, which is the operator norm of the symmetric part of A. Thus

eλ + A2 2 1

A A 2 + A + A 2

2 A 2 = LA.

By Proposition 13 we therefore have

min {B 0} z1 z2 2 2L1 Y σ(0) 2

Choosing C such that 0 < C/Lσ < 1 and letting ν = C/Lσ 2 n+t in Theorem 15 below, we have that

A 2 = LA < C/Lσ with probability exceeding 1 2 exp t2/2.

Theorem 15 (Corollary of Theorem 2.6 in Rudelson & Vershynin (2010)). Let A be an n1 n2 matrix with independent zero-mean Gaussian random entries each having standard deviation ν. Then for all t 0,

P A 2 < ν( n1 + n2 + t) 1 2e t2/2.

When C is small, we expect there to exist kernel matrices that perform roughly the same (as poorly) as random Gaussian initialisation. This result is consistent with our empirical results in the ﬁgures in Appendix F, where we observe that randomly initialised parameter matrices with spectral norm of around 10 1 or less obtain roughly the same performance as kernel initialised models with the same norm, and both outperform randomly initialised parameter matrices with large spectral norms. However, this experimental setup is slightly complicated by the use of convolutional layers, which when converted to parameter matrices, are very large, sparse and contained shared elements.