# deep_kernel_processes__a839c079.pdf Deep kernel processes Laurence Aitchison 1 Adam X. Yang 1 Sebastian W. Ober 2 We define deep kernel processes in which positive definite Gram matrices are progressively transformed by nonlinear kernel functions and by sampling from (inverse) Wishart distributions. Remarkably, we find that deep Gaussian processes (DGPs), Bayesian neural networks (BNNs), infinite BNNs, and infinite BNNs with bottlenecks can all be written as deep kernel processes. For DGPs the equivalence arises because the Gram matrix formed by the inner product of features is Wishart distributed, and as we show, standard isotropic kernels can be written entirely in terms of this Gram matrix we do not need knowledge of the underlying features. We define a tractable deep kernel process, the deep inverse Wishart process, and give a doubly-stochastic inducing-point variational inference scheme that operates on the Gram matrices, not on the features, as in DGPs. We show that the deep inverse Wishart process gives superior performance to DGPs and infinite BNNs on fully-connected baselines.1 1. Introduction The deep learning revolution has shown us that effective performance on difficult tasks such as image classification (Krizhevsky et al., 2012) requires deep models with flexible lower-layers that learn task-dependent representations. Here, we consider whether these insights from the neural network literature can be applied to purely kernel-based methods. (Note that we do not consider deep Gaussian processes or DGPs to be fully kernel-based as they use a feature-based representation in intermediate layers). Importantly, deep kernel methods (e.g. Cho & Saul, 2009) 1Department of Computer Science, Bristol, BS8 1UB, UK 2Department of Engineering, Cambridge, CB2 1PZ, UK. Correspondence to: Laurence Aitchison . Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). 1Reference implementation at: github.com/Laurence A/ bayesfunc already exist. In these methods, which are closely related to infinite Bayesian neural networks (Lee et al., 2017; Matthews et al., 2018; Garriga-Alonso et al., 2018; Novak et al., 2018), we take an initial kernel (usually the dot product of the input features) and perform a series of deterministic, parameter-free transformations to obtain an output kernel that we use in e.g. a support vector machine or Gaussian process. However, the deterministic, parameter-free nature of the transformation from input to output kernel means that they lack the capability to learn a top-layer representation, which is believed to be crucial for the effectiveness of deep methods (Aitchison, 2019). 2. Contributions 1. We propose deep kernel processes (DKPs), which combine nonlinear transformations of the kernel, as in Cho & Saul (2009) with a flexible learned representation by exploiting a Wishart or inverse Wishart process (Dawid, 1981; Shah et al., 2014). 2. We show that models ranging from DGPs (Damianou & Lawrence, 2013; Salimbeni & Deisenroth, 2017) to Bayesian neural networks (BNNs; Blundell et al., 2015, App. C.1), infinite BNNs (App. C.2) and infinite BNNs with bottlenecks (App. C.3) can be written as DKPs. 3. We define a specific DKP, the deep inverse Wishart process (DIWP) which offers convenient variational approximate posteriors. 4. We develop a novel doubly-stochastic variational inducing-point inference scheme purely in the kernel domain (as opposed to Salimbeni & Deisenroth, 2017, who described DSVI for standard feature-based DGPs) for DIWPs. 5. We demonstrate improved performance of DIWPs on fully-connected benchmark datasets. DKPs and specifically DIWPs offer two key advantages over feature-based methods such as DGPs and BNNs. First, DGPs and BNNs have complex approximate posteriors (Li et al., 2018), due in part to permuation/rotation symmetries in the posterior over weights/features (App. D.1 and D.2; Mac Kay, 1992; Moore, 2016; Pourzanjani et al., 2017). Deep kernel processes This complexity means that common variational approximate posteriors can give a very poor approximation to the true posterior. In contrast, the Gram matrices in DKPs are invariant to permutations/rotations of the weights/features and thus have much simpler true posteriors which are more easily captured by variational approximate posteriors. Second, in DIWPs the width parameter is learnable, and in the limit of infinite width gives a series of deterministic kernel transformations, as in an infinite neural network. This gives DIWPs the ability to learn on a layer-by-layer basis where a deterministic kernel transformation is appropriate, or where more flexibility in the kernel is needed. 3. Background We briefly revise Wishart and inverse Wishart distributions. The Wishart distribution is a generalization of the gamma distribution that is defined over positive semidefinite matrices. Suppose that we have a collection of P-dimensional random variables xi with i {1, . . . , N} such that xi iid N (0, V) , (1) PN i=1xix T i = S W (V, N) (2) has Wishart distribution with scale matrix V and N degrees of freedom. When N > P 1, the density is, W (S; V, N) = |S|(N P 1)/2e Tr(V 1S/2) 2NP |V|ΓP N where ΓP is the multivariate gamma function. Further, the inverse, S 1 has inverse Wishart distribution, W 1 V 1, N . The inverse Wishart is defined only for N > P 1 and also has closed-form density. Finally, we note that the Wishart distribution has mean NV while the inverse Wishart has mean V 1/(N P 1) (for N > P +1). 4. Deep kernel processes We define a kernel process to be a set of distributions over positive definite matrices of different sizes, that are consistent under marginalisation (Dawid, 1981; Shah et al., 2014). The two most common kernel processes are the Wishart process and inverse Wishart process, which we write in a slightly unusual form to ensure their expectation is K. We take G and G to be finite dimensional marginals of the underlying Wishart and inverse Wishart process, G W (K /N, N) , (4a) G W (K /N, N) , (4b) G W 1(δK , δ+(P +1)) , (4c) G W 1(δK , δ+(P +1)) , (4d) and where we explicitly give the consistent marginal distributions over K , G and G which are P P principal submatrices of the P P matrices K, G and G dropping the same rows and columns. In the inverse-Wishart distribution, δ is a positive parameter that can be understood as controlling the degree of variability, with larger values for δ implying smaller variability in G . We define a deep kernel process by analogy with a DGP, as a composition of kernel processes, and show in App. A that under sensible assumptions any such composition is itself a kernel process. 2 4.1. DGPs with isotropic kernels are deep Wishart processes We consider deep GPs of the form (Fig. 1 top) with X RP N0, where P is the number of input points and N0 is the number of features in the input. ( 1 N0 XXT for ℓ= 1, K (Gℓ 1) for ℓ {2, . . . , L+1}, (5a) P (Fℓ|Kℓ) = QNℓ λ=1N f ℓ λ; 0, Kℓ , (5b) Gℓ= 1 NℓFℓFT ℓ. (5c) Here, Fℓ RP Nℓare the Nℓhidden features in layer ℓ; λ indexes hidden features so f ℓ λ is a single column of Fℓ, representing the value of the λth feature for all training inputs. Note that K( ) is a function that takes a Gram matrix and returns a kernel matrix, whereas Kℓis a (possibly random) variable representing the kernel matrix at layer ℓ. Note, we have restricted ourselves to kernels that can be written as functions of the Gram matrix, Gℓ, and do not require the full set of activations, Fℓ. As we describe later, this is not too restrictive, as it includes amongst others all isotropic kernels (i.e. those that can be written as a function of the distance between points Williams & Rasmussen, 2006). Note that we have a number of choices as to how to initialize the kernel in Eq. (5a). The current choice just uses a linear dotproduct kernel, rather than immediately applying the kernel function K. This is both to ensure exact equivalence with infinite NNs with bottlenecks (App. C.3) and also to highlight an interesting interpretation of this layer as Bayesian inference over generalised lengthscale hyperparameters in the squared-exponential kernel (App. B e.g. Lalchand & Rasmussen, 2020). For DGP regression, the outputs, Y, are most commonly given by a likelihood that can be written in terms of the output features, FL+1. For instance, for regression, the 2Note that we leave the question of the full Kolmogorov extension theorem (Kolmogorov, 1933) for matrices to future work: for our purposes, it is sufficient to work with very large but ultimately finite input spaces as in practice, the input vectors are represented by elements of the finite set of 32-bit or 64-bit floating-point numbers (Sterbenz, 1974). Deep kernel processes X K1 F1 G1 K2 F2 G2 K3 F3 Y X G1 G2 F3 Y Layer 1 Layer 2 Output Layer Figure 1. Generative models for two layer (L = 2) deep GPs. (Top) Generative model for a deep GP, with a kernel that depends on the Gram matrix, and with Gaussian-distributed features. (Bottom) Integrating out the features, the Gram matrices become Wishart distributed. distribution of the λth output feature column could be P (yλ|FL+1) = N yλ; f L+1 λ , σ2I , (6) alternatively, we could use a classification likelihood, P (y|FL+1) = Categorical y; softmax FL+1 λ . (7) Importantly, our methods can be used with any likelihood with a known probability density function. The generative process for the Gram matrices, Gℓ, consists of generating samples from a Gaussian distribution (Eq. 5b), and taking their product with themselves transposed (Eq. 5c). This exactly matches the generative process for a Wishart distribution (Eq. 1), so we can write the Gram matrices, Gℓ, directly in terms of the kernel, without needing to sample features (Fig. 1 bottom), P (G1|X) = W 1 N1 1 N0 XXT , N1 , (8a) P (Gℓ|Gℓ 1) = W (K (Gℓ 1) /Nℓ, Nℓ) , (8b) P (FL+1|GL) = QNL+1 λ=1 N f L+1 λ ; 0, K (GL) . (8c) Except at the output, the model is phrased entirely in terms of positive-definite kernels and Gram matrices, and is consistent under marginalisation (assuming a valid kernel function) and is thus a DKP. At a high level, the model can be understood as alternatively sampling a Gram matrix (introducing flexibility in the representation), and nonlinearly transforming the Gram matrix using a kernel (Fig. 2). This highlights a particularly simple interpretation of the DKP as an autoregressive process. In a standard autoregressive process, we might propagate the current vector, xt, through a deterministic function, f(xt), and add zero-mean Gaussian noise, ξ, xt+1 = f (xt) + σ2ξ such that E [xt+1|xt] = f (xt) . (9) By analogy, the next Gram matrix has expectation centered on a deterministic transformation of the previous Gram matrix, E [Gℓ|Gℓ 1] = K (Gℓ 1) , (10) so Gℓcan be written as this expectation plus a zero-mean random variable, Ξℓ, that can be interpreted as noise, Gℓ= K (Gℓ 1) + Ξℓ. (11) Note that Ξℓis not in general positive definite, and may not have an analytically tractable distribution. This noise decreases as Nℓincreases, V Gℓ ij = V Ξℓ ij (12) = 1 Nℓ K2 ij(Gℓ 1) + K2 ii(Gℓ 1)K2 jj(Gℓ 1) . Notably, as Nℓtends to infinity, the Wishart samples converge on their expectation, and the noise disappears, leaving us with a series of deterministic transformations of the Gram matrix. Therefore, we can understand a deep kernel process as alternatively adding noise to the kernel by sampling e.g. a Wishart or inverse Wishart distribution (G2 and G3 in Fig. 2) and computing a nonlinear transformation of the kernel (K(G2) and K(G3) in Fig. 2) Remember that we are restricted to kernels that can be written as a function of the Gram matrix, Kℓ= K (Gℓ) = Kfeatures (Fℓ) , Kℓ ij = k Fℓ i,:, Fℓ j,: . (13) where Kfeatures( ) takes a matrix of features, Fℓ, and returns the kernel matrix, Kℓ, and k is the usual kernel function, which takes two feature vectors (rows of Fℓ) and returns an element of the kernel matrix. This does not include all possible kernels because it is not possible to recover the features from the Gram matrix. In particular, the Gram matrix is invariant to unitary transformations of the features: the Gram matrix is the same for Fℓand F ℓ= UFℓwhere U is a unitary matrix, such that UUT = I, Gℓ= 1 NℓFℓFT ℓ= 1 NℓFℓUℓUT ℓFT ℓ= 1 NℓF ℓF T ℓ. (14) Superficially, this might seem very limiting leaving us only with dot-product kernels (Williams & Rasmussen, 2006) such as, k(f, f ) = f f + σ2. (15) However, in reality, a far broader range of kernels fit within this class. Importantly, isotropic or radial basis function kernels including the squared exponential and Matern depend Deep kernel processes input index 0 100 200 input index input index 0 100 200 input index 0 100 200 input index 0 100 200 input index 0 100 200 input index Wishart Inverse Wishart Figure 2. Visualisations of a single prior sample of the kernels and Gram matrices as they pass through the network. We use 1D, equally spaced inputs with a squared exponential kernel. As we transition K(Gℓ 1) Gℓ, we add noise by sampling from a Wishart (top) or an inverse Wishart (bottom). As we transition from Gℓto K(Gℓ), we deterministically transform the Gram matrix using a squared-exponential kernel. only on the squared distance between points, R, (Williams & Rasmussen, 2006) k(f, f ) = k (R) , R = |f f |2 . (16) These kernels can be written as a function of G, because the matrix of squared distances, R, can be computed from G, Rℓ ij = 1 Nℓ PNℓ λ=1 F ℓ iλ F ℓ jλ 2 = 1 Nℓ PNℓ λ=1 F ℓ iλ 2 2F ℓ iλF ℓ jλ + F ℓ jλ 2 = Gℓ ii 2Gℓ ij + Gℓ jj. (17) 5. Variational inference in deep kernel processes A key part of the motivation for developing deep kernel processes was that the posteriors over weights in a BNN or over features in a deep GP are extremely complex and multimodal, with a large number of symmetries that are not captured by standard approximate posteriors (Mac Kay, 1992; Moore, 2016; Pourzanjani et al., 2017). For instance, in the Appendix we show that there are permutation symmetries in the prior and posteriors over weights in BNNs (App. D.1) and rotational symmetries in the prior and posterior over features in deep GPs with isotropic kernels (App. D.2). The inability to capture these symmetries in standard variational posteriors may introduce biases in the parameters inferred by variational inference, because the variational bound is not uniformly tight across the state-space (Turner & Sahani, 2011). Gram matrices are invariant to permutations or rotations of the features, so we can sidestep these complex posterior symmetries by working with the Gram matrices as the random variables in variational inference. However, variational inference in deep Wishart processes (equivalent to DGPs Sec. 4.1 and infinite NNs with bottlenecks App. C.3) is difficult because the approximate posterior we would like to use, the non-central Wishart (App. E), has a probability density function that is prohibitively costly and complex to evaluate in the inner loop of a deep learning model (Koev & Edelman, 2006). Instead, we consider an inverse Wishart process prior, for which the inverse Wishart itself makes a good choice of approximate posterior. 5.1. The deep inverse Wishart processes By analogy with Eq. (8), we define a deep inverse Wishart processes (DIWPs). However, the inverse Wishart process introduces a new difficulty: that at the input layer, 1 N0 XXT may be singular if there are more datapoints than features. Instead of attempting to use a singular Wishart distribution over G1, which would be complex and difficult to work with (Bodnar & Okhrin, 2008; Bodnar et al., 2016), we instead define an approximate posterior over the full-rank N0 N0 matrix, Ω, and use G1 = 1 N0 XΩXT RP P . P (Ω) = W 1(δ1I, δ1+N0+1) , (18) (with G1 = 1 N0 XΩXT , ) P Gℓ {2...L}|Gℓ 1 = W 1(δℓK (Gℓ 1) , P+1+δℓ) , P (FL+1|GL) = QNL+1 λ=1 N f L+1 λ ; 0, K (GL) , remembering that X RP N0, Gℓ RP P and Fℓ RP NL+1. Critically, the distributions in Eq. (18) are consistent under marginalisation as long as δℓis held constant (Dawid, 1981), with P taken to be the number of input points, or equivalently the size of Kℓ 1. Further, the deep inverse Wishart process retains the interpretation as a deterministic trans- Deep kernel processes formation of the kernel plus noise because the expectation is, E [Gℓ|Gℓ 1] = δℓK (Gℓ 1) (P + 1 + δℓ) (P + 1) = K (Gℓ 1) . The resulting inverse Wishart process does not have a direct interpretation as e.g. a deep GP, but does have more appealing properties for variational inference, as it is always full-rank and allows independent control over the approximate posterior mean and variance. Finally, it is important to note that Wishart and inverse Wishart distributions do not differ as much as one might expect; the standard Wishart and standard inverse Wishart distributions have isotropic distributions over the eigenvectors so they only differ in terms of their distributions over eigenvalues, and these are often quite similar, especially if we consider a Wishart model with Res Net-like structure (App. H). 5.2. An approximate posterior for the deep inverse Wishart process Choosing an appropriate and effective form for variational approximate posteriors is usually a difficult research problem. Here, we take inspiration from Ober & Aitchison (2020) by exploiting the fact that the inverse-Wishart distribution is the conjugate prior for the covariance matrix of a multivariate Gaussian. In particular, if we consider an inverse-Wishart prior over Σ RP P with mean Σ0, which forms the covariance of Gaussian-distributed matrix, V RP P , consisting of columns vλ, then the posterior over Σ is also inverse-Wishart, P (Σ) = W 1(Σ; δΣ0, P+1+δ) , (20a) P (V|Σ) = QNV λ=1N (vλ; 0, Σ) , (20b) P (Σ|V) = W 1 δΣ0 + VVT , P+1+δ+NV . (20c) Inspired by this exact posterior that is available in simple models, we choose the approximate posterior in our model to be, Q Ω = W 1 δ1I + V1VT 1 , δ1+γ1+N0+1 , (with G1 = 1 N0 XΩXT , ) Q Gℓ|Gℓ 1 = W 1 δℓK (Gℓ 1) +VℓVT ℓ, δℓ+γℓ+P+1 , Q FL+1|GL = QNL+1 λ=1 N f L+1 λ ; ΣλΛλvλ, Σλ , where Σλ = K 1 (GL) + Λλ 1 , (21) and where V1 is a learned N0 N0 matrix, {Vℓ}L ℓ=2 are P P learned matrices and γℓare learned non-negative real numbers. For more details about the input layer, see App. F. At the output layer, we use the global inducing approximate posterior for DGPs from Ober & Aitchison (2020), with learned parameters being vectors, vλ, and positive definite matrices, Λλ (see App. G). In summary, the prior has parameters δℓ(which also appears in the approximate posterior), and the posterior has parameters Vℓand γℓfor the inverse-Wishart hidden layers, and {vλ}NL+1 λ=1 and {Λλ}NL+1 λ=1 at the output. In all our experiments, we optimize all five parameters {δℓ, Vℓ, γℓ}L ℓ=1 and ({vλ, Λλ}NL+1 λ=1 ), and in addition, for inducing-point methods, we also optimize a single set of global inducing inputs, Xi RPi N0, which are defined only at the input layer. 5.3. Doubly stochastic inducing-point variational inference in deep inverse Wishart processes For efficient inference in high-dimensional problems, we take inspiration from the DGP literature (Salimbeni & Deisenroth, 2017) by considering doubly-stochastic inducing-point deep inverse Wishart processes. We begin by decomposing all variables into inducing and training (or test) points Xt RPt N0, FL+1 = FL+1 i FL+1 t Gℓ= Gℓ ii Gℓ it Gℓ ti Gℓ tt where e.g. Gℓ ii is Pi Pi and Gℓ it is Pi Pt where Pi is the number of inducing points, and Pt is the number of testing/training points. Note that Ωdoes not decompose as it is N0 N0. The full ELBO including latent variables for all the inducing and training points is, log P (Y|FL+1) +log P Ω, {Gℓ}L ℓ=2, FL+1|X Q Ω, {Gℓ}L ℓ=2, FL+1|X where the expectation is taken over Q Ω, {Gℓ}L ℓ=2, FL+1|X . The prior is given by combining all terms in Eq. (18) for both inducing and test/train inputs, P Ω, {Gℓ}L ℓ=2, FL+1|X = P (Ω) h QL ℓ=2 P (Gℓ|Gℓ 1) i P (FL+1|GL) , (24) where the X-dependence enters on the right because G1 = 1 N0 XΩXT . Taking inspiration from Salimbeni & Deisenroth (2017), the full approximate posterior is the product of an approximate posterior over inducing points and the conditional prior for train/test points, Q Ω, {Gℓ}L ℓ=2, FL+1|X = (25) Q Ω, {Gℓ ii}L ℓ=2, FL+1 i |Xi P {Gℓ it}L ℓ=2, {Gℓ tt}L ℓ=2, FL+1 t |Ω, {Gℓ ii}L ℓ=2, FL+1 i , X Deep kernel processes and the prior can be written in the same form, P Ω, {Gℓ}L ℓ=2, FL+1|X = (26) P Ω, {Gℓ ii}L ℓ=2, FL+1 i |Xi P {Gℓ it}L ℓ=2, {Gℓ tt}L ℓ=2, FL+1 t |Ω, {Gℓ ii}L ℓ=2, FL+1 i , X To obtain the full ELBO, we substitute Eqs. (25) and (26) into Eq.(23), the conditional prior terms cancel, log P (Y|FL+1) +log P Ω, {Gℓ ii}L ℓ=2, FL+1|X Q Ω, {Gℓ ii}L ℓ=2, FL+1|X P Ω,{Gℓ ii}L ℓ=2, FL+1 i |Xi = (28) P (Ω) h QL ℓ=2 P Gℓ ii|Gℓ 1 ii i P FL+1 i |GL ii , Q Ω,{Gℓ ii}L ℓ=2, FL+1 i |Xi = (29) Q (Ω) h QL ℓ=2 Q Gℓ ii|Gℓ 1 ii i Q FL+1 i |GL ii . Importantly, the first term in the ELBO (Eq. 27 is a summation across test/train datapoints, and the second term depends only on the inducing points, so as in Salimbeni & Deisenroth (2017) we can compute unbiased estimates of the expectation by taking only a minibatch of datapoints, and we never need to compute the density of the conditional prior (Eq. 30), we only need to be able to sample it. Finally, to sample the test/training points, conditioned on the inducing points, we need to sample, P {Gℓ it, Gℓ tt}L ℓ=2, FL+1 t |Ω, {Gℓ ii}L ℓ=2, FL+1 i , X = P FL+1 t |FL+1 i , GL QL ℓ=2 P Gℓ it, Gℓ tt|Gℓ ii, Gℓ 1 . (30) The first distribution, P FL+1 t |FL+1 i , GL , is a multivariate Gaussian, and can be evaluated using methods from the GP literature (Williams & Rasmussen, 2006; Salimbeni & Deisenroth, 2017). The difficulties arise for the inverse Wishart terms, P Gℓ it, Gℓ tt|Gℓ ii, Gℓ 1 . To sample this distribution, note that samples from the joint over inducing and train/test locations can be written, Gℓ ii Gℓ it Gℓ ti Gℓ tt W 1 Ψii Ψit Ψti Ψtt , δℓ+ Pi + Pt + 1 , Ψii Ψit Ψti Ψtt = δℓK (Gℓ 1) , (31) and where Pi is the number of inducing inputs, and Pt is the number of train/test inputs. Defining the Schur complements, Gℓ tt i = Gℓ tt Gℓ ti Gℓ ii 1 Gℓ it, (32) Ψtt i = Ψtt ΨtiΨ 1 ii Ψit. (33) We know that Gℓ tt i and Gℓ ii 1 Gℓ it have distribution, (Eaton, 1983) Gℓ tt i Gℓ ii, Gℓ 1 W 1(Ψtt i, δℓ+Pi+Pt+1) , (34a) Gℓ it Gℓ tt i, Gℓ ii, Gℓ 1 MN Gℓ iiΨ 1 ii Ψit, Gℓ iiΨ 1 ii Gℓ ii, Gℓ tt i , (34b) where MN is the matrix normal. Now, Gℓ it and Gℓ tt, can be recovered by algebraic manipulation. Finally, because of the doubly stochastic form for the objective, we do not need to sample multiple of jointly consistent samples for test points; instead, (and as in DGPs Salimbeni & Deisenroth, 2017) we can independently sample each test point (App. I), which dramatically reduces computational complexity. The full algorithm is given in Alg. 1, where the P and Q distributions for Ωand for inducing points are given by Eq. (18) and (21). We optimize using standard reparameterised variational inference (Kingma & Welling, 2013; Rezende et al., 2014) (for details on how to reparameterise samples from the Wishart, see Ober & Aitchison, 2020). 6. Computational complexity As in non-deep GPs, the complexity is O(P 3) for time and O(P 2) for space for standard DKPs (the O(P 3) time dependencies emerge e.g. because of inverses and determinants required for the inverse Wishart distributions). For DSVI, there is a P 3 i time and P 2 i space term for the inducing points, because the computations for inducing points are exactly the same as in the non-DSVI case. As we can treat each test/train point independently (App. I), the complexity for test/training points must scale linearly with Pt, and this term has P 2 i time scaling, e.g. due to the matrix products in Eq. (32). Thus, the overall complexity for DSVI is O(P 3 i + P 2 i Pt) for time and O(P 2 i + Pi Pt) for space which is exactly the same as non-deep inducing GPs. Thus, and exactly as in non-deep inducing-GPs, by using a small number of inducing points, we are able to convert a cubic dependence on the number of input points into a linear dependence, which gives considerably better scaling. Surprisingly, this is substantially better than standard DGPs. In standard DGPs, we allow the approximate posterior covariance for each feature to differ (Salimbeni & Deisenroth, 2017), in which case, we are in essence doing standard inducing-GP inference over N hidden features, which gives complexity of O(NP 3 i + NP 2 i Pt) for time and O(NP 2 i + NPi Pt) for space (Salimbeni & Deisenroth, 2017). It is possible to improve this complexity by restricting the approximate posterior to have the same covariance for each point (but this restriction harms performance). Deep kernel processes Algorithm 1 Computing predictions/ELBO for one batch P parameters: {δℓ}L ℓ=1. Q parameters: {Vℓ, γℓ}L ℓ=1, ({vλ, Λλ}NL+1 λ=1 ), Xi. Inputs: Xt; Targets: Y combine inducing and test/train inputs X = Xi Xt sample first Gram matrix and update ELBO Ω Q (Ω) L = log P (Ω) log Q (Ω) G1 = 1 N0 XΩXT for ℓin {2, . . . , L} do sample inducing Gram matrix and update ELBO Gℓ ii Q Gℓ ii|Gℓ 1 ii L L + log P Gℓ ii|Gℓ 1 ii log Q Gℓ ii|Gℓ 1 ii sample full Gram matrix from conditional prior Ψ = δℓK(Gℓ 1) Ψtt i = Ψtt ΨtiΨ 1 ii Ψit Gℓ tt i W Ψℓ tt i, δℓ+Pi+Pt+1 Gℓ it MN Gℓ iiΨ 1 ii Ψit, Gℓ iiΨ 1 ii Gℓ ii, Gℓ tt i Gℓ tt = Gℓ tt i + ΨtiΨ 1 ii Ψit. Gℓ= Gℓ ii Gℓ it Gℓ ti Gℓ tt end for sample GP inducing outputs and update ELBO FL+1 i Q FL+1 i |GL ii L L + log P FL+1 i |GL ii log Q FL+1 i |GL ii sample GP predictions conditioned on inducing points FL+1 t Q FL+1 t |GL, FL+1 i add likelihood to ELBO L L + log P Y|FL+1 t We began by comparing the performance of our deep inverse Wishart process (DIWP) against infinite Bayesian neural networks (known as the neural network Gaussian process or NNGP) and DGPs. To ensure sensible comparisons against the NNGP, we used a Re LU kernel in all models (Cho & Saul, 2009). For all models, we used three layers (two hidden layers and one output layer), with three applications of the kernel. In each case, we used a learned bias and scale for each input feature, and trained for 8000 gradient steps with the Adam optimizer with 100 inducing points, a learning rate of 10 2 for the first 4000 steps and 10 3 for the final 4000 steps. For evaluation, we used approximate posterior 100 samples, and for each training step we used 10 approximate posterior samples in the smaller datasets (boston, concrete, energy, wine, yacht), and 1 in the larger datasets. We found that DIWP usually gives better predictive performance and (and when it does not, the differences are very small; Table 1). We expected DIWP to be better than (or the same as) the NNGP as the NNGP was a special case of our DIWP (sending δℓ sends the variance of the inverse Wishart to zero, so the model becomes equivalent to the NNGP). We found that the DGP performs poorly in comparison to DIWP and NNGPs, and even to past baselines on all datasets except protein (which is by far the largest). This is because we used a plain feedforward architecture for all models. In contrast, Salimbeni & Deisenroth (2017) found that good performance (or even convergence) with DGPs on UCI datasets required a complex GP-prior inspired by skip connections. Here, we used simple feedforward architectures, both to ensure a fair comparison to the other models, and to avoid the need for an architecture search. In addition, the inverse Wishart process is implicitly able to learn the network width , δℓ, whereas in the DGPs, the width is fixed to be equal to the number of input features, following standard practice in the literature (e.g. Salimbeni & Deisenroth, 2017). Next, we considered fully-connected networks for small image classification datasets (MNIST and CIFAR-10; Table 2). We used the same models as in the previous section, with the omission of learned bias and scaling of the inputs. Note that we do not expect these methods to perform well relative to standard methods (e.g. CNNs) for these datasets, as we are using fully-connected networks with only 100 inducing points (whereas e.g. work in the NNGP literature uses the full 60, 000 60, 000 covariance matrix). Nonetheless, as the architectures are carefully matched, it provides another opportunity to compare the performance of DIWPs, NNGPs and DGPs. Again, we found that DIWP usually gave statistically significant gains in predictive performance (except for CIFAR-10 test-log-likelihood, where DIWP lagged by only 0.01). Importantly, DIWP gives very large improvements in the ELBO, with gains of 0.09 against DGPs for MNIST and 0.08 for CIFAR-10 (Table 2). For MNIST, remember that the ELBO must be negative (because both the log-likelihood for classification and the KL-divergence term give negative contributions), so the change from 0.301 to 0.214 represents a dramatic improvement. 8. Related work Our first contribution was the observation that DGPs with isotropic kernels can be written as deep Wishart processes as the kernel depends only on the Gram matrix. We then gave similar observations for neural networks (App. C.1), infinite neural networks (App. C.2) and infinite network with bottlenecks (App. C.3, also see Aitchison, 2019). These observations motivated us to consider the deep inverse Wishart process prior, which is a novel combination of two preexisting elements: nonlinear transformations of the kernel (e.g. Cho & Saul, 2009) and inverse Wishart priors over kernels (e.g. Shah et al., 2014). Deep nonlinear transformations Deep kernel processes Table 1. Performance measured as predictive log-likelihood for a three-layer (two hidden layer) DGP, NNGP and DIWP on UCI benchmark tasks. We have consider relu and squared exponential kernels. Errors are quoted as two standard errors in the difference between that method and the best performing method, as in a paired t-test. This is to account for the shared variability that arises due to the use of different test/train splits in the data (20 splits for all but protein, where 5 splits are used Gal & Ghahramani, 2015) some splits are harder for all models, and some splits are easier. Because we consider these differences, errors for the best measure are implicitly included in errors for other measures, and we cannot provide a comparable error for the best method itself. kernel dataset DGP NNGP DIWP boston 3.44 0.13 2.46 0.03 2.40 concrete 3.20 0.03 3.13 0.03 3.04 energy 0.90 0.05 0.71 0.71 0.01 kin8nm 1.05 0.01 1.10 1.09 0.00 relu naval 2.80 0.14 5.74 0.14 5.95 power 2.85 0.01 2.83 0.00 2.81 protein 2.80 2.88 0.01 2.81 0.01 wine 1.18 0.03 0.96 0.01 0.95 yacht 2.45 0.49 0.77 0.07 0.64 boston 3.63 0.09 2.48 0.04 2.40 concrete 4.22 0.05 3.13 0.03 3.08 energy 3.73 0.06 0.70 0.70 0.01 kin8nm 0.09 0.01 1.04 1.01 0.02 sq. exp. naval 2.80 0.12 5.96 5.92 0.27 power 2.84 0.01 2.80 0.01 2.78 protein 3.23 0.01 2.88 0.01 2.74 wine 1.22 0.02 0.96 1.00 0.01 yacht 4.12 0.14 0.50 0.13 0.39 Table 2. Performance in terms of ELBO test log-likelihood and test accuracy for fully-connected three-layer (two hidden layer) DGPs, NNGP and DIWP on MNIST and CIFAR-10. metric dataset DGP NNGP DIWP ELBO MNIST 0.301 0.001 0.268 0.001 0.198 0.000 CIFAR-10 1.735 0.002 1.719 0.001 1.606 0.001 test LL MNIST 0.130 0.001 0.134 0.002 0.089 0.001 CIFAR-10 1.516 0.002 1.539 0.002 1.433 0.004 test acc. MNIST 96.5 0.1% 96.5 0.0% 97.7 0.0% CIFAR-10 46.8 0.1% 47.4 0.1% 50.5 0.1% of the kernel have been used in the infinite neural network literature (Lee et al., 2017; Matthews et al., 2018) where they form deterministic, parameter-free kernels that do not have any flexibility in the lower-layers (Aitchison, 2019). Likewise, inverse-Wishart distributions have been suggested as priors over covariance matrices (Shah et al., 2014), but they considered a model without nonlinear transformations of the kernel. Surprisingly, without these nonlinear transformations, the inverse Wishart prior becomes equivalent to simply scaling the covariance with a scalar random variable (App. L; Shah et al., 2014). In addition, there are generalised Wishart processes (Wilson & Ghahramani, 2010, contrasting with our deep Wishart processes). While the term generalised Wishart process is not yet in widespread use, it allows us to make a distinction that is very useful in our context. In particular, a generalised Wishart process is a distribution over infinitely many finite-dimensional marginally Wishart matrices. For instance, these might represent the noise in a dynamical system. In that case, there would in principle be infinitely covariance matrices, one for each state-space location or time-point (Wilson & Ghahramani, 2010; Heaukulani & van der Wilk, 2019; Jorgensen et al., 2020). In contrast, kernel processes (Dawid, 1981; Bru, 1991) are distributions over a single infinite dimensional matrix. We stack these kernel process to form a (non-genearlised) deep kernel pro- Deep kernel processes cess. Importantly, generalised Wishart priors are actually quite inflexible. They are not capable of capturing a DKP prior because in a generalised Wishart process, the Wishart matrices are generated from underlying features, and these features are jointly multivariate Gaussian at all locations (Sec. 4 in Wilson & Ghahramani, 2010) and therefore lack the required nonlinearities between layers. In addition, inference is also very different. In particular, inference for the generalised Wishart is generally performed on the underlying multivariate Gaussian feature vectors (Eq. 1 e.g. Eq. 15-18 in Wilson & Ghahramani 2010, Eq. 12 in Heaukulani & van der Wilk 2019 or Eq. 24 in Jorgensen et al. 2020). Unfortunately, variational approximate posteriors defined over multivariate Gaussian feature vectors fail to capture symmetries in the true posterior (Eq. 14). In contrast, we define approximate posteriors directly over the symmetric positive semi-definite Gram matrices themselves, which required us to develop new, more flexible distributions over these matrices. Further linear (inverse) Wishart processes have been used in the financial domain to model how the volatility of asset prices changes over time (Philipov & Glickman, 2006b;a; Asai & Mc Aleer, 2009; Gourieroux & Sufana, 2010; Wilson & Ghahramani, 2010; Heaukulani & van der Wilk, 2019). Importantly, inference in these dynamical (inverse) Wishart processes is often performed by assuming fixed, integer degrees of freedom, and working with underlying Gaussian distributed features. This approach allows one to leverage standard GP techniques (e.g. Kandemir & Hamprecht, 2015; Heaukulani & van der Wilk, 2019), but it is not possible to optimize the degrees of freedom and the posterior over these features usually has rotational symmetries (App. D.2) that are not captured by standard variational posteriors. In contrast, we give a novel doubly-stochastic variational inducing point inference method that operates purely on Gram matrices and thus avoids needing to capture these symmetries. 9. Conclusions We proposed deep kernel processes which combine nonlinear transformations of the Gram matrix with sampling from matrix-variate distributions such as the inverse Wishart. We showed that DGPs, BNNs (App. C.1), infinite BNNs (App. C.2) and infinite BNNs with bottlenecks (App. C.3) are all instances of DKPs. We defined a new family of deep inverse Wishart processes, and give a novel doublystochastic inducing point variational inference scheme that works purely in the space of Gram matrices. DIWP performed better than fully connected NNGPs and DGPs on UCI, MNIST and CIFAR-10 benchmarks. Acknowledgments SWO acknowledges the support of the Gates Cambridge Trust for funding his doctoral studies. We would like to thank the University of Bristol s Advanced Computing Research Centre (ACRC) for computational resources. Aitchison, L. Why bigger is not always better: on finite and infinite neural networks. ar Xiv preprint ar Xiv:1910.08013, 2019. Asai, M. and Mc Aleer, M. The structure of dynamic correlations in multivariate stochastic volatility models. Journal of Econometrics, 150(2):182 192, 2009. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. Weight uncertainty in neural networks. ar Xiv preprint ar Xiv:1505.05424, 2015. Bodnar, T. and Okhrin, Y. Properties of the singular, inverse and generalized inverse partitioned Wishart distributions. Journal of Multivariate Analysis, 99(10):2389 2405, 2008. Bodnar, T., Mazur, S., and Podgórski, K. Singular inverse Wishart distribution and its application to portfolio theory. Journal of Multivariate Analysis, 143:314 326, 2016. Bru, M.-F. Wishart processes. Journal of Theoretical Probability, 4(4):725 751, 1991. Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in neural information processing systems, pp. 342 350, 2009. Damianou, A. and Lawrence, N. Deep Gaussian processes. In Artificial Intelligence and Statistics, pp. 207 215, 2013. Dawid, A. P. Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika, 68(1):265 274, 1981. Eaton, M. L. Multivariate Statistics. A Vector Space Approach.-A Volume in the Wiley Series in Probability and Mathematical Statistics. Wiley-Interscience, 1983. Gal, Y. and Ghahramani, Z. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. ar Xiv:1506.02142, 2015. Garriga-Alonso, A., Rasmussen, C. E., and Aitchison, L. Deep convolutional networks as shallow Gaussian processes. ar Xiv preprint ar Xiv:1808.05587, 2018. Deep kernel processes Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. PAC-Bayesian theory meets Bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884 1892, 2016. Gourieroux, C. and Sufana, R. Derivative pricing with wishart multivariate stochastic volatility. Journal of Business & Economic Statistics, 28(3):438 451, 2010. Heaukulani, C. and van der Wilk, M. Scalable Bayesian dynamic covariance modeling with variational Wishart and inverse Wishart processes. In Advances in Neural Information Processing Systems, pp. 4582 4592, 2019. Hofmann, T., Schölkopf, B., and Smola, A. J. Kernel methods in machine learning. The annals of statistics, pp. 1171 1220, 2008. Jorgensen, M., Deisenroth, M., and Salimbeni, H. Stochastic differential equations with variational Wishart diffusions. In International Conference on Machine Learning, pp. 4974 4983. PMLR, 2020. Kandemir, M. and Hamprecht, F. A. The deep feed-forward Gaussian process: An effective generalization to covariance priors. In Feature Extraction: Modern Questions and Challenges, pp. 145 159, 2015. Kingma, D. P. and Welling, M. Auto-encoding variational Bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. Koev, P. and Edelman, A. The efficient evaluation of the hypergeometric function of a matrix argument. Mathematics of Computation, 75(254):833 846, 2006. Kolmogorov, A. Grundbegriffe der Wahrscheinlichkeitreichnung. Ergebnisse der Mathematik, 1933. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097 1105, 2012. Lalchand, V. and Rasmussen, C. E. Approximate inference for fully Bayesian Gaussian process regression. In Symposium on Advances in Approximate Bayesian Inference, pp. 1 12. PMLR, 2020. Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as Gaussian processes. ar Xiv preprint ar Xiv:1711.00165, 2017. Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. In Advances in neural information processing systems, pp. 6389 6399, 2018. Mac Kay, D. J. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448 472, 1992. Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., and Ghahramani, Z. Gaussian process behaviour in wide deep neural networks. ar Xiv preprint ar Xiv:1804.11271, 2018. Moore, D. A. Symmetrized variational inference. In NIPS Workshop on Advances in Approximate Bayesian Inferece, volume 4, pp. 31, 2016. Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D. A., Pennington, J., and Sohl Dickstein, J. Bayesian deep convolutional networks with many channels are Gaussian processes. ar Xiv preprint ar Xiv:1810.05148, 2018. Ober, S. W. and Aitchison, L. Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes. ar Xiv preprint ar Xiv:2005.08140, 2020. Philipov, A. and Glickman, M. E. Factor multivariate stochastic volatility via Wishart processes. Econometric Reviews, 25(2-3):311 334, 2006a. Philipov, A. and Glickman, M. E. Multivariate stochastic volatility via Wishart processes. Journal of Business & Economic Statistics, 24(3):313 328, 2006b. Pourzanjani, A. A., Jiang, R. M., and Petzold, L. R. Improving the identifiability of neural networks for Bayesian inference. In NIPS Workshop on Bayesian Deep Learning, volume 4, pp. 31, 2017. Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. ar Xiv preprint ar Xiv:1401.4082, 2014. Rivasplata, O., Tankasali, V. M., and Szepesvari, C. PACBayes with backprop. ar Xiv preprint ar Xiv:1908.07380, 2019. Salimbeni, H. and Deisenroth, M. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pp. 4588 4599, 2017. Shah, A., Wilson, A., and Ghahramani, Z. Student-t processes as alternatives to Gaussian processes. In Artificial intelligence and statistics, pp. 877 885, 2014. Srivastava, M. S. et al. Singular Wishart and multivariate beta distributions. The Annals of Statistics, 31(5):1537 1560, 2003. Deep kernel processes Sterbenz, P. H. Floating-point computation. Prentice-Hall, 1974. Turner, R. E. and Sahani, M. Two problems with variational expectation maximisation for time-series models. In Barber, D., Cemgil, T., and Chiappa, S. (eds.), Inference and Learning in Dynamic Models. Cambridge University Press, 2011. Uhlig, H. On singular Wishart and singular multivariate beta distributions. The Annals of Statistics, pp. 395 405, 1994. Williams, C. K. and Rasmussen, C. E. Gaussian Processes for Machine Learning. MIT press Cambridge, MA, 2006. Wilson, A. G. and Ghahramani, Z. Generalised Wishart processes. ar Xiv preprint ar Xiv:1101.0240, 2010.