# how_deep_are_deep_gaussian_processes__5f31e66e.pdf Journal of Machine Learning Research 19 (2018) 1-46 Submitted 1/18; Revised 8/18; Published 9/18 How Deep Are Deep Gaussian Processes? Matthew M. Dunlop mdunlop@caltech.edu Computing and Mathematical Sciences Caltech Pasadena, CA 91125, USA Mark A. Girolami m.girolami@imperial.ac.uk Department of Mathematics Imperial College London London, SW7 2AZ, UK and The Alan Turing Institute 96 Euston Road London, NW1 2DB, UK Andrew M. Stuart astuart@caltech.edu Computing and Mathematical Sciences Caltech Pasadena, CA 91125, USA Aretha L. Teckentrup a.teckentrup@ed.ac.uk School of Mathematics University of Edinburgh Edinburgh, EH9 3FD, UK and The Alan Turing Institute 96 Euston Road London, NW1 2DB, UK Editor: Neil Lawrence Recent research has shown the potential utility of deep Gaussian processes. These deep structures are probability distributions, designed through hierarchical construction, which are conditionally Gaussian. In this paper, the current published body of work is placed in a common framework and, through recursion, several classes of deep Gaussian processes are defined. The resulting samples generated from a deep Gaussian process have a Markovian structure with respect to the depth parameter, and the effective depth of the resulting process is interpreted in terms of the ergodicity, or non-ergodicity, of the resulting Markov chain. For the classes of deep Gaussian processes introduced, we provide results concerning their ergodicity and hence their effective depth. We also demonstrate how these processes may be used for inference; in particular we show how a Metropolis-within-Gibbs construction across the levels of the hierarchy can be used to derive sampling tools which are robust to the level of resolution used to represent the functions on a computer. For illustration, we consider the effect of ergodicity in some simple numerical examples. Keywords: deep learning, deep Gaussian processes, deep kernels c 2018 Matthew M. Dunlop, Mark A. Girolami, Andrew M. Stuart and Aretha L. Teckentrup. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/18-015.html. Dunlop, Girolami, Stuart and Teckentrup 1. Introduction This section provides background on the study and application of Gaussian and deep Gaussian processes, outlines the contribution and setup of the paper, and establishes notation to be used in subsequent sections. 1.1 Background Gaussian processes have proved remarkably successful as a tool for various statistical inference and machine learning tasks (Rasmussen and Williams, 2006; Kennedy and O Hagan, 2001; Higdon et al., 2004; Stein, 1999). This success relates in part to the ease with which computations may be performed in the Gaussian framework, and also to the flexible ways in which Gaussian processes may be used, for example when combined with thresholding to perform classification tasks via probit models (Neal, 1997; Rasmussen and Williams, 2006) or to find interfaces in Bayesian inversion (Iglesias et al., 2016). Nonetheless there are limits to the sort of phenomena that are readily expressible via direct use of Gaussian processes, such as in the sparse data scenario, where the constructed probability distribution is far from posterior contraction. Recognizing this fact, there have been a number of interesting research activities which seek to represent new phenomena via the hierarchical cascading of Gaussians. Early work of this type includes the Ph D thesis of Paciorek (2003) (see also Paciorek and Schervish, 2004) in which the aim is to reproduce spatially non-stationary phenomena, and this is achieved by means of a Gaussian process whose covariance function itself depends on another Gaussian process. This idea was recently re-visited by Roininen et al. (2016), using the precision operator viewpoint, rather than covariance function, and building on the explicit link between Gaussian processes and stochastic partial differential equations (SPDEs) (Lindgren et al., 2011). A different approach was adopted by Damianou and Lawrence (2013) where a Gaussian process was directly composed with another Gaussian process; furthermore the idea was implemented recursively, leading to what is referred to as deep Gaussian processes (DGP). These ingenious constructions open up new possibilities for problems in non-parametric inference and machine learning and the purpose of this paper is to establish, and utilize, a common framework for their study. Relevant to our analysis is the early work of Diaconis and Freedman (1999) which studied iterations of random Lipschitz functions and the conditions required for their convergence. 1.2 Our Contribution In the paper we make three main contributions: We demonstrate a unifying perspective on the hierarchical Gaussian processes described in the previous subsection, leading to a wide class of deep Gaussian processes, with a common framework within which new deep Gaussian processes can be constructed. By exploiting the fact that this common framework has a Markovian structure, we interpret the depth of the process in terms of the ergodicity or non-ergodicity of this process; in simple terms ergodic constructions have effective depth given by the mixing time. How Deep Are Deep Gaussian Processes? We demonstrate how these processes may be used for inference; in particular we show how a Metropolis-within-Gibbs construction across the levels of the hierarchy can be used to derive sampling tools which are robust to the level of resolution used to represent the functions on a computer. We also describe numerical experiments which illustrate the theory, and which demonstrate some of the limitations of the framework in the inference context, suggesting the need for further algorithmic innovation and theoretical understanding. We now summarize the results and contributions by direct reference to the main theorems in the paper. Theorem 4 shows that a composition-based deep Gaussian process will, with sufficiently many layers, produce samples that are approximately constant. This pathology can be avoided by, for example, increasing the width of each hidden layer, or allowing each layer to depend on the input layer. Theorem 8 shows the ergodicity of a class of discretized deep Gaussian processes, constructed using non-stationary covariance functions. As a consequence, there is little benefit in adding additional layers after a certain point. This observation elucidates the mechanism underlying the choices of DGPs with a small number of layers for inference in numerous papers (for example Cutajar et al., 2017; Salimbeni and Deisenroth, 2017; Dai et al., 2015). Theorem 14 establishes a similar result as Theorem 8 on function space, for a different class of deep Gaussian processes constructed using non-stationary covariance operators. Theorem 16 establishes the asymptotic properties of a deep Gaussian process formed by iterated convolution of fairly general classes of Gaussian random fields. Specifically it is shown that such processes will either converge weakly to zero or diverge as the number of layers is increased, and so they will provide little flexibility for inference in practice. 1.3 Overview The general framework in which we place the existing literature, and which we employ to analyze deep Gaussian processes, and to construct algorithms for related inference tasks, is as follows. We consider sequences of functions {un} which are conditionally Gaussian: un+1|un N m(un), C(un) ; (Cov Op) here m(un) denotes the mean function and C(un) the covariance operator. We will also sometimes work with the covariance function representation, in which case we will write un+1|un GP m(x; un), c(x, x ; un) . (GP) Note that the covariance function is the kernel of the covariance operator when the latter is represented as an integral operator over the approximate domain D Rd: C(un)φ (x) = Z c(x, x ; un)φ(x )dx . Dunlop, Girolami, Stuart and Teckentrup In most of the paper we consider the centred case where m 0, although the flexibility of allowing for non-zero mean will be important in some applications, as discussed in the conclusions. When the mean is zero, the iterations (Cov Op) and (GP) can be written in the form un+1 = L(un)ξn+1, (Zero Mean) where {ξn} form an i.i.d. Gaussian sequence and, for each u, L(u) is a linear operator. For example if the ξn are white then the covariance operator is C(u) = L(u)L(u) with denoting the adjoint operation and L(u) is a Cholesky factor of C(u). The formulation (Zero Mean) is useful in much of our analysis. For the purpose of this paper, we will refer to any sequence of functions constructed as in (Zero Mean) as a deep Gaussian process. In section 2 we discuss the hierarchical Gaussian constructions referenced above, and place them in the setting of equations (Cov Op), (GP) and (Zero Mean). Section 3 studies the ergodicity of the resulting deep Gaussian processes, using the Markov chain which defines them. In section 4 we provide supporting numerical experiments; we give illustrations of draws from deep Gaussian process priors, and we discuss inference. In the context of inference we describe a methodology for MCMC, using deep Gaussian priors, which is defined in the function space limit and is hence independent of the level of resolution used to represent the functions un; numerical illustrations are given. We conclude in section 5 in which we describe generalizations of the settings considered in this paper, and highlight future directions. 1.4 Notation The structure of the deep Gaussian processes above means that they can be interpreted as Markov chains on a Hilbert space H of functions. Let B(H) denote the Borel σ-algebra on H. We denote by P : H B(H) R the one-step transition probability distribution, P(u, A) = P(un A | un 1 = u), (1) and denote by Pn : H B(H) R the n-step transition probability distribution, Pn(u, A) = P(un A | u0 = u). (2) Thus, for example, in the case of the covariance operator construction (Cov Op) we have P(u, ) = N 0, C(u) , when the mean is zero. This Markovian structure will be exploited when showing ergodicity, or lack of ergodicity, of the chains. 2. Four Constructions This section provides examples of four constructions of deep Gaussian processes, all of which fall into our general framework. The reader will readily design others. How Deep Are Deep Gaussian Processes? 2.1 Composition Let D Rd, D Rl, un : D Rm and F : Rm D . If {ξn} is a collection of i.i.d. centred Gaussian processes taking values in the space of continuous functions C(D ; Rm) then we define the Markov chain un+1(x) = ξn+1 F un(x) . (3) The case m = l, F = id and D = D = Rm was introduced by Damianou and Lawrence (2013) and the generalization here is inspired by the formulation of Duvenaud et al. (2014). The case where two layers are employed could be interpreted as a form of warped Gaussian process: a generalization of Gaussian processes that have been used successfully in a number of inference problems (Snelson et al., 2004; Schmidt and O Hagan, 2003). We note that the mapping ξ 7 ξ F u is linear, and we may thus define L(u) by L(u)ξ = ξ F u; hence the Markov chain may be written in the form (Zero Mean). If ξ1 N(0, Σ) then the Markov chain has the form (Cov Op), with mean zero and C(u) = L(u)ΣL(u) ; if ξ1 GP 0, k(z, z ) then the Markov chain has the form (GP) with mean zero and c(x, x ; u) = k F u(x) , F u(x ) . 2.2 Covariance Function Paciorek (2003) gives a general strategy to construct anisotropic versions of isotropic covariance functions. Let Σ : Rd Rd d be such that Σ(z) is symmetric positive definite for all z Rd, and define the quadratic form Q(x, x ) = (x x )T Σ(x) + Σ(x ) 1 (x x ), x, x Rd. If the isotropic correlation function ρS( ) is positive definite on Rd, for all d N, then the function c(x, x ) = σ2 2 d 2 det(Σ(x)) 1 4 det(Σ(x )) 1 4 det(Σ(x) + Σ(x )) 1 2 ρS( p is positive definite on Rd Rd and may thus be used as a covariance function. We make these statements precise below. If we choose Σ to depend on un then this may be used as the basis of a deep Gaussian process. To be concrete we choose Σ(x) = F u(x) Id where F : R R 0 for u : D Rd R. We then write c(x, x ; u). Now let un : D R and consider the Markov chain (GP) in the mean zero case. Paciorek (2003) considered this iteration over one-step with u0 GP(0, σ2ρS( x x )) and u1 was shown to exhibit interesting non-stationary effects. Here we generalize and consider the deep process that results from this construction for arbitrary n N. By considering the covariance operator C(u)ϕ (x) = Z Rd c(x, x ; u)ϕ(x ) dx Dunlop, Girolami, Stuart and Teckentrup we may write the iteration in the form (Cov Op). The form (Zero Mean) follows with L(u) = C(u) 1 2 and ξn+1 being white noise. Various generalizations of this construction are possible, for example allowing the pointwise variance of the process σ2 to be spatially varying (Heinonen et al., 2016) and to depend on un(x). These may be useful in applications, but we confine our analysis to the simpler setting for expository purposes; however in Remark 12 we discuss this generalization. In order to make the statements made above precise, let ρS : [0, ) R be a stationary covariance kernel, where the covariance between locations x and y depends only on the Euclidean distance x y 2. We make the following assumption on ρS. Assumptions 1 (i) The covariance kernel ρS( x y 2) is positive definite1 on Rd Rd: for any N N, b RN\{0} and pairwise distinct {xi}N i=1 Rd, we have j=1 bibjρS xi xj 2 > 0. (ii) ρS is normalized to be a correlation kernel, i.e. ρS(0) = 1. Using (Wendland, 2004, Theorem 6.11), sufficient conditions for ρS to fulfill Assumptions 1(i) are that ρS, as a function of x y, is continuous, bounded and in L1(Rd), with a Fourier transform that is non-negative and non-vanishing. These sufficient conditions are satisfied, for example, for the family of Matèrn covariance functions and the Gaussian covariance. To satisfy Assumptions 1(ii), any positive definite kernel ρS can simply be rescaled by ρS(0). We now have the following proposition, a slightly weaker version of which is proved by Paciorek (2003), where it is shown that ρ( , ) is positive semi-definite if ρS is positive semi-definite. Our proof, which is in the Appendix, follows closely that of (Paciorek, 2003, Theorem 1), but sharpens the result using a characterization of positive definite kernels proved in (Wendland, 2004, Theorem 7.14). Proposition 1 Let Assumptions 1 hold. Suppose Σ : Rd Rd d is such that Σ(z) is symmetric positive definite for all z Rd, and define the quadratic form Q(x, x ) = (x x )T Σ(x) + Σ(x ) 1 (x x ), x, x Rd. Then the function ρ( , ), defined by ρ(x, x ) = 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4 |Σ(x) + Σ(x )| 1 2 ρS p is positive definite on Rd Rd, for any d N, and is a non-stationary correlation function. 1. If the double sum in this definition is only non-negative, we say that the kernel ρS is positive semidefinite. We are thus adopting the terminology used by Wendland (2004), where the kernel ρS is called positive definite if the double sum in Assumptions 1(i) is positive, and positive semi-definite if the sum is non-negative. For historical reasons, there is an alternative terminology, used by for example Paciorek (2003), where our notion of positive definite is referred to as strictly positive definite, and our notion of positive semi-definite is referred to as positive definite. How Deep Are Deep Gaussian Processes? Non-stationary covariance functions c(x, y), for which c(x, x) = 1, can be obtained from the non-stationary correlation function ρ(x, y) through multiplication by a standard deviation function σ : Rd R, in which case we have c(x, y) = σ(x)σ(y)ρ(x, y). Since the product of two positive definite kernels is also positive definite by (Wendland, 2004, Theorem 6.2), the kernel c(x, y) can be ensured to be positive definite by a proper choice of σ. We discuss generalizations such as this in the conclusions section 5. We are interested in studying the behaviour of Gaussian processes with non-stationary correlation functions ρ(x, y) of the form derived in Proposition 1, in the particular case where the matrices Σ(z) are derived from another Gaussian process. Specifically, we consider the following hierarchy of conditionally Gaussian processes on a bounded domain D Rd defined as follows: u0 GP(0, ρS( )), (4a) un+1|un GP(0, ρ( , ; un)), for n N. (4b) Here, ρ( , ; un) denotes a non-stationary correlation function constructed from ρS( ) as in Proposition 1, with the map Σ defined through un. Typical choices for Σ are Σ(z) = (un(z))2 Id and Σ(z) = exp(un(z)) Id. Choices such as the first of these lead to the possibility of positive semi-definite Σ and, in the worst case, Σ 0. If Σ 0 the resulting correlation function is given by ρS(0) = 1, and ρS(r) = 0 for any r > 0. This does not correspond to any (function valued) Gaussian process on Rd (Kallianpur, 2013): heuristically the resulting process would be a white noise process, but normalized to zero. However, it is possible to sample from any set of finite dimensional distributions when Σ 0: the correlation matrix is then the identity. To allow for the possibility of F( ) taking the value zero, we therefore only study the finite dimensional process defined as follows: u0 N(0, RS), (5a) un+1 | un N(0, R(un)), for n N. (5b) The vector un has entries (un)i = un(xi). Here, RS is the covariance matrix with entries (RS)ij = ρS( xi xj 2), and R(un) is the covariance matrix with entries (R(un))ij = ρ(xi, xj; un). The set {xj} comprises a finite set of points in Rd. We may now generalize Proposition 1 to allow for Σ becoming zero. In order to do this we make the following assumptions: Assumptions 2 (i) We have Σ(z) = G(z)Id, for some non-negative, bounded function G : R R 0. (ii) The correlation function ρS is continuous, with limr ρS(r) = 0. We then have the following result on the positive-definiteness of ρ( , ), Proposition 2 Let Assumptions 1 and 2 hold. Then the kernel ρ( , ) defined in Proposition 1 is positive definite on Rd Rd. Dunlop, Girolami, Stuart and Teckentrup Remark 3 This proposition applies to the process (5) with Σ(z) = F un(z) Id and F : R R 0 locally bounded, by taking G = F un, proving that ρ( , ; un) is positive definite on D D for all bounded functions un on D. Here we generalize the notion of positive-definite in the obvious way to apply on D Rd rather than on the whole of Rd. 2.3 Covariance Operator Here we demonstrate how precision (inverse covariance) operators may be used to make deep Gaussian processes. Because precision operators encode conditional independence and sparsity this can be a very attractive basis for fast computations (Lindgren et al., 2011). Our approach is inspired by the hierarchical Gaussian process introduced by Roininen et al. (2016), where one-step of the Markov chain which we introduce here was considered. Let D Rd, un : D R and X := C(D; R). Assume that F : R R 0 is a bounded function. Let C be a covariance operator associated to a Gaussian process taking values in X and let P be the associated precision operator. Define the multiplication operator Γ(u) by Γ(u)v (x) = F u(x) v(x) and the covariance operator C(u) by C(u) 1 = P + Γ(u) and consider the Markov chain (Cov Op) with mean zero; this defines our deep Gaussian process. We note that formulation (GP) can be obtained by observing that the covariance function c(u) := c(x, x ; u) is the Green s function associated with the precision operator for C(u): C(u) 1c( , x ; u) = δx ( ) where δx is a Dirac delta function centred at point x . Computationally we will typically choose P to be a differential operator, noting that then fast methods may be employed to sample the Gaussian process un+1|un by means of SPDEs (Lindgren et al., 2011; Dashti and Stuart, 2017). If P is chosen as a differential operator, then the order of this operator will be related to the order of regularity of samples, and F will be related to the length scale of the samples. These relations are made explicit in the case of certain Whittle-Matérn distributions when F is constant (Lindgren et al., 2011); some boundary effects may be present when D = Rd, though methodology is available to ameliorate these (Daon and Stadler, 2018). As in the previous subsection, the form (Zero Mean) follows with L(u) = C(u) 1 2 and ξn+1 being white noise. Generalizations of the construction in this subsection are possible, and we highlight these in subsection 5; however for expository purposes we confine our analysis to the setting described in this subsection. For theoretical investigation of the equivalence, as measures, of Gaussians defined by addition of an operator to a given precision operator (see Pinski et al., 2015). 2.4 Convolution We consider the case (Zero Mean) where L(u)ξ := u ξ is a convolution. To be concrete we let D = [0, 1]d and construct a sequence of functions un : D R (or un : D C) defined via the iteration un+1(x) = (un ξn+1)(x) := Z [0,1]d un(x y)ξn+1(y) dy, How Deep Are Deep Gaussian Processes? where {ξn} are a sequence of i.i.d. centred real-valued Gaussian random functions on D. Here we implicitly work with periodic extension of un from D to the whole of Rd in order to define the convolution. 3. The Role of Ergodicity The purpose of this section is to demonstrate that the iteration (Zero Mean) is, in many situations, ergodic. This has the practical implication that the effective depth of the deep Gaussian process is limited by the mixing time of the Markov chain. In some cases the ergodic behaviour may be trivial (convergence to a constant). Furthermore, even if the chain is not ergodic, the large iteration number dynamics may blow-up, prohibiting use of the iteration at significant depth. The take home message is that in many cases the effective depth is not that great. Great care will be needed to design deep Gaussian processes whose depth, and hence approximation power, is substantial. This issue was first identified by Duvenaud et al. (2014), and we here provide a more general analysis of the phenomenon within the broad framework we have introduced for deep Gaussian processes. 3.1 Composition We first consider the case where the iteration is defined by (3), which includes examples considered by Damianou and Lawrence (2013); Duvenaud et al. (2014). Duvenaud et al. (2014) observed that after a number of iterations, sample paths are approximately piecewise constant. We investigate this effect in the context of ergodicity. We first make two observations: (i) if u0 is piecewise constant, then un is piecewise constant for all n N; (ii) if u0 has discontinuity set Z0, and Zn denotes the discontinuity set of the nth iterate, then Zn+1 Zn for all n N. Due to point (ii) above, if the sequence {un} is to be ergodic, then necessarily it must be the case that Zn , or else the process will have retained knowledge of the initial condition. In particular, if the initial condition is piecewise constant, then ergodicity would force the limit to be constant in space. In what follows we assume that the iteration is given by un+1(x) = ξn+1 un(x) , ξj n+1 GP 0, h( x x 2) i.i.d. where h is a stationary covariance function. We therefore make the choice m = l and F = id in (3) so that we are in the same setup as Damianou and Lawrence (2013); Duvenaud et al. (2014); the inclusion of more general maps F is discussed in Remark 5. Then for any x, x R we have uj n+1(x) uj n+1(x ) , h(0) h un(x) un(x ) 2 h un(x) un(x ) 2 h(0) A common choice of covariance function is the squared exponential kernel: h(z) = σ2e z2/2w2 (6) Dunlop, Girolami, Stuart and Teckentrup where σ2, w2 > 0 are scalar parameters. In Duvenaud et al. (2014), in the case m = d = 1, the choice σ2/w2 = π/2 is made above to ensure that the expected magnitude of the derivative remains constant through iterations. We show in the next proposition that if σ2, w2 are chosen such that σ2 < w2/m, then the limiting process is trivial in a sense to be made precise. Theorem 4 Assume that h( ) is given by the squared exponential kernel (6) and that u0 is bounded on bounded sets almost-surely. Then if σ2 < w2/m, P un(x) un(x ) 2 0 for all x, x D = 1 where P denotes the law of the process {un} over the probability space Ω. Proof Since 1 e x x for x 0 it follows that, for all z R, 2h(0) 2h(z) σ2 with equality when z = 0. Then we have E un(x) un(x ) 2 2 un 1 = j=1 E |uj n(x) uj n(x )|2 un 1 2h(0) 2h un 1(x) un 1(x ) 2 w2 un 1(x) un 1(x ) 2 2 and so using induction and the tower property of conditional expectations, E un(x) un(x ) 2 2 mσ2 E un 1(x) un 1(x ) 2 2 n E u0(x) u0(x ) 2 2 for some constant κ(x, x ). By the Markov inequality, we see that for any ε > 0, P un(x) un(x ) 2 ε) 1 n κ(x, x ), (7) and so, since σ2 < w2/m, applying the first Borel-Cantelli lemma we deduce that m=n { um(x) um(x ) 2 ε} How Deep Are Deep Gaussian Processes? We therefore have that P( un(x) un(x ) 2 0) = P m=n { um(x) um(x ) 2 < 1/k} m=n { um(x) um(x ) 2 1/k} ! m=n { um(x) um(x ) 2 1/k} Hence given any x, x D, we can find Ω(x, x ) Ωwith P(Ω(x, x )) = 1 such that for any ω Ω(x, x ), un(x; ω) un(x ; ω) 2 0. Let {qj} be a countable dense subset of D, and define Ω = \ i,j N Ω(qi, qj) , noting that P(Ω ) = 1. Then for any ω Ω and any x, x {qj}, un(x; ω) un(x ; ω) 2 0. Since sample paths are almost-surely continuous, the above can be extended to all x, x D, so that P un(x) un(x ) 2 0 for all x, x D = 1. Remark 5 1. If a more general transformation map F : Rm D is included, then the above result still holds provided we take σ2 < w2/( F m). The convergence to a constant hence occurs when the length scale w is large or F is small (so each Gaussian random field doesn t change too rapidly across the domain), or when the amplitude σ is small (so inputs are not warped too far). 2. The condition of the above theorem is less likely to be satisfied as the width m of each layer is increased, and so this triviality pathology is unlikely to arise for large m; this may be observed in practice numerically. 3. Following Neal (1995); Duvenaud et al. (2014), recent works (such as Dai et al., 2015; Cutajar et al., 2017) connect all layers to the input layer in order to avoid certain pathologies. The Markovian structure of the process is maintained in this case: with the above notation, the process is then defined by un+1(x) = ξn+1(un(x), x), ξj n+1 GP 0, h( x x 2) i.i.d, where now ξn : Rm Rd Rm. Defining β = mσ2/w2 < 1, if σ 1 we may use the same argument as the proof above to deduce that E un(x) un(x ) 2 2 un 1 β un 1(x) un 1(x ) 2 2 + β x x 2 2, Dunlop, Girolami, Stuart and Teckentrup which leads to E un(x) un(x ) 2 2 βn E u0(x) u0(x ) 2 2 + β 1 βn The right hand side does not vanish as n , and so we can no longer use the first Borel-Cantelli lemma to reach the same conclusion as the case where the layers are not connected to the input layer. This could provide some intuition as to why including the connection of each layer to the input layer provides greater stability than not doing so. 3.2 Covariance Function In order to study ergodicity of the deep Gaussian process defined through covariance functions, we will restrict attention in the remainder of this subsection to hierarchies of finitedimensional multivariate Gaussian random variables as in (5). Note that although we have here defined u0 N(0, RS), following e.g. Paciorek (2003), the ergodicity of the deep Gaussian process will be proved for fixed u0 RN (cf. Theorem 8). The following result is immediate from Proposition 2. Corollary 6 Let Assumptions 1 and 2 hold. Then the covariance matrix R(un) is positive definite for all un C, for any compact subset of C RN. Note that, because we have chosen to work with a correlation kernel, we have Tr R(un) = N. (8) We will use this fact explicitly in the ergodicity proof; however it may be relaxed as discussed in the Remark 12 below. We view the sequence of random variables {un} n=0 as a Markov chain, with u0 RN given, and we want to show the existence of a stationary distribution. Recall the one-step transition kernel P of the Markov chain given by (1), and its n fold composition given by (2). In order to prove ergodicity of the Markov chain we will follow the proof technique of Mattingly et al. (2002); Meyn and Tweedie (2012), which establishes geometric ergodicity with the following proposition. Proposition 7 Suppose the Markov chain {un} n=0 satisfies, for some compact set C B(RN), the following: (i) For some y int(C) and for any δ > 0, we have P(u, Bδ(y )) > 0 for all u C. (ii) The transition kernel P(u, ) possesses a density p(u, y) in C, precisely P(u, A) = Z A p(u, y) dy, for all u C, A B(RN) B(C), and p(u, y) is jointly continuous on C C. How Deep Are Deep Gaussian Processes? (iii) There is a function V : RN [1, ), with limu V (u) = , and real numbers α (0, 1) and β [0, ) such that E(V (un+1) | un) αV (un) + β. If we can choose the compact set C such that C = u : V (u) 2β γ α for some γ ( α, 1), then there exists a unique invariant measure π. Furthermore, there is r(γ) (0, 1) and κ(γ) (0, ) such that for all u0 RN and all measurable g with |g(u)| V (u) for all u RN, we have |EPn(u0, )(g) π(g)| κrn V (u0). We may verify the assumptions of Proposition 7 leading to the following theorem concerning the ergodicity of deep Gaussian processes defined via the covariance function: Theorem 8 Suppose Assumptions 1 and 2 hold. Then the Markov chain {un} n=0 satisfies the assumptions of Proposition 7. As a consequence, there exists ε (0, 1) such that for any u0 RN, there is a K(u0) > 0 with Pn(u0, ) π TV K(1 ε)n for all n N, and so the chain is ergodic. The proof rests on the following three lemmas, and is given after stating and proving them. The first lemma shows that, on average, the norm of states of the chain remains constant as the length of the chain is increased. The second shows that, given any current state in RN and any ball around the origin in RN, there is a positive probability that the next state will belong to that ball. The third lemma shows that the probability that the Markov chain moves to a set may be found via integration of a continuous function over that set. Lemma 9 (Boundedness) Suppose Assumptions 1 and 2 hold. For all n N, we have E un+1 2 2 | un = N. Proof Let n 0. Since the random variable un+1|un has zero mean, the linearity of expectation implies (using (8)) that E un+1 2 2 | un = E N X j=1 (un+1)2 j = Tr(R(un)) = N, for all n N. Dunlop, Girolami, Stuart and Teckentrup Lemma 10 (Positive probability of a ball around zero) Suppose Assumptions 1 and 2 hold. For all u RN and δ > 0, we have P u, Bδ(0) > 0. Proof We have the equality un+1|(un = u) = p R(u)ξn+1 in distribution, where p R(u) denotes the Cholesky factor of the correlation matrix R(u) and ξn+1 N(0, IN). Then P u, Bδ(0) = P un 2 δ | un 1 = u R(u)ξn+1 2 δ R(u) 2 ξn+1 2 δ = P ξn+1 2 δ p To show that the latter probability is positive, we need to show that δ p R(u) 1 2 > 0. Since δ > 0 is fixed, we only need to show p R(u) 2 < . Since p R(u) 2 2 = ρ(R(u)), the spectral radius of R(u), we have p R(u) 2 2 = ρ(R(u)) Tr(R(u)) = N. The claim then follows. Lemma 11 (Transition probability has a density) Suppose Assumptions 1 and 2 hold. Then the transition probability P u, has a jointly continuous density p(u, y) for all u C, for any compact set C RN. Proof We have un+1|(un = u) N(0, R(u)), and the existence of a jointly continuous density of the transition probability in C follows if R(u) is positive definite for all u C. The claim then follows by Proposition 2. We may now use the three preceding lemmas to prove the main ergodic theorem for deep Gaussian processes defined through the covariance function. Proof of Theorem 8 Lemma 10 shows that assumption (i) is satisfied, for any C containing y = 0, and Lemma 11 shows that assumption (ii) is satisfied, for any compact set C. It follows from Lemma 9 that assumption (iii) is satisfied, with V (u) = u 2 2 +1, any α (0, 1) and β = N + 1. Now choose α = 1/4 and γ = 3/4 ( α, 1), so that the set C = u : V (u) 2β γ α = u : u 2 2 4N + 4 is compact. Then there is a unique invariant measure π, and there is r(γ) (0, 1) and κ(γ) (0, ) such that for u0 RN and all measurable g with |g(u)| V (u) for all u RN, we have |EPn(u0, )(g) π(g)| κrn V (u0). (9) How Deep Are Deep Gaussian Processes? Since V (u) 1 for all u RN, the above holds in particular for all measurable g with g 1. Taking the supremum over all such g in (9) yields the given total variation bound, with K = κV (u0) and ε = 1 r. Remark 12 (Covariance vs correlation kernels) In this subsection we have restricted our attention to correlation kernels ρS( xi xj 2) and ρ(xi, xj; un), rather than more general covariance kernels c S( xi xj 2) = σ2 SρS( xi xj 2), c(xi, xj; un) = σ(xi; un)σ(xj; un)ρ(xi, xj; un), for stationary and non-stationary marginal standard deviation functions σS (0, ) and σ : Rd (0, ) respectively. This restriction is solely for ease of presentation; the analysis presented readily extends to c(xi, xj; un), under suitable assumptions on σ. In particular the analysis may be adapted to the case of general covariance kernels c S( xi xj 2) and c(xi, xj; un) under the assumption that there exist positive constants σ , σ+ such that σ σ(z) σ+, for all z Rd. When general covariances are used then it is possible to ensure that every multivariate Gaussian random variable in the hierarchy is of the same amplitude by scaling the corresponding covariance matrix C(un) to have constant trace N at each iteration n; the average variance over all points {xi}N i=1 is then 1 for every n. 3.3 Covariance Operator We consider the class of covariance operators introduced in section 2.3 and show that, under precise assumptions detailed below, the iteration (Zero Mean) produces an ergodic Markov chain. Unlike the previous subsection, where we worked on RN, here we will work on the separable Hilbert space H = L2(D; R). To begin with, define the precision operators (densely defined on H: Hairer et al., 2005; Pinski et al., 2015), C 1 + = P + F+I, C(u) 1 = P + Γ(u), u H, and the probability measures µ = N(0, C ), µ+ = N(0, C+), µ( ; u) = N(0, C(u)), u H. Throughout the rest of this section we make the following assumptions on C and F: Assumptions 3 1. The operator C : H H is symmetric and positive, and its eigenvalues {λ2 j} have algebraic decay λ2 j j r for some r > 1. 2. The function F : H R is continuous, and there exists F+ 0 such that 0 F(u) F+ for all u H. Dunlop, Girolami, Stuart and Teckentrup Remark 13 1. The assumption on algebraic decay of the eigenvalues can be relaxed to the operator C being trace-class on H; however the arguments that follow are cleaner when we assume this explicit decay which, of course, implies the trace condition. Note also that, under the stated assumption on algebraic decay, Gaussian measures on L2(D; R) will be supported on X = C(D; R) under mild conditions on the eigenfunctions of C (Dashti and Stuart, 2017) so that F u(x) will be defined for all x D rather than x a.e. in D. Then Γ(u)v makes sense pointwise when v X. 2. The assumed form of the precision operator together with Assumptions 3 mean that the resulting family of measures {µ( ; u)}u H will be mutually equivalent. This allows for the total variation metric between measures to be used, and a concise proof of ergodicity to be obtained. If the measures were singular, a different metric such as the Wasserstein metric would be required to quantify the convergence. We now prove the following ergodic theorem for the deep Gaussian processes constructed through covariance operators. Theorem 14 Let Assumptions 3 hold, and let the Markov chain {un} be given by (Zero Mean) with L(u) = C(u) 1 2 as defined above. Then there exists a unique invariant distribution π, and there exists ε > 0 such that for any u0 H, Pn(u0, ) π TV (1 ε)n for all n N. In particular, the chain is ergodic. The following lemma will be used to show a minorization condition, as well as establish further notation, key to the proof of Theorem 14 which follows it. It essentially shows a stronger form of equivalence of the family of measures {µ( , u)}u H. Lemma 15 Let Assumptions 3 hold. Then there exists ε > 0 such that for any u, v H, Proof The assumptions on F mean that the measures µ( ; u), µ and µ+ are mutually absolutely continuous, with dµ (v) = 1 Z(u) exp 1 2 v, F(u)v , Z(u) = Eµ exp 1 2 v, F(u)v ; dµ+ dµ (v) = 1 Z+ exp 1 Z+ = Eµ exp 1 How Deep Are Deep Gaussian Processes? Observe that we may bound Z(u) 1 uniformly in u H since F 0. Additionally, we have that Z+ Eµ exp 1 2 v, F+v 1 v 2 1 µ v 2 1 =: ε > 0. Note that ε is positive since H is separable, and thus all balls have positive measure (Hairer, 2009). It follows that dµ+ (v) = dµ( ; u) = 1 Z(u) exp 1 2 v, F(u)v Z+ exp 1 2 v, F+ F(u) v since F+ bounds F above uniformly. Proof of Theorem 14 We first establish existence of at least one invariant distribution by showing that chain {un} is (strong) Feller, and that for each u0 H the family {Pn(u0, )} of transition kernels is tight. To see the former, let f : H R be any bounded measurable function. We have that, for any v H, (Pf)(u) := Z H f(v)P(u, dv) H f(v) 1 Z(u) exp 1 2 v, F(u)v µ (dv). Since F(u) F+ it follows that Z(u) is bounded below by a positive constant, uniformly with respect to u. Additionally F is continuous and non-negative, and so the integrand is bounded and continuous with respect to u. Hence given any sequence u(k) u in H, we may apply the dominated convergence theorem to see that (Pf)(u(k)) (Pf)(u). The function Pf is therefore continuous, and so the chain {un} is strong Feller. We now show tightness. The assumptions on the operator C : H H imply that it is trace-class, and so in particular compact. It is also positive and symmetric, and so by the spectral theorem, admits a complete orthonormal system of eigenvectors {ϕj} with corresponding positive eigenvalues {λ2 j} such that λ2 j 0. Given s > 0, define the subspace Hs H by Hs = v H v 2 Hs := j=1 j2s| ϕj, v |2 < . It is standard to show that Hs is compactly embedded in H for any s > 0 (see for example Robinson, 2001, Appendix A.2). By the Karhunen-Loéve theorem, any v µ may be represented as j=1 λjξjϕj, ξj N(0, 1) i.i.d. Dunlop, Girolami, Stuart and Teckentrup Hence, by the orthonormality of the {ϕj} and the assumed decay of the eigenvalues, we have that Eµ v 2 Hs = j=1 j2sλ2 j Eµ v 2 Hs < if and only if s < r Since r > 1 by assumption, we can always choose s > 0 such that this holds; fix such an s in what follows. Observe that, for any n N, E un 2 Hs) = E E un 2 Hs un 1 = E Eµ( ;un 1) v 2 Hs H v 2 Hs 1 Z(un 1) exp 1 2 v, F(un 1)v µ (dv) 1 Z+ Eµ v 2 Hs We have bounded Z(un 1) Z+ using that F(un 1) F+. Applying the Chebychev inequality, we have for each n N and R > 0 P un Hs > R E un 2 Hs and so given any κ > 0, This can be rewritten as Pn(u0, Kκ) 1 κ where Kκ = u H | u Hs p M/κ is compact in H, since Hs is compactly embedded in H; this shows tightness of the sequence of probability measures Pn(u0, ). Since tightness implies boundedness in probability on average, an application of (Meyn and Tweedie, 2012, Theorem 12.0.1) gives existence of an invariant distribution. Lemma 15 shows that {un} satisfies a global minorization condition for the one-step transition probabilities: for any u0 H and any measurable A H, P u0, A = Eµ( ;u0) 1A(v) = Eµ+ dµ( ; u0) dµ+ (v)1A(v) εµ+(A). Combined with the existence of an invariant distribution above, a short coupling argument (Meyn and Tweedie, 2012, Theorem 16.2.4) gives the result with the same ε as above. How Deep Are Deep Gaussian Processes? 3.4 Convolution The convolution iteration has the advantage that, through use of Fourier series and the law of large numbers, its long time behaviour can be completely characterized analytically. We consider the convolution as a random map on H = L2(D; C), D = (0, 1)d. The iteration is given by un+1(x) = (un ξn+1)(x) := Z D un(x y)ξn+1(y) dy, ξn+1 N(0, C) i.i.d. (10) where we implicitly work with periodic extensions to define the convolution. We assume that C is a negative fractional power of a differential operator so that it diagonalizes in Fourier space; such a form of covariance operator is common in applications, as it includes, for example, Whittle-Matérn distributions (Lindgren et al., 2011). For example, we may take C = (I ) α, D( ) = H2 per([0, 1]d) H, in which case the samples ξn+1 N(0, C) will (almost surely) possess s fractional Sobolev and Hölder derivatives for any s < α d/2 (see Dashti and Stuart, 2017, for details). We choose the orthonormal Fourier basis ϕk(x) = e2πik x, k Zd, which are the eigenvectors of C; we denote the corresponding eigenvalues {λ2 k}. Given u H and k Zd, define the Fourier coefficient ˆu(k) C by ˆu(k) := ϕk, u L2 = Z ϕk(x)u(x) dx. Then it can be readily checked that for any u, v H and k Zd, \ (u v)(k) = ˆu(k)ˆv(k). (11) We use this property to establish the following theorem. Theorem 16 Let C : H H be a negative fractional power of a differential operator such that C is positive, symmetric and trace-class, with eigenvectors {ψk} and eigenvalues {λ2 k}. Define the Markov chain {un} by (10). Then for any u0 H, lim n |ˆun(k)|2 = ( 0 |λk|2 < 2eγ |λk|2 > 2eγ almost surely where γ 0.577 is the Euler-Mascheroni constant. In particular, if |λk|2 < 2eγ for all k Zd, then every Fourier coefficient of un tends to zero almost surely and hence un 0 in H almost surely. Proof First observe that by the Karhunen-Loéve theorem, we may express ξn+1 N(0, C) as ξn+1 = X k Zd λkηn,kϕk, ηn,k N(0, 1) i.i.d. Dunlop, Girolami, Stuart and Teckentrup and so, since {ϕk} is orthonormal, ˆξn+1(k) = λkηn,k. Then by the property (11), we see that for each k Zd and n N, ˆun+1(k) = ˆun(k)ˆξn+1(k) = ˆun(k)λkηn,k (12) where the second equality is in distribution. The problem has now been reduced to an independent family of scalar problems. We can write ˆun(k) explicitly as ˆun(k) = ˆu0(k) j=1 λkηj,k. (13) Now observe that |ˆun(k)|2 = |ˆu0(k)|2 n Y j=1 |λk|2|ηj,k|2 = |ˆu0(k)|2 exp j=1 log |λk|2|ηj,k|2 = |ˆu0(k)|2 exp j=1 log |ηj,k|2 + log |λk|2 By the strong law of large numbers, the scaled sum inside the exponential converges almost surely to E(log |η1,k|2). This can be calculated as E(log |η1,k|2) = γ log 2. If the bracketed term inside the exponential in (14) is eventually negative almost surely, then the limit of |un(k)|2 will be zero almost surely. This is guaranteed when γ log 2 + log |λk|2 < 0, i.e. |λk|2 < 2eγ. Similarly we get divergence if the bracketed term is eventually positive, which happens when |λk|2 > 2eγ. Remark 17 It is interesting to note that we may take expectations in (12) to establish that E|ˆun(k)|2 = |ˆu0(k)|2|λk|2n lim n E|ˆun(k)|2 = ( 0 |λk|2 < 1 |λk|2 > 1 . In particular, if |λk|2 (1, 2eγ), then |ˆun(k)|2 converges to zero almost surely, but diverges in mean square. How Deep Are Deep Gaussian Processes? Via a slight modification of the above proof to account for different boundary conditions, we have the following result. Corollary 18 Let D = (0, 1) and let {un} be defined by the iteration (10), where each ξn+1 is a Brownian bridge. Then un 0 almost surely. Proof The Brownian bridge on [0, 1] has covariance operator ( ) 1, where D( ) = {u H2 per([0, 1]) | u(0) = u(1) = 0}. The result of Theorem 16 cannot be applied directly, since the basis functions {ϕk} do not satisfy the boundary conditions. The eigenfunctions with the correct boundary conditions are given by 2 sin(jπx) = 1 ϕj(x) ϕ j(x) , j 1 with corresponding eigenvalues α2 j = (π2j2) 1. A Brownian bridge ξn+1 N(0, ( ) 1) can then be expressed as j=1 αjζn,jψj, ζn,j N(0, 1) i.i.d. by the Karhunen-Loéve theorem. We calculate ˆun+1(k) = ˆun(k)ˆξn(k) j=1 αjζn,j ϕk, ψj j=1 αjζn,j 1 ϕk, ϕj ϕk, ϕ j = ˆun(k)sgn(k)α|k| = ˆun(k)λkηn,k, ηn,k N(0, 1). We can now proceed as in Theorem 16 to deduce that |un(k)|2 0 whenever |λk|2 < 2eγ; note that the correlations between ˆun(k) and ˆun( k) do not affect the argument. Now observe that |λk|2 = (2π2k2) 1 < 1 < 2eγ for all k, and the result follows. Remark 19 The preceding results also holds if we replace the Brownian bridge by a Gaussian process with precision operator the negative Laplacian subject to Neumann boundary conditions and spatial mean zero; the eigenfunctions are then 2 cos(jπx) = 1 ϕj(x) + ϕ j(x) , j 1. The argument is identical, except no sgn(k) term appears in λk. Dunlop, Girolami, Stuart and Teckentrup 4. Numerical Illustrations We now study two of the constructions of deep Gaussian processes numerically. In subsection 4.1 we look at realizations of the deep Gaussian process constructed using the covariance function formulation, and in subsection 4.2 we perform similar experiments for the covariance operator formulations. Finally we consider Bayesian inverse problems, in which we choose deep Gaussian processes as our prior distributions; we introduce a function space MCMC algorithm, which scales well under mesh refinement of the functions to be inferred, for sampling. For the composition construction, numerical experiments are given by, for example, Damianou and Lawrence (2013); Duvenaud et al. (2014). We do not provide numerical experiments for the convolution construction; Theorem 16 tells us that interesting behaviour cannot be expected in this case. 4.1 Covariance Function We start by investigating typical realizations of a deep Gaussian process, constructed through anisotropic covariance kernels as in section 2.2. As the basis of our construction, we choose a stationary Gaussian correlation kernel, given by ρS(r) = exp( r2), r > 0. The function F determining the length scale of the kernel ρ( , ; un) is chosen as F(x) = x2, such that Σ(z) = (un(z))2 Id. Similar results are obtained with other choices of F in terms of the distribution of samples un. The choice of F does, however, influence the conditioning of the correlation matrix R(un), and the choice F(x) = exp(x), for example, can lead to numerical instabilities. As described in section 2.2, we will sample from the finite dimensional distributions obtained by sampling from the Gaussian process at a finite number of points in the domain D. To generate the samples, we use the command mvnrnd in MATLAB, and when plotting the samples, we use linear interpolation. In Figure 1, we show four independent realizations of the first seven layers u0, . . . , u6, where u0 is taken as a sample of the stationary Gaussian process with correlation kernel ρS. The domain D is here chosen as the interval (0, 1), and the sampling points are given by the uniform grid xi = i 1 256 , for i = 1, . . . , 257. Each column in Figure 1 corresponds to one realization, and each row corresponds to a given layer un, the first row showing u0. We can clearly see the non-stationary behaviour in the samples when progressing through the levels. We note that the ergodicity of the chain is also reflected in the samples, with the distribution of the samples un looking similar for larger values of n. Figure 2 shows the same information as Figure 1, in the case where the domain D is (0, 1)2 and the sampling points are the tensor product of the one-dimensional points x1 i = i 1 64 , for i = 1, . . . , 65. 4.2 Covariance Operator We now consider the covariance operator construction of the deep Gaussian process. In order to produce more interesting behaviour in the samples, we move away from the absolutely continuous setting considered in section 3.3 by introducing a rescaling of C(u) that depends How Deep Are Deep Gaussian Processes? 0 0.5 1 0.6 0 0.5 1 -0.5 u0 u1 u2 u3 u4 u5 u6 Sample 1 Sample 2 Sample 3 Sample 4 Figure 1: Four independent realizations of the first seven layers of a deep Gaussian process, in one spatial dimension, using the covariance kernel construction described in subsection 2.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom. Dunlop, Girolami, Stuart and Teckentrup u0 u1 u2 u3 u4 u5 u6 Sample 1 Sample 2 Sample 3 Sample 4 Figure 2: Four independent realizations of the first seven layers of a deep Gaussian process, in two spatial dimensions, using the covariance kernel construction described in subsection 2.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom. How Deep Are Deep Gaussian Processes? on u. This scaling is chosen so that the amplitude of samples is O(1) with respect to u. The rescaled family can be shown to satisfy Assumptions 3, and a minorization condition as in Lemma 15 can also be shown to hold when the state space is finite-dimensional. From this we can deduce that the resulting discretized process will still be ergodic. Assume D Rd and define the negative Laplacian on D( ), D( ) = u H2(D; R) du dν (x) = 0 for x D , where ν is the outward normal to D. Given α > d/2, σ > 0, we define P = and C(u) 1 = σ 2(P + Γ(u))α/2Γ(u)d/2 α(P + Γ(u))α/2 (15) where Γ(u)v (x) = F u(x) v(x). The scaling introduced is inspired by the SPDE representation of Whittle-Matérn distributions (Lindgren et al., 2011); if F(u) = τ 2 is chosen to be constant, then modulo boundary conditions, samples from a centred Gaussian distribution with covariance C(u) are samples from a Whittle-Matérn distribution. In particular, τ corresponds to the inverse length-scale of samples, and samples almost-surely have s Sobolev and Hölder and derivatives for any s < α d/2. For numerical experiments, we take F(u) = min{F + aebu2, F+} for some F+, F , a, b > 0. In particular, in one spatial dimension we take F+ = 1502, F = 200, a = 100 and b = 2. In two dimensions, we take F+ = 1502, F = 50, a = 25 and b = 0.3. We take α = 4 in both cases, and choose σ such that E u(x)2 1. These parameter choices were made empirically to ensure interesting structure of the samples. In order to generate samples at a given level, the negative Laplacian P is constructed using a finite-difference method. Given u, the operator A(u) is then computed, A(u) := σ 1Γ(u)d/4 α/2(P + Γ(u))α/2, so that v N(0, C(u)) solves the SPDE A(u)v = ξ, where ξ is white noise. In Figure 3 we show samples of the deep Gaussian process on domain D = (0, 1), sampled on the uniform grid xi = i 1 1000, for i = 1, . . . , 1001. We show 4 independent realizations of the first seven layers of the process each row corresponds to a given layer un. The anisotropy of the length-scale is evident in levels beyond u0, and the effect of ergodicity is evident, with deeper levels having similar properties. Compared to the covariance function construction, local effects are less prominent, though a greater level of anisotropy could potentially be obtained by making an alternative choice of F( ). Figure 4 shows the same experiments on domain D = (0, 1)2, sampled on the tensor product of the one-dimensional points x1 i = i 1 150 , for i = 1, . . . , 151, and the same effects are observed. Figure 5 shows the trace of the norm of a DGP {un} with d = 1, along with the running mean of these norms; the rapid convergence of the mean reflects the ergodicity of the chain. We emphasize that our perspective on inference includes quite general inverse problems, and is not limited to the problems of regression and classification which dominate much of Dunlop, Girolami, Stuart and Teckentrup 0 0.5 1 -10 0 0.5 1 -10 0 0.5 1 -10 0 0.5 1 -10 0 0.5 1 -10 u0 u1 u2 u3 u4 u5 u6 Sample 1 Sample 2 Sample 3 Sample 4 Figure 3: Four independent realizations of the first seven layers of a deep Gaussian process, in one spatial dimension, using the covariance operator construction described in subsection 4.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom. How Deep Are Deep Gaussian Processes? u0 u1 u2 u3 u4 u5 u6 Sample 1 Sample 2 Sample 3 Sample 4 Figure 4: Four independent realizations of the first seven layers of a deep Gaussian process, in two spatial dimensions, using the covariance operator construction described in subsection 4.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom. Dunlop, Girolami, Stuart and Teckentrup 0 100 200 300 400 500 600 700 800 900 1000 0 Figure 5: The trace of the norm of un versus n for a 1000 layer DGP {un} as in Figure 3. The thick black curve shows the running mean of the norms. classical machine learning; this broad perspective on the potential for the methodology affects the choice of algorithms that we study as we do not exploit any of the special structures that arise in regression and classification. The deep Gaussian processes discussed in the previous sections were introduced with the idea of providing flexible prior distributions for inference, for example in inverse problems. The structure of such problems is as follows. We have data y RJ arising via the model y = G(u) + η (16) where η is a realization of some additive noise, and G : X RJ is a (typically non-linear) forward map. The map G may involve, for example, solution of a partial differential equation which takes function u as input, or point evaluations of a function u, regression. In this paper we will fix X = HN, writing u = (u0, . . . , u N 1) X; our prior beliefs on u will then be characterized by the first N states of a Markov chain of a form considered in the previous sections. Note that the map G could incorporate a projection map if the dependence is only upon a single state u N 1; indeed this is the canonical example the variables (u0, . . . , u N 2) are viewed as hyperparameters in a prior on the parameter u N 1. 4.2.1 Algorithms We now turn to the design of algorithms for the Bayesian inference problems of sampling u|y. As already mentioned above, we are typically only interested in sampling the deepest layer u N 1|y. However, due to the hierarchical definition of u N 1 given all the components of u, our algorithms work with the full set of layers u. Since the components of u are functions, and hence infinite dimensional objects in general, a guiding principle is to design algorithms which are well-defined on function space, an approach to MCMC inference reviewed by Cotter et al. (2013); the value of this approach is that it leads to algorithms whose mixing time is not dependent on the number of mesh points used to represent the function to be inferred. For simplicity of exposition we assume that the observational noise η is distributed How Deep Are Deep Gaussian Processes? as N(0, Γ); this is not central to our developments but makes the exposition concrete. Recalling that the Markov chain defining the prior beliefs is given by (Zero Mean), we can consider the unknowns in the problem to be the variables u = (u0, . . . , u N 1), which are correlated under the prior, or the variables ξ = (ξ0, . . . , ξN 1), where we define ξ0 = u0, which are independent under the prior. These variables are related via u = T(ξ), where the components of the deterministic map T : X X are defined iteratively by T1(ξ0, . . . , ξN 1) = ξ0, Tn+1(ξ0, . . . , ξN 1) = L(Tn(ξ0, . . . , ξN 1))ξn, n = 1, . . . , N 1. The data may then be expressed in terms of ξ rather than u: y = G(ξ) + η = G(T(ξ)) + η (17) where our prior belief on ξ is that its components are i.i.d. Gaussians. To be consistent with the notation introduced by Papaspiliopoulos et al. (2007); Yu and Meng (2011), (16) will be referred to as the centred model and (17) will be referred to as the non-centred model. The space H may be chosen differently in the centred and non-centred cases. Associated with the two data models are two likelihoods: P(y|u) and P(y|ξ). Assuming that the observational noise η N(0, Γ) is Gaussian, where Γ RJ J is a positive definite covariance matrix, the likelihoods are given by P(y|u) = 1 Z(y) exp Φ(u; y) , Φ(u; y) := 1 2 (y G(u)) 2, P(y|ξ) = 1 Z(y) exp Φ(ξ; y) , Φ(ξ; y) := 1 2 (y G(ξ)) 2. We may then apply Bayes theorem to write down the posterior distributions P(u|y) and P(ξ|y): P(u|y) P(y|u)P(u) exp Φ(u; y) P(u), P(ξ|y) P(y|ξ)P(ξ) exp Φ(ξ; y) P(ξ). We know (Cotter et al., 2013) that it is straightforward to design algorithms to sample P(ξ|y) which are well-defined in infinite dimensions, exploiting the fact that P(ξ) is Gaussian. An example of such an algorithm given by Algorithm 1. This algorithm produces a chain {ξ(k)}k N that samples P(ξ|y) in stationarity; and {T(ξ(k))}k N will be samples of P(u|y). By working in non-centred coordinates we have been able to design this algorithm which is well-defined on function space. If we were to work with the centred coordinates u directly, the algorithm would not be well-defined on function space: in infinite dimensions, each family of measures {P(un|un 1)}un 1 H will typically be mutually singular, and so a proposed update u 7 ˆu will almost surely be rejected. To see why this rejection occurs in practice, in high finite dimensions K, notice that the acceptance probability for an update u 7 ˆu will involve the ratios of the Gaussian densities N(un; 0, C(un 1)) and N(un; 0, C(ˆun 1)). These densities will decay to zero as the dimension K is increased, and their ratio will only be well-defined in the limit if the measures are equivalent; consequently, the Markov chain will mix very poorly. Working with the Dunlop, Girolami, Stuart and Teckentrup Algorithm 1 Non-Centred Algorithm 1. Fix β0, . . . , βN 1 (0, 1] and define B = diag(βj). Choose initial state ξ(0) X, and set u(0) = T(ξ(0)) X. Set k = 0. 2. Propose ˆξ(k) = (I B2) 1 2 ξ(k) + Bζ(k) j , ζ(k) N(0, I). 3. Set ξ(k+1) = ˆξ(k) with probability αk = min n 1, exp Φ T(ξ(k)); y Φ T(ˆξ(k)); y o ; otherwise set ξ(k+1) = ξ(k). 4. Set k 7 k + 1 and go to 1. non-centred coordinates ξ, the prior does not appear in the acceptance probability and so this issue is circumvented. Another advantage of using the non-centred coordinates is that there is no need to calculate the (divergent) log determinants which appear in the centred acceptance probability, avoiding potential numerical issues. These issues are discussed in greater depth and generality by Chen et al. (2018). For the reasons set-out in that paper, including those above, we have used only the non-centred algorithm in what follows. When the forward model G(u) = Au is linear, the non-centred algorithm can be combined with standard Gaussian process regression techniques via the identity P(du N|y) = Z X P(du N|u N 1, y)P(du N 1|y). The distribution P(du N|u N 1, y) = N(my(u N 1), Cy(u N 1)) is Gaussian, where expressions for my, Cy are known, and so direct sampling methods are available. On the other hand, we have that P(y|u N 1) = N(0, AC(u N 1)A + Γ), and so we may use the non-centred algorithm to robustly sample the measure P(du N 1|y) = exp( Ψ(u N 1; y))P(du N 1), Ψ(u N 1; y) = 1 2 y 2 AC(u N 1)A +Γ + 1 2 log det(AC(u N 1)A + Γ), after reparametrizing in terms of ξ. This approach can be viable even when the data is particularly informative so that Φ is very singular this singularity does not in general pass to Ψ. It is this approach that we use for the simulations in the following subsections. An alternative approach not based on MCMC would be to use the non-centred parameterization of the Ensemble Kalman Filter (Chada et al., 2018) which we have successfully implemented in the context of the deep Gaussian processes of this paper, but do not show here for reasons of brevity. 4.3 Application to Regression We consider the application of the non-centered algorithm described above to simple regression problems in one and two spatial dimensions. How Deep Are Deep Gaussian Processes? 4.3.1 One-Dimensional Simulations We consider first the case D = (0, 1), where the forward map is given by a number of point evaluations: Gj(u) = u(xj) for some sequence {xj}J j=1 D. We compare the quality of reconstruction versus both the number of point evaluations and the number of levels in the deep Gaussian prior. We use the same parameters for the family of covariance operators as in subsection 4.2. The base layer u0 is taken to be Gaussian with covariance of the form (15), with Γ(u) 202. The true unknown field u is given by the indicator function u = 1(0.3,0.7), shown in Figure 6. It is generated on a mesh of 400 points, and three data sets are created wherein it is observed on uniform grids of J = 25, 50 and 100 points, and corrupted by white noise with standard deviation γ = 0.02. Sampling is performed on a mesh of 200 points to avoid an inverse crime (Kaipio and Somersalo, 2006). 106 samples are generated per chain, with the first 2 105 discarded as burn-in when calculating means. The jump parameters βj are adaptively tuned to keep acceptance rates close to 30%. In these experiments the deepest field is labelled as u N, rather than as u N 1 as in the statement of the algorithm; this is purely for notational convenience, of course. In Figure 7 the means of the deepest field u N and of the length-scales associated with each hidden layer are shown, that is, approximations to E u N and E F(uj) 1 2 for each j = 0, . . . , N 1. We see that, in all cases, the reconstructions of u are visually similar when two or more layers are used, and similar length-scale fields E F(u N 1) 1 2 are obtained in these cases. The sharpness of these length-scale fields is related to the amount of data. Additionally, when N = 4 and J = 100 the location of the discontinuities is visible in the estimate for E F(u N 2) 1 2 , suggesting the higher quality data can influence the process more deeply. When J = 50 or J = 25, this layer does not appear to be significantly informed. When a single layer prior is used, the reconstruction fails to accurately capture the discontinuities. Figure 7 also shows bands of quantiles of the values u(x) under the posterior, illustrating their distribution; in particular the lack of symmetry and disagreement of the means and medians show that the posterior is clearly non-Gaussian. Uncertainty increases both as the number of observations J and the layer n in the chain is increased. Note in particular the over-confidence of the shallow Gaussian process posterior: the truth is not contained within 95% credible intervals in all cases. In Table 1 we show the L1-errors between the true field and the posterior means arising from the different setups. The errors decrease as the number of observation points is increased, as would be expected. Additionally, when J = 100 and J = 50, the accuracy of the reconstruction increases with the number of layers, though the most significant increase occurs when increasing from 1 to 2 layers. When J = 25, the error increases beyond 2 layers, suggesting that some balance is required between the quality of the data and the flexibility of the prior. In Figure 8 we replace the uniformly spaced observations with 106 randomly placed observations, to illustrate the effect of very high quality data. With 3 or 4 layers, more anisotropic behavior is observed in the length-scale field. Additionally, the layer u N 2 is much more strongly informed than the cases with fewer observations, though the layer u N 3 in the case N = 4 does not appear to be informed at all, indicating a limitation on how deeply the process can be influenced by data. The corresponding errors are shown in Table 1 as Dunlop, Girolami, Stuart and Teckentrup 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.5 Figure 6: The true field used to generate the data for the one-dimensional inverse problem. J 1 layer 2 layers 3 layers 4 layers 100 0.0485 0.0200 0.0198 0.0196 50 0.0568 0.0339 0.0339 0.0337 25 0.0746 0.0658 0.0667 0.0670 106 0.0131 0.000145 0.000133 0.000133 Table 1: The L1-errors u E(u N) L1 between the true field and sample means for the onedimensional simulations shown in Figure 7, for different numbers of data points J and layers N. Also shown are the corresponding errors for the simulations shown in Figure 8 in the cases N = 50, 100, more layers increases the accuracy of the mean, with diminishing returns for each additional layer. Note that higher accuracy could be attained in the single layer case by adjusting the constant length-scale parameter. Finally, in Figure 9, we consider the same experiment as in Figure 7, except observations are limited to the subset (0, 0.5) of the domain. Uncertainty is naturally higher in the unobserved portion of the domain. Uncertainty also increases in the observed layer u N as N is increased; this could suggest that deep Gaussian processes may provide better generalization to unseen data than shallow Gaussian processes note that the truth has much higher probability under the posterior with 4 layers versus just 1. 4.3.2 Two-Dimensional Simulations We now consider the case D = (0, 1)2, again where the forward map is given by a number of point evaluations. We fix the number of point observations J = 210, on a 25 25 uniform grid. We again compare quality of reconstruction versus the number of point evaluations and the number of levels in the deep Gaussian prior, and use the same parameters for the How Deep Are Deep Gaussian Processes? Figure 7: Estimates of posterior means (solid curves) and 5 95% quantiles (shaded regions) arising from one-dimensional inverse problem. Number of data points taken are J = 100 (top block), J = 50 (middle block), J = 25 (bottom block). From left-to right, results for u N, F(u N 1) 1 2 , F(u N 2) 1 2 , F(u N 3) 1 2 are shown. From top-to-bottom within each block, N = 4, 3, 2, 1. Dunlop, Girolami, Stuart and Teckentrup Figure 8: Estimates of posterior means (solid curves) and 5 95% quantiles (shaded regions) arising from one-dimensional inverse problem, with J = 106 data points. From left-to right, results for u N, F(u N 1) 1 2 , F(u N 2) 1 2 , F(u N 3) 1 2 are shown. From top-to-bottom, N = 4, 3, 2, 1. family of covariance operators as in subsection 4.2. The base layer u0 is taken to be Gaussian with covariance of the form (15), with Γ(u) 202. The true unknown field u is constructed as a linear combination of truncated trigonometric functions with different length-scales, and shown in Figure 10 along with its contours. It is given by u (x, y) = cos(2πx) cos(2πy) + sin(4πx) sin(4πy)1(1/4,3/4)2(x, y) + sin(8πx) sin(8πy)1(1/2,3/4)2(x, y) + sin(16πx) sin(16πy)1(1/4,1/2)2(x, y). It is generated on a uniform square mesh of 214 points, and two data sets are created wherein it is observed on uniform square grid of J = 210, 28 points, and corrupted by white noise with standard deviation γ = 0.02. Sampling is performed on a mesh of 212 points to again avoid an inverse crime. 4 105 samples are generated per chain, with the first 2 105 discarded as burn-in when calculating means. Again the jump parameters βj are adaptively tuned to keep acceptance rates close to 30%. In Figure 11, analogously to Figure 7, the means of u N and of the length-scales associated with each layer are shown, for N = 1, 2, 3. When J = 210, reconstructions are similar, though quality is generally proportional to the number of layers. In particular the, effect of too short a length-scale is evident in the case N = 1, in the regions where the length-scale should be larger, and conversely the effect of too long a length-scale is evident in the cases N = 1, 2 in the region where the length-scale should be the shortest. In the cases N = 2, 3, the length-scale fields E F(u N 1) 1 2 are similar, though in the case N = 3 more accurately captures the true length-scales. When J = 28 the reconstructions are again similar, though How Deep Are Deep Gaussian Processes? Figure 9: Estimates of posterior means (solid curves) and 5 95% quantiles (shaded regions) arising from one-dimensional inverse problem. Number of data points taken are J = 100 (top block), J = 25 (bottom block). From left-to right, results for u N, F(u N 1) 1 2 , F(u N 2) 1 2 , F(u N 3) 1 2 are shown. From top-to-bottom within each block, N = 4, 3, 2, 1. Dunlop, Girolami, Stuart and Teckentrup 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 0.8 1 0 Figure 10: The true field used to generate the data for the two-dimensional inverse problem. J 1 layer 2 layers 3 layers 210 0.0856 0.0813 0.0681 28 0.1310 0.1260 0.1279 Table 2: The L2-errors u E(u N) L2 between the true field and sample means for the twodimensional simulations shown in Figure 11, for different numbers of data points J and layers N. there is now less accuracy in the shapes of the contours. In particular, the effect of too short a length-scale is especially evident in the case N = 1. The values of the reconstructed fields in the area of shortest length-scale are inaccurate in all cases the positions of the observation points meant that the actual values of the peaks were not reflected in the data. The fields E F(u N 1) 1 2 have similar structure to the case J = 210, though less accurately represent the true length scales. The L2-errors between the means and the truth are shown in Table 2 5. Conclusions, Discussion, and Actionable Advice In this section we provide an overview of the advantages and disadvantages of each of the four different DGP constructions considered in the paper, and summarize actionable advice that can be taken from the theoretical and numerical results that have been presented. We then outline a number of directions that would be interesting for future study. How Deep Are Deep Gaussian Processes? N = 3 N = 2 N = 1 N = 3 N = 2 N = 1 Figure 11: Estimates of posterior means arising from two-dimensional inverse problem. (Top block) J = 210, (Bottom block) J = 28. From left-to right, E u N , E F(u N 1) 1 2 , E F(u N 2) 1 2 . From top-to-bottom within each block, N = 3, 2, 1. Dunlop, Girolami, Stuart and Teckentrup 5.1 Comparison of Deep GP Constructions We have considered four different constructions of deep GPs and we now discuss their relative merits. We also consider the context of variational inference which is popular in machine learning primarily because of its tractability. We emphasize however that it forms an uncontrolled approximation of the true posterior distribution and may fail to adequately represent the posterior distribution, and uncertainty in particular. The composition construction is the classical construction introduced by Damianou and Lawrence (2013), building a hierarchy of layers using a stationary covariance function and composition. It has received the most study, and methods for variational inference have already been established. It has the advantage of scaling well with respect to data dimension d, however accurate sampling methods such as MCMC are intractable for large numbers of data points, due to the requirement to construct and factor dense covariance matrices at every step. The covariance function construction builds the hierarchy using a stationary covariance function, and iteratively modifying its associated length scale. It has the advantage that each layer can be readily interpreted as the anisotropic length-scale field of the following layer. Its scaling properties are similar to those of the composition construction, however variational inference methods for this construction have not yet been studied. The covariance operator construction builds the hierarchy using an SPDE representation of stationary Matèrn fields, and again iteratively modifies their associated length scale. It allows for fast sampling in low data dimension d via the use of PDE solvers, even when the number of data points is large. Accurate sampling via MCMC methods is tractable with this construction, due to the low cost of constructing and storing the inverse covariance (precision) matrix. Inference when d is large appears to be intractable at present, due to the requirement of dense meshes for PDE solvers. Finally, the convolution construction builds the hierarchy via iterative convolution of Gaussian random fields. It has the advantage of being amenable to analysis, however the results of this analysis indicate that it would likely be a poor construction to use for inference due to trivial behaviour for large depth. To summarize the numerical results on illustrative regression problems from the previous section, if the data is high quality, a small number of layers in the DGP will be sufficient as the problem becomes closer to interpolation. Conversely, if the data is low quality the likelihood is not strong enough to inform deeper layers in the DGP, and so a small number of layers is again sufficient. As a consequence, when the data lies between these two cases, and the truth has sufficiently rich structure, the use of deeper processes may be advantageous, but care is required to limit the number of layers employed. 5.2 Summary and Future Work There are a number of interesting ways in which this work may be generalized. Within the context of covariance operators it is of interest to construct covariances C(u) which are defined as L α with L being the divergence form elliptic operator Lu = F(u) u . How Deep Are Deep Gaussian Processes? Such a construction allows for the conditional distributions of the layers to be viewed as stationary on deformed spaces (Lindgren et al., 2011, 3.4), or to incorporate anisotropy in specific directions (Roininen et al., 2014, 3.1). Similar notions of anisotropy in different directions can be incorporated into the covariance function formulation by choosing the length scale Σ(z) different to a multiple of the identity matrix. Additionally, we could consider a non-zero mean in the iteration (GP), as considered by Duvenaud et al. (2014); Salimbeni and Deisenroth (2017), allowing for forcing of the system. For example, with the choice m(un) = un and a rescaling of the covariance, we obtain the Res Net-type iteration un+1 = un + p t L(un)ξn+1. This may be viewed as a discretization of the continuous-time stochastic differential equation du = L(u)L(u) d W, analogously to what has been considered for neural networks (Haber and Ruthotto, 2017). Study of these systems could be insightful, for example deriving conditions to ensure a lack of ergodicity and hence arbitrary depth. As before denotes the adjoint operation. And finally it is possible to consider processes outside the four categories considered here; for example the one-step transition from un to un+1 might be defined via stochastic integration against i.i.d. Brownian motions. We have shown how a number of ideas in the literature may be recursed to produce deep Gaussian processes, different from those introduced by Damianou and Lawrence (2013). We have studied the effective depth of these processes, either through demonstrating ergodicity, or through showing convergence to a trivial solution (such as 0 or ). Together these results demonstrate that, as also shown by Duvenaud et al. (2014) for the original construction of deep Gaussian processes, care is needed in order to design processes with significant depth. Nonetheless, even a few layers can be useful for inference purposes, and we have demonstrated this also. It is an interesting question to ask precisely how the approximation power and effective depth are affected by the number of layers of the process, both in the non-ergodic case, and in the ergodic case before stationarity has been reached. We also emphasize that the analysis in the paper is based solely on the deep Gaussian process un, and not the conditioned process un|y in the inference problem with observed data y. The ergodicity properties of un do not directly carry over to un|y. As we have seen in the numerical experiments, the number of layers required in the inference problem in practice depends on the information content in the observed data y, and the analysis in this paper does not fully answer the question as to how many. The results in this paper do show, however, that in the case of ergodic constructions, the expressive power of the prior distribution in the inference problem does not increase past a certain number of layers. This provides some justification for using only a moderate number of layers in a deep Gaussian process prior in inference problems. There are interesting approximation theory questions around deep processes, such as those identified in the context of neural networks by Pinkus (1999). There are also interesting questions around the use of these deep processes for inversion; in particular it seems hard to get significant value from using depth of more than two or three layers for noisy inverse problems. On the algorithmic side the issue of efficiently sampling these deep processes Dunlop, Girolami, Stuart and Teckentrup (even over only two layers), when conditioned on possibly nonlinear observations remains open. We have used non-centred parameterizations because these may be sampled using function-space MCMC (Cotter et al., 2013; Chen et al., 2018); but centred methods, or mixtures, may be desirable for some applications. Acknowledgments MG is supported by EPSRC grants [EP/R034710/1, EP/R018413/1, EP/R004889/1, EP/P020720/1], an EPSRC Established Career Fellowship EP/J016934/3, a Royal Academy of Engineering Research Chair, and The Lloyds Register Foundation Programme on Data Centric Engineering. AMS is supported by AFOSR Grant FA9550-17-1-0185 and by US National Science Foundation (NSF) grant DMS 1818977. ALT is partially supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. Appendix A. Proofs for Section 2 Proof of Proposition 1 The stationary kernel ρS is positive definite by Assumption 1, and so by (Wendland, 2004, Theorem 7.14), we have 0 exp( r2t) dν(t) for all r [0, ), for a finite, non-negative Borel measure ν on [0, ) that is not concentrated at 0 (i.e. it is not a multiple of the Dirac measure centred at 0). For any x Rd and t [0, ), let us now define the matrix Σt(x) := (4t) 1Σ(x) and the functions Kx,t(z) = 1 (2π)d/2| Σt(x)|1/2 exp 1 2(x z)T Σt(x) 1(x z) . Here | | denotes determinant and so the preceding is simply an expression for a normal density with mean x and covariance matrix Σt(x) when t > 0; at t = 0, we simply have Kx,t(z) = 0, for all x, z Rd. Then ρ(x, x ) is given by 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4 |Σ(x) + Σ(x )| 1 2 ρS p Q(x, x ) = 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4 |Σ(x) + Σ(x )| 1 2 0 exp t Q(x, x ) dν(t) = 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4 |Σ(x) + Σ(x )| 1 2 0 exp t(x x )T Σ(x) + Σ(x ) 1 (x x ) dν(t) | Σt(x)| 1 4 | Σt(x )| 1 4 | Σt(x) + Σt(x )| 1 2 exp 1 2(x x )T Σt(x) + Σt(x ) 1 (x x ) dν(t) = (2π) d 2 2 d 2 Z 0 | Σt(x)| 1 4 | Σt(x )| 1 4 Z Rd Kx,t(z)Kx ,t(z) dz dν(t), where in the last step, we have used the fact that the convolution R Rd Kx,t(z)Kx ,t(z) dz can be calculated explicitly using properties of normal random variables. More precisely, we How Deep Are Deep Gaussian Processes? Rd Kx,t(z)Kx ,t(z) dz = Z Rd p X(z x)p X (z) dz = Z Rd p X,X (z x, z) dz, where p X is the density of X N(0, Σt(x)), p X is the density of X N(x , Σt(x )) and X and X are independent. The change of variable from X, X to W, X , where W = X X, has Jacobian 1, and so Z Rd p X,X (z x, z) dz = Z Rd p W,X (z (z x), z) dz = Z Rd p W,X (x, z) dz = p W (x). Since W = X X N(x , Σt(x) + Σt(x )), we hence have Z Rd Kx,t(z)Kx ,t(z) dz (2π) d 2 | Σt(x) + Σt(x )| 1 2 exp 1 2(x x )T Σt(x) + Σt(x ) 1 (x x ) , as required. Now, for any b RN and pairwise distinct {xi}N i=1, we then have j=1 bibjρ(xi, xj) = (2π) d 2 2 d 2 Rd | Σt(xi)| 1 4 Kxi,t(z)| Σt(xj)| 1 4 Kxj,t(z) dz dν(t) = (2π) d 2 2 d 2 Z i=1 bi| Σt(xi)| 1 4 Kxi,t(z) since the Borel measure ν is finite and non-negative. It remains to show that strict inequality also holds. Firstly, we note that | Σ0(xi)| 1 4 Kxi,0(z) = 0, for all xi, z Rd, which means that the integrand with respect to t is identically equal to zero at t = 0. Secondly, we note that the points {xi}N i=1 are pairwise distinct and the functions {| Σt(xi)| 1 4 Kxi,t( )}N i=1 are hence linearly independent for any t (0, ). It is thus impossible to make the integrand with respect to z identically equal to 0 for a.e. z Rd. As a consequence the integrand with respect to t is positive for all t (0, ). Since we know that the measure ν is not concentrated at 0 this completes the proof that ρ is positive definite on Rd Rd, for any d N. Finally, we note that the kernel ρ is clearly non-stationary, and is a correlation function since ρ(x, x) = 1, for any x Rd. Proof of Proposition 2 We note that the definition of positive definite in Assumptions 1(i) refers only to behaviour of the kernel on a finite set of pairwise distinct points {xi}N i=1. By Assumption 2(i), the function G is non-negative and bounded. If G(z) > 0 for all z Rd, Dunlop, Girolami, Stuart and Teckentrup then the matrix Σ(z) is positive definite for all z Rd, and the fact that ρ( , ) is positive definite follows directly from Proposition 1. It remains to investigate the case where G(z) = 0 for some z Rd. We will prove that ρ( , ) is positive definite by showing that the correlation matrix R, with entries Rij = ρ(xi, xj), is positive definite for any pairwise disjoint points {xi}N i=1. Without loss of generality, we will study the case G(x1) = 0; the proof easily adapts to the case where G(xi) = 0, for i = 1. To define ρ(x1, xj) in this case, we start by assuming G(x1) > 0, G(xj) > 0, and then take limits. With Σ(z) = G(z)Id, we have Q(x1, xj) = (x1 xj)T Σ(x1) + Σ(xj) = 2 x1 xj 2 2 G(x1) + G(xj) 1 , where 2 is the Euclidean norm, and 2 d 2 det(Σ(x1)) 1 4 det(Σ(xj)) 1 4 det(Σ(x1) + Σ(xj)) 1 2 = 4G(x1)G(xj) G(x1) + G((xj) 2 We now study separately three cases: i) xj = x1: we have lim G(x1) 0 4G(x1)G(x1) G(x1) + G(x1) 2 4 = lim G(x1) 0 1 = 1, (18) and so using the algebra of limits, the continuity of ρS, (18) and the fact that ρS(0) = 1, we have lim G(x1) 0 ρ(x1, x1) = lim G(x1) 0 ρS p Q(x1, x1) = ρS(0) = 1. ii) xj = x1 and G(xj) > 0: we have lim G(x1) 0 Q(x1, xj) = 2 x1 xj 2 2 G(xj)) 1 , lim G(x1) 0 4G(x1)G(xj) G(x1) + G(xj) 2 4 = 0. (19) Thus, using the continuity of ρS, together with (19) and the algebra of limits, we have lim G(x1) 0 ρ(x1, xj) = 0. How Deep Are Deep Gaussian Processes? iii) xj = x1, G(xj) = 0: we obtain lim G(x1),G(xj) 0 Q(x1, xj) = , which by Assumptions 2(ii) implies that lim G(x1),G(xj) 0 ρS q Q(x1, xj) = 0. Since (a + b)2 4ab for any positive numbers a and b, we have 4G(x1)G(xj) G(x1) + G((xj) 2 for any G(x1) > 0, G(xj) > 0, and hence lim G(x1),G(xj) 0 ρ(x1, xj) = 0. Hence, when G(xi) > 0, for i = 2, . . . , N, we have lim G(x1) 0 R = R , where the matrix R has the first row and column equal to the first basis vector e1 = (1, 0, 0, . . . , 0) RN, and the remaining submatrix R N 1 RN 1 N 1 with entries ρ(xi, xj), for i, j = 2, . . . , N. The matrix R N 1 is positive definite by Proposition 1, from which we can conclude that R is positive definite also. A similar argument holds when G(xi) = 0 for one or more indices i {2, . . . , N}. Neil K Chada, Marco A Iglesias, Lassi Roininen, and Andrew M Stuart. Parameterizations for ensemble Kalman inversion. Inverse Problems, 34(5):055009, 2018. Victor Chen, Matthew M Dunlop, Omiros Papaspiliopoulos, and Andrew M Stuart. Robust MCMC sampling with non-Gaussian and hierarchical priors in high dimensions. ar Xiv preprint ar Xiv:1803.03344, 2018. Simon L Cotter, Gareth O Roberts, Andrew M Stuart, and David White. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28(3): 424 446, 2013. Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. In International Conference on Machine Learning, pages 884 893, 2017. Zhenwen Dai, Andreas Damianou, Javier González, and Neil Lawrence. Variational autoencoded deep Gaussian processes. ar Xiv preprint ar Xiv:1511.06455, 2015. Dunlop, Girolami, Stuart and Teckentrup Andreas C Damianou and Neil D Lawrence. Deep Gaussian Processes. In Artificial Intelligence and Statistics, pages 207 215, 2013. Yair Daon and Georg Stadler. Mitigating the influence of the boundary on PDE-based covariance operators. Inverse Problems and Imaging, 12(5), 2018. Masoumeh Dashti and Andrew M Stuart. The Bayesian approach to inverse problems. Handbook of Uncertainty Quantification, 2017. Persi Diaconis and David Freedman. Iterated random functions. SIAM Review, 41(1):45 76, 1999. David K Duvenaud, Oren Rippel, Ryan P Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, pages 202 210, 2014. Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017. Martin Hairer. An introduction to stochastic PDEs. ar Xiv preprint ar Xiv:0907.4178, 2009. Martin Hairer, Andrew M Stuart, Jochen Voss, and Petter Wiberg. Analysis of SPDEs arising in path sampling. Part I: The Gaussian case. Communications in Mathematical Sciences, 3(4):587 603, 2005. Markus Heinonen, Henrik Mannerström, Juho Rousu, Samuel Kaski, and Harri Lähdesmäki. Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. In Artificial Intelligence and Statistics, pages 732 740, 2016. Dave Higdon, Marc Kennedy, James C Cavendish, John A Cafeo, and Robert D Ryne. Combining field data and computer simulations for calibration and prediction. SIAM Journal on Scientific Computing, 26(2):448 466, 2004. Marco A Iglesias, Yulong Lu, and Andrew M Stuart. A Bayesian level set method for geometric inverse problems. Interfaces and Free Boundaries, 18(2):181 217, 2016. Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverse Problems, volume 160. Springer Science & Business Media, 2006. Gopinath Kallianpur. Stochastic Filtering Theory, volume 13. Springer Science & Business Media, 2013. Marc C Kennedy and Anthony O Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425 464, 2001. Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423 498, 2011. How Deep Are Deep Gaussian Processes? Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise. Stochastic Processes and their Applications, 101(2):185 232, 2002. Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2012. Radford M Neal. Bayesian Learning for Neural Networks. Ph D thesis, University of Toronto, 1995. Radford M Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. ar Xiv preprint physics/9701026, 1997. Christopher J Paciorek. Nonstationary Gaussian processes for regression and spatial modelling. Ph D thesis, Carnegie Mellon University, 2003. Christopher J Paciorek and Mark J Schervish. Nonstationary covariance functions for Gaussian process regression. Advances in Neural Information Processing Systems, 16:273 280, 2004. Omiros Papaspiliopoulos, Gareth O Roberts, and Martin Sköld. A general framework for the parametrization of hierarchical models. Statistical Science, pages 59 73, 2007. Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143 195, 1999. Frank J Pinski, Gideon Simpson, Andrew M Stuart, and Hendrik Weber. Kullback Leibler approximation for probability measures on infinite dimensional spaces. SIAM Journal on Mathematical Analysis, 47(6):4091 4122, 2015. Carl E Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. the MIT Press, 2(3):4, 2006. James C Robinson. Infinite-Dimensional Dynamical Systems: An Introduction to Dissipative Parabolic PDEs and the Theory of Global Attractors, volume 28. Cambridge University Press, 2001. Lassi Roininen, Janne M J Huttunen, and Sari Lasanen. Whittle-Matérn priors for Bayesian statistical inversion with applications in electrical impedance tomography. Inverse Problems and Imaging, 8(2):561 586, 2014. Lassi Roininen, Mark Girolami, Sari Lasanen, and Markku Markkanen. Hyperpriors for Matérn fields with applications in Bayesian inversion. ar Xiv preprint ar Xiv:1612.02989, 2016. Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pages 4588 4599, 2017. Dunlop, Girolami, Stuart and Teckentrup Alexandra M Schmidt and Anthony O Hagan. Bayesian inference for non-stationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):743 758, 2003. Edward Snelson, Zoubin Ghahramani, and Carl E Rasmussen. Warped Gaussian processes. In Advances in Neural Information Processing Systems, pages 337 344, 2004. Michael L Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999. Holger Wendland. Scattered Data Approximation, volume 17. Cambridge University Press, 2004. Yaming Yu and Xiao-Li Meng. To center or not to center: That is not the question an Ancillarity Sufficiency Interweaving Strategy (ASIS) for boosting MCMC efficiency. Journal of Computational and Graphical Statistics, 20(3):531 570, 2011.