# how_deep_are_deep_gaussian_processes__5f31e66e.pdf

Journal of Machine Learning Research 19 (2018) 1-46 Submitted 1/18; Revised 8/18; Published 9/18

How Deep Are Deep Gaussian Processes?

Matthew M. Dunlop mdunlop@caltech.edu Computing and Mathematical Sciences Caltech Pasadena, CA 91125, USA

Mark A. Girolami m.girolami@imperial.ac.uk Department of Mathematics Imperial College London London, SW7 2AZ, UK and The Alan Turing Institute 96 Euston Road London, NW1 2DB, UK

Andrew M. Stuart astuart@caltech.edu Computing and Mathematical Sciences Caltech Pasadena, CA 91125, USA

Aretha L. Teckentrup a.teckentrup@ed.ac.uk School of Mathematics University of Edinburgh Edinburgh, EH9 3FD, UK and The Alan Turing Institute 96 Euston Road London, NW1 2DB, UK

Editor: Neil Lawrence

Recent research has shown the potential utility of deep Gaussian processes. These deep structures are probability distributions, designed through hierarchical construction, which are conditionally Gaussian. In this paper, the current published body of work is placed in a common framework and, through recursion, several classes of deep Gaussian processes are deﬁned. The resulting samples generated from a deep Gaussian process have a Markovian structure with respect to the depth parameter, and the eﬀective depth of the resulting process is interpreted in terms of the ergodicity, or non-ergodicity, of the resulting Markov chain. For the classes of deep Gaussian processes introduced, we provide results concerning their ergodicity and hence their eﬀective depth. We also demonstrate how these processes may be used for inference; in particular we show how a Metropolis-within-Gibbs construction across the levels of the hierarchy can be used to derive sampling tools which are robust to the level of resolution used to represent the functions on a computer. For illustration, we consider the eﬀect of ergodicity in some simple numerical examples.

Keywords: deep learning, deep Gaussian processes, deep kernels

c 2018 Matthew M. Dunlop, Mark A. Girolami, Andrew M. Stuart and Aretha L. Teckentrup.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v19/18-015.html.

Dunlop, Girolami, Stuart and Teckentrup

1. Introduction

This section provides background on the study and application of Gaussian and deep Gaussian processes, outlines the contribution and setup of the paper, and establishes notation to be used in subsequent sections.

1.1 Background

Gaussian processes have proved remarkably successful as a tool for various statistical inference and machine learning tasks (Rasmussen and Williams, 2006; Kennedy and O Hagan, 2001; Higdon et al., 2004; Stein, 1999). This success relates in part to the ease with which computations may be performed in the Gaussian framework, and also to the ﬂexible ways in which Gaussian processes may be used, for example when combined with thresholding to perform classiﬁcation tasks via probit models (Neal, 1997; Rasmussen and Williams, 2006) or to ﬁnd interfaces in Bayesian inversion (Iglesias et al., 2016). Nonetheless there are limits to the sort of phenomena that are readily expressible via direct use of Gaussian processes, such as in the sparse data scenario, where the constructed probability distribution is far from posterior contraction. Recognizing this fact, there have been a number of interesting research activities which seek to represent new phenomena via the hierarchical cascading of Gaussians. Early work of this type includes the Ph D thesis of Paciorek (2003) (see also Paciorek and Schervish, 2004) in which the aim is to reproduce spatially non-stationary phenomena, and this is achieved by means of a Gaussian process whose covariance function itself depends on another Gaussian process. This idea was recently re-visited by Roininen et al. (2016), using the precision operator viewpoint, rather than covariance function, and building on the explicit link between Gaussian processes and stochastic partial diﬀerential equations (SPDEs) (Lindgren et al., 2011). A diﬀerent approach was adopted by Damianou and Lawrence (2013) where a Gaussian process was directly composed with another Gaussian process; furthermore the idea was implemented recursively, leading to what is referred to as deep Gaussian processes (DGP). These ingenious constructions open up new possibilities for problems in non-parametric inference and machine learning and the purpose of this paper is to establish, and utilize, a common framework for their study. Relevant to our analysis is the early work of Diaconis and Freedman (1999) which studied iterations of random Lipschitz functions and the conditions required for their convergence.

1.2 Our Contribution

In the paper we make three main contributions:

We demonstrate a unifying perspective on the hierarchical Gaussian processes described in the previous subsection, leading to a wide class of deep Gaussian processes, with a common framework within which new deep Gaussian processes can be constructed.

By exploiting the fact that this common framework has a Markovian structure, we interpret the depth of the process in terms of the ergodicity or non-ergodicity of this process; in simple terms ergodic constructions have eﬀective depth given by the mixing time.

How Deep Are Deep Gaussian Processes?

We demonstrate how these processes may be used for inference; in particular we show how a Metropolis-within-Gibbs construction across the levels of the hierarchy can be used to derive sampling tools which are robust to the level of resolution used to represent the functions on a computer.

We also describe numerical experiments which illustrate the theory, and which demonstrate some of the limitations of the framework in the inference context, suggesting the need for further algorithmic innovation and theoretical understanding. We now summarize the results and contributions by direct reference to the main theorems in the paper.

Theorem 4 shows that a composition-based deep Gaussian process will, with suﬃciently many layers, produce samples that are approximately constant. This pathology can be avoided by, for example, increasing the width of each hidden layer, or allowing each layer to depend on the input layer.

Theorem 8 shows the ergodicity of a class of discretized deep Gaussian processes, constructed using non-stationary covariance functions. As a consequence, there is little beneﬁt in adding additional layers after a certain point. This observation elucidates the mechanism underlying the choices of DGPs with a small number of layers for inference in numerous papers (for example Cutajar et al., 2017; Salimbeni and Deisenroth, 2017; Dai et al., 2015).

Theorem 14 establishes a similar result as Theorem 8 on function space, for a diﬀerent class of deep Gaussian processes constructed using non-stationary covariance operators.

Theorem 16 establishes the asymptotic properties of a deep Gaussian process formed by iterated convolution of fairly general classes of Gaussian random ﬁelds. Speciﬁcally it is shown that such processes will either converge weakly to zero or diverge as the number of layers is increased, and so they will provide little ﬂexibility for inference in practice.

1.3 Overview

The general framework in which we place the existing literature, and which we employ to analyze deep Gaussian processes, and to construct algorithms for related inference tasks, is as follows. We consider sequences of functions {un} which are conditionally Gaussian:

un+1|un N m(un), C(un) ; (Cov Op)

here m(un) denotes the mean function and C(un) the covariance operator. We will also sometimes work with the covariance function representation, in which case we will write

un+1|un GP m(x; un), c(x, x ; un) . (GP)

Note that the covariance function is the kernel of the covariance operator when the latter is represented as an integral operator over the approximate domain D Rd:

C(un)φ (x) = Z c(x, x ; un)φ(x )dx .

Dunlop, Girolami, Stuart and Teckentrup

In most of the paper we consider the centred case where m 0, although the ﬂexibility of allowing for non-zero mean will be important in some applications, as discussed in the conclusions. When the mean is zero, the iterations (Cov Op) and (GP) can be written in the form

un+1 = L(un)ξn+1, (Zero Mean)

where {ξn} form an i.i.d. Gaussian sequence and, for each u, L(u) is a linear operator. For example if the ξn are white then the covariance operator is C(u) = L(u)L(u) with denoting the adjoint operation and L(u) is a Cholesky factor of C(u). The formulation (Zero Mean) is useful in much of our analysis. For the purpose of this paper, we will refer to any sequence of functions constructed as in (Zero Mean) as a deep Gaussian process. In section 2 we discuss the hierarchical Gaussian constructions referenced above, and place them in the setting of equations (Cov Op), (GP) and (Zero Mean). Section 3 studies the ergodicity of the resulting deep Gaussian processes, using the Markov chain which deﬁnes them. In section 4 we provide supporting numerical experiments; we give illustrations of draws from deep Gaussian process priors, and we discuss inference. In the context of inference we describe a methodology for MCMC, using deep Gaussian priors, which is deﬁned in the function space limit and is hence independent of the level of resolution used to represent the functions un; numerical illustrations are given. We conclude in section 5 in which we describe generalizations of the settings considered in this paper, and highlight future directions.

1.4 Notation

The structure of the deep Gaussian processes above means that they can be interpreted as Markov chains on a Hilbert space H of functions. Let B(H) denote the Borel σ-algebra on H. We denote by P : H B(H) R the one-step transition probability distribution,

P(u, A) = P(un A | un 1 = u), (1)

and denote by Pn : H B(H) R the n-step transition probability distribution,

Pn(u, A) = P(un A | u0 = u). (2)

Thus, for example, in the case of the covariance operator construction (Cov Op) we have

P(u, ) = N 0, C(u) ,

when the mean is zero. This Markovian structure will be exploited when showing ergodicity, or lack of ergodicity, of the chains.

2. Four Constructions

This section provides examples of four constructions of deep Gaussian processes, all of which fall into our general framework. The reader will readily design others.

How Deep Are Deep Gaussian Processes?

2.1 Composition

Let D Rd, D Rl, un : D Rm and F : Rm D . If {ξn} is a collection of i.i.d. centred Gaussian processes taking values in the space of continuous functions C(D ; Rm) then we deﬁne the Markov chain

un+1(x) = ξn+1 F un(x) . (3)

The case m = l, F = id and D = D = Rm was introduced by Damianou and Lawrence (2013) and the generalization here is inspired by the formulation of Duvenaud et al. (2014). The case where two layers are employed could be interpreted as a form of warped Gaussian process: a generalization of Gaussian processes that have been used successfully in a number of inference problems (Snelson et al., 2004; Schmidt and O Hagan, 2003). We note that the mapping ξ 7 ξ F u is linear, and we may thus deﬁne L(u) by L(u)ξ = ξ F u; hence the Markov chain may be written in the form (Zero Mean). If ξ1 N(0, Σ) then the Markov chain has the form (Cov Op), with mean zero and C(u) = L(u)ΣL(u) ; if ξ1 GP 0, k(z, z ) then the Markov chain has the form (GP) with mean

zero and c(x, x ; u) = k F u(x) , F u(x ) .

2.2 Covariance Function

Paciorek (2003) gives a general strategy to construct anisotropic versions of isotropic covariance functions. Let Σ : Rd Rd d be such that Σ(z) is symmetric positive deﬁnite for all z Rd, and deﬁne the quadratic form

Q(x, x ) = (x x )T Σ(x) + Σ(x )

1 (x x ), x, x Rd.

If the isotropic correlation function ρS( ) is positive deﬁnite on Rd, for all d N, then the function

c(x, x ) = σ2 2 d 2 det(Σ(x)) 1 4 det(Σ(x )) 1 4

det(Σ(x) + Σ(x )) 1 2 ρS( p

is positive deﬁnite on Rd Rd and may thus be used as a covariance function. We make these statements precise below. If we choose Σ to depend on un then this may be used as the basis of a deep Gaussian process. To be concrete we choose

Σ(x) = F u(x) Id

where F : R R 0 for u : D Rd R. We then write c(x, x ; u). Now let un : D R and consider the Markov chain (GP) in the mean zero case. Paciorek (2003) considered this iteration over one-step with u0 GP(0, σ2ρS( x x )) and u1 was shown to exhibit interesting non-stationary eﬀects. Here we generalize and consider the deep process that results from this construction for arbitrary n N. By considering the covariance operator

C(u)ϕ (x) = Z

Rd c(x, x ; u)ϕ(x ) dx

Dunlop, Girolami, Stuart and Teckentrup

we may write the iteration in the form (Cov Op). The form (Zero Mean) follows with L(u) = C(u) 1 2 and ξn+1 being white noise. Various generalizations of this construction are possible, for example allowing the pointwise variance of the process σ2 to be spatially varying (Heinonen et al., 2016) and to depend on un(x). These may be useful in applications, but we conﬁne our analysis to the simpler setting for expository purposes; however in Remark 12 we discuss this generalization. In order to make the statements made above precise, let ρS : [0, ) R be a stationary covariance kernel, where the covariance between locations x and y depends only on the Euclidean distance x y 2. We make the following assumption on ρS.

Assumptions 1 (i) The covariance kernel ρS( x y 2) is positive deﬁnite1 on Rd Rd: for any N N, b RN\{0} and pairwise distinct {xi}N i=1 Rd, we have

j=1 bibjρS xi xj 2 > 0.

(ii) ρS is normalized to be a correlation kernel, i.e. ρS(0) = 1.

Using (Wendland, 2004, Theorem 6.11), suﬃcient conditions for ρS to fulﬁll Assumptions 1(i) are that ρS, as a function of x y, is continuous, bounded and in L1(Rd), with a Fourier transform that is non-negative and non-vanishing. These suﬃcient conditions are satisﬁed, for example, for the family of Matèrn covariance functions and the Gaussian covariance. To satisfy Assumptions 1(ii), any positive deﬁnite kernel ρS can simply be rescaled by ρS(0). We now have the following proposition, a slightly weaker version of which is proved by Paciorek (2003), where it is shown that ρ( , ) is positive semi-deﬁnite if ρS is positive semi-deﬁnite. Our proof, which is in the Appendix, follows closely that of (Paciorek, 2003, Theorem 1), but sharpens the result using a characterization of positive deﬁnite kernels proved in (Wendland, 2004, Theorem 7.14).

Proposition 1 Let Assumptions 1 hold. Suppose Σ : Rd Rd d is such that Σ(z) is symmetric positive deﬁnite for all z Rd, and deﬁne the quadratic form

Q(x, x ) = (x x )T Σ(x) + Σ(x )

1 (x x ), x, x Rd.

Then the function ρ( , ), deﬁned by

ρ(x, x ) = 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4

|Σ(x) + Σ(x )| 1 2 ρS p

is positive deﬁnite on Rd Rd, for any d N, and is a non-stationary correlation function.

1. If the double sum in this deﬁnition is only non-negative, we say that the kernel ρS is positive semideﬁnite. We are thus adopting the terminology used by Wendland (2004), where the kernel ρS is called positive deﬁnite if the double sum in Assumptions 1(i) is positive, and positive semi-deﬁnite if the sum is non-negative. For historical reasons, there is an alternative terminology, used by for example Paciorek (2003), where our notion of positive deﬁnite is referred to as strictly positive deﬁnite, and our notion of positive semi-deﬁnite is referred to as positive deﬁnite.

How Deep Are Deep Gaussian Processes?

Non-stationary covariance functions c(x, y), for which c(x, x) = 1, can be obtained from the non-stationary correlation function ρ(x, y) through multiplication by a standard deviation function σ : Rd R, in which case we have c(x, y) = σ(x)σ(y)ρ(x, y). Since the product of two positive deﬁnite kernels is also positive deﬁnite by (Wendland, 2004, Theorem 6.2), the kernel c(x, y) can be ensured to be positive deﬁnite by a proper choice of σ. We discuss generalizations such as this in the conclusions section 5. We are interested in studying the behaviour of Gaussian processes with non-stationary correlation functions ρ(x, y) of the form derived in Proposition 1, in the particular case where the matrices Σ(z) are derived from another Gaussian process. Speciﬁcally, we consider the following hierarchy of conditionally Gaussian processes on a bounded domain D Rd deﬁned as follows:

u0 GP(0, ρS( )), (4a)

un+1|un GP(0, ρ( , ; un)), for n N. (4b)

Here, ρ( , ; un) denotes a non-stationary correlation function constructed from ρS( ) as in Proposition 1, with the map Σ deﬁned through un. Typical choices for Σ are Σ(z) = (un(z))2 Id and Σ(z) = exp(un(z)) Id. Choices such as the ﬁrst of these lead to the possibility of positive semi-deﬁnite Σ and, in the worst case, Σ 0. If Σ 0 the resulting correlation function is given by

ρS(0) = 1, and ρS(r) = 0 for any r > 0.

This does not correspond to any (function valued) Gaussian process on Rd (Kallianpur, 2013): heuristically the resulting process would be a white noise process, but normalized to zero. However, it is possible to sample from any set of ﬁnite dimensional distributions when Σ 0: the correlation matrix is then the identity. To allow for the possibility of F( ) taking the value zero, we therefore only study the ﬁnite dimensional process deﬁned as follows:

u0 N(0, RS), (5a)

un+1 | un N(0, R(un)), for n N. (5b)

The vector un has entries (un)i = un(xi). Here, RS is the covariance matrix with entries (RS)ij = ρS( xi xj 2), and R(un) is the covariance matrix with entries (R(un))ij = ρ(xi, xj; un). The set {xj} comprises a ﬁnite set of points in Rd. We may now generalize Proposition 1 to allow for Σ becoming zero. In order to do this we make the following assumptions:

Assumptions 2 (i) We have Σ(z) = G(z)Id, for some non-negative, bounded function G : R R 0.

(ii) The correlation function ρS is continuous, with limr ρS(r) = 0.

We then have the following result on the positive-deﬁniteness of ρ( , ),

Proposition 2 Let Assumptions 1 and 2 hold. Then the kernel ρ( , ) deﬁned in Proposition 1 is positive deﬁnite on Rd Rd.

Dunlop, Girolami, Stuart and Teckentrup

Remark 3 This proposition applies to the process (5) with Σ(z) = F un(z) Id and F : R R 0 locally bounded, by taking G = F un, proving that ρ( , ; un) is positive deﬁnite on D D for all bounded functions un on D. Here we generalize the notion of positive-deﬁnite in the obvious way to apply on D Rd rather than on the whole of Rd.

2.3 Covariance Operator

Here we demonstrate how precision (inverse covariance) operators may be used to make deep Gaussian processes. Because precision operators encode conditional independence and sparsity this can be a very attractive basis for fast computations (Lindgren et al., 2011). Our approach is inspired by the hierarchical Gaussian process introduced by Roininen et al. (2016), where one-step of the Markov chain which we introduce here was considered. Let D Rd, un : D R and X := C(D; R). Assume that F : R R 0 is a bounded function. Let C be a covariance operator associated to a Gaussian process taking values in X and let P be the associated precision operator. Deﬁne the multiplication operator Γ(u) by Γ(u)v (x) = F u(x) v(x) and the covariance operator C(u) by

C(u) 1 = P + Γ(u)

and consider the Markov chain (Cov Op) with mean zero; this deﬁnes our deep Gaussian process. We note that formulation (GP) can be obtained by observing that the covariance function c(u) := c(x, x ; u) is the Green s function associated with the precision operator for C(u): C(u) 1c( , x ; u) = δx ( )

where δx is a Dirac delta function centred at point x . Computationally we will typically choose P to be a diﬀerential operator, noting that then fast methods may be employed to sample the Gaussian process un+1|un by means of SPDEs (Lindgren et al., 2011; Dashti and Stuart, 2017). If P is chosen as a diﬀerential operator, then the order of this operator will be related to the order of regularity of samples, and F will be related to the length scale of the samples. These relations are made explicit in the case of certain Whittle-Matérn distributions when F is constant (Lindgren et al., 2011); some boundary eﬀects may be present when D = Rd, though methodology is available to ameliorate these (Daon and Stadler, 2018). As in the previous subsection, the form (Zero Mean) follows with L(u) = C(u) 1 2 and ξn+1 being white noise. Generalizations of the construction in this subsection are possible, and we highlight these in subsection 5; however for expository purposes we conﬁne our analysis to the setting described in this subsection. For theoretical investigation of the equivalence, as measures, of Gaussians deﬁned by addition of an operator to a given precision operator (see Pinski et al., 2015).

2.4 Convolution

We consider the case (Zero Mean) where L(u)ξ := u ξ is a convolution. To be concrete we let D = [0, 1]d and construct a sequence of functions un : D R (or un : D C) deﬁned via the iteration

un+1(x) = (un ξn+1)(x) := Z

[0,1]d un(x y)ξn+1(y) dy,

How Deep Are Deep Gaussian Processes?

where {ξn} are a sequence of i.i.d. centred real-valued Gaussian random functions on D. Here we implicitly work with periodic extension of un from D to the whole of Rd in order to deﬁne the convolution.

3. The Role of Ergodicity

The purpose of this section is to demonstrate that the iteration (Zero Mean) is, in many situations, ergodic. This has the practical implication that the eﬀective depth of the deep Gaussian process is limited by the mixing time of the Markov chain. In some cases the ergodic behaviour may be trivial (convergence to a constant). Furthermore, even if the chain is not ergodic, the large iteration number dynamics may blow-up, prohibiting use of the iteration at signiﬁcant depth. The take home message is that in many cases the eﬀective depth is not that great. Great care will be needed to design deep Gaussian processes whose depth, and hence approximation power, is substantial. This issue was ﬁrst identiﬁed by Duvenaud et al. (2014), and we here provide a more general analysis of the phenomenon within the broad framework we have introduced for deep Gaussian processes.

3.1 Composition

We ﬁrst consider the case where the iteration is deﬁned by (3), which includes examples considered by Damianou and Lawrence (2013); Duvenaud et al. (2014). Duvenaud et al. (2014) observed that after a number of iterations, sample paths are approximately piecewise constant. We investigate this eﬀect in the context of ergodicity. We ﬁrst make two observations:

(i) if u0 is piecewise constant, then un is piecewise constant for all n N;

(ii) if u0 has discontinuity set Z0, and Zn denotes the discontinuity set of the nth iterate, then Zn+1 Zn for all n N.

Due to point (ii) above, if the sequence {un} is to be ergodic, then necessarily it must be the case that Zn , or else the process will have retained knowledge of the initial condition. In particular, if the initial condition is piecewise constant, then ergodicity would force the limit to be constant in space. In what follows we assume that the iteration is given by

un+1(x) = ξn+1 un(x) , ξj n+1 GP 0, h( x x 2) i.i.d.

where h is a stationary covariance function. We therefore make the choice m = l and F = id in (3) so that we are in the same setup as Damianou and Lawrence (2013); Duvenaud et al. (2014); the inclusion of more general maps F is discussed in Remark 5. Then for any x, x R we have uj n+1(x) uj n+1(x )

, h(0) h un(x) un(x ) 2

h un(x) un(x ) 2 h(0)

A common choice of covariance function is the squared exponential kernel:

h(z) = σ2e z2/2w2 (6)

Dunlop, Girolami, Stuart and Teckentrup

where σ2, w2 > 0 are scalar parameters. In Duvenaud et al. (2014), in the case m = d = 1, the choice σ2/w2 = π/2 is made above to ensure that the expected magnitude of the derivative remains constant through iterations. We show in the next proposition that if σ2, w2 are chosen such that σ2 < w2/m, then the limiting process is trivial in a sense to be made precise.

Theorem 4 Assume that h( ) is given by the squared exponential kernel (6) and that u0 is bounded on bounded sets almost-surely. Then if σ2 < w2/m,

P un(x) un(x ) 2 0 for all x, x D = 1

where P denotes the law of the process {un} over the probability space Ω.

Proof Since 1 e x x for x 0 it follows that, for all z R,

2h(0) 2h(z) σ2

with equality when z = 0. Then we have

E un(x) un(x ) 2 2 un 1 =

j=1 E |uj n(x) uj n(x )|2 un 1

2h(0) 2h un 1(x) un 1(x ) 2

w2 un 1(x) un 1(x ) 2 2

and so using induction and the tower property of conditional expectations,

E un(x) un(x ) 2 2 mσ2

E un 1(x) un 1(x ) 2 2

n E u0(x) u0(x ) 2 2

for some constant κ(x, x ). By the Markov inequality, we see that for any ε > 0,

P un(x) un(x ) 2 ε) 1

n κ(x, x ), (7)

and so, since σ2 < w2/m, applying the ﬁrst Borel-Cantelli lemma we deduce that

m=n { um(x) um(x ) 2 ε}

How Deep Are Deep Gaussian Processes?

We therefore have that

P( un(x) un(x ) 2 0) = P

m=n { um(x) um(x ) 2 < 1/k}

m=n { um(x) um(x ) 2 1/k} !

m=n { um(x) um(x ) 2 1/k}

Hence given any x, x D, we can ﬁnd Ω(x, x ) Ωwith P(Ω(x, x )) = 1 such that for any ω Ω(x, x ), un(x; ω) un(x ; ω) 2 0. Let {qj} be a countable dense subset of D, and deﬁne Ω = \

i,j N Ω(qi, qj) ,

noting that P(Ω ) = 1. Then for any ω Ω and any x, x {qj}, un(x; ω) un(x ; ω) 2 0. Since sample paths are almost-surely continuous, the above can be extended to all x, x D, so that

P un(x) un(x ) 2 0 for all x, x D = 1.

Remark 5 1. If a more general transformation map F : Rm D is included, then the above result still holds provided we take σ2 < w2/( F m). The convergence to a constant hence occurs when the length scale w is large or F is small (so each Gaussian random ﬁeld doesn t change too rapidly across the domain), or when the amplitude σ is small (so inputs are not warped too far).

2. The condition of the above theorem is less likely to be satisﬁed as the width m of each layer is increased, and so this triviality pathology is unlikely to arise for large m; this may be observed in practice numerically.

3. Following Neal (1995); Duvenaud et al. (2014), recent works (such as Dai et al., 2015; Cutajar et al., 2017) connect all layers to the input layer in order to avoid certain pathologies. The Markovian structure of the process is maintained in this case: with the above notation, the process is then deﬁned by

un+1(x) = ξn+1(un(x), x), ξj n+1 GP 0, h( x x 2) i.i.d,

where now ξn : Rm Rd Rm. Deﬁning β = mσ2/w2 < 1, if σ 1 we may use the same argument as the proof above to deduce that

E un(x) un(x ) 2 2 un 1 β un 1(x) un 1(x ) 2 2 + β x x 2 2,

Dunlop, Girolami, Stuart and Teckentrup

which leads to

E un(x) un(x ) 2 2 βn E u0(x) u0(x ) 2 2 + β 1 βn

The right hand side does not vanish as n , and so we can no longer use the ﬁrst Borel-Cantelli lemma to reach the same conclusion as the case where the layers are not connected to the input layer. This could provide some intuition as to why including the connection of each layer to the input layer provides greater stability than not doing so.

3.2 Covariance Function

In order to study ergodicity of the deep Gaussian process deﬁned through covariance functions, we will restrict attention in the remainder of this subsection to hierarchies of ﬁnitedimensional multivariate Gaussian random variables as in (5). Note that although we have here deﬁned u0 N(0, RS), following e.g. Paciorek (2003), the ergodicity of the deep Gaussian process will be proved for ﬁxed u0 RN (cf. Theorem 8). The following result is immediate from Proposition 2.

Corollary 6 Let Assumptions 1 and 2 hold. Then the covariance matrix R(un) is positive deﬁnite for all un C, for any compact subset of C RN.

Note that, because we have chosen to work with a correlation kernel, we have

Tr R(un) = N. (8)

We will use this fact explicitly in the ergodicity proof; however it may be relaxed as discussed in the Remark 12 below. We view the sequence of random variables {un} n=0 as a Markov chain, with u0 RN

given, and we want to show the existence of a stationary distribution. Recall the one-step transition kernel P of the Markov chain given by (1), and its n fold composition given by (2). In order to prove ergodicity of the Markov chain we will follow the proof technique of Mattingly et al. (2002); Meyn and Tweedie (2012), which establishes geometric ergodicity with the following proposition.

Proposition 7 Suppose the Markov chain {un} n=0 satisﬁes, for some compact set C B(RN), the following:

(i) For some y int(C) and for any δ > 0, we have

P(u, Bδ(y )) > 0 for all u C.

(ii) The transition kernel P(u, ) possesses a density p(u, y) in C, precisely

P(u, A) = Z

A p(u, y) dy, for all u C, A B(RN) B(C),

and p(u, y) is jointly continuous on C C.

How Deep Are Deep Gaussian Processes?

(iii) There is a function V : RN [1, ), with limu V (u) = , and real numbers α (0, 1) and β [0, ) such that

E(V (un+1) | un) αV (un) + β.

If we can choose the compact set C such that

C = u : V (u) 2β γ α

for some γ ( α, 1), then there exists a unique invariant measure π. Furthermore, there is r(γ) (0, 1) and κ(γ) (0, ) such that for all u0 RN and all measurable g with |g(u)| V (u) for all u RN, we have

|EPn(u0, )(g) π(g)| κrn V (u0).

We may verify the assumptions of Proposition 7 leading to the following theorem concerning the ergodicity of deep Gaussian processes deﬁned via the covariance function:

Theorem 8 Suppose Assumptions 1 and 2 hold. Then the Markov chain {un} n=0 satisﬁes the assumptions of Proposition 7. As a consequence, there exists ε (0, 1) such that for any u0 RN, there is a K(u0) > 0 with

Pn(u0, ) π TV K(1 ε)n for all n N,

and so the chain is ergodic.

The proof rests on the following three lemmas, and is given after stating and proving them. The ﬁrst lemma shows that, on average, the norm of states of the chain remains constant as the length of the chain is increased. The second shows that, given any current state in RN and any ball around the origin in RN, there is a positive probability that the next state will belong to that ball. The third lemma shows that the probability that the Markov chain moves to a set may be found via integration of a continuous function over that set.

Lemma 9 (Boundedness) Suppose Assumptions 1 and 2 hold. For all n N, we have

E un+1 2 2 | un = N.

Proof Let n 0. Since the random variable un+1|un has zero mean, the linearity of expectation implies (using (8)) that

E un+1 2 2 | un = E N X

j=1 (un+1)2 j

= Tr(R(un)) = N,

for all n N.

Dunlop, Girolami, Stuart and Teckentrup

Lemma 10 (Positive probability of a ball around zero) Suppose Assumptions 1 and 2 hold. For all u RN and δ > 0, we have

P u, Bδ(0) > 0.

Proof We have the equality un+1|(un = u) = p

R(u)ξn+1 in distribution, where p

R(u) denotes the Cholesky factor of the correlation matrix R(u) and ξn+1 N(0, IN). Then

P u, Bδ(0) = P un 2 δ | un 1 = u

R(u)ξn+1 2 δ

R(u) 2 ξn+1 2 δ

= P ξn+1 2 δ p

To show that the latter probability is positive, we need to show that δ p

R(u) 1 2 > 0.

Since δ > 0 is ﬁxed, we only need to show p

R(u) 2 < . Since p

R(u) 2 2 = ρ(R(u)), the spectral radius of R(u), we have p

R(u) 2 2 = ρ(R(u)) Tr(R(u)) = N.

The claim then follows.

Lemma 11 (Transition probability has a density) Suppose Assumptions 1 and 2 hold. Then the transition probability P u, has a jointly continuous density p(u, y) for all u C, for any compact set C RN.

Proof We have un+1|(un = u) N(0, R(u)), and the existence of a jointly continuous density of the transition probability in C follows if R(u) is positive deﬁnite for all u C. The claim then follows by Proposition 2.

We may now use the three preceding lemmas to prove the main ergodic theorem for deep Gaussian processes deﬁned through the covariance function. Proof of Theorem 8 Lemma 10 shows that assumption (i) is satisﬁed, for any C containing y = 0, and Lemma 11 shows that assumption (ii) is satisﬁed, for any compact set C. It follows from Lemma 9 that assumption (iii) is satisﬁed, with V (u) = u 2 2 +1, any α (0, 1) and β = N + 1. Now choose α = 1/4 and γ = 3/4 ( α, 1), so that the set

C = u : V (u) 2β γ α

= u : u 2 2 4N + 4

is compact. Then there is a unique invariant measure π, and there is r(γ) (0, 1) and κ(γ) (0, ) such that for u0 RN and all measurable g with |g(u)| V (u) for all u RN, we have

|EPn(u0, )(g) π(g)| κrn V (u0). (9)

How Deep Are Deep Gaussian Processes?

Since V (u) 1 for all u RN, the above holds in particular for all measurable g with g 1. Taking the supremum over all such g in (9) yields the given total variation bound, with K = κV (u0) and ε = 1 r.

Remark 12 (Covariance vs correlation kernels) In this subsection we have restricted our attention to correlation kernels ρS( xi xj 2) and ρ(xi, xj; un), rather than more general covariance kernels

c S( xi xj 2) = σ2 SρS( xi xj 2),

c(xi, xj; un) = σ(xi; un)σ(xj; un)ρ(xi, xj; un),

for stationary and non-stationary marginal standard deviation functions σS (0, ) and σ : Rd (0, ) respectively. This restriction is solely for ease of presentation; the analysis presented readily extends to c(xi, xj; un), under suitable assumptions on σ. In particular the analysis may be adapted to the case of general covariance kernels c S( xi xj 2) and c(xi, xj; un) under the assumption that there exist positive constants σ , σ+ such that σ σ(z) σ+, for all z Rd. When general covariances are used then it is possible to ensure that every multivariate Gaussian random variable in the hierarchy is of the same amplitude by scaling the corresponding covariance matrix C(un) to have constant trace N at each iteration n; the average variance over all points {xi}N i=1 is then 1 for every n.

3.3 Covariance Operator

We consider the class of covariance operators introduced in section 2.3 and show that, under precise assumptions detailed below, the iteration (Zero Mean) produces an ergodic Markov chain. Unlike the previous subsection, where we worked on RN, here we will work on the separable Hilbert space H = L2(D; R). To begin with, deﬁne the precision operators (densely deﬁned on H: Hairer et al., 2005; Pinski et al., 2015),

C 1 + = P + F+I,

C(u) 1 = P + Γ(u), u H,

and the probability measures

µ = N(0, C ),

µ+ = N(0, C+),

µ( ; u) = N(0, C(u)), u H.

Throughout the rest of this section we make the following assumptions on C and F:

Assumptions 3 1. The operator C : H H is symmetric and positive, and its eigenvalues {λ2 j} have algebraic decay λ2 j j r for some r > 1.

2. The function F : H R is continuous, and there exists F+ 0 such that 0 F(u) F+ for all u H.

Dunlop, Girolami, Stuart and Teckentrup

Remark 13 1. The assumption on algebraic decay of the eigenvalues can be relaxed to the operator C being trace-class on H; however the arguments that follow are cleaner when we assume this explicit decay which, of course, implies the trace condition. Note also that, under the stated assumption on algebraic decay, Gaussian measures on L2(D; R) will be supported on X = C(D; R) under mild conditions on the eigenfunctions of C (Dashti and Stuart, 2017) so that F u(x) will be deﬁned for all x D rather than x a.e. in D. Then Γ(u)v makes sense pointwise when v X.

2. The assumed form of the precision operator together with Assumptions 3 mean that the resulting family of measures {µ( ; u)}u H will be mutually equivalent. This allows for the total variation metric between measures to be used, and a concise proof of ergodicity to be obtained. If the measures were singular, a diﬀerent metric such as the Wasserstein metric would be required to quantify the convergence.

We now prove the following ergodic theorem for the deep Gaussian processes constructed through covariance operators.

Theorem 14 Let Assumptions 3 hold, and let the Markov chain {un} be given by (Zero Mean) with L(u) = C(u) 1 2 as deﬁned above. Then there exists a unique invariant distribution π, and there exists ε > 0 such that for any u0 H,

Pn(u0, ) π TV (1 ε)n for all n N.

In particular, the chain is ergodic.

The following lemma will be used to show a minorization condition, as well as establish further notation, key to the proof of Theorem 14 which follows it. It essentially shows a stronger form of equivalence of the family of measures {µ( , u)}u H.

Lemma 15 Let Assumptions 3 hold. Then there exists ε > 0 such that for any u, v H,

Proof The assumptions on F mean that the measures µ( ; u), µ and µ+ are mutually absolutely continuous, with

dµ (v) = 1 Z(u) exp 1

2 v, F(u)v ,

Z(u) = Eµ exp 1

2 v, F(u)v ;

dµ+ dµ (v) = 1 Z+ exp 1

Z+ = Eµ exp 1

How Deep Are Deep Gaussian Processes?

Observe that we may bound Z(u) 1 uniformly in u H since F 0. Additionally, we have that

Z+ Eµ exp 1

2 v, F+v 1 v 2 1

µ v 2 1 =: ε > 0.

Note that ε is positive since H is separable, and thus all balls have positive measure (Hairer, 2009). It follows that

dµ+ (v) = dµ( ; u)

= 1 Z(u) exp 1

2 v, F(u)v Z+ exp 1

2 v, F+ F(u) v

since F+ bounds F above uniformly.

Proof of Theorem 14 We ﬁrst establish existence of at least one invariant distribution by showing that chain {un} is (strong) Feller, and that for each u0 H the family {Pn(u0, )} of transition kernels is tight. To see the former, let f : H R be any bounded measurable function. We have that, for any v H,

(Pf)(u) := Z

H f(v)P(u, dv)

H f(v) 1 Z(u) exp 1

2 v, F(u)v µ (dv).

Since F(u) F+ it follows that Z(u) is bounded below by a positive constant, uniformly with respect to u. Additionally F is continuous and non-negative, and so the integrand is bounded and continuous with respect to u. Hence given any sequence u(k) u in H, we may apply the dominated convergence theorem to see that (Pf)(u(k)) (Pf)(u). The function Pf is therefore continuous, and so the chain {un} is strong Feller. We now show tightness. The assumptions on the operator C : H H imply that it is trace-class, and so in particular compact. It is also positive and symmetric, and so by the spectral theorem, admits a complete orthonormal system of eigenvectors {ϕj} with corresponding positive eigenvalues {λ2 j} such that λ2 j 0. Given s > 0, deﬁne the subspace Hs H by

Hs = v H v 2 Hs :=

j=1 j2s| ϕj, v |2 < .

It is standard to show that Hs is compactly embedded in H for any s > 0 (see for example Robinson, 2001, Appendix A.2). By the Karhunen-Loéve theorem, any v µ may be represented as

j=1 λjξjϕj, ξj N(0, 1) i.i.d.

Dunlop, Girolami, Stuart and Teckentrup

Hence, by the orthonormality of the {ϕj} and the assumed decay of the eigenvalues, we have that

Eµ v 2 Hs =

j=1 j2sλ2 j

Eµ v 2 Hs < if and only if s < r

Since r > 1 by assumption, we can always choose s > 0 such that this holds; ﬁx such an s in what follows. Observe that, for any n N,

E un 2 Hs) = E E un 2 Hs un 1

= E Eµ( ;un 1) v 2 Hs

H v 2 Hs 1 Z(un 1) exp 1

2 v, F(un 1)v µ (dv)

1 Z+ Eµ v 2 Hs

We have bounded Z(un 1) Z+ using that F(un 1) F+. Applying the Chebychev inequality, we have for each n N and R > 0

P un Hs > R E un 2 Hs

and so given any κ > 0,

This can be rewritten as Pn(u0, Kκ) 1 κ

where Kκ = u H | u Hs p

M/κ is compact in H, since Hs is compactly embedded in H; this shows tightness of the sequence of probability measures Pn(u0, ). Since tightness implies boundedness in probability on average, an application of (Meyn and Tweedie, 2012, Theorem 12.0.1) gives existence of an invariant distribution. Lemma 15 shows that {un} satisﬁes a global minorization condition for the one-step transition probabilities: for any u0 H and any measurable A H,

P u0, A = Eµ( ;u0) 1A(v) = Eµ+ dµ( ; u0)

dµ+ (v)1A(v) εµ+(A).

Combined with the existence of an invariant distribution above, a short coupling argument (Meyn and Tweedie, 2012, Theorem 16.2.4) gives the result with the same ε as above.

How Deep Are Deep Gaussian Processes?

3.4 Convolution

The convolution iteration has the advantage that, through use of Fourier series and the law of large numbers, its long time behaviour can be completely characterized analytically. We consider the convolution as a random map on H = L2(D; C), D = (0, 1)d. The iteration is given by

un+1(x) = (un ξn+1)(x) := Z

D un(x y)ξn+1(y) dy, ξn+1 N(0, C) i.i.d. (10)

where we implicitly work with periodic extensions to deﬁne the convolution. We assume that C is a negative fractional power of a diﬀerential operator so that it diagonalizes in Fourier space; such a form of covariance operator is common in applications, as it includes, for example, Whittle-Matérn distributions (Lindgren et al., 2011). For example, we may take C = (I ) α, D( ) = H2 per([0, 1]d) H,

in which case the samples ξn+1 N(0, C) will (almost surely) possess s fractional Sobolev and Hölder derivatives for any s < α d/2 (see Dashti and Stuart, 2017, for details). We choose the orthonormal Fourier basis

ϕk(x) = e2πik x, k Zd,

which are the eigenvectors of C; we denote the corresponding eigenvalues {λ2 k}. Given u H and k Zd, deﬁne the Fourier coeﬃcient ˆu(k) C by

ˆu(k) := ϕk, u L2 = Z

ϕk(x)u(x) dx.

Then it can be readily checked that for any u, v H and k Zd,

\ (u v)(k) = ˆu(k)ˆv(k). (11)

We use this property to establish the following theorem.

Theorem 16 Let C : H H be a negative fractional power of a diﬀerential operator such that C is positive, symmetric and trace-class, with eigenvectors {ψk} and eigenvalues {λ2 k}. Deﬁne the Markov chain {un} by (10). Then for any u0 H,

lim n |ˆun(k)|2 =

( 0 |λk|2 < 2eγ

|λk|2 > 2eγ almost surely

where γ 0.577 is the Euler-Mascheroni constant. In particular, if |λk|2 < 2eγ for all k Zd, then every Fourier coeﬃcient of un tends to zero almost surely and hence un 0 in H almost surely.

Proof First observe that by the Karhunen-Loéve theorem, we may express ξn+1 N(0, C) as ξn+1 = X

k Zd λkηn,kϕk, ηn,k N(0, 1) i.i.d.

Dunlop, Girolami, Stuart and Teckentrup

and so, since {ϕk} is orthonormal,

ˆξn+1(k) = λkηn,k.

Then by the property (11), we see that for each k Zd and n N,

ˆun+1(k) = ˆun(k)ˆξn+1(k) = ˆun(k)λkηn,k (12)

where the second equality is in distribution. The problem has now been reduced to an independent family of scalar problems. We can write ˆun(k) explicitly as

ˆun(k) = ˆu0(k)

j=1 λkηj,k. (13)

Now observe that

|ˆun(k)|2 = |ˆu0(k)|2 n Y

j=1 |λk|2|ηj,k|2

= |ˆu0(k)|2 exp

j=1 log |λk|2|ηj,k|2

= |ˆu0(k)|2 exp

j=1 log |ηj,k|2 + log |λk|2

By the strong law of large numbers, the scaled sum inside the exponential converges almost surely to E(log |η1,k|2). This can be calculated as

E(log |η1,k|2) = γ log 2.

If the bracketed term inside the exponential in (14) is eventually negative almost surely, then the limit of |un(k)|2 will be zero almost surely. This is guaranteed when γ log 2 + log |λk|2 < 0, i.e. |λk|2 < 2eγ. Similarly we get divergence if the bracketed term is eventually positive, which happens when |λk|2 > 2eγ.

Remark 17 It is interesting to note that we may take expectations in (12) to establish that

E|ˆun(k)|2 = |ˆu0(k)|2|λk|2n

lim n E|ˆun(k)|2 =

( 0 |λk|2 < 1 |λk|2 > 1 .

In particular, if |λk|2 (1, 2eγ), then |ˆun(k)|2 converges to zero almost surely, but diverges in mean square.

How Deep Are Deep Gaussian Processes?

Via a slight modiﬁcation of the above proof to account for diﬀerent boundary conditions, we have the following result.

Corollary 18 Let D = (0, 1) and let {un} be deﬁned by the iteration (10), where each ξn+1 is a Brownian bridge. Then un 0 almost surely.

Proof The Brownian bridge on [0, 1] has covariance operator ( ) 1, where

D( ) = {u H2 per([0, 1]) | u(0) = u(1) = 0}.

The result of Theorem 16 cannot be applied directly, since the basis functions {ϕk} do not satisfy the boundary conditions. The eigenfunctions with the correct boundary conditions are given by

2 sin(jπx) = 1

ϕj(x) ϕ j(x) , j 1

with corresponding eigenvalues α2 j = (π2j2) 1. A Brownian bridge ξn+1 N(0, ( ) 1) can then be expressed as

j=1 αjζn,jψj, ζn,j N(0, 1) i.i.d.

by the Karhunen-Loéve theorem. We calculate

ˆun+1(k) = ˆun(k)ˆξn(k)

j=1 αjζn,j ϕk, ψj

j=1 αjζn,j 1

ϕk, ϕj ϕk, ϕ j

= ˆun(k)sgn(k)α|k|

= ˆun(k)λkηn,k, ηn,k N(0, 1).

We can now proceed as in Theorem 16 to deduce that |un(k)|2 0 whenever |λk|2 < 2eγ; note that the correlations between ˆun(k) and ˆun( k) do not aﬀect the argument. Now observe that |λk|2 = (2π2k2) 1 < 1 < 2eγ for all k, and the result follows.

Remark 19 The preceding results also holds if we replace the Brownian bridge by a Gaussian process with precision operator the negative Laplacian subject to Neumann boundary conditions and spatial mean zero; the eigenfunctions are then

2 cos(jπx) = 1

ϕj(x) + ϕ j(x) , j 1.

The argument is identical, except no sgn(k) term appears in λk.

Dunlop, Girolami, Stuart and Teckentrup

4. Numerical Illustrations

We now study two of the constructions of deep Gaussian processes numerically. In subsection 4.1 we look at realizations of the deep Gaussian process constructed using the covariance function formulation, and in subsection 4.2 we perform similar experiments for the covariance operator formulations. Finally we consider Bayesian inverse problems, in which we choose deep Gaussian processes as our prior distributions; we introduce a function space MCMC algorithm, which scales well under mesh reﬁnement of the functions to be inferred, for sampling. For the composition construction, numerical experiments are given by, for example, Damianou and Lawrence (2013); Duvenaud et al. (2014). We do not provide numerical experiments for the convolution construction; Theorem 16 tells us that interesting behaviour cannot be expected in this case.

4.1 Covariance Function

We start by investigating typical realizations of a deep Gaussian process, constructed through anisotropic covariance kernels as in section 2.2. As the basis of our construction, we choose a stationary Gaussian correlation kernel, given by

ρS(r) = exp( r2), r > 0.

The function F determining the length scale of the kernel ρ( , ; un) is chosen as F(x) = x2, such that Σ(z) = (un(z))2 Id. Similar results are obtained with other choices of F in terms of the distribution of samples un. The choice of F does, however, inﬂuence the conditioning of the correlation matrix R(un), and the choice F(x) = exp(x), for example, can lead to numerical instabilities. As described in section 2.2, we will sample from the ﬁnite dimensional distributions obtained by sampling from the Gaussian process at a ﬁnite number of points in the domain D. To generate the samples, we use the command mvnrnd in MATLAB, and when plotting the samples, we use linear interpolation. In Figure 1, we show four independent realizations of the ﬁrst seven layers u0, . . . , u6, where u0 is taken as a sample of the stationary Gaussian process with correlation kernel ρS. The domain D is here chosen as the interval (0, 1), and the sampling points are given by the uniform grid xi = i 1

256 , for i = 1, . . . , 257. Each column in Figure 1 corresponds to one realization, and each row corresponds to a given layer un, the ﬁrst row showing u0. We can clearly see the non-stationary behaviour in the samples when progressing through the levels. We note that the ergodicity of the chain is also reﬂected in the samples, with the distribution of the samples un looking similar for larger values of n. Figure 2 shows the same information as Figure 1, in the case where the domain D is (0, 1)2

and the sampling points are the tensor product of the one-dimensional points x1 i = i 1

64 , for i = 1, . . . , 65.

4.2 Covariance Operator

We now consider the covariance operator construction of the deep Gaussian process. In order to produce more interesting behaviour in the samples, we move away from the absolutely continuous setting considered in section 3.3 by introducing a rescaling of C(u) that depends

How Deep Are Deep Gaussian Processes?

0 0.5 1 0.6

0 0.5 1 -0.5

u0 u1 u2 u3 u4 u5 u6

Sample 1 Sample 2 Sample 3 Sample 4

Figure 1: Four independent realizations of the ﬁrst seven layers of a deep Gaussian process, in one spatial dimension, using the covariance kernel construction described in subsection 2.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom.

Dunlop, Girolami, Stuart and Teckentrup

u0 u1 u2 u3 u4 u5 u6

Sample 1 Sample 2 Sample 3 Sample 4

Figure 2: Four independent realizations of the ﬁrst seven layers of a deep Gaussian process, in two spatial dimensions, using the covariance kernel construction described in subsection 2.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom.

How Deep Are Deep Gaussian Processes?

on u. This scaling is chosen so that the amplitude of samples is O(1) with respect to u. The rescaled family can be shown to satisfy Assumptions 3, and a minorization condition as in Lemma 15 can also be shown to hold when the state space is ﬁnite-dimensional. From this we can deduce that the resulting discretized process will still be ergodic. Assume D Rd and deﬁne the negative Laplacian on D( ),

D( ) = u H2(D; R) du dν (x) = 0 for x D ,

where ν is the outward normal to D. Given α > d/2, σ > 0, we deﬁne P = and

C(u) 1 = σ 2(P + Γ(u))α/2Γ(u)d/2 α(P + Γ(u))α/2 (15)

where Γ(u)v (x) = F u(x) v(x). The scaling introduced is inspired by the SPDE representation of Whittle-Matérn distributions (Lindgren et al., 2011); if F(u) = τ 2 is chosen to be constant, then modulo boundary conditions, samples from a centred Gaussian distribution with covariance C(u) are samples from a Whittle-Matérn distribution. In particular, τ corresponds to the inverse length-scale of samples, and samples almost-surely have s Sobolev and Hölder and derivatives for any s < α d/2. For numerical experiments, we take

F(u) = min{F + aebu2, F+}

for some F+, F , a, b > 0. In particular, in one spatial dimension we take F+ = 1502, F = 200, a = 100 and b = 2. In two dimensions, we take F+ = 1502, F = 50, a = 25 and b = 0.3. We take α = 4 in both cases, and choose σ such that E u(x)2 1. These parameter choices were made empirically to ensure interesting structure of the samples. In order to generate samples at a given level, the negative Laplacian P is constructed using a ﬁnite-diﬀerence method. Given u, the operator A(u) is then computed,

A(u) := σ 1Γ(u)d/4 α/2(P + Γ(u))α/2,

so that v N(0, C(u)) solves the SPDE A(u)v = ξ, where ξ is white noise. In Figure 3 we show samples of the deep Gaussian process on domain D = (0, 1), sampled on the uniform grid xi = i 1

1000, for i = 1, . . . , 1001. We show 4 independent realizations of the ﬁrst seven layers of the process each row corresponds to a given layer un. The anisotropy of the length-scale is evident in levels beyond u0, and the eﬀect of ergodicity is evident, with deeper levels having similar properties. Compared to the covariance function construction, local eﬀects are less prominent, though a greater level of anisotropy could potentially be obtained by making an alternative choice of F( ). Figure 4 shows the same experiments on domain D = (0, 1)2, sampled on the tensor product of the one-dimensional points x1 i = i 1

150 , for i = 1, . . . , 151, and the same eﬀects are observed. Figure 5 shows the trace of the norm of a DGP {un} with d = 1, along with the running mean of these norms; the rapid convergence of the mean reﬂects the ergodicity of the chain. We emphasize that our perspective on inference includes quite general inverse problems, and is not limited to the problems of regression and classiﬁcation which dominate much of

Dunlop, Girolami, Stuart and Teckentrup

0 0.5 1 -10

0 0.5 1 -10

0 0.5 1 -10

0 0.5 1 -10

0 0.5 1 -10

u0 u1 u2 u3 u4 u5 u6

Sample 1 Sample 2 Sample 3 Sample 4

Figure 3: Four independent realizations of the ﬁrst seven layers of a deep Gaussian process, in one spatial dimension, using the covariance operator construction described in subsection 4.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom.

How Deep Are Deep Gaussian Processes?

u0 u1 u2 u3 u4 u5 u6

Sample 1 Sample 2 Sample 3 Sample 4

Figure 4: Four independent realizations of the ﬁrst seven layers of a deep Gaussian process, in two spatial dimensions, using the covariance operator construction described in subsection 4.2. Each column corresponds to an independent chain, and layers u0, u1, . . . , u6 are shown from top-to-bottom.

Dunlop, Girolami, Stuart and Teckentrup

0 100 200 300 400 500 600 700 800 900 1000 0

Figure 5: The trace of the norm of un versus n for a 1000 layer DGP {un} as in Figure 3. The thick black curve shows the running mean of the norms.

classical machine learning; this broad perspective on the potential for the methodology affects the choice of algorithms that we study as we do not exploit any of the special structures that arise in regression and classiﬁcation. The deep Gaussian processes discussed in the previous sections were introduced with the idea of providing ﬂexible prior distributions for inference, for example in inverse problems. The structure of such problems is as follows. We have data y RJ arising via the model

y = G(u) + η (16)

where η is a realization of some additive noise, and G : X RJ is a (typically non-linear) forward map. The map G may involve, for example, solution of a partial diﬀerential equation which takes function u as input, or point evaluations of a function u, regression. In this paper we will ﬁx X = HN, writing u = (u0, . . . , u N 1) X; our prior beliefs on u will then be characterized by the ﬁrst N states of a Markov chain of a form considered in the previous sections. Note that the map G could incorporate a projection map if the dependence is only upon a single state u N 1; indeed this is the canonical example the variables (u0, . . . , u N 2) are viewed as hyperparameters in a prior on the parameter u N 1.

4.2.1 Algorithms

We now turn to the design of algorithms for the Bayesian inference problems of sampling u|y. As already mentioned above, we are typically only interested in sampling the deepest layer u N 1|y. However, due to the hierarchical deﬁnition of u N 1 given all the components of u, our algorithms work with the full set of layers u. Since the components of u are functions, and hence inﬁnite dimensional objects in general, a guiding principle is to design algorithms which are well-deﬁned on function space, an approach to MCMC inference reviewed by Cotter et al. (2013); the value of this approach is that it leads to algorithms whose mixing time is not dependent on the number of mesh points used to represent the function to be inferred. For simplicity of exposition we assume that the observational noise η is distributed

How Deep Are Deep Gaussian Processes?

as N(0, Γ); this is not central to our developments but makes the exposition concrete. Recalling that the Markov chain deﬁning the prior beliefs is given by (Zero Mean), we can consider the unknowns in the problem to be the variables u = (u0, . . . , u N 1), which are correlated under the prior, or the variables ξ = (ξ0, . . . , ξN 1), where we deﬁne ξ0 = u0, which are independent under the prior. These variables are related via u = T(ξ), where the components of the deterministic map T : X X are deﬁned iteratively by

T1(ξ0, . . . , ξN 1) = ξ0,

Tn+1(ξ0, . . . , ξN 1) = L(Tn(ξ0, . . . , ξN 1))ξn, n = 1, . . . , N 1.

The data may then be expressed in terms of ξ rather than u:

y = G(ξ) + η = G(T(ξ)) + η (17)

where our prior belief on ξ is that its components are i.i.d. Gaussians. To be consistent with the notation introduced by Papaspiliopoulos et al. (2007); Yu and Meng (2011), (16) will be referred to as the centred model and (17) will be referred to as the non-centred model. The space H may be chosen diﬀerently in the centred and non-centred cases. Associated with the two data models are two likelihoods: P(y|u) and P(y|ξ). Assuming that the observational noise η N(0, Γ) is Gaussian, where Γ RJ J is a positive deﬁnite covariance matrix, the likelihoods are given by

P(y|u) = 1 Z(y) exp Φ(u; y) , Φ(u; y) := 1

2 (y G(u)) 2,

P(y|ξ) = 1 Z(y) exp Φ(ξ; y) , Φ(ξ; y) := 1

2 (y G(ξ)) 2.

We may then apply Bayes theorem to write down the posterior distributions P(u|y) and P(ξ|y):

P(u|y) P(y|u)P(u) exp Φ(u; y) P(u),

P(ξ|y) P(y|ξ)P(ξ) exp Φ(ξ; y) P(ξ).

We know (Cotter et al., 2013) that it is straightforward to design algorithms to sample P(ξ|y) which are well-deﬁned in inﬁnite dimensions, exploiting the fact that P(ξ) is Gaussian. An example of such an algorithm given by Algorithm 1. This algorithm produces a chain {ξ(k)}k N that samples P(ξ|y) in stationarity; and {T(ξ(k))}k N will be samples of P(u|y). By working in non-centred coordinates we have been able to design this algorithm which is well-deﬁned on function space. If we were to work with the centred coordinates u directly, the algorithm would not be well-deﬁned on function space: in inﬁnite dimensions, each family of measures {P(un|un 1)}un 1 H will typically be mutually singular, and so a proposed update u 7 ˆu will almost surely be rejected. To see why this rejection occurs in practice, in high ﬁnite dimensions K, notice that the acceptance probability for an update u 7 ˆu will involve the ratios of the Gaussian densities N(un; 0, C(un 1)) and N(un; 0, C(ˆun 1)). These densities will decay to zero as the dimension K is increased, and their ratio will only be well-deﬁned in the limit if the measures are equivalent; consequently, the Markov chain will mix very poorly. Working with the

Dunlop, Girolami, Stuart and Teckentrup

Algorithm 1 Non-Centred Algorithm

1. Fix β0, . . . , βN 1 (0, 1] and deﬁne B = diag(βj). Choose initial state ξ(0) X, and set u(0) = T(ξ(0)) X. Set k = 0.

2. Propose ˆξ(k) = (I B2) 1 2 ξ(k) + Bζ(k) j , ζ(k) N(0, I).

3. Set ξ(k+1) = ˆξ(k) with probability

αk = min n 1, exp Φ T(ξ(k)); y Φ T(ˆξ(k)); y o ;

otherwise set ξ(k+1) = ξ(k).

4. Set k 7 k + 1 and go to 1.

non-centred coordinates ξ, the prior does not appear in the acceptance probability and so this issue is circumvented. Another advantage of using the non-centred coordinates is that there is no need to calculate the (divergent) log determinants which appear in the centred acceptance probability, avoiding potential numerical issues. These issues are discussed in greater depth and generality by Chen et al. (2018). For the reasons set-out in that paper, including those above, we have used only the non-centred algorithm in what follows. When the forward model G(u) = Au is linear, the non-centred algorithm can be combined with standard Gaussian process regression techniques via the identity

P(du N|y) = Z

X P(du N|u N 1, y)P(du N 1|y).

The distribution P(du N|u N 1, y) = N(my(u N 1), Cy(u N 1)) is Gaussian, where expressions for my, Cy are known, and so direct sampling methods are available. On the other hand, we have that P(y|u N 1) = N(0, AC(u N 1)A + Γ), and so we may use the non-centred algorithm to robustly sample the measure

P(du N 1|y) = exp( Ψ(u N 1; y))P(du N 1),

Ψ(u N 1; y) = 1

2 y 2 AC(u N 1)A +Γ + 1

2 log det(AC(u N 1)A + Γ),

after reparametrizing in terms of ξ. This approach can be viable even when the data is particularly informative so that Φ is very singular this singularity does not in general pass to Ψ. It is this approach that we use for the simulations in the following subsections. An alternative approach not based on MCMC would be to use the non-centred parameterization of the Ensemble Kalman Filter (Chada et al., 2018) which we have successfully implemented in the context of the deep Gaussian processes of this paper, but do not show here for reasons of brevity.

4.3 Application to Regression

We consider the application of the non-centered algorithm described above to simple regression problems in one and two spatial dimensions.

How Deep Are Deep Gaussian Processes?

4.3.1 One-Dimensional Simulations

We consider ﬁrst the case D = (0, 1), where the forward map is given by a number of point evaluations: Gj(u) = u(xj) for some sequence {xj}J j=1 D. We compare the quality of reconstruction versus both the number of point evaluations and the number of levels in the deep Gaussian prior. We use the same parameters for the family of covariance operators as in subsection 4.2. The base layer u0 is taken to be Gaussian with covariance of the form (15), with Γ(u) 202. The true unknown ﬁeld u is given by the indicator function u = 1(0.3,0.7), shown in Figure 6. It is generated on a mesh of 400 points, and three data sets are created wherein it is observed on uniform grids of J = 25, 50 and 100 points, and corrupted by white noise with standard deviation γ = 0.02. Sampling is performed on a mesh of 200 points to avoid an inverse crime (Kaipio and Somersalo, 2006). 106 samples are generated per chain, with the ﬁrst 2 105 discarded as burn-in when calculating means. The jump parameters βj are adaptively tuned to keep acceptance rates close to 30%. In these experiments the deepest ﬁeld is labelled as u N, rather than as u N 1 as in the statement of the algorithm; this is purely for notational convenience, of course. In Figure 7 the means of the deepest ﬁeld u N and of the length-scales associated with each hidden layer are shown, that is, approximations to E u N and E F(uj) 1 2 for each j = 0, . . . , N 1. We see that, in all cases, the reconstructions of u are visually similar when two or more layers are used, and similar length-scale ﬁelds E F(u N 1) 1 2 are obtained in these cases. The sharpness of these length-scale ﬁelds is related to the amount of data. Additionally, when N = 4 and J = 100 the location of the discontinuities is visible in the estimate for E F(u N 2) 1 2 , suggesting the higher quality data can inﬂuence the process more deeply. When J = 50 or J = 25, this layer does not appear to be signiﬁcantly informed. When a single layer prior is used, the reconstruction fails to accurately capture the discontinuities. Figure 7 also shows bands of quantiles of the values u(x) under the posterior, illustrating their distribution; in particular the lack of symmetry and disagreement of the means and medians show that the posterior is clearly non-Gaussian. Uncertainty increases both as the number of observations J and the layer n in the chain is increased. Note in particular the over-conﬁdence of the shallow Gaussian process posterior: the truth is not contained within 95% credible intervals in all cases. In Table 1 we show the L1-errors between the true ﬁeld and the posterior means arising from the diﬀerent setups. The errors decrease as the number of observation points is increased, as would be expected. Additionally, when J = 100 and J = 50, the accuracy of the reconstruction increases with the number of layers, though the most signiﬁcant increase occurs when increasing from 1 to 2 layers. When J = 25, the error increases beyond 2 layers, suggesting that some balance is required between the quality of the data and the ﬂexibility of the prior. In Figure 8 we replace the uniformly spaced observations with 106 randomly placed observations, to illustrate the eﬀect of very high quality data. With 3 or 4 layers, more anisotropic behavior is observed in the length-scale ﬁeld. Additionally, the layer u N 2 is much more strongly informed than the cases with fewer observations, though the layer u N 3 in the case N = 4 does not appear to be informed at all, indicating a limitation on how deeply the process can be inﬂuenced by data. The corresponding errors are shown in Table 1 as

Dunlop, Girolami, Stuart and Teckentrup

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 -0.5

Figure 6: The true ﬁeld used to generate the data for the one-dimensional inverse problem.

J 1 layer 2 layers 3 layers 4 layers

100 0.0485 0.0200 0.0198 0.0196

50 0.0568 0.0339 0.0339 0.0337

25 0.0746 0.0658 0.0667 0.0670

106 0.0131 0.000145 0.000133 0.000133

Table 1: The L1-errors u E(u N) L1 between the true ﬁeld and sample means for the onedimensional simulations shown in Figure 7, for diﬀerent numbers of data points J and layers N. Also shown are the corresponding errors for the simulations shown in Figure 8

in the cases N = 50, 100, more layers increases the accuracy of the mean, with diminishing returns for each additional layer. Note that higher accuracy could be attained in the single layer case by adjusting the constant length-scale parameter. Finally, in Figure 9, we consider the same experiment as in Figure 7, except observations are limited to the subset (0, 0.5) of the domain. Uncertainty is naturally higher in the unobserved portion of the domain. Uncertainty also increases in the observed layer u N as N is increased; this could suggest that deep Gaussian processes may provide better generalization to unseen data than shallow Gaussian processes note that the truth has much higher probability under the posterior with 4 layers versus just 1.

4.3.2 Two-Dimensional Simulations

We now consider the case D = (0, 1)2, again where the forward map is given by a number of point evaluations. We ﬁx the number of point observations J = 210, on a 25 25 uniform grid. We again compare quality of reconstruction versus the number of point evaluations and the number of levels in the deep Gaussian prior, and use the same parameters for the

How Deep Are Deep Gaussian Processes?

Figure 7: Estimates of posterior means (solid curves) and 5 95% quantiles (shaded regions) arising from one-dimensional inverse problem. Number of data points taken are J = 100 (top block), J = 50 (middle block), J = 25 (bottom block). From left-to right, results for u N, F(u N 1) 1 2 , F(u N 2) 1 2 , F(u N 3) 1 2 are shown. From top-to-bottom within each block, N = 4, 3, 2, 1.

Dunlop, Girolami, Stuart and Teckentrup

Figure 8: Estimates of posterior means (solid curves) and 5 95% quantiles (shaded regions) arising from one-dimensional inverse problem, with J = 106 data points. From left-to right, results for u N, F(u N 1) 1 2 , F(u N 2) 1 2 , F(u N 3) 1 2 are shown. From top-to-bottom, N = 4, 3, 2, 1.

family of covariance operators as in subsection 4.2. The base layer u0 is taken to be Gaussian with covariance of the form (15), with Γ(u) 202. The true unknown ﬁeld u is constructed as a linear combination of truncated trigonometric functions with diﬀerent length-scales, and shown in Figure 10 along with its contours. It is given by

u (x, y) = cos(2πx) cos(2πy) + sin(4πx) sin(4πy)1(1/4,3/4)2(x, y)

+ sin(8πx) sin(8πy)1(1/2,3/4)2(x, y)

+ sin(16πx) sin(16πy)1(1/4,1/2)2(x, y).

It is generated on a uniform square mesh of 214 points, and two data sets are created wherein it is observed on uniform square grid of J = 210, 28 points, and corrupted by white noise with standard deviation γ = 0.02. Sampling is performed on a mesh of 212 points to again avoid an inverse crime. 4 105 samples are generated per chain, with the ﬁrst 2 105

discarded as burn-in when calculating means. Again the jump parameters βj are adaptively tuned to keep acceptance rates close to 30%. In Figure 11, analogously to Figure 7, the means of u N and of the length-scales associated with each layer are shown, for N = 1, 2, 3. When J = 210, reconstructions are similar, though quality is generally proportional to the number of layers. In particular the, eﬀect of too short a length-scale is evident in the case N = 1, in the regions where the length-scale should be larger, and conversely the eﬀect of too long a length-scale is evident in the cases N = 1, 2 in the region where the length-scale should be the shortest. In the cases N = 2, 3, the length-scale ﬁelds E F(u N 1) 1 2 are similar, though in the case N = 3 more accurately captures the true length-scales. When J = 28 the reconstructions are again similar, though

How Deep Are Deep Gaussian Processes?

Figure 9: Estimates of posterior means (solid curves) and 5 95% quantiles (shaded regions) arising from one-dimensional inverse problem. Number of data points taken are J = 100 (top block), J = 25 (bottom block). From left-to right, results for u N, F(u N 1) 1 2 , F(u N 2) 1 2 , F(u N 3) 1 2 are shown. From top-to-bottom within each block, N = 4, 3, 2, 1.

Dunlop, Girolami, Stuart and Teckentrup

0 0.2 0.4 0.6 0.8 1 0

0 0.2 0.4 0.6 0.8 1 0

Figure 10: The true ﬁeld used to generate the data for the two-dimensional inverse problem.

J 1 layer 2 layers 3 layers

210 0.0856 0.0813 0.0681

28 0.1310 0.1260 0.1279

Table 2: The L2-errors u E(u N) L2 between the true ﬁeld and sample means for the twodimensional simulations shown in Figure 11, for diﬀerent numbers of data points J and layers N.

there is now less accuracy in the shapes of the contours. In particular, the eﬀect of too short a length-scale is especially evident in the case N = 1. The values of the reconstructed ﬁelds in the area of shortest length-scale are inaccurate in all cases the positions of the observation points meant that the actual values of the peaks were not reﬂected in the data. The ﬁelds E F(u N 1) 1 2 have similar structure to the case J = 210, though less accurately represent the true length scales. The L2-errors between the means and the truth are shown in Table 2

5. Conclusions, Discussion, and Actionable Advice

In this section we provide an overview of the advantages and disadvantages of each of the four diﬀerent DGP constructions considered in the paper, and summarize actionable advice that can be taken from the theoretical and numerical results that have been presented. We then outline a number of directions that would be interesting for future study.

How Deep Are Deep Gaussian Processes?

N = 3 N = 2 N = 1

N = 3 N = 2 N = 1

Figure 11: Estimates of posterior means arising from two-dimensional inverse problem. (Top block) J = 210, (Bottom block) J = 28. From left-to right, E u N , E F(u N 1) 1 2 , E F(u N 2) 1 2 . From top-to-bottom within each block, N = 3, 2, 1.

Dunlop, Girolami, Stuart and Teckentrup

5.1 Comparison of Deep GP Constructions

We have considered four diﬀerent constructions of deep GPs and we now discuss their relative merits. We also consider the context of variational inference which is popular in machine learning primarily because of its tractability. We emphasize however that it forms an uncontrolled approximation of the true posterior distribution and may fail to adequately represent the posterior distribution, and uncertainty in particular. The composition construction is the classical construction introduced by Damianou and Lawrence (2013), building a hierarchy of layers using a stationary covariance function and composition. It has received the most study, and methods for variational inference have already been established. It has the advantage of scaling well with respect to data dimension d, however accurate sampling methods such as MCMC are intractable for large numbers of data points, due to the requirement to construct and factor dense covariance matrices at every step. The covariance function construction builds the hierarchy using a stationary covariance function, and iteratively modifying its associated length scale. It has the advantage that each layer can be readily interpreted as the anisotropic length-scale ﬁeld of the following layer. Its scaling properties are similar to those of the composition construction, however variational inference methods for this construction have not yet been studied. The covariance operator construction builds the hierarchy using an SPDE representation of stationary Matèrn ﬁelds, and again iteratively modiﬁes their associated length scale. It allows for fast sampling in low data dimension d via the use of PDE solvers, even when the number of data points is large. Accurate sampling via MCMC methods is tractable with this construction, due to the low cost of constructing and storing the inverse covariance (precision) matrix. Inference when d is large appears to be intractable at present, due to the requirement of dense meshes for PDE solvers. Finally, the convolution construction builds the hierarchy via iterative convolution of Gaussian random ﬁelds. It has the advantage of being amenable to analysis, however the results of this analysis indicate that it would likely be a poor construction to use for inference due to trivial behaviour for large depth. To summarize the numerical results on illustrative regression problems from the previous section, if the data is high quality, a small number of layers in the DGP will be suﬃcient as the problem becomes closer to interpolation. Conversely, if the data is low quality the likelihood is not strong enough to inform deeper layers in the DGP, and so a small number of layers is again suﬃcient. As a consequence, when the data lies between these two cases, and the truth has suﬃciently rich structure, the use of deeper processes may be advantageous, but care is required to limit the number of layers employed.

5.2 Summary and Future Work

There are a number of interesting ways in which this work may be generalized. Within the context of covariance operators it is of interest to construct covariances C(u) which are deﬁned as L α with L being the divergence form elliptic operator

Lu = F(u) u .

How Deep Are Deep Gaussian Processes?

Such a construction allows for the conditional distributions of the layers to be viewed as stationary on deformed spaces (Lindgren et al., 2011, 3.4), or to incorporate anisotropy in speciﬁc directions (Roininen et al., 2014, 3.1). Similar notions of anisotropy in diﬀerent directions can be incorporated into the covariance function formulation by choosing the length scale Σ(z) diﬀerent to a multiple of the identity matrix. Additionally, we could consider a non-zero mean in the iteration (GP), as considered by Duvenaud et al. (2014); Salimbeni and Deisenroth (2017), allowing for forcing of the system. For example, with the choice m(un) = un and a rescaling of the covariance, we obtain the Res Net-type iteration

un+1 = un + p

t L(un)ξn+1.

This may be viewed as a discretization of the continuous-time stochastic diﬀerential equation

du = L(u)L(u) d W,

analogously to what has been considered for neural networks (Haber and Ruthotto, 2017). Study of these systems could be insightful, for example deriving conditions to ensure a lack of ergodicity and hence arbitrary depth. As before denotes the adjoint operation. And ﬁnally it is possible to consider processes outside the four categories considered here; for example the one-step transition from un to un+1 might be deﬁned via stochastic integration against i.i.d. Brownian motions. We have shown how a number of ideas in the literature may be recursed to produce deep Gaussian processes, diﬀerent from those introduced by Damianou and Lawrence (2013). We have studied the eﬀective depth of these processes, either through demonstrating ergodicity, or through showing convergence to a trivial solution (such as 0 or ). Together these results demonstrate that, as also shown by Duvenaud et al. (2014) for the original construction of deep Gaussian processes, care is needed in order to design processes with signiﬁcant depth. Nonetheless, even a few layers can be useful for inference purposes, and we have demonstrated this also. It is an interesting question to ask precisely how the approximation power and eﬀective depth are aﬀected by the number of layers of the process, both in the non-ergodic case, and in the ergodic case before stationarity has been reached. We also emphasize that the analysis in the paper is based solely on the deep Gaussian process un, and not the conditioned process un|y in the inference problem with observed data y. The ergodicity properties of un do not directly carry over to un|y. As we have seen in the numerical experiments, the number of layers required in the inference problem in practice depends on the information content in the observed data y, and the analysis in this paper does not fully answer the question as to how many. The results in this paper do show, however, that in the case of ergodic constructions, the expressive power of the prior distribution in the inference problem does not increase past a certain number of layers. This provides some justiﬁcation for using only a moderate number of layers in a deep Gaussian process prior in inference problems. There are interesting approximation theory questions around deep processes, such as those identiﬁed in the context of neural networks by Pinkus (1999). There are also interesting questions around the use of these deep processes for inversion; in particular it seems hard to get signiﬁcant value from using depth of more than two or three layers for noisy inverse problems. On the algorithmic side the issue of eﬃciently sampling these deep processes

Dunlop, Girolami, Stuart and Teckentrup

(even over only two layers), when conditioned on possibly nonlinear observations remains open. We have used non-centred parameterizations because these may be sampled using function-space MCMC (Cotter et al., 2013; Chen et al., 2018); but centred methods, or mixtures, may be desirable for some applications.

Acknowledgments

MG is supported by EPSRC grants [EP/R034710/1, EP/R018413/1, EP/R004889/1, EP/P020720/1], an EPSRC Established Career Fellowship EP/J016934/3, a Royal Academy of Engineering Research Chair, and The Lloyds Register Foundation Programme on Data Centric Engineering. AMS is supported by AFOSR Grant FA9550-17-1-0185 and by US National Science Foundation (NSF) grant DMS 1818977. ALT is partially supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1.

Appendix A. Proofs for Section 2

Proof of Proposition 1 The stationary kernel ρS is positive deﬁnite by Assumption 1, and so by (Wendland, 2004, Theorem 7.14), we have

0 exp( r2t) dν(t) for all r [0, ),

for a ﬁnite, non-negative Borel measure ν on [0, ) that is not concentrated at 0 (i.e. it is not a multiple of the Dirac measure centred at 0). For any x Rd and t [0, ), let us now deﬁne the matrix Σt(x) := (4t) 1Σ(x) and the functions

Kx,t(z) = 1 (2π)d/2| Σt(x)|1/2 exp 1

2(x z)T Σt(x) 1(x z) .

Here | | denotes determinant and so the preceding is simply an expression for a normal density with mean x and covariance matrix Σt(x) when t > 0; at t = 0, we simply have Kx,t(z) = 0, for all x, z Rd. Then ρ(x, x ) is given by

2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4

|Σ(x) + Σ(x )| 1 2 ρS p

Q(x, x ) = 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4

|Σ(x) + Σ(x )| 1 2

0 exp t Q(x, x ) dν(t)

= 2 d 2 |Σ(x)| 1 4 |Σ(x )| 1 4

|Σ(x) + Σ(x )| 1 2

0 exp t(x x )T Σ(x) + Σ(x )

1 (x x ) dν(t)

| Σt(x)| 1 4 | Σt(x )| 1 4

| Σt(x) + Σt(x )| 1 2 exp 1

2(x x )T Σt(x) + Σt(x ) 1 (x x ) dν(t)

= (2π) d 2 2 d 2 Z

0 | Σt(x)| 1 4 | Σt(x )| 1 4 Z

Rd Kx,t(z)Kx ,t(z) dz dν(t),

where in the last step, we have used the fact that the convolution R

Rd Kx,t(z)Kx ,t(z) dz can be calculated explicitly using properties of normal random variables. More precisely, we

How Deep Are Deep Gaussian Processes?

Rd Kx,t(z)Kx ,t(z) dz = Z

Rd p X(z x)p X (z) dz = Z

Rd p X,X (z x, z) dz,

where p X is the density of X N(0, Σt(x)), p X is the density of X N(x , Σt(x )) and X and X are independent. The change of variable from X, X to W, X , where W = X X, has Jacobian 1, and so Z

Rd p X,X (z x, z) dz = Z

Rd p W,X (z (z x), z) dz = Z

Rd p W,X (x, z) dz = p W (x).

Since W = X X N(x , Σt(x) + Σt(x )), we hence have Z

Rd Kx,t(z)Kx ,t(z) dz

(2π) d 2 | Σt(x) + Σt(x )| 1 2 exp 1

2(x x )T Σt(x) + Σt(x ) 1 (x x ) ,

as required. Now, for any b RN and pairwise distinct {xi}N i=1, we then have

j=1 bibjρ(xi, xj)

= (2π) d 2 2 d 2

Rd | Σt(xi)| 1 4 Kxi,t(z)| Σt(xj)| 1 4 Kxj,t(z) dz dν(t)

= (2π) d 2 2 d 2 Z

i=1 bi| Σt(xi)| 1 4 Kxi,t(z)

since the Borel measure ν is ﬁnite and non-negative. It remains to show that strict inequality also holds. Firstly, we note that | Σ0(xi)| 1 4 Kxi,0(z) = 0, for all xi, z Rd, which means that the integrand with respect to t is identically equal to zero at t = 0. Secondly, we note that the points {xi}N i=1 are pairwise distinct and the functions {| Σt(xi)| 1 4 Kxi,t( )}N i=1 are hence linearly independent for any t (0, ). It is thus impossible to make the integrand with respect to z identically equal to 0 for a.e. z Rd. As a consequence the integrand with respect to t is positive for all t (0, ). Since we know that the measure ν is not concentrated at 0 this completes the proof that ρ is positive deﬁnite on Rd Rd, for any d N. Finally, we note that the kernel ρ is clearly non-stationary, and is a correlation function since ρ(x, x) = 1, for any x Rd.

Proof of Proposition 2 We note that the deﬁnition of positive deﬁnite in Assumptions 1(i) refers only to behaviour of the kernel on a ﬁnite set of pairwise distinct points {xi}N i=1. By Assumption 2(i), the function G is non-negative and bounded. If G(z) > 0 for all z Rd,

Dunlop, Girolami, Stuart and Teckentrup

then the matrix Σ(z) is positive deﬁnite for all z Rd, and the fact that ρ( , ) is positive deﬁnite follows directly from Proposition 1. It remains to investigate the case where G(z) = 0 for some z Rd. We will prove that ρ( , ) is positive deﬁnite by showing that the correlation matrix R, with entries Rij = ρ(xi, xj), is positive deﬁnite for any pairwise disjoint points {xi}N i=1. Without loss of generality, we will study the case G(x1) = 0; the proof easily adapts to the case where G(xi) = 0, for i = 1. To deﬁne ρ(x1, xj) in this case, we start by assuming G(x1) > 0, G(xj) > 0, and then take limits. With Σ(z) = G(z)Id, we have

Q(x1, xj) = (x1 xj)T Σ(x1) + Σ(xj)

= 2 x1 xj 2 2 G(x1) + G(xj) 1 ,

where 2 is the Euclidean norm, and

2 d 2 det(Σ(x1)) 1 4 det(Σ(xj)) 1 4

det(Σ(x1) + Σ(xj)) 1 2 =

4G(x1)G(xj) G(x1) + G((xj) 2

We now study separately three cases:

i) xj = x1: we have

lim G(x1) 0

4G(x1)G(x1) G(x1) + G(x1) 2

4 = lim G(x1) 0 1 = 1, (18)

and so using the algebra of limits, the continuity of ρS, (18) and the fact that ρS(0) = 1, we have lim G(x1) 0 ρ(x1, x1) = lim G(x1) 0 ρS p

Q(x1, x1) = ρS(0) = 1.

ii) xj = x1 and G(xj) > 0: we have

lim G(x1) 0 Q(x1, xj) = 2 x1 xj 2 2 G(xj)) 1 ,

lim G(x1) 0

4G(x1)G(xj) G(x1) + G(xj) 2

4 = 0. (19)

Thus, using the continuity of ρS, together with (19) and the algebra of limits, we have

lim G(x1) 0 ρ(x1, xj) = 0.

How Deep Are Deep Gaussian Processes?

iii) xj = x1, G(xj) = 0: we obtain

lim G(x1),G(xj) 0 Q(x1, xj) = ,

which by Assumptions 2(ii) implies that

lim G(x1),G(xj) 0 ρS q

Q(x1, xj) = 0.

Since (a + b)2 4ab for any positive numbers a and b, we have

4G(x1)G(xj) G(x1) + G((xj) 2

for any G(x1) > 0, G(xj) > 0, and hence

lim G(x1),G(xj) 0 ρ(x1, xj) = 0.

Hence, when G(xi) > 0, for i = 2, . . . , N, we have lim G(x1) 0 R = R , where the matrix R has the ﬁrst row and column equal to the ﬁrst basis vector e1 = (1, 0, 0, . . . , 0) RN, and the remaining submatrix R N 1 RN 1 N 1 with entries ρ(xi, xj), for i, j = 2, . . . , N. The matrix R N 1 is positive deﬁnite by Proposition 1, from which we can conclude that R

is positive deﬁnite also. A similar argument holds when G(xi) = 0 for one or more indices i {2, . . . , N}.

Neil K Chada, Marco A Iglesias, Lassi Roininen, and Andrew M Stuart. Parameterizations for ensemble Kalman inversion. Inverse Problems, 34(5):055009, 2018.

Victor Chen, Matthew M Dunlop, Omiros Papaspiliopoulos, and Andrew M Stuart. Robust MCMC sampling with non-Gaussian and hierarchical priors in high dimensions. ar Xiv preprint ar Xiv:1803.03344, 2018.

Simon L Cotter, Gareth O Roberts, Andrew M Stuart, and David White. MCMC methods for functions: modifying old algorithms to make them faster. Statistical Science, 28(3): 424 446, 2013.

Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random feature expansions for deep Gaussian processes. In International Conference on Machine Learning, pages 884 893, 2017.

Zhenwen Dai, Andreas Damianou, Javier González, and Neil Lawrence. Variational autoencoded deep Gaussian processes. ar Xiv preprint ar Xiv:1511.06455, 2015.

Dunlop, Girolami, Stuart and Teckentrup

Andreas C Damianou and Neil D Lawrence. Deep Gaussian Processes. In Artiﬁcial Intelligence and Statistics, pages 207 215, 2013.

Yair Daon and Georg Stadler. Mitigating the inﬂuence of the boundary on PDE-based covariance operators. Inverse Problems and Imaging, 12(5), 2018.

Masoumeh Dashti and Andrew M Stuart. The Bayesian approach to inverse problems. Handbook of Uncertainty Quantiﬁcation, 2017.

Persi Diaconis and David Freedman. Iterated random functions. SIAM Review, 41(1):45 76, 1999.

David K Duvenaud, Oren Rippel, Ryan P Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artiﬁcial Intelligence and Statistics, pages 202 210, 2014.

Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.

Martin Hairer. An introduction to stochastic PDEs. ar Xiv preprint ar Xiv:0907.4178, 2009.

Martin Hairer, Andrew M Stuart, Jochen Voss, and Petter Wiberg. Analysis of SPDEs arising in path sampling. Part I: The Gaussian case. Communications in Mathematical Sciences, 3(4):587 603, 2005.

Markus Heinonen, Henrik Mannerström, Juho Rousu, Samuel Kaski, and Harri Lähdesmäki. Non-stationary Gaussian process regression with Hamiltonian Monte Carlo. In Artiﬁcial Intelligence and Statistics, pages 732 740, 2016.

Dave Higdon, Marc Kennedy, James C Cavendish, John A Cafeo, and Robert D Ryne. Combining ﬁeld data and computer simulations for calibration and prediction. SIAM Journal on Scientiﬁc Computing, 26(2):448 466, 2004.

Marco A Iglesias, Yulong Lu, and Andrew M Stuart. A Bayesian level set method for geometric inverse problems. Interfaces and Free Boundaries, 18(2):181 217, 2016.

Jari Kaipio and Erkki Somersalo. Statistical and Computational Inverse Problems, volume 160. Springer Science & Business Media, 2006.

Gopinath Kallianpur. Stochastic Filtering Theory, volume 13. Springer Science & Business Media, 2013.

Marc C Kennedy and Anthony O Hagan. Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(3):425 464, 2001.

Finn Lindgren, Håvard Rue, and Johan Lindström. An explicit link between Gaussian ﬁelds and Gaussian Markov random ﬁelds: the stochastic partial diﬀerential equation approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(4):423 498, 2011.

How Deep Are Deep Gaussian Processes?

Jonathan C Mattingly, Andrew M Stuart, and Desmond J Higham. Ergodicity for SDEs and approximations: locally Lipschitz vector ﬁelds and degenerate noise. Stochastic Processes and their Applications, 101(2):185 232, 2002.

Sean P Meyn and Richard L Tweedie. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2012.

Radford M Neal. Bayesian Learning for Neural Networks. Ph D thesis, University of Toronto, 1995.

Radford M Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and classiﬁcation. ar Xiv preprint physics/9701026, 1997.

Christopher J Paciorek. Nonstationary Gaussian processes for regression and spatial modelling. Ph D thesis, Carnegie Mellon University, 2003.

Christopher J Paciorek and Mark J Schervish. Nonstationary covariance functions for Gaussian process regression. Advances in Neural Information Processing Systems, 16:273 280, 2004.

Omiros Papaspiliopoulos, Gareth O Roberts, and Martin Sköld. A general framework for the parametrization of hierarchical models. Statistical Science, pages 59 73, 2007.

Allan Pinkus. Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143 195, 1999.

Frank J Pinski, Gideon Simpson, Andrew M Stuart, and Hendrik Weber. Kullback Leibler approximation for probability measures on inﬁnite dimensional spaces. SIAM Journal on Mathematical Analysis, 47(6):4091 4122, 2015.

Carl E Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. the MIT Press, 2(3):4, 2006.

James C Robinson. Inﬁnite-Dimensional Dynamical Systems: An Introduction to Dissipative Parabolic PDEs and the Theory of Global Attractors, volume 28. Cambridge University Press, 2001.

Lassi Roininen, Janne M J Huttunen, and Sari Lasanen. Whittle-Matérn priors for Bayesian statistical inversion with applications in electrical impedance tomography. Inverse Problems and Imaging, 8(2):561 586, 2014.

Lassi Roininen, Mark Girolami, Sari Lasanen, and Markku Markkanen. Hyperpriors for Matérn ﬁelds with applications in Bayesian inversion. ar Xiv preprint ar Xiv:1612.02989, 2016.

Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pages 4588 4599, 2017.

Dunlop, Girolami, Stuart and Teckentrup

Alexandra M Schmidt and Anthony O Hagan. Bayesian inference for non-stationary spatial covariance structure via spatial deformations. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3):743 758, 2003.

Edward Snelson, Zoubin Ghahramani, and Carl E Rasmussen. Warped Gaussian processes. In Advances in Neural Information Processing Systems, pages 337 344, 2004.

Michael L Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer, 1999.

Holger Wendland. Scattered Data Approximation, volume 17. Cambridge University Press, 2004.

Yaming Yu and Xiao-Li Meng. To center or not to center: That is not the question an Ancillarity Suﬃciency Interweaving Strategy (ASIS) for boosting MCMC eﬃciency. Journal of Computational and Graphical Statistics, 20(3):531 570, 2011.