# interdomain_deep_gaussian_processes__d426d44c.pdf Inter-domain Deep Gaussian Processes Tim G. J. Rudner 1 Dino Sejdinovic 2 Yarin Gal 1 Inter-domain Gaussian processes (GPs) allow for high flexibility and low computational cost when performing approximate inference in GP models. They are particularly suitable for modeling data exhibiting global structure but are limited to stationary covariance functions and thus fail to model non-stationary data effectively. We propose Inter-domain Deep Gaussian Processes, an extension of inter-domain shallow GPs that combines the advantages of inter-domain and deep Gaussian processes (DGPs), and demonstrate how to leverage existing approximate inference methods to perform simple and scalable approximate inference using inter-domain features in DGPs. We assess the performance of our method on a range of regression tasks and demonstrate that it outperforms inter-domain shallow GPs and conventional DGPs on challenging large-scale realworld datasets exhibiting both global structure as well as a high-degree of non-stationarity. 1. Introduction Gaussian processes (GPs) are a powerful tool for function approximation. They are Bayesian non-parametric models and as such they are flexible, robust to overfitting, and provide well-calibrated predictive uncertainty estimates (Rasmussen & Williams, 2005; Bui et al., 2016). Deep Gaussian processes (DGPs) are layer-wise compositions of GPs designed to model a larger class of functions than shallow GPs. To scale GP and DGP models to large datasets, a wide array of approximate inference methods has been developed, with inducing points-based variational inference being the most widely used (Snelson & Ghahramani, 2006; Titsias, 2009; Wilson & Nickisch, 2015). However, conventional inducing 1Department of Computer Science, University of Oxford, Oxford, United Kingdom 2Department of Statistics, University of Oxford, Oxford, United Kingdom. Correspondence to: Tim G. J. Rudner . Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s). points-based inference for GPs relies on point evaluations and thus, by construction, creates local approximations to the target function. As a result, the approximate posterior predictive distribution may fail to capture complex global structure in the data, severely limiting the usefulness and computational efficiency of local inducing points-based approximations. Inter-domain GPs were designed to overcome this limitation. In order to capture global structure in the underlying data-generating process, inter-domain GPs define inducing variables as projections of the target function over the entire input space and not as mere point evaluations (L azaro Gredilla & Figueiras-Vidal, 2009; Rahimi & Recht, 2008; Gal & Turner, 2015). The resulting posterior predictive distribution is able to represent complex data with global structure with higher accuracy as local approximations but at the same computational cost. Unfortunately, inter-domain projections most suitable for capturing global structure (e.g., spectral transforms) are limited by the fact that they can only be used with stationary covariance functions, making them ill-suited for modeling non-stationary data and limiting their usefulness in practice. We propose Inter-domain Deep Gaussian Processes to overcome this limitation while retaining the benefits of interdomain methods.1 Specifically, we define an augmented DGP model, in which we replace local inducing variables by reproducing kernel Hilbert space (RKHS) Fourier features, and exploit the compositional structure of the variational distribution in doubly stochastic variational inference (DSVI) for DGPs. This way, we achieve simple and scalable approximate inference while efficiently capturing global structure in the underlying data-generating process. The resulting inter-domain DGP is composed of a composition of interdomain GPs, which makes it possible to efficiently model complex, non-stationary data despite each inter-domain GP in the hierarchy being restricted to stationary covariance functions. We establish that our method performs well on several complex real-world datasets exhibiting global structure and nonstationarity and demonstrate that inter-domain DGPs are more computationally efficient than DGPs with local approx- 1For source code and additional results, see https:// bit.ly/inter-domain-dgps. Inter-domain Deep Gaussian Processes 0 500 1000 1500 3 Training Point Test Point (a) Deep GP with global approximations. 0 500 1000 1500 3 (b) Shallow GP with global approximations. 0 500 1000 1500 3 (c) Deep GP with local approximations. Figure 1: Approximate posterior predictive distributions for shallow and deep GP models obtained from 20 inducing points. The blue lines denote the posterior predictive means of the models, respectively. Each shade of blue corresponds to one posterior standard deviation. imations when modeling data with global structure. Figure 1 shows approximate posterior predictive distributions of an inter-domain deep GP (1a), an inter-domain shallow GP (1b), and a deep GP based on local approximations (1c) on a dataset with global structure. To summarize, our main contributions are as follows: 1. We propose Inter-domain Deep Gaussian Processes and use RKHS Fourier features to incorporate global structure into the DGP posterior predictive distribution; 2. We present a simple approach for performing approxi- mate inference in inter-domain DGPs by exploiting the compositional structure of the variational distribution in DSVI; 3. We show that inter-domain DGPs significantly outper- form both inter-domain shallow GPs and state-of-theart local approximate inference methods for DGPs on complex real-world datasets with global structure; 4. We demonstrate that inter-domain DGPs are more com- putationally efficient than local approximate inference methods for DGPs when trained on data exhibiting global structure. 2. Background We begin by reviewing DGPs and inter-domain GPs. We will draw on this exposition in subsequent sections. 2.1. Deep Gaussian Processes DGPs are layer-wise compositions of GPs in which the output of a previous layer is used as the input to the next layer. Similar to deep neural networks, the hidden layers of a DGP learn representations of the input data, but unlike neural networks, they allow for uncertainty to be propagated through the function compositions. This way, DGPs define probabilistic predictive distributions over the target variables and unlike for shallow GPs any finite collection of random variables distributed according to a DGP posterior predictive distribution does not need to be jointly Gaussian, allowing DGP models to represent a larger class of distributions over functions than shallow GPs. Consider a set of N noisy target observations y 2 RN at corresponding input points X = [x1, ..., x N]> 2 RN D. A DGP is defined by the composition y = f (L) + def= f (L)(f (L 1)(...f (1)(X))...) + , (1) where L is the number of layers, and f ( ) = f ( )(f ( 1)) in the composition denotes the th-layer GP, f ( )( ), evaluated at f ( 1). We follow previous work and absorb the noise between layers, which is assumed to be i.i.d. Gaussian, into the kernel so that knoisy(xi, xj) = k(xi, xj) + σ( )2δij, where δij is the Kronecker delta and σ( )2 is the noise variance between layers (Salimbeni & Deisenroth, 2017). A DGP with likelihood p(y | f (L)) has the joint distribution p(y, {f ( )}L p(yi | f (L) p(f ( ) | f ( 1)), with f 0 def= X. Unlike shallow GPs, exact inference in DGPs is not analytically tractable due to the nonlinear transformations at every layer of the composition in Equation (1). To make posterior inference tractable, a number of approximate inference techniques for DGPs have been developed with the aim of improving performance, scalability, stability, and ease of optimization (Dai et al., 2015; Hensman & Lawrence, 2014; Bui et al., 2016; Salimbeni & Deisenroth, 2017; Cutajar et al., 2017; Mattos et al., 2015; Havasi et al., 2018; Salimbeni et al., 2019). Inter-domain Deep Gaussian Processes 2.2. Inter-domain Gaussian Processes Inter-domain GPs are centered around the idea of finding a possibly more compact representative set of input features in a domain different from the input data domain. This way, it is possible to incorporate prior knowledge about relevant characteristics of data such as the presence of global structure into the inducing variables. Consider a real-valued GP f(x) with x 2 RD and some deterministic function g(x, Z), with M inducing points Z 2 RM H. We define the following transformation: RD f(x)g(x, Z) dx. (2) Since u(Z) is obtained through an affine transformation of f(x), u(Z) is also a GP, but may lie in a different domain than f(x) (L azaro-Gredilla & Figueiras-Vidal, 2009). Inter-domain GPs arise when f(x) and u(Z) are considered jointly as a single, augmented GP, as is the case for local inducing points-based approximate inference. The feature extraction function g(x, Z) used in the integral then defines the transformed domain in which the inducing dataset lies. The inducing variables obtained this way can be seen as projections of the target function f(x) on the feature extraction function over the entire input space (L azaro-Gredilla & Figueiras-Vidal, 2009). As such, each of the inducing variables is constructed to contain information about the structure of f(x) everywhere in the input space, making them more informative of the stochastic process than local approximations. (Hensman et al., 2018; L azaro-Gredilla & Figueiras-Vidal, 2009). In general, the usefulness of inducing variables mostly relies on their covariance with the remainder of the process, which, for inducing points-based approximate inference, is encoded in the vector-valued function ku(x) = [k(z1, x), k(z2, x), ..., k(z M, x)]. The matrix Kuu def= K(Z, Z) and the vector-valued function ku(x) are central to inducing points-based approximate inference for GPs where they are used to construct an approximate posterior distribution. 3. Inter-domain Deep Gaussian Processes In this section, we will introduce inter-domain DGPs. First, we will present a general inter-domain DGP framework. Next, we will explain why constructing inter-domain deep GPs is more challenging than constructing inter-domain shallow GPs and how we can leverage the compositional structure of the layer-wise approximate posterior predictive distributions in doubly stochastic variational inference (Salimbeni & Deisenroth, 2017) to obtain simple and scalable inter-domain DGPs. Finally, we will draw on prior work (Hensman et al., 2018) to explicitly incorporate global structure into the inter-domain transformation. 3.1. The Augmented Inter-domain Deep Gaussian Process Model In inducing points-based approximate inference, the GP model is augmented by a set of inducing variables, u(Z). Unlike conventional inducing points-based approximations, inter-domain approaches do not constrain inducing points to lie in the same domain as the input data. To distinguish between inducing points that lie in the same domain as the input data and inter-domain inducing points, we diverge from the notation in the previous section and from now on define inter-domain inducing points across DGP layers as { ( )}L 1 =0 with corresponding inducing variables u( ) def= u( ( 1)) for = 1, ..., L, where L is the number of DGP layers. We can then express the augmented DGP joint distribution by p(y, {f ( ), u( )}L p(yn | f (L) p(f ( ) | u( ); f ( 1), ( 1)) p(u( ); f ( 1), ( 1)). Importantly, each p(u( ); f ( 1), ( 1)) is being evaluated at a set of inter-domain inducing points ( 1) but also includes information about f ( 1) via the inter-domain projections. For a graphical representation, see Figure 2b. To avoid overloading notation, we will assume that each GP layer has the same mean and covariance functions m( ) and k( , ). For each DGP layer we thus have a transformed-domain instance of the mean function, m( ( 1)) = E[u( ( 1))] RD E[f ( )(f ( 1))] g(f ( 1), ( 1)) df ( 1) RD m(f ( 1)) g(f ( 1), ( 1)) df ( 1) and a transformed-domain instance of the covariance func- Inter-domain Deep Gaussian Processes n yn f (1) GP f (2) GP f (3) GP n = 1, ..., N (a) Graphical model representation of a DGP model with local inducing inputs, inducing variables, and two hidden layers, f (1) and f (2), for n = 1, ..., N. n yn f (1) GP f (2) GP f (3) GP n = 1, ..., N (b) Graphical model representation of an inter-domain DGP model with inducing frequencies, RKHS Fourier feature inducing variables, and two hidden layers, f (1) and f (2), for n = 1, ..., N. Figure 2: Graphical model representations of local inducing-points DGPs (Figure 2a) and inter-domain DGPs (Figure 2b). Greyed-out nodes denote observed data and non-greyed out nodes denote unobserved data. k(f ( 1), ( 1)) =E[f ( )(f ( 1))u( ( 1))] f ( )(f ( 1)) RD f ( )(f ( 1)) g(f ( 1), ( 1)) df ( 1) RD k(f ( 1), f ( 1)0) g(f ( 1)0, ( 1)) df ( 1)0 k( ( 1), ( 1)0) =E[u( ( 1))u( ( 1)0)] RD f ( )(f ( 1)) g(f ( 1), ( 1)) df ( 1) RD f ( )(f ( 1)0) g(f ( 1)0, ( 1)0) df ( 1)0 $ RD k(f ( 1), f ( 1)0) g(f ( 1), ( 1)) g(f ( 1)0), ( 1)0) df ( 1)0 where we use the superscript φ to indicate that the basis functions have the form of the inter-domain instance of the covariance function (L azaro-Gredilla & Figueiras-Vidal, 2009). Mean and covariance functions at each layer are therefore defined both by the values and domains of their arguments. We can now express the joint distribution in Equation (3) in terms of the layer-wise covariance functions given above and perform inference across domains. 3.2. Simple and Scalable Approximate Inference in Inter-domain Deep Gaussian Processes Since exact inference in DGPs is intractable, we need approximate inference methods. Unfortunately, most approximate inference methods for DGPs require computing convolutions between Ku f and the distributions of the latent functions, that is, Ku f N(f ( ) |mf , Sf ) df ( ), (7) where N(f ( ) |mf , Sf ) represents the variational distribution of layer with mean mf and variance Sf (see, for example, pages 50-51 in Damianou (2015) or Damianou & Lawrence (2013), Dai et al. (2014), Bui et al. (2016)). While these convolutions are easy to compute in closed form for conventional inducing points-based approximations where the covariance matrix is computed from the DGP s inputdomain covariance function, they are non-trivial to compute analytically for inter-domain covariance functions (Hensman et al., 2018). To perform approximate inference in inter-domain DGPs, we exploit the fact that in contrast to previous inducing points-based variational inference methods for DGPs the layer-wise marginalization over each f ( ) in doubly stochastic variational inference (DSVI) (Salimbeni & Deisenroth, 2017) does not require computing convolutions that explicitly depend on the specific type of cross-covariance function kφ u( )(f ). Instead, the functional form of the posterior predictive distribution q(f (L)) and the use of the reparameterization trick make marginalizing out the latent GP functions across layers straightforward and result in simple, compositional posterior predictive mean and covariance functions at each DGP layer. For further details on DSVI, see Appendix D. This property allow us to simply use the inter-domain operators Kφ u f as off-the-shelf replacements for the conven- Inter-domain Deep Gaussian Processes tional inducing-point operators Ku f without having to analytically convolve Kφ u f with the distribution over functions at the th layer, yielding the variational distribution q(f ( ) |µ( ), ( ); f ( 1), ( 1)) N(f ( ) | emf , e Sf ) u u (µ( ) mφ def= Kf f Kφ u u ( ))Kφ 1 def= m(f ( 1)) and mu def= m( ( 1)), and µ( ) and ( ) are variational parameters. Since DSVI uses the reparameterization trick to sample functions at each layer, the inter-domain operators can be used directly to compute the posterior mean and variance for each layer, which allows for simple and scalable approximate inference in inter-domain DGPs. 3.3. RKHS Fourier Features for Approximate Inference in Gaussian Processes In the previous section, we showed how to perform approximate inference in inter-domain DGPs with any inter-domain operators kφ u(x) and Kφ uu. Next, we will introduce RKHS Fourier features (Hensman et al., 2018) an inter-domain approach able to capture global structure in data and show how to incorporate them into inter-domain DGPs. RKHS Fourier features use RKHS theory to construct interdomain alternatives to the covariance matrices Kuu and ku(x) used in conventional inducing points-based approximate inference methods. They are constructed by projecting the target function f onto the truncated Fourier basis φ(x) =[1, cos(!1(x a)), ..., cos(!M(x a)), sin(!1(x a)), ..., sin(!M(x a))]>, where x is a single, one-dimensional input, and [!1, ..., !M] denote inducing frequencies defined by !m = 2 m b a for some interval [a, b]. From this truncated Fourier basis, we can construct inducing variables as inter-domain projections by defining um def= Pφm(f), which can be shown to yield transformed-domain instances of the covariance function given by cov(um, f(x)) = φm(x), cov(um, um0) = hφm, φm0i H, for both of which there are closed-form expressions if the GP prior covariance function is given by a half-integer member of the Mat ern family of kernels (Durrande et al., 2016). For further details, see Hensman et al. (2018). The resulting inter-domain operators u(x) = φm(x), Kφ uu = hφm, φm0i H, represent inter-domain alternatives to the ku(x) and Kuu operators used in local inducing points-based approximations. By constructing linear combinations of the values of the data-generating process as projections instead of simple function evaluations, the resulting inducing variables become more informative of the underlying process and have more capacity to represent complex functions (Hensman et al., 2018; L azaro-Gredilla & Figueiras-Vidal, 2009). Analogous to the way in which local inducing pointsbased approaches approximate the DGP posterior distribution through kernel functions, RKHS Fourier features approximate the posterior through sinusoids (Hensman et al., 2018). The structure imposed by the frequency domain makes RKHS Fourier features particularly well-suited to capture global structure in data. For further details on RKHS Fourier features, see Appendix C. 3.4. Inter-domain Deep Gaussian Processes with RKHS Fourier Features To construct inter-domain DGPs that leverage global structure in data, we use approximate posterior predictive distributions based on RKHS Fourier Features at every layer. For layers = 1, ..., L with input dimensions D( 1), let !m = 2 m b a for m = 1, ..., M, and let ( 1) def= [!( 1) 1 , ..., !( 1) be the matrix of M D( 1) inducing frequencies producing a set of D( 1) truncated Fourier bases φ( )(f ( 1)), as defined in Equation (9). Each φ( )(f ( 1)) then maps f ( ) into Fourier space by applying the RKHS inner product h , i H given by m (f ( )) = hφ( ) m , f ( )i H, for Fourier basis entries φ( ) m (f ( 1)) with m = 1, ..., M 0 and M 0 = 2M +1 (as in the shallow GP case), thus creating the M 0 D( )-dimensional matrix u( ) = [Pφ( ) 1 (f), ..., Pφ( ) We thus obtain inter-domain operators kφ u (f ( 1)) and Kφ u u for DGP layers = 1, ..., L. Using the variational distribution in Equation (8), we then get a final-layer posterior predictive distribution n |µ( ), ( ); f ( 1) n , ( 1)) df ( ) Inter-domain Deep Gaussian Processes where f ( ) n is the nth row of f ( ). This quantity is easy to compute using the reparameterization trick, which allows for sampling from the nth instance of the variational posteriors across layers by defining and sampling from ( ) n N(0, ID( )) (Kingma & Welling, 2014; Salimbeni & Deisenroth, 2017). Prediction To make predictions, we sample from the approximate posterior predictive distribution of the final layer the same way as in DSVI. For a test input x , we draw S samples from the posterior predictive distribution |µ(L), (L); f (s) (L 1), (L 1)) where q(f (L) ) is the DGPs marginal distribution at x and (L 1) are draws from the penultimate layer (and thus indirectly from all previous layers) obtained via reparameterization of each layer as shown in Equation (10). Evidence Lower Bound The evidence lower bound (ELBO) is the same as in DSVI, apart from the fact that it is computed from the inter-domain posterior predictive distributions at each DGP layer. It is given by log p(yn | f (L) KL(q(u( )) || p(u( ))), which can be optimized variationally using gradient-based stochastic optimization. We include a derivation of this bound in Appendix E. To estimate the expected loglikelihood, we generate predictions at the input locations by drawing Monte Carlo samples from q(f (L) n ) as shown in Equation (11). 3.5. Further Model Details In our implementation, we let m,i = !( 1) m,j 8i, j D( 1) m 8 , 0 2 {1, ..., L} 8m 2 {1, ..., M 0}, which means that we use the same inducing frequencies at every DGP layer, but this assumption can be relaxed easily. Moreover, we use additive kernels to apply RKHS Fourier features to multidimensional inputs. For for each layer, we define f ( )(f ( 1)) = where f ( 1) d is the dth element of the multi-dimensional single input f ( 1) n , and k( ) p ( , ) is a kernel defined on a scalar input space (Hensman et al., 2018). This way, we obtain the DGP layer for which we are then able to construct a matrix of features with elements u( ) m,d = Pφm(f ( ) d ), resulting in a total of 2MD( ) + 1 inducing variables, independent across dimensions, i.e., cov(u( ) m,d0) = 0. With the corresponding variational parameters estimated via gradient-based optimization, the cost per iteration when computing the posterior mean for an additive kernel is O(NM 2D). Using an additive kernel at each DGP layer then results in a time complexity of O(NM 2(D(1) + D(2) + ... + D(L))) per iteration, which is identical to that of DSVI. In practice, however, we find that inter-domain DGPs require fewer inducing points and fewer gradient steps to achieve a given level of predictive accuracy compared to DGPs with DSVI, making them more computationally efficient. Unlike DGPs that use conventional inducing-points based approximate inference, inter-domain DGPs have an additional hyperparameter; the frequency interval [a, b]. To avoid undesirable edge effects in the DGP posterior predictive distributions, we normalize all input data dimensions to lie in the interval [0, 1] and define the RKHS over the interval [a, b] = [ 2, 3]. We repeat this normalization at each DGP layer before feeding the samples into the next GP. To avoid pathologies in DGP models investigated in prior work (Duvenaud et al., 2014), we follow Salimbeni & Deisenroth (2017) and use a linear mean function m( )(f ( 1)) = f ( 1) w( ), where w( ) is a vector of weights, for all but the final-layer GP, for which we use a zero mean function. We used a Mat ern3 2 kernel for all experiments. 4. Related Work Inducing points-based approximate inference has allowed GPs models to scale to large numbers of input points (Snelson & Ghahramani, 2006; Titsias, 2009; Hensman et al., 2013; Bui & Turner, 2014; Hensman et al., 2015). Our work directly builds on Hensman et al. (2018) and Salimbeni & Inter-domain Deep Gaussian Processes Output Layer 0.0 0.2 0.4 0.6 0.8 1.0 1 Intermediate Layer (a) Inter-domain DGP with DSVI (two layers). Top: DGP posterior predictive distribution. Bottom: Predictive distribution at intermediate layer. Output Layer 0.0 0.2 0.4 0.6 0.8 1.0 Intermediate Layer (b) Conventional DGP with DSVI (two layers). Top: DGP posterior predictive distribution. Bottom: Predictive distribution at intermediate layer. 0.0 0.2 0.4 0.6 0.8 1.0 2.0 (c) GP with RKHS Fourier features (single layer). Posterior predictive distribution. Figure 3: Comparison of posterior predictive distributions of different GP models on synthetic non-stationary data. The models are trained using 20 inducing frequencies and 20 inducing points, respectively. In each plot, training points are shown in red. Each shade of blue represents one standard deviation in the posterior predictive distribution. For enlarged plots, see Appendix B. Deisenroth (2017) and adds to the literature on sparse spectrum approximations (L azaro-Gredilla & Figueiras-Vidal, 2009; L azaro-Gredilla et al., 2010; Gal & Turner, 2015; Wilson & Nickisch, 2015). Specifically, we extend Hensman et al. (2018) to compositions GP models by leveraging the compositional structure of the approximate posterior of Salimbeni & Deisenroth (2017). In contrast to Wilson & Nickisch (2015), L azaro-Gredilla & Figueiras-Vidal (2009), and Gal & Turner (2015), Hensman et al. (2018) (and, by extension, our approach) combines inter-domain operators with SVI (Hensman et al., 2013) and is amenable to stochastic optimization on minibatches, which makes it possible to apply it to large datasets without facing memory constraints. Similar to our approach, random feature expansions for DGPs (Cutajar et al., 2017) use projections of each DGP layer s predictive distribution onto the spectral domain to perform approximate inference, but unlike our approach, it is not based on inducing points. 5. Empirical Evaluation To demonstrate that inter-domain DGPs improve upon interdomain shallow GPs in their ability to model complex, nonstationary data and to show that inter-domain DGPs improve upon local inducing points-based approximate inference methods for DGPs, we will present results from several experiments that showcase the types of prediction problems for which inter-domain DGPs are particularly well-suited. We are particularly interested in modeling complex datagenerating processes which exhibit global structure as well as non-stationarity, since the former is challenging for DGPs that use local approximations, such as DSVI for DGPs, and the latter is challenging for shallow GPs with stationary covariance functions. To illustrate the advantage of inter-domain deep GPs over inter-domain shallow GPs in modeling non-stationary data, we present a suite of qualitative and quantitative empirical evaluations on datasets that exhibit global structure and non-stationarity. First, we present a simple, synthetic data experiment designed to demonstrate that our method is well-suited for modeling data from generating processes that exhibit both non-stationarity and global structure. Next, we illustrate that inter-domain deep GPs provide a significant gain in computational efficiency when modeling data that exhibits global structure. In particular, we compare the number of inducing frequencies and inducing points needed to attain a certain predictive accuracy when using inter-domain DGPs and local inducing points-based DGPs on a challenging real-world audio sub-band reconstruction task. Lastly, we demonstrate that our method outperforms existing state-of-the-art shallow GPs with local approximate inference, shallow GPs with global approximate inference, and deep GPs with local approximate inference on a series of challenging real-world benchmark prediction tasks. For additional experiments and more experimental details, see Appendix B. 5.1. Highly Non-Stationary Data with Global Structure The multi-step function in Figure 3 is designed to exhibit both global structure as well as non-stationarity, providing an optimal test case to assess the performance of interdomain DGPs visa-vis related methods on a simple and easily interpretable prediction task. The plot shows the posterior predictive distributions of interdomain DGPs, DGPs with DSVI, and inter-domain shallow GPs with RKHS Fourier features. As can be seen in the plots, inter-domain DGPs are the only method that is able to model the step locations well and to infer the global structure that is, that the function is constant within certain intervals Inter-domain Deep Gaussian Processes 0 50 100 150 200 250 300 350 400 Number of Inducing Inputs/Frequencies Mean Standardized RMSE Dataset Size: 352 Inter-domain DGP with DSVI DGP with DSVI 0 50 100 150 200 250 300 350 400 Number of Inducing Inputs/Frequencies Dataset Size: 3526 0 100 200 300 400 500 600 700 800 900 1000 Number of Inducing Inputs/Frequencies Dataset Size: 35267 0 100 200 300 0 1000 2000 3000 0.04 0 10000 20000 30000 4 Figure 4: Comparison of average standardized root mean squared errors for varying numbers of inducing inputs on three datasets of increasing global structure and complexity. On complex datasets (center and right panel), Inter-domain deep GPs with DSVI require fewer inducing inputs than conventional DGPs with DSVI. Standardized root mean squared errors were evaluated on a test set of 40% of datapoints in each subset over 10 random seeds each. with high accuracy and good predictive uncertainty despite having a stationary covariance function (see Figure 3a). Inter-domain shallow GPs, in contrast, are unable to capture either the step transitions nor the global structure, reflecting their limited expressiveness (see Figure 3c). While DGPs benefit from increased expressivity, they, too, fail to fully capture the global structure and the non-stationarity (see Figure 3b). This is due to the inherently local nature of local inducing points-based inference, which requires large numbers of inducing inputs to accurately approximate complex posterior distributions. The experiment illustrates that interdomain DGPs are in fact able to overcome key limitations of both shallow GP inter-domain approaches and outperform state-of-the-art local DGP inference methods. Figure 3 presents another interesting insight into differences between conventional and inter-domain DGPs. In particular, the bottom plots in Figure 3a and Figure 3b, show the output of the DGP intermediate layers and are markedly different from one another. While it appears that the conventional DGP seeks to model the target function directly in intermediate-layer space, the inter-domain DGP appears to cluster datapoints from the original input space in a way such that the changes in the step function in output space become associated with smooth transitions in intermediatelayer space. 5.2. Modeling Complex Data Efficiently via Global Next, we quantitatively assess the predictive accuracy and computational efficiency of inter-domain DGPs. To do so, we use a smoothed sub-band of a speech signal taken from the TIMIT database and previously used in Bui & Turner (2014). The dataset exhibits complex global structure which is difficult to model using local approximation methods. To assess how well inter-domain DGPs are able to capture global structure in the data, we compare it to a doubly stochastic variational inference for DGPs, a state-of-the-art approximate inference method for DGPs based on local inducing points. To assess how well different approximate inference methods are able to capture the complex global structure, we look at three subset of the data: the first 352 datapoints, the first 3,526 datapoints, and the first 35,267 datapoints. The smallest subset of only 352 datapoints does not exhibit much global structure and is small enough to be modeled with few (local) approximations, which is reflected by the left panel in Figure 4, where inter-domain DGPs and conventional DGPs perform equally well, and increasing the number of inducing frequencies/points does not lead to an improvement in performance. As we increase the size of the dataset to 3,526 datapoints, however, the global structure measurable by a high degree of autocorrelation in the data becomes readily apparent. As can be seen in the top row, inter-domain DGPs require relatively fewer inducing frequencies compared to conventional DGPs to achieve a test error close to zero. The difference in the number of inducing points required to model the data is most significant for the largest subset, shown in the right panel of Figure 4. As can be seen in the plot, the covariance of the process varies significantly. While this subset of the audio sub-band dataset is highly non-stationary, it does exhibit global structure in the shape of repeating patterns in output space. As a result, inter-domain deep GPs are able to attain a test error close Inter-domain Deep Gaussian Processes 0.78 0.65 0.52 Mean Log-likelihood GP with SVI GP with VFF DGP with DSVI ID-DGP with DSVI parking N: 35717 D: 4 1.200 1.125 1.050 Mean Log-likelihood air N: 41757 D: 8 0.910 0.775 0.640 Mean Log-likelihood traffic N: 48204 D: 7 1.200 1.175 1.150 Mean Log-likelihood power N: 2049280 D: 5 1.320 1.275 1.230 Mean Log-likelihood appliances N: 19735 D: 27 1.360 1.335 1.310 Mean Log-likelihood airline N: 5929413 D: 8 Figure 5: Average test log-likelihood (higher is better) and standard errors (over 10 random seeds) on a set of real-world datasets with global structure. All models were trained with 50 inducing points. The inter-domain DGP with DSVI has two layers and the conventional DGP with DSVI has four layers. The performance of the inter-domain DGP did not increase as additional layers were added. to zero with fewer than half the number of inducing points needed for conventional DGPs to achieve the same level of accuracy. Since the time complexity of DSVI scales quadratically in the number of inducing points and inter-domain and conventional DGPs have the same time complexity (that is, a single gradient step takes approximately equally long for the same number of inducing frequencies/points), inter-domain DGPs are more computationally efficient in practice when modeling data exhibiting global structure. Additionally, in Figure 1, we also show that for 20 inducing frequencies/points, inter-domain DGPs have better-calibrated posterior predictive uncertainty estimates than conventional DGPs. 5.3. Global Structure in Real-World Data To quantitatively assess the predictive performance of interdomain DGPs, we evaluate them on a range of real-world dataset, which exhibit global structure usually in the form of a temporal component that induces a high autocorrelation. The experiments include medium-sized datasets ( parking , air , traffic ), two very large datasets with over two and five million datapoints each ( power and airline ), and a highdimensional dataset with 27 input dimensions ( appliances ). As can be seen in Figure 5, inter-domain DGPs consistently outperform conventional DGPs (DGPs with DSVI) as well as inter-domain shallow GPs (GPs with VFF) and significantly outperform conventional shallow GPs (SVI), suggesting that combining the increased expressivity of DGP models with the ability of inter-domain approaches to capture global structure leads to the best predictive performance. See Appendix B for a plot of the test standardized RMSEs for the experiments in Figure 5 and for additional results on datasets that do not exhibit global structure (and on which our method performs on par with existing methods). To assess the predictive performance of inter-domain DGPs on extremely complex, non-stationary data, we test our method on the U.S. flight delay prediction problem, a large-scale regression problem that has reached a status of a standard test in GP regression due to its massive size of 5, 929, 413 observations and its non-stationary nature, which makes it challenging for GPs with stationary covariance functions (Hensman et al., 2018). The data set consists of flight arrival and departure times for every commercial flight in the United States for the year 2008. We predict the delay of the aircraft at landing (in minutes) from eight covariates: the age of the aircraft (number of years since deployment), route distance, airtime, departure time, arrival time, day of the week, day of the month, and month. The non-stationarity in the data is likely due to the recurring daily, weekly, and monthly fluctuations in occupancy. In our evaluation, we find that the predictive performance of interdomain DGPs is superior to closely-related state-of-the-art shallow and deep GPs as shown in Table 1 and Figure 5. Table 1: Average standardized root mean squared errors and standard errors (over 10 random seeds) on the U.S. flight delay prediction task. N 1,000,000 5,929,413 Method RMSE SE RMSE SE GP with SVI (local) 0.946 0.008 0.941 0.005 GP with VFF (global) 0.925 0.007 0.923 0.006 DGP with DSVI (local) 0.932 0.004 0.930 0.003 DGP with DSVI (global) 0.906 0.006 0.903 0.002 6. Conclusion We proposed Inter-domain Deep Gaussian Processes as a deep extension of inter-domain GPs that combines the advantages of inter-domain and deep GPs and allows us to model data exhibiting non-stationarity and global structure with high predictive accuracy and low computational overhead. We showed how to leverage the compositional nature of the approximate posterior in DSVI to perform simple and scalable approximate inference and established that inter-domain DGPs can be more computationally efficient than conventional DGPs. Finally, we demonstrated that our method significantly and consistency outperforms inter-domain shallow GPs and conventional DGPs on data exhibiting non-stationarity and global structure.