# stochastic_deep_gaussian_processes_over_graphs__c1b8c8b3.pdf Stochastic Deep Gaussian Processes over Graphs Naiqi Li1, Wenjie Li2, Jifeng Sun1 Yinghua Gao2 Yong Jiang1,3 Shu-Tao Xia2,3, 1Tsinghua-Berkeley Shenzhen Institute, Tsinghua University 2Shenzhen International Graduate School, Tsinghua University 3PCL Research Center of Networks and Communications, Peng Cheng Laboratory In this paper we propose Stochastic Deep Gaussian Processes over Graphs (DGPG), which are deep Gaussian models that learn the mappings between input and output signals in graph domains. The approximate posterior distributions of the latent variables are derived with variational inference, and the evidence lower bound is evaluated and optimized by the proposed recursive sampling scheme. The Bayesian non-parametric natural of our model allows it to resist overfitting, while the expressive deep structure grants it the potential to learn complex relations. Extensive experiments demonstrate that our method achieves superior performances in both small size (< 50) and large size (> 35,000) datasets. We show that DGPG outperforms another Gaussian-based approach, and is competitive to a state-ofthe-art method in the challenging task of traffic flow prediction. Our model is also capable of capturing uncertainties in a mathematical principled way and automatically discovering which vertices and features are relevant to the prediction. 1 Introduction Gaussian processes (GPs) [1] are a favourable choice in the machine learning arsenal, due to their distinguishing advantages in modeling uncertainties, the ability of introducing expert knowledge through the flexible kernel design, and the data efficient property which accounts for their success in small and medium datasets. GPs have been successfully applied in a variety of tasks, including computer vision [2], Bayesian optimization [3], active learning [4], multi-task learning [5] and reinforcement learning [6]. However standard GPs scale poorly as O(N 3), making it a challenge to apply them to large-scale datasets. Their expressiveness is also limited by the specific choice of kernel functions, which are difficult to be manually decided for complex problems. Recently a series of works about deep GPs were presented [7 9], which overcome the aforementioned disadvantages. All these methods can find their roots in the seminal paper [10], which proposes to use a set of inducing points whose size is manageable to summarize all the information in the original large dataset - the inducing points and their latent function values can be inferred by variational inference. This sparse approximation technique enables deep GPs to handle extremely large datasets. For instance [11] reports that their method achieves superior performance on a dataset with over one billion data points. The deep hierarchical structure has greatly improved the expressiveness of the models, resulting in state-of-the-art performances over various challenging tasks. The last decade also witnesses the rapid development of machine learning algorithms over graphstructured datasets. A great amount of impacting research on graph neural networks (GNNs) has been presented, including [12 17], to name only a few of which. Hitherto most of the emerging literature on graph data analysis are based on neural networks. Though demonstrating satisfactory Equal contributions. Corresponding author: Shu-Tao Xia (xiast@sz.tsinghua.edu.cn). 34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada. performances over various graph learning tasks, their nature as parametric methods is associated with several inevitable drawbacks: they are insufficient in modeling uncertainties; they are vulnerable to overfitting, especially on small datasets; the learning is difficult and their success relies heavily on the choices of network structures and hyperparameters. A natural question to be asked is that can we analyze graph-structured data with non-parametric models such as Gaussian processes? A few recent publications show us that the answer is positive [18 20]. However, the investigations up to now are far from thorough and complete, and the development of deep Gaussian processes has also revealed us an alternative approach that is scarcely explored by far. In this paper we propose Stochastic Deep Gaussian Processes over Graphs (DGPG), which is a method for modeling the relations between input and output signals over graph-structured domains. Our work is closely related to [11]: both our methods are based on sparse approximation [10] and the variational inference framework [21]. Though the evidence lower bound is analytically intractable, it can be evaluated and optimized by the stochastic minibatch sampling technique. We summarize the main contributions of this paper as follows: 1) We propose a novel Bayesian non-parametric method called Stochastic Deep Gaussian Processes over Graphs (DGPG), to model the relations of input/output signals over graphs; 2) It is rigorously proved that under some technical assumptions, the sampling variances of DGPG are strictly less then that of [11], implying that DGPG achieves faster convergence by considering graph information; 3) We performed experiments on a synthetic dataset. Numerical results conform with our theoretical analysis and support the claim that DGPG converges much faster and better by utilizing graph information; 4) Experiments on realistic datasets demonstrate that DGPG can be successfully applied to both small and large-scale datasets. We show that our method outperforms a recent GP-based graph learning algorithm, and is competitive to a state-of-the-art DNN method on the challenging task of traffic flow prediction; 5) We show that DGPG possesses several other desirable characteristics: it can model uncertainties with high accuracy, and the automatic relevance determination (ARD) kernel allows it to learn which neighbors and features are of greater importance for the prediction. 2 Background and Related Work 2.1 Sparse Gaussian Processes Gaussian processes (GPs) [1] are among the most popular choices in the Bayesian non-parametric machine learning family. Given a dataset D = {(xi, yi)}N i=1 with xi RD and yi R, we denote X = (x1, ..., x N) , y = (y1, ..., y N) , and the latent function values as f = f(X). Gaussian process regression (GPR) models the relation as a function f : RD R, assuming f has a Gaussian prior. The prediction has a closed form when the likelihood p(y|f) is Gaussian. The prior of f is characterized by a kernel function k : RD RD R and a mean function m : RD R. For a test point x , its prediction y also follows the Gaussian distribution and is analytically tractable. Standard GPs possess various advantages, however the prohibitive O(N 3) training complexity renders them inapplicable to large-scale datasets. Recently lots of work emerged to assuage the complexity burden [22 26], among which the sparse Gaussian processes stand out as some of the most successful and popular methods [27 32]. Sparse Gaussian processes (SGPs) introduce M inducing inputs Z = (z1, ..., z M) , and the corresponding function values are u = f(Z) where p(u) = N(u|m(Z), k(Z, Z)). The joint probability density is p(y, f, u) = p(f|u; X, Z)p(u; Z) QN i=1 p (yi|fi). Posterior p(f, u|y) is only tractable when the likelihood p(yi|fi) is Gaussian, and in this case it still requires O(N 3) computation. Fortunately, variational inference presents us a framework to solve these two problems simultaneously. Variational inference introduces a distribution q(f, u) to approximate the posterior p(f, u|y). It can be showed that minimizing the Kullback-Leibler divergence KL(q(f.u)||p(f, u|y)) is equivalent to maximizing the evidence lower bound (ELBO), which is defined as LSGP = Eq(f,u) h log p(y,f,u) As in [30] the variational distribution is defined as q(f, u) = p(f|u; X, Z)q(u) where q(u) = N(u|m, S) is assumed to be Gaussian. Furthermore, the conditional distribution p(f|u; X, Z) is also Gaussian: p(f|u; X, Z) = N(f|µ, Σ), where [µ]i = m (xi) + α (xi) (u m(Z)), [Σ]ij = k (xi, xj) α (xi) k(Z, Z)α (xj), and α (xi) = k(Z, Z) 1k (Z, xi). Finally we can show that the marginal also follows Gaussian: q(f|m, S; X, Z) = R p(f|u; X, Z)q(u)du = N(f| µ, Σ), where [ µ]i = µm,Z (xi) = m (xi) + α (xi) (m m(Z)), (1) [ Σ]ij = ΣS,Z (xi, xj) = k (xi, xj) α (xi) (k(Z, Z) S)α (xj) . (2) With the analysis above, ELBO LSGP can be further simplified, so that by maximizing it both the variational parameters and the latent variables can be found. 2.2 Doubly Stochastic Deep Gaussian Process In [11] the authors show that it is possible to stack the above single-layer models and form a hierarchical deep Gaussian process (DGP). For a DGP model with L layers, it introduces L independent latent variables {Ul}L l=1 as the function values of the inducing points {Zl}L l=1, so that p(Y, {Fl, Ul}L l=1) = QN i=1 p(yi|f L i ) QL l=1 p(Fl|Ul; Fl 1, Zl 1)p(Ul; Zl 1). With a similar reasoning by assuming q(Ul) = N(Ul|ml, Sl), we have q({Fl}L l=1) = QL l=1 N(Fl| µl, Σl), where [ µl]i = µml,Zl 1(f l i) and [ Σl]ij = ΣSl,Zl 1(f l i, f l j) (µml,Zl 1 and ΣSl,Zl 1 defined in Eq (1) (2)). A key finding of [11] is that for a given xi, the marginal of the variational distribution in the last layer q(f L i ) depends only on the marginals of {q(f l i)}L 1 l=1 in the previous layers. Formally we have q f L i = R QL 1 l=1 q f l i|ml, Sl; f l 1 i , Zl 1 df l i. Consequently f l depends only on f l 1, so that the sampling of f L i in the last layer can be conducted in a recursive layer-to-layer manner by using the re-parameterization trick [33, 34]. After introducing the hierarchical structure, it can be showed that the new ELBO is LDGP = PN i=1 Eq(f L i ) log p yn|f L n PL l=1 KL q Ul p Ul; Zl 1 . With the samples of f L i in the last layer readily available, [11] finally proposed to evaluate and optimize ELBO with stochastic sampling. Similar to SGP, DGP can be applied to large-scale datasets (e.g. a billion data points), and the deep structure allows it to model more complex relations. Later we will see that our new method enhances this recursive sampling scheme by utilizing graph information, and our sampling has strictly smaller variances so that better and faster convergence can be achieved. All the aforementioned advantages of SGP and DGP are naturally inherited by our method. 2.3 Related Work To assuage the O(N 3) training complexity of standard GPs, several sparse approximation methods have been proposed [35, 36, 27 29, 31, 32], the ideas of which are to construct a manageable set of inducing points to summarize the information of the original dataset. It is rigorously established in [37] that under some technical assumptions sparse approximation methods can produce reliable results with M = O(log D N) inducing points, where D is the dimension of the data. Inspired by some of them, [10] introduces a method to select the inducing points by using variational inference. A few years later this work has become the foundation of a series of deep Gaussian process models: inspired by [10] and the GP latent variable model (GP-LVM) [38], Damianou and Lawrence proposed deep GPs [7], which are composed of hierarchical Gaussian process mappings; several other deep GP models were lately proposed [8, 9]. All these models demonstrate their superior performances over a variety of challenging tasks. The work above derives analytically tractable evidence lower bounds, while in contrast with them in [11] - which is the most related paper of our work - the authors present a model with milder independent assumption and utilize the sampling technique. Applying the sampling technique not only allows the model to use both Gaussian (e.g. in regression) and non-Gaussian (e.g. in classification) likelihood in an uniform manner, but also enables it to take advantage of GPU acceleration. These strengths are also inherited by our proposal. Recently applying machine learning methods to graph-structured datasets has draw much attention from the research community. Among various approaches, graph neural networks (GNNs) [39, 40] are one family of the most successful methods. GNNs can be divided into two categories, the spectral methods [12, 15, 16] and the non-spectral methods [13, 14, 17]. Though with great empirical success in many challenging tasks including network analysis, biochemistry, and traffic prediction - to name only a few of them - their training is generally difficult and the success relies heavily on the choices of network structures and hyperparameters, and the models are vulnerable to overfitting. It is thus desirable that non-parametric methods like GPs can overcome these drawbacks. A few works aiming to apply GPs to graph-structured domains exist. The paper of [18] proposes a GP model to solve semi-supervised learning problems over graphs. Their method is also based on sparse approximation and the variational inference framework. However, our goals are different and our approaches share little similarity besides the minor points mentioned above. In [19] the authors present a novel model called the graph convolutional GPs. Their paper aims to tackle the graph classification problem by designing an addictive kernel with translation-invariant property, which borrows the idea of the convolutional operator on the grid-like domains. Gaussian Markov Random Fields (GMRFs) [41, 42] are elegant models for interpolating signals over graphs. However they are inadequate in extrapolation across the vertices, which is the challenge that focused in our work. Finally a method called Gaussian Processes over Graphs (GPG) was proposed in [20]. This paper shares the same goal with us, i.e. both our models try to learn the mappings between input-output signals over graphs. However our methodologies are totally different. The key idea of their approach is to incorporate the graph information as prior knowledge in the covariance matrix. Their method still suffers the poor scalability, and results are only reported for datasets up to about 100 data points. By contrast DGPG is based on sparse approximation and deep GPs. Consequently our method can handle datasets with tens of thousands of training instances, and has the potential of modeling more complex mappings. 3 Stochastic Deep Gaussian Processes over Graphs Problem Statement and Notations The dataset is defined as D = {G, Ψ, Φ}, where G encodes the graph structure, and Ψ and Φ represent the input and output signals. To be more precise, G = V, E is comprised of a set of vertices V and a set of edged E V V. For each vi V, we use pa(vi) = {vk|(vk, vi) E} to denote its parents. Each input ψ Ψ is a function ψ : V Rdin over the vertices of the graph. Similarly each output φ Φ is defined as φ : V Rdout. The goal is to learn a function h : Ψ Φ that takes a signal ψ as input and predicts the output φ. We introduce several notations that makes the presentation more succinct. Let N be the size of the training dataset, |V| = K be the number of vertices. The graph G is denoted by AG {0, 1}K K. Note that binary connection does not restrict the expressiveness, since the ARD kernel can capture the relevance of each node (see later discussions). The input signals Ψ can be represented as a matrix X = (x1, ..., x N) ,where xi RKdin is a vector constructed by concatenating the features of all the nodes, depicted as the i-th row of X in Fig. 1. Similarly Φ is represented as Y = (y1, ..., y N) , where yi RKdout is depicted as the i-th row of Y in Fig. 1. Figure 1: An example illustrating the notations and factorization. X and Y are the observed input and output signals; Ql = Kdl is the dimension of the concatenated vector in layer l. In this example v2 is connected to v1 and v3; two yellow boxes denote F1,1 1 and F1,3 1 (or F1,pa(2) 1 ); red box denotes F2,2 1 ; green box denotes U2,2. Model Structure Similar to [11], we define a deep structure in our DGPG. For a L-layer model, we denote F1, ..., FL (Fl RN Kdl where dl is the dimension of each node in layer l) as the latent function values. Inspired by the sparse approximation framework, for each layer we introduce M inducing points Z1, ..., ZL (Zl RM Kdl) and the associated function values are U1, ..., UL (Ul RM Kdl). For a matrix M (M {X, Y, F, Z, U}), we use Ml,k to denote the submatrix associated with the l-th layer and the k-th vertex. Ml,pa(k) is the submatrix in the l-th layer associated with all the parents of vertex vk in the next layer. Ml,k i is a vector that denotes the i-th row of Ml,k. The factorization of the joint distribution is depicted in Fig. 1, given by p Y, Fl,k, Ul,k k=1 p yk n|FL,k n | {z } likelihood l=1 p Fl,k|Ul,k; Fl 1,pa(k), Zl 1,pa(k) p Ul,k; Zl 1,pa(k) | {z } GP prior where F0 = X. It is assumed that the prior is Gaussian, and Ul,k s are factorized among layers and vertices. The likelihood can be Gaussian (e.g. regression) or non-Gaussian (e.g. classification). By applying variational inference, we define the approximate posterior distribution as q({Fl,k, Ul,k}l,k) = k=1 p(Fl,k|Ul,k; Fl 1,pa(k), Zl 1,pa(k))q(Ul,k), (4) where q(Ul,k) =N(Ul,k|ml,k, Sl,k) is Gaussian. Since both two terms on the r.h.s. in Eq (4) are Gaussian, we can marginalize out Ul,k and obtain q({Fl,k}l,k) = QL l=1 QK k=1 N(Fl,k| µl,k, Σl,k), where [ µl,k]i = µml,k,Zl 1,pa(k)(Fl 1,pa(k) i ), [ Σ]ij = ΣSl,k,Zl 1,pa(k)(Fl 1,pa(k) i , Fl 1,pa(k) j ). In this equation [ µl,k]i denotes the i-th row of the matrix µl,k RN Kdl; [ Σl,k]ij denotes the value at the entry (i, j) in Σl,k RN N. The final analytical results are can be obtained by Eq (1) and (2). Considering the above mean and covariance, as well as the marginal property of Gaussian processes, we have the following simple method to sample the latent function values recursively. Recursive Sampling Scheme The marginal of the i-th row, k-th vertex in layer L depends only on the parents of k in layer L 1. By a recursive analysis, the distribution of q(FL,k i ) depends only on the ancestors of k. Consequently for the final layer we have q(FL,k i ) = Z Y q(Fl,k i |ml,k , Sl,k ; Fl 1,pa(k ) i , Zl 1,pa(k ))d{Fl 1,pa(k ) i }, (5) where l, k s enumerate all ancestors of L, k and L, k itself. The significance of the above observation is that it provides us with a method to sample from q(FL,k) in a recursive way. Similar to [11], the sampling can be conveniently achieved by using the reparameterization trick recursively. Specifically let ϵl,k i N(0, Idl) be a sampled Gaussian noise, the samples of vertices in different layers can be obtained recursively as ˆFl,k i = µml,k,Zl 1,pa(k)(ˆFl 1,pa(k) i ) + ϵl,k i q ΣSl,k,Zl 1,pa(k)(ˆFl 1,pa(k) i , ˆFl 1,pa(k) i ) (6) Evidence Lower Bound Now we derive the evidence lower bound of DGPG. As Fl,k s and Ul,k s represent the latent variables in the model, the evidence lower bound is computed as LDGP G = Eq({Fl,k,Ul,k}l,k) log p Y,{Fl,k,Ul,k}l,k q({Fl,k,Ul,k}l,k) . By substituting Eq (3) and (4) this can be simplified as k=1 Eq(FL,k n ) log p yk n|FL,k n k=1 KL h q Ul,k p Ul,k; Zl 1,pa(k) i . There all totally O(KNM 2(d1 + + d L)) variational parameters in the model. Main Theorem Next we present the main theoretical result of this paper, which states that by utilizing graph information DGPG reduces sampling variances, which implies faster convergence during the optimization of ELBO. First we list two technical assumptions and their intuitive explanations. Assumption 1. k(Z, Z) S 0 ( 0 denotes positive definite). Intuitively in Eq (2) the uncertainty characterized by ΣS,Z(xi, xj) should be strictly smaller than that of k(xi, xj) since more information is available. Similar observations and arguments can be found in [27]. Assumption 2. Without graph the prediction y depends on all other nodes, i.e. h(x) = y where x RD; while with graph y depends only on x x, i.e. h( x) = y where x Rd and d < D. Similarly Z RM d and S RM M represent the variables when graph is considered. We assume that ||k( Z, x)||2/||k(Z, x)||2 > λmax/ λmin (λmax is the maximal eigenvalue of k(Z, Z) 1(k(Z, Z) S)k(Z, Z) 1, and λmin is the minimal eigenvalue of k( Z, Z) 1(k( Z, Z) S)k( Z, Z) 1). In the proof we show how to justify this assumption when the graph is sparse (i.e. d D). Theorem 1 We denote the sampling variance with graph as Σii = Σ S, Z( xi, xi) (defined in Eq (2)), and the sampling variance without graph as Σii = ΣS,Z(xi, xi). Under the assumptions 1 and 2, for a large variety of kernel functions (including RBF, Matérn32, etc.) we have Σii < Σii. The proof is left in the supplementary materials. Finally let α = ||k( Z, x)||2/||k(Z, x)||2 and β = λmax/ λmin. The assumptions are equivalent to λmin( λmin) > 0 and α > β. Now both the assumptions become verifiable assertions, which will be further validated in the experimental study. 4 Experimental Study In this section, we conduct thorough experimental study of our method. Our implementation is based on the GPflow framework [43] and the implementation of DGP 3. The source code and evaluations of our work are publicly available 4. All experiments were performed on a Ubuntu server equipped with Ge Force GTX 1080 Ti GPU, which supports our parallelized implementation. 4.1 Synthetic Dataset We begin with presenting the experimental results on a synthetic dataset. The goal of this example is to support our theoretical analysis with numerical evidences, e.g. verifying the positive definite assumption and the claim that graph can reduce sampling variances and lead to faster convergence. Consider a symmetric graph with 500 nodes, and each node is connected to approximately 7 randomly selected parents (including a self-connecting edge). We sample each input signal xi R500 from a standard multivariate normal distribution with an isometric covariance matrix x N(0, I). The output of each node is defined as the sum of its parents: yk i = P k :k pa(k) xk i . We totally generate 500 input-output signals for training, and 200 for testing. In the theoretical analysis we show that α β implies the algorithm benefits from graph information and thus converges faster, so we use r = α/β to measure the richness of information of graphs: larger r suggests more informative graph structures. We further corrupt the graph by randomly select p = 0, 2, 4, 6 edges and change them into erroneous connections (when p = 0 the connection is not changed). Table 1: Several statistics that are of interest for the theoretical analysis. Run p = 0 p = 2 p = 4 p = 6 λmin λmax λmin α β α/β α/β α/β α/β 1 0.4 1.1 1.1E+00 5.0E+18 1E+0 5.0E+18 2.9E+17 2.8E+15 6.1E+13 2 1.2 1.2 5.4E+02 4.1E+17 2E-3 2.1E+20 2.8E+16 3.3E+15 4.2E+13 3 1.0 1.1 4.9E+01 2.8E+17 2E-2 1.4E+19 1.1E+16 6.8E+14 4.5E+13 4 0.6 1.0 2.8E+00 3.7E+20 3E-1 1.2E+21 3.9E+15 3.3E+15 3.5E+13 5 0.4 1.1 9.5E-01 1.3E+18 4E-1 3.3E+18 6.0E+16 1.8E+16 1.6E+13 We trained a DGPG model with 1 layer and 50 inducing points, repeated over 5 runs. Without loss of generality we report the results of the first node and the first instance. Table 1 presents various numerical values that are of interest to our analysis. As p increases r becomes smaller (corrupted graph is less informative), which follows our analysis and expectation. Fig. 2 shows the convergences ELBO. DGPG without graph is computed with a fully connected graph, in which case DGPG reduces to DGP. 3https://github.com/ICL-SML/Doubly-Stochastic-DGP 4https://github.com/naiqili/DGPG Figure 2: Convergence of ELBO Table 2: Dataset description of training/test instances, number of nodes, and average degree. #Iter. is the steps of iteration. Dataset Weather f MRI ETEX #Training 46 145 30 #Test 46 145 30 #Node 45 100 168 Avg. D. 5 5 10 #Iter. 2000 5000 5000 The results suggest that: 1) Assumption k(Z, Z) S 0 does always hold (λmin s are always positive); 2) α β - according to Theorem 1 this implies that our method can reduce the sampling variances by utilizing the graph information; 3) the above claim is further demonstrated in Fig. 3, from which we can see that our method converges much faster in optimizing ELBO. 4.2 Small Datasets Data efficiency is an appealing property of GPs. In this experiment we test the performance of DGPG on three small datasets: Weather [44], f MRI [45] and ETEX [46]. To ensure that our comparison is fair, we use the identical datasets and training/test splitting. Description of the datasets is presented in Table 2. We use M = 5 inducing points. When building the adjacent matrix we use Euclidean distance to construct the connections. Considering the utilization of graph information, we add two graph-aware Gaussian Process methods (GPG [20] and GCGP [19]) to the comparison experiment. Accordingly, other DGP methods are excluded as they are incapable of modeling graph datasets. GCGP [19] is originally designed to only make one output for the whole graph rather than every node. In order to make it comparable, we train an independent GCGP instance for each node and recompose their outputs as the requested complete prediction. Besides, the definition of intrinsic angular distance on general graphs is not provided in [19], so we just use the diffusion graph convolution kernel in experiments. Table 3 presents the evaluation results. We are interested in three metrics: mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE). Since these metrics were not considered in [20] we evaluated their method locally. We also consider two naive baselines: historical mean and historical median. In the preprocessing stage we normalize the data on each node to have zero mean and one standard deviation. For f MRI and ETEX the input signals on some nodes are absent, and we simply take their values as Gaussian noises with very small variances (10 4). We tested our model with 1 to 4 layers, and tried the RBF and Matérn32 kernels. From this experiment we can draw the following conclusions: 1) our method outperforms the other recently proposed Gaussian-based method in most cases, and sometimes by a large margin; 2) in complex tasks deeper layers lead to better performance, showing that DGPG benefits from the deep structure; 3) RBF and Matérn32 kernels produce similar results, indicating that DGPG is not sensitive to the particular kernel choice; 4) DGPG reserves the appealing data efficiency property. Table 3: GP-L and GPG-L are baselines in [20]. SVR denotes support vector regression. For SVR the output is a function of its neighbors input and high-order graph information is lost. Metrics of GCGP [19] are calculated using the recomposed prediction result. We report the results of DGPG using linear kernel (L), RBF kernel (RBF), Matérn32 kernel (M32), and the optimal layer. Terms with underline denote best results. Results of several other baselines are in the supplementary materials. Dataset Metrics mean GP-L GPG-L SVR GCGP DGPG (L/RBF/M32/layer) MAE 1.61 1.52 1.66 1.44 1.46 1.47 / 1.37 / 1.36 / 1 Weather RMSE 2.11 1.97 2.19 1.88 1.90 1.92 / 1.80 / 1.79 / 1 MAPE 17% 19% 24% 15% 18% 17% / 16% / 15% / 1 MAE 0.016 0.026 0.074 0.014 0.015 0.015 / 0.010 / 0.010 / 4 f MRI RMSE 0.021 0.033 0.089 0.020 0.021 0.020 / 0.015 / 0.015 / 4 MAPE 1.6% 2.6% 7.4% 1.4% 1.5% 1.5% / 1.0% / 1.0% / 4 MAE 0.40 0.25 0.30 0.18 0.22 0.30 / 0.27 / 0.21 / 3 ETEX RMSE 0.45 0.34 0.36 0.31 0.31 0.41 / 0.41 / 0.35 / 3 MAPE 40% 25% 30% 18% 22% 30% / 27% / 21% / 3 Table 5: Comparison on the task of traffic flow prediction. Results of other baselines are obtained from [48]. DGPG* utilizes validation data during training, fixed to be 3-layer. Terms with underline indicate best results. :::: Terms with wavy underline indicate second best. Results of the BAY dataset are in the supplementary materials. T Metrics VAR FC-LSTM DCRNN DGPG (1/2/3/4) DGPG MAE 4.42 3.44 2.77 3.06 / 3.04 / 3.02 / 3.02 ::: 3.00 RMSE 7.89 6.30 5.38 5.40 / 5.35 /:::: 5.32 / ::: 5.32 5.31 MAPE 10.2% 10.9% 7.3% 6.6% / 6.0% / 6.6% / ::: 6.5% :::: 6.5% MAE 5.41 3.77 3.15 3.57 / 3.42 / 3.42 / ::: 3.39 ::: 3.39 RMSE 9.13 7.23 6.45 6.37 / 6.16 / 6.16 / 6.12 ::: 6.13 MAPE 12.7% 10.9 8.8% 7.5% /:::: 7.3% /:::: 7.3% / 7.2% 7.2% MAE 6.52 4.37 :: 3.6 4.02 / 3.83 /3.00 / 3.80 3.8 RMSE 10.11 8.69 7.59 7.12 / ::: 6.93 / 6.94 / 6.94 6.85 MAPE 15.8% 13.2% 10.5% 8.4% / 8.1% /:::: 7.9% / 8.0% 5.0% 4.3 Large Datasets Next we test DGPG on two challenging large datasets, where the goal is to predict the traffic flow with different forecasting horizons (15, 30 and 60 min) in Los Angeles (LA) [47] and Bay Area (BAY) [48]. We compare with DCRNN [48], which is one of state-of-the-art approaches in traffic flow prediction. We use the same data and splitting according to their released code5. We use M = 20 inducing points per node. Records in some instances are missing, and we fill them with the median values. When building the graph each vertex is connected to about 5 neighbors. The only difference is that we only consider the first 100 nodes. Dataset details can be found in the supplementary materials. Table 4: Variance Analysis of DGPG 15 min 84.4% 94.7% 97.7% 30 min 84.2% 94.5% 97.5% 60 min 83.7% 94.3% 97.5% 15 min 87.4% 95.6% 97.9% 30 min 86.0% 94.5% 97.2% 60 min 85.1% 94.1% 97.0% Figure 3: Locations of the sensors with: (top) the least uncertainty, and (bottom) the largest uncertainty. The results are presented in Table 5. Due to space limit we only show three strong baselines, VAR [49], FC-LSTM [50] and DCRNN (other results are in the supplementary materials). One advantage of DGPG is that we do not have strong necessity to use a validation set, so similar to [18] we utilize the validation dataset for training while fixing its layer to be 3. These results are showed as DGPG*. Results in Table 5 show that: 1) DGPG is competitive w.r.t. the state-of-the-art method DCRNN, and outperforms it in most cases in the more challenging dataset LA; 2) DGPG produces accurate predictions for different datasets and forecasting horizons, showing its stability and consistency; 3) DGPG achieves its best performance with the appropriate layers, demonstrating that it benefits from deep structures; 4) the performance can be improved by utilizing the validation data during training. Variance Analysis A distinguishing advantage of GPs is that they are capable of modeling the predictive variances. We will see that DGPG can capture uncertainty with satisfactory precision. Inspired by [51], we exam how many test instances fall in the predictive confidence interval. For Gaussian distribution N(µ, σ2), the interval (µ kσ, µ + kσ) covers about 68.3%/95.5%/99.7% of the probability density as k = 1, 2, 3. So we can expect that the potion of the test instances falling in the predictive intervals should have the same ratio. Table 4 shows the experimental results. We can see that the numerical results comply with our analysis reasonably well, particularly within the 2σ intervals. 5https://github.com/liyaguang/DCRNN We performed a case study to show DGPG s ability in capturing uncertainty. We computed the variances of the test data and took the average across each node. Fig. 3 depicts the two sensors with the least and the largest uncertainties. The sensor with the least uncertainty locates at sparsely populated area with simple traffic condition; the sensor with the largest uncertainty locates at the business center where the traffic condition is very complex. Automatic Relevance Determination Fig. 4 depicts the learned lengthscales of node 1. In this particular example, node 1 is connected to 5 other nodes (including itself). Each node has 8 features: T represents time in a day (00:00, 00:05, ...); D represents day in a week (Monday, Tuesday, etc.); features at t = {6, ..., 1} represent the traffic at 5t minutes before. The first 8 bars correspond to the lengthscales of the self connection, and other bars represent the lengthscales w.r.t. the other four nodes. Note that a smaller lengthscale indicates greater relevance. The effectiveness of the ARD kernel can be inspected in Fig. 4: 1) by using ARD, DGPG is capable of discerning which neighbors are more relevant - for instance the self connection is more important when the forecasting horizon is small (15 min); 2) DGPG can further discover which features are relevant - time plays a significant role in the prediction, while which day the record occurred is less relevant; 3) we emphasize again that the relevance is automatically discovered, so that using binary adjacent matrices does not reduce the expressiveness of our model. Figure 4: Lengthscales inferred by the ARD kernel in: (left) 15 min; (middle) 30 min; (right) 60 min. 5 Conclusion In this paper we propose a method to learn the mappings between input and output signals over graphs, which we call Stochastic Deep Gaussian Processes over Graphs (DGPG). Our method utilizes the graph information during the construction of the deep GP structure. By applying variational inference, the evidence lower bound is derived and optimized by a recursive sampling scheme. We conducted thorough experiments in both synthetic and realistic datasets. Numerical results on a synthetic dataset validate our theoretical assumptions and analysis, particularly the claim that our method has smaller sampling variances and thus converges faster. In the realistic datasets, we show that our method generally outperforms other baselines in small datasets, and is competitive to the state-of-the-art method in the challenging task of traffic flow prediction. Finally DGPG also exhibits several appealing characteristics, such as the ability to accurately model uncertainties and to automatically discover which vertices and features are relevant to the prediction. Acknowledgments and Disclosure of Funding This work is supported in part by the National Key Research and Development Program of China under Grant 2018YFB1800204, the National Natural Science Foundation of China under Grant 61771273, the R&D Program of Shenzhen under Grant JCYJ20180508152204044, and the project PCL Future Greater-Bay Area Network Facilities for Large-scale Experiments and Applications (LZC0019) . Broader Impact This paper proposes a graph guided deep Gaussian process model. Through incorporating the graph information into deep Gaussian process, our proposed method reduces sampling variances and achieves faster convergence. Graph structure is a very common tool in numerous fields. In experiments, we also show that the proposed method can facilitate the tasks of traffic flow prediction. We believe that the exploration of graph structure will largely improve the capacity of current machine learning algorithms. [1] C. Rasmussen and C. Williams, Gaussian Processes for Machine Learning. Adaptive Computation and Machine Learning, MIT Press, Jan. 2006. [2] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, Privacy preserving crowd monitoring: Counting people without people models or tracking, in IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2008), 2008. [3] J. Snoek, H. Larochelle, and R. P. Adams, Practical bayesian optimization of machine learning algorithms, in Advances in Neural Information Processing Systems (NIPS-2012), pp. 2951 2959, 2012. [4] H. Liu, Y.-S. Ong, and J. Cai, A survey of adaptive sampling for global metamodeling in support of simulation-based complex engineering design, Structural and Multidisciplinary Optimization, vol. 57, no. 1, pp. 393 416, 2018. [5] E. V. Bonilla, K. M. Chai, and C. Williams, Multi-task gaussian process prediction, in Advances in Neural Information Processing Systems (NIPS-2008), pp. 153 160, 2008. [6] M. Kuss and C. E. Rasmussen, Gaussian processes in reinforcement learning, in Advances in Neural Information Processing Systems (NIPS-2004), pp. 751 758, 2004. [7] A. C. Damianou and N. D. Lawrence, Deep gaussian processes, in International Conference on Artificial Intelligence and Statistics (AISTATS-2013), pp. 207 215, 2013. [8] Z. Dai, A. C. Damianou, J. González, and N. D. Lawrence, Variational auto-encoded deep gaussian processes, in International Conference on Learning Representations (ICLR-2016), 2016. [9] C. L. C. Mattos, Z. Dai, A. C. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence, Recurrent gaussian processes, in International Conference on Learning Representations (ICLR-2016), 2016. [10] M. Titsias, Variational learning of inducing variables in sparse gaussian processes, in Artificial Intelligence and Statistics (AISTATS-2009), pp. 567 574, 2009. [11] H. Salimbeni and M. Deisenroth, Doubly stochastic variational inference for deep gaussian processes, in Advances in Neural Information Processing Systems (NIPS-2017), pp. 4588 4599, 2017. [12] J. Bruna, W. Zaremba, A. Szlam, and Y. Le Cun, Spectral networks and locally connected networks on graphs, in International Conference on Learning Representations (ICLR-2014), 2014. [13] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, Convolutional networks on graphs for learning molecular fingerprints, in Advances in Neural Information Processing Systems (NIPS-2015), pp. 2224 2232, 2015. [14] J. Atwood and D. Towsley, Diffusion-convolutional neural networks, in Advances in Neural Information Processing Systems (NIPS-2016), pp. 1993 2001, 2016. [15] M. Defferrard, X. Bresson, and P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in Advances in Neural Information Processing Systems (NIPS-2016), pp. 3844 3852, 2016. [16] T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, in International Conference on Learning Representations (ICLR-2017), 2017. [17] W. Hamilton, Z. Ying, and J. Leskovec, Inductive representation learning on large graphs, in Advances in Neural Information Processing Systems (NIPS-2017), pp. 1024 1034, 2017. [18] Y. C. Ng, N. Colombo, and R. Silva, Bayesian semi-supervised learning with graph gaussian processes, in Advances in Neural Information Processing Systems (NIPS-2018), pp. 1690 1701, 2018. [19] I. Walker and B. Glocker, Graph convolutional gaussian processes, in International Conference on Machine Learning (ICML-2019), pp. 6495 6504, 2019. [20] A. Venkitaraman, S. Chatterjee, and P. Handel, Gaussian processes over graphs, in International Conference on Acoustics, Speech and Signal Processing (ICASSP-2020), pp. 5640 5644, IEEE, 2020. [21] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [22] V. Tresp, A bayesian committee machine, Neural Computation, vol. 12, no. 11, pp. 2719 2741, 2000. [23] Y. Cao and D. J. Fleet, Generalized product of experts for automatic and principled fusion of gaussian process predictions, ar Xiv preprint ar Xiv:1410.7827, 2014. [24] M. Deisenroth and J. W. Ng, Distributed gaussian processes, in International Conference on Machine Learning (ICML-2015), pp. 1481 1490, 2015. [25] H. Liu, J. Cai, Y. Wang, and Y. S. Ong, Generalized robust bayesian committee machine for large-scale gaussian process regression, in International Conference on Machine Learning (ICML-2018), 2018. [26] H. Liu, Y.-S. Ong, X. Shen, and J. Cai, When gaussian process meets big data: A review of scalable gps, IEEE Transactions on Neural Networks and Learning Systems, 2020. [27] J. Quiñonero-Candela and C. E. Rasmussen, A unifying view of sparse approximate gaussian process regression, Journal of Machine Learning Research, vol. 6, no. Dec, pp. 1939 1959, 2005. [28] E. Snelson and Z. Ghahramani, Sparse gaussian processes using pseudo-inputs, in Advances in Neural Information Processing Systems (NIPS-2016), 2006. [29] E. Snelson and Z. Ghahramani, Local and global sparse gaussian process approximations, in Artificial Intelligence and Statistics (AISTATS-2007), pp. 524 531, 2007. [30] J. Hensman, N. Fusi, and N. D. Lawrence, Gaussian processes for big data, in Conference on Uncertainty in Artificial Intelligence (UAI-2013), pp. 282 290, 2013. [31] M. K. Titsias, Variational inference for gaussian and determinantal point processes, in Workshop on Advances in Variational Inference, 2014. [32] A. G. d. G. Matthews, Scalable Gaussian process inference using variational methods. Ph D thesis, University of Cambridge, 2017. [33] D. J. Rezende, S. Mohamed, and D. Wierstra, Stochastic backpropagation and approximate inference in deep generative models, in International Conference on Machine Learning (ICML2014), pp. 1278 1286, 2014. [34] D. P. Kingma, T. Salimans, and M. Welling, Variational dropout and the local reparameterization trick, in Advances in Neural Information Processing Systems (NIPS-2015), pp. 2575 2583, 2015. [35] C. K. I. Williams and M. W. Seeger, Using the nyström method to speed up kernel machines, in Advances in Neural Information Processing Systems (NIPS-2000), pp. 682 688, 2000. [36] M. W. Seeger, C. K. I. Williams, and N. D. Lawrence, Fast forward selection to speed up sparse gaussian process regression, in International Workshop on Artificial Intelligence and Statistics (AISTATS-2003), 2003. [37] D. Burt, C. E. Rasmussen, and M. Van Der Wilk, Rates of convergence for sparse variational gaussian process regression, in International Conference on Machine Learning, pp. 862 871, 2019. [38] M. K. Titsias and N. D. Lawrence, Bayesian gaussian process latent variable model, in International Conference on Artificial Intelligence and Statistics, (AISTATS-2010), pp. 844 851, 2010. [39] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun, Graph neural networks: A review of methods and applications, ar Xiv preprint ar Xiv:1812.08434, 2018. [40] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, A comprehensive survey on graph neural networks, IEEE Transactions on Neural Networks and Learning Systems, 2020. [41] H. Rue and L. Held, Gaussian Markov random fields: theory and applications. CRC press, 2005. [42] P. Sidén and F. Lindsten, Deep gaussian markov random fields, ar Xiv preprint ar Xiv:2002.07467, 2020. [43] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman, GPflow: A Gaussian process library using Tensor Flow, Journal of Machine Learning Research, vol. 18, pp. 1 6, apr 2017. [44] Swedish meteorological and hydrological institute. http://opendata-download-metobs. smhi.se/. Last accessed: 2020-05-27. [45] H. Behjat, U. Richter, D. Van De Ville, and L. Sörnmo, Signal-adapted tight frames on graphs, IEEE Transactions on Signal Processing, vol. 64, no. 22, pp. 6017 6029, 2016. [46] H. Van Dop, G. Graziani, and W. Klug, ETEX: A european tracer experiment, in Large Scale Computations in Air Pollution Modelling, pp. 137 150, Springer, 1999. [47] H. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi, Big data and its technical challenges, Communications of the ACM, vol. 57, no. 7, pp. 86 94, 2014. [48] Y. Li, R. Yu, C. Shahabi, and Y. Liu, Diffusion convolutional recurrent neural network: Data-driven traffic forecasting, in International Conference on Learning Representations (ICLR-2018), 2018. [49] J. D. Hamilton, Time series analysis, vol. 2. Princeton New Jersey, 1994. [50] I. Sutskever, O. Vinyals, and Q. V. Le, Sequence to sequence learning with neural networks, in Advances in Neural Information Processing Systems (NIPS-2014), pp. 3104 3112, 2014. [51] D. Chai, L. Wang, and Q. Yang, Bike flow prediction with multi-graph convolutional networks, in International Conference on Advances in Geographic Information Systems, pp. 397 400, 2018.