# scalable_gaussian_process_regression_networks__bbc842b6.pdf Scalable Gaussian Process Regression Networks Shibo Li1 , Wei Xing2 , Robert M. Kirby1,2 and Shandian Zhe1 1School of Computing, University of Utah 2Scientific Computing and Imaging Institute, University of Utah {shibo,kirby,zhe}@cs.utah.edu, wxing@sci.utah.edu Gaussian process regression networks (GPRN) are powerful Bayesian models for multi-output regression, but their inference is intractable. To address this issue, existing methods use a fully factorized structure (or a mixture of such structures) over all the outputs and latent functions for posterior approximation, which, however, can miss the strong posterior dependencies among the latent variables and hurt the inference quality. In addition, the updates of the variational parameters are inefficient and can be prohibitively expensive for a large number of outputs. To overcome these limitations, we propose a scalable variational inference algorithm for GPRN, which not only captures the abundant posterior dependencies but also is much more efficient for massive outputs. We tensorize the output space and introduce tensor/matrix-normal variational posteriors to capture the posterior correlations and to reduce the parameters. We jointly optimize all the parameters and exploit the inherent Kronecker product structure in the variational model evidence lower bound to accelerate the computation. We demonstrate the advantages of our method in several real-world applications. 1 Introduction Multi-output regression is an important machine learning problem where the critical challenge is to grasp the complex output correlations to enable accurate predictions. Gaussian process regression networks (GPRN) [Wilson et al., 2012] are promising Bayesian models for multi-output regression, which exploit both the structure properties of neural networks and the flexibility of nonparametric function learning by Gaussian processes (GP) [Williams and Rasmussen, 2006]. In GRPNs, the outputs are an adaptive linear projection of a set of latent functions; both the latent functions and projection weights are sampled from independent GPs. In this way, GPRNs can capture input-dependent, highly nonlinear correlations between the outputs, provide heavy-tail predictive distributions and resist overfitting. However, a critical bottleneck of GPRN is the inference intractability. To address this issue, existing methods use a fully factorized posterior approximation over the latent functions and projection weights to conduct mean-field variational inference [Wilson et al., 2012]. In light of the alternating updates, the posterior covariance matrix for each function and weight can be further parameterized by N (rather than N 2) parameters (N is the number of training samples) [Nguyen and Bonilla, 2013]. In addition, Nguyen and Bonilla (2013) developed nonparametric variational inference that uses a mixture of diagonal Gaussian distributions as the variational posterior of all the latent variables. Despite the success of the aforementioned approaches, they can suffer from applications with very high-dimensional outputs, which are common in real world, e.g., MRI imaging prediction and physical simulations. First, the fully factorized posterior (and the mixture of diagonal Gaussian) can miss the strong posterior dependencies within the weights and latent function values, resulting in suboptimal quality. Second, the alternating mean-field updates and the optimization of non-parametric inference have O(NK2D) and O(QN 2KD) time complexity respectively, where D, K and Q are the number of the outputs, latent function and mixture components. When D is very large, say, millions, the computational cost can be prohibitively expensive even for moderated N and/or K, say, hundreds. To overcome these problems, we propose a scalable variational inference algorithm that not only better captures the posterior dependency but also is much more efficient for massive outputs. Specifically, we tensorize the output space so that the projection matrix can be converted into a tensor, based on which we propose a tensor-normal distribution as a joint variational posterior for the projection weights. The tensor-normal posterior not only captures the strong posterior dependencies of the weights, but also requires much less covariance parameters orders of magnitude less than the output dimension. Similarly, we incorporate a matrixnormal variational posterior for all the latent function values to capture their posterior dependency and save the parameters. Finally, we jointly optimize the variational evidence lower bound (EBLO), where we use the Kronecker product properties to decompose the expensive log determinant and matrix inverse to further accelerate the computation. As the result, our algorithm can linearly scale to all N, D and K, i.e., O(NDK). For evaluation, we first examined our method on three Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) small datasets where the existing GPRN inference approaches are available. Our method shows not only better predictive performance but also a great speed-up. Then we tested our method in two real-world applications with thousands of outputs and the existing GPRN inference algorithms are not feasible. Compared with several state-of-the-art scalable multi-output regression methods, our method almost always achieves significantly better prediction accuracy. Finally, we applied GPRN in a large-scale physical simulation application for one million output prediction. Our method often improves upon the competing approaches by a large margin. 2 Gaussian Process Regression Networks Let us first introduce the notations and background. Suppose we have a set of N multi-output training examples D = {(x1, y1), . . . (x N, y N)}, where each output yn (1 n N) is D dimensional and D can be much larger than N. To model multiple outputs, Gaussian process regression networks (GPRNs) [Wilson et al., 2012] first introduce a small set of K latent functions, {f1( ), . . . , f K( )}. Each latent function fk( ) is sampled from a GP prior [Williams and Rasmussen, 2006], a nonparametric function prior that can flexibly estimate various complex functions from data by incorporating (nonlinear) covariance or kernel functions. Next, GPRN introduces a D K projection matrix W, where each element wij(1 i D, 1 j K) is also considered as a function of the input, and sampled from an independent GP prior. Given an input x, the outputs are modelled by y(x) = W(x)[f(x) + σfϵ] + σyz (1) where f(x) = [f1(x), . . . , f K(x)] , and ϵ and z are random noises sampled from the standard normal distribution. If we view each latent function fk(x) as an input neuron, GPRN generates the outputs in the same way as neural networks (NNs). Hence GPRN enjoys the structure properties of NNs. Furthermore, GPRN accommodates input-dependent (i.e., non-stationary) correlations of the outputs. Given W( ), the covariance of arbitrary two outputs yi(xa) and yj(xb) is kyi,yj(xa, xb) = k=1 wik(xa)κ ˆ fk(xa, xb)wjk(xb) + δabσ2 y where δab is 1 if a = b and 0 otherwise, κ ˆ fk(xa, xb) = κfk(xa, xb) + δabσ2 f, and κfk( , ) is the covariance (kernel) function for the latent function fk( ). The output covariance are determined by the inputs via the projection weights wik(xa) and wjk(xb). Therefore, the model is able to adaptively capture the complex output correlations varying in the input space. This is more flexible than many popular multioutput regression models [Alvarez et al., 2012] that only impose stationary correlations invariant to input locations. Now we look into the joint probability of GPRN on the aforementioned training dataset D. Following the original paper [Wilson et al., 2012], we assume all the latent functions share the same kernel κf( , ) and parameters θf, and all the projection weights the same kernel κw( , ) and parameters θw. For a succinct representation, we consider a noisy version of each latent function, ˆfk(x) = fk(x) + ϵkσf where ϵk N(0, 1). Since fk is assigned a GP prior, ˆfk also has a GP prior and the kernel is κ ˆ f(xa, xb) = κf(xa, xb) + δab σ2 f. We denote the values of ˆfk at the training inputs X = [x1, . . . , x N] by ˆfk = [ ˆfk(x1), . . . , ˆfk(x N)] and projection weight wij by wij = [wij(x1), . . . , wij(x N)] . Since ˆfk( ) is sampled from the GP prior, its finite projection follow a multivariate Gaussian prior distribution, p(ˆfk|θf, σ2 f) = N(ˆfj|0, K ˆ f) where K ˆ f is a kernel matrix and each element [K ˆ f]ij = κ ˆ f(xi, xj). Similarly, the prior of each wij is p(wij|θw) = N(wij|0, Kw) where each [Kw]ij = κw(xi, xj). According to (1), given {wij}1 i D,1 j K and {ˆfk}1 k K, the observed outputs Y = [y1, . . . , y N] are sampled from p(Y|{wij}, {ˆfk}) = QN n=1 N(yn|Wnhn, σ2 y I) where Wn is D K, each [Wn]ij = wij(xn), and hn = [ ˆf1(xn), . . . , ˆf K(xn)] . The joint probability then is p(Y, {wij}, {ˆfk}|X, θf, θw, σ2 f, σ2 y) = k=1 N(ˆfk|0, K ˆ f) j=1 N(wij|0, Kw) n=1 N(yn|Wnhn, σ2 y I). (2) The inference of GPRN, namely, calculating the exact posterior distribution of the latent function values and projection weights, {wij} and {ˆfk}, and other parameters, is infeasible due to the intractable normalization constant. While we can use Markov-chain Monte-Carlo sampling, it is known to be slow and hard to diagnose the convergence. For more efficient and tractable inference, current approaches [Wilson et al., 2012; Nguyen and Bonilla, 2013] introduce approximate posteriors that are fully factorized over the projection matrix and latent functions and then conduct mean-field variational inference [Wainwright et al., 2008]. Typically, the approximate posterior takes the following form, q({wij}, {ˆfk}) = j=1 q(wij) (3) where each marginal posterior is an N dimensional Gaussian distribution. The mean-field inference enjoys analytical, alternating updates between each q(ˆfk) and q(wij). In light of the structure of the updates, the covariance of each posterior can be parameterized by just N rather than N 2 parameters [Nguyen and Bonilla, 2013]. The hyper-parameters {θf, θw, σ2 f, σ2 y} are then estimated by gradient-based optimization of the variational evidence lower bound (ELBO). In [Nguyen and Bonilla, 2013], a nonparametric variational inference approach is also developed to better capture multimodality, where the variational posterior is a mixture of diagonal Gaussian distribution, j=1 N(u|µj, vj I) (4) Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) where u is a vector that concatenate all {wij} and {ˆfk}. The variational parameters {µj, vj} and the hyper-parameters are jointly optimized by maximizing the variational ELBO. 3 Scalable Variational Inference Despite the success of the existing GPRN inference methods, their performance can be limited by oversimplified posterior structures and they can be computationally too costly for high-dimensional outputs. First, the fully factorized posterior (3) essentially assumes the projection weights and latent functions are mutually independent as in their prior, and completely ignores their strong posterior dependency arising from the coupled data likelihoods. Although the Gaussian mixture approximation (4) alleviates this issue, the diagonal covariance of each component can still miss the abundant posterior correlations. Second, the alternating mean-field updates for (3) and the optimization for (4) in the nonparametric inference take the time complexity O(NK2D) and O(QN 2KD) respectively in each iteration. When D is very large, say, millions, the computation can be prohibitively expensive even for moderated N or K, e.g, hundreds. To improve both the inference quality and computational efficiency to massive outputs, we develop a scalable variational inference algorithm for GPRN, presented as follows. 3.1 Matrix and Tensor Normal Posteriors First, to capture the posterior dependency between the latent functions, we use a matrix normal distribution as the joint variational posterior for the N K function values F = [ˆf1, . . . , ˆf K] , q(F) = MN(F, M, Σ, Ω) = N(vec(F)|vec(M), Σ Ω) (5) where is the Kronecker product, Σ and Ωare N N row covariance and K K column covariance matrices, respectively. To ensure the positive definiteness, we further parameterize Σ and Ωby their Cholesky decomposition, Σ = UU and Ω= VV . The matrix Gaussian posterior not only captures the dependency of the function values, but also reduces the number of parameters and computational cost. If we use a factorized posterior over each ˆfk, the number of parameters is NK + KN(N + 1)/2, including the mean and Cholesky decomposition of the covariance for each k, while our approximate posterior only needs NK +N(N +1)/2+K(K +1)/2 parameters. Next, we consider the variational posterior of the projection weights {wij}. To obtain a joint posterior yet with compact parameterization, we tensorize the D dimensional output space into an M-mode tensor space, d1 . . . d M where D = QM m=1 dm. For simplicity, we set d1 = . . . = d M = d = M D. Then we can organize all the weights into an N K d1 . . . d M tensor W. To capture the posterior dependencies between all the weights, we introduce a tensor normal distribution a straightforward extension of the matrix normal distribution as the variational posterior for W, q(W) = T N(W|U, Γ1, . . . , ΓM+2) = N(vec(W)|vec(U), Γ1 . . . ΓM+2) (6) where U is the mean tensor, {Γ1, . . . , ΓM+2} are covariance matrices in each mode, Γ1 is N N, Γ2 is K K, and Γ3:M+2 are d d. To ensure positive definiteness, we parameterize each covariance matrix by its Cholesky decomposition, Γm = Lm L m. Note that to represent the entire NKD NKD covariance matrix of q(W), we only need N(N + 1)/2 + K(K + 1)/2 + Md(d + 1)/2 parameters which can be even far less than NDK. Take D = 106, N = 100, K = 10 as an example, the number of the covariance parameters is 0.2%NKD (for M = 3). For the mean-field posterior in (3), the number of covariance parameters for all the projection weights is NKD even with the compact representation. Therefore, our tensor normal posterior not only preserves the strong correlations among all the weights, but also saves much more parameters and so computation cost. Finally, we choose our variational posterior as q(F, W) = q(F)q(W). While it stills factorizes over F and W, the strong correlations of many variables within F and W are captured, and hence still improves upon the fully factorized posterior. 3.2 Simplified Variational Evidence Lower Bound Now, we derive the evidence lower bound (ELBO) with our proposed variational posterior, L = Eq[log p(Y, F, W|θf, θw, σ2 f, σ2 y) q(F, W) ] We will maximize the ELBO to jointly optimize the variational parameters and hyper-parameters. To accelerate the computation, we further use the properties of the Kronecker product [Stegle et al., 2011] in our variational posteriors ((5) and (6)) to dispose of their full covariances and derive a much simplified bound, L = KL q(W) p(W) KL q(F) p(F) + Eq[log p(Y|X, W, F)] (7) KL q(W) p(W) = 1 2 tr(K 1 w Γ1) m=2 tr(Γm)+ DK log |Kw| + tr(K 1 ˆ f U1U 1 ) tm log |Γm| , KL q(F) p(F) = 1 2 tr(K 1 ˆ f Σ)tr(Ω) + tr(K 1 ˆ f MM ) + K log |K ˆ f| (K log |Σ| + N log |Ω|) , Eq[log p(Y|X, W, F)] = ND log σy 2y n Eq[Wn]Eq[hn] + tr(Eq[W n Wn]Eq[hnh n ]) . Here U1 is an N DK matrix, obtained by unfolding the mean tensor U at mode 1, tm is the dimension of mode m of the tensor W, Eq[Wn] is obtained by taking the n-th slice of U at mode 1 and then reorganize it into a D K matrix, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) Eq[hn] the n-th row vector of M, and the remaining moments are calculated by Eq[W n Wn] = Γ2 m=3 tr(Γm) + Eq[Wn] Eq[Wn], Eq[hnh n ] = Ω diag(Σ) + Eq[hn]Eq[hn] . As we can see, the computation of the ELBO (7) only involves the covariance matrices at each mode of W, F and the kernel matrices: {Γ1:M+2, Σ, Ω, K ˆ f, Kw}, which are small even for a very large number of outputs. Hence the computation is much simplified. In addition, due to the compact parameterization, optimizing the ELBO is much easier. We then use gradient-based optimization methods to jointly estimate the parameters of the variational posteriors and the hyper-parameters {σ2 f, σ2 y, θf, θw}. 3.3 Prediction Given a new input x , we aim to use the estimated variational posterior to predict the output y . The posterior mean of y is computed by E[y |x , D] = E[W(x )]E[ˆf(x )] where ˆf(x ) = [ ˆf1(x ), . . . , ˆf K(x )] . Each [E[W(x )]]ij is computed by k w K 1 w vij where k w = [κw(x , x1), . . . , κw(x , x N)] and vij is obtained by first reorganizing U (the posterior mean of W) to an N K D tensor b U and then taking the fiber b U(:, i, j). Each E( ˆfk(x )) = k ˆ f K 1 ˆ f M(:, k) where k ˆ f = [κ ˆ f(x , x1), . . . , κ ˆ f(x , x N)]. The predictive distribution, however, does not have a close form, because the likelihood is non-Gaussian w.r.t the projection weights and latent functions. To address this issue, we can use Monte-Carlo approximations. We can generate a set of i.i.d posterior samples of W(x ) and ˆf(x ), denoted by {f Wt,eft}T t=1, and then approximate p(y |x , D) 1 T PT t=1 N(y |f Wteft, σ2 y I). Note that when the output dimension is large, the posterior sample of W(x ) can be too costly to generate. We can instead approximate the predictive distribution of each single output y j with the same method. 3.4 Algorithm Complexity The overall time complexity of our inference algorithm is O(N 3 + K3 + Md2 + NDK). When D {N, K} and Md2 D, the complexity is O(NDK). It is trivial to show that when 3 M 3 D, we always have Md2 D. The space complexity is O(NKD +ND +N 2 +K2 +Md2) including the storage of training data, kernel matrices, the variational posterior parameters and the other parameters. 4 Related Work Many multi-output regression approaches have been proposed and most of them are based on GPs; see an excellent review in [Alvarez et al., 2012]. A classical method is the linear model of coregionalization (LMC) [Goulard and Voltz, 1992], which projects a set of latent functions to the multi-output space; each latent function is sampled from an independent GP. PCA-GP is a popular LCM [Higdon et al., 2008] that identifies the projection matrix by Singular Value Decomposition (SVD). The variants of PCA-GP include KPCA-GP [Xing et al., 2016], Iso Map-GP [Xing et al., 2015], etc. GPRN [Wilson et al., 2012] is another instance of LMC. However, since the projection weights are also sampled from GPs, the prior of the outputs is no longer a GP. Other approaches include convolved GPs [Higdon, 2002; Boyle and Frean, 2005; Alvarez et al., 2019] and multi-task GPs [Bonilla et al., 2007; 2008; Rakitsch et al., 2013]. Convolved GPs generate each output by convolving a smoothing kernel and a set of latent functions. Multi-task GPs define a product kernel over the input features and task dependent features (or free-from task correlation matrices). Despite their success, both types of models might be too costly (O((ND)3) or O(N 3 + D3) time complexity) for highdimensional outputs. To mitigate this issue, several sparse approximations have been developed [Alvarez and Lawrence, 2009; Alvarez et al., 2010]. Recently, Zhe et al. (2019) introduced latent coordinate features in the tensorized output space to model complex correlations and to predict tensorized outputs. Their method essentially constructs a product kernel with which to simplify the inference. By contrast, our work introduces tensor/matrix-normal variational posteriors to improve GPRN inference and does not assume any special kernel structure in GPRN. 5 Experiment 5.1 Predicting Small Numbers of Outputs We first evaluated the proposed GPRN inference on five realworld datasets with a small number of outputs: (1) Jura 1, the heavy-metal concentration measurements of 349 neighbouring locations in Swiss Jura. Following [Wilson et al., 2012], we predicted 3 correlated concentrations, cadmium, nickel and zinc, given the locations of the measurements. (2)Equity 2 [Wilson et al., 2012], a financial datasets that include 643 records of 5 equity indices NASDAQ, FTSE, TSE, NIKKEI and DJC. The inputs are the 5 indices, and the goal is to predict their 25 pair-wise correlations. (3) PM2.53, 100 spatial measurements (i.e., outputs) of the particulate matter pollution (PM2.5) in Salt Lake City in July 4-7, 2018. The inputs are time points of the measurements. (4) Cantilever [Andreassen et al., 2011], material structures with the maximum stiffness on bearing forces from the right side. The input of each example is the force and the outputs are a 3,200 dimensional vector that represents the stress field that determines the optimal material layout in a 80 40 rectangular domain. (5) Gene Exp4, expressions of 4,511 genes (outputs) measured by different microarrays, each of which is described by a 10 dimensional input vector. Competing methods. We compared our inference algorithm, denoted by SGPRN, with the following approaches. (1) MFVB mean-field variational Bayes inference for GPRN [Wilson et al., 2012], and (2) NPV nonparametric variational Inference for GPRN [Nguyen and Bonilla, 1https://rdrr.io/cran/gstat/man/jura.html 2https://github.com/davidaknowles/gprn 3http://www.aqandu.org/ 4https://www.synapse.org/#!Synapse:syn2787209/wiki/70350 Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 2 5 15 50 # of latent functions per-iter running time(seconds) SGPRN MFVB NPV 2 5 15 50 # of latent functions per-iter running time(seconds) 2 5 15 50 # of latent functions per-iter running time(seconds) Figure 1: Training speed of three GPRN inference algorithms. 2013]. In addition, we compared with three other multioutput GP models that are scalable to high-dimensional outputs: (3) PCA-GP [Higdon et al., 2008] that uses PCA to find the projection matrix in the LMC framework [Goulard and Voltz, 1992] for multi-output regression, (4) KPCA-GP [Xing et al., 2015] and (5) Iso Map-GP [Xing et al., 2016] that use KPCA [Sch olkopf et al., 1998] and (5) Iso Map [Balasubramanian and Schwartz, 2002] respectively to identify the projection matrix. We also compared with (5) HOGP [Zhe et al., 2019], high-order GP for regression that introduces latent coordinate features for tensorized output prediction (see Sec. 4 Related Work). Parameter settings. We implemented our method SGPRN with Tensor Flow [Abadi et al., 2016]. We used Adam [Kingma and Ba, 2014] algorithm for gradient-based optimization and the learning rate was set to 10 3. All the competing methods were implemented with MATLAB. For MFVB and NPV, we used the efficient implementation (https://github.com/trungngv/gprn) of the paper [Nguyen and Bonilla, 2013]. We used 2 mixture components for the variational poster in NVP (see (4)). The competing methods used L-BFGS for optimization and the number of iterations was set to 100. We used RBF kernel for all the methods. The input features of all the datasets were normalized, and the kernel parameters (i.e., the length-scale) were initialized to 1. We varied the number of latent functions/features/bases from {2, 5, 15, 50}. Comparison with state-of-the-art GPRN inference. We first compared with MFVB and NPV on the three smallest datasets, Jura, Equity and PM2.5, with 2, 25 and 100 outputs respectively. We tested SGRPN, MFVB and NPV on a workstation with 2 Intel(R) Xeon(R) E5-2697 CPUs, 28 cores and 196GB memory. Note that MFVB and NPV are not available on the other datasets (i.e., Cantilever and Gene Exp) our test shows that they will take extremely long time (weeks or months) to train with 15 and 50 latent functions. We followed the setting of the original paper [Wilson et al., 2012] and [Nguyen and Bonilla, 2013] to only use 2 latent functions. On Jura, we randomly split the data into 249 examples for training and 100 for test, on Equity 200 for training and 200 for test, and on PM2.5 256 for training and 32 for test. In SGPRN 0.5127 0.002 MFVB 0.5237 0.001 NPV 0.6088 9e-06 SGPRN 2.5759e-05 2e-07 MFVB 4.4530e-05 6e-07 NPV 4.7267e-05 9e-18 SGPRN 1.07089 0.02 MFVB 1.3916 0.06 NPV 14.7101 0.14 Table 1: The mean absolute error(MAE) of the three GPRN inference methods. The results were averaged over 5 runs. our algorithm, for Jura and Equity, we used the original output space, and for PM2.5 we tensorized the output space to be 10 10. We ran our algorithm for 2, 000 epochs to ensure convergence; both MFVB and NPV converged after 100 iterations. We repeated for 5 times and reported the average of the mean absolute error (MAE) and the standard deviation in Table 1. As we can see, our method SGPRN significantly outperforms (p-value < 0.05) both MFVB and NPV on all the three datasets, showing superior inference quality. Note that NPV obtains much bigger MAEs on PM2.5. This might be because the optimization of the variational ELBO converged to inferior local maximums. Comparison of running time. Next, we compared with MFVB and NPV in terms of the speed. We reported the periteration running time of MFVB and NPV in Fig. 1. Since our algorithm ran 2, 000 epochs while MFVB and NPV 100 iterations, we used the average running time of 20 epochs of SGPRN as the per-iteration time for a fair comparison. As we can see, the speed of SGPRN is close to that of MFVB and NPV for very a few latent function (2 and 5). However, with more latent functions, SGPRN becomes much faster. For example, on PM2.5 with 50 latent functions, SGPRN gains 200x and 785x speed-up as compared with MFVB and NPV. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) 25 15 50 0.7 (a) Cantilever (# training samples = 128) (b) Cantilever (# training samples = 256) 25 15 50 0.43 (c) Gen Exp (# training samples = 128) 25 15 50 0.39 (d) Gen Exp (# training samples = 256) 2 5 10 0.11 (e) Pressure (# training samples = 64) Figure 2: The normalized root-mean-squared error(NRMSE) of all the methods on three datasets. The results were averaged over 5 runs. The x-axis represents the number of latent functions of SGPRN, features of HOGP, and bases of Iso Map-GPR, KPCA-GPR, and PCA-GPR. This is consistent with their time complexities, {MFVB: O(NK2D), NPV: O(QN 2KD), SGPRN:O(NKD)}. Comparison with other methods. Next, we compared with other scalable multi-output GP regression models. To this end, we used Cantilever and Gene Exp datasets. For our algorithm and HOGP, we tensorized the output space of Cantilever to 80 40 and Gene Exp to 347 13. On each dataset, we randomly chose {128, 256} examples for training and 100 from the remaining set for test. We repeated this procedure for 5 times and reported the average normalized root-meansquare error (NRMSE) and the standard deviation of each method in Fig. 2. As we can see, SGPRN significantly outperforms the competing methods (p-value < 0.05) in all the cases except when training on Cantilever with 128 examples and 50 latent functions , SGPRN was a little worse than PCAPG (Fig. 2a). Therefore, it demonstrates the advantage of GPRN in predictive performance, which might be due to its capability of capturing non-stationary output dependencies. 5.2 Large-Scale Physical Simulations for Lid-Driven Cavity Flows Finally, we applied SGPRN in a large-scale physical simulation application. Specifically, we trained GPRN to predict a one-million dimensional pressure field for lid-driven cavity flows [Bozeman and Dalton, 1973], which include turbulent flow inside the cavity. The simulation of the field is done by solving the Navier-Stoke equations that are known to have complicated behaviours under large Reynolds numbers. We used a fine-grained mesh to ensure the numerical solver to converge. The input of each simulation example is a 5 dimensional vector that represents a specific boundary condition. The computation for each simulation is very expensive, so we only collected 96 examples. Hence, this is a typical large D, small N problem. We randomly split the dataset into 64 training and 32 test examples, and then ran SGPRN, PCA-GP, KPCA-GP and Iso MAP-GP. For SGPRN, we tensorized the one million outputs into a 100 100 100 tensor. We varied the number of latent functions from {2, 5, 10}. We repeated the training and test procedure for 5 times and showed the average normalized root-mean-square error (NRMSE) of each method in Fig.2(e). As we can see, our method significantly outperforms all the competing approaches, by a large margin especially when using 5 and 10 latent functions. The results further demonstrate the advantage of GPRN for large-scale multi-out regression tasks when it is available. 6 Conclusion We have proposed a scalable variational inference algorithm for GPRN, a powerful Bayesian multi-output regression model. Our algorithm not only improves the inference quality upon the existing GPRN inference methods but also is much more efficient and scalable to a large number of outputs. In the future work, we will continue to explore GPRN in large-scale multi-output regression applications. Acknowledgments This work has been supported by DARPA TRADES Award HR0011-17-2-0016 and NSF IIS-1910983. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20) [Abadi et al., 2016] Mart ın Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for large-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pages 265 283, 2016. [Alvarez and Lawrence, 2009] Mauricio Alvarez and Neil D Lawrence. Sparse convolved gaussian processes for multioutput regression. In Advances in neural information processing systems, pages 57 64, 2009. [ Alvarez et al., 2010] Mauricio Alvarez, David Luengo, Michalis Titsias, and Neil Lawrence. Efficient multioutput gaussian processes through variational inducing kernels. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pages 25 32, 2010. [Alvarez et al., 2012] Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vectorvalued functions: A review. Foundations and Trends R in Machine Learning, 4(3):195 266, 2012. [Alvarez et al., 2019] Mauricio Alvarez, Wil Ward, and Cristian Guarnizo. Non-linear process convolutions for multioutput gaussian processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1969 1977, 2019. [Andreassen et al., 2011] Erik Andreassen, Anders Clausen, Mattias Schevenels, Boyan S Lazarov, and Ole Sigmund. Efficient topology optimization in matlab using 88 lines of code. Structural and Multidisciplinary Optimization, 43(1):1 16, 2011. [Balasubramanian and Schwartz, 2002] Mukund Balasubramanian and Eric L Schwartz. The isomap algorithm and topological stability. Science, 295(5552):7 7, 2002. [Bonilla et al., 2007] Edwin V Bonilla, Felix V Agakov, and Christopher KI Williams. Kernel multi-task learning using task-specific features. In Artificial Intelligence and Statistics, pages 43 50, 2007. [Bonilla et al., 2008] Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian process prediction. In Advances in neural information processing systems, pages 153 160, 2008. [Boyle and Frean, 2005] Phillip Boyle and Marcus Frean. Dependent gaussian processes. In Advances in neural information processing systems, pages 217 224, 2005. [Bozeman and Dalton, 1973] James D Bozeman and Charles Dalton. Numerical study of viscous flow in a cavity. Journal of Computational Physics, 12(3):348 363, 1973. [Goulard and Voltz, 1992] Michel Goulard and Marc Voltz. Linear coregionalization model: tools for estimation and choice of cross-variogram matrix. Mathematical Geology, 24(3):269 286, 1992. [Higdon et al., 2008] Dave Higdon, James Gattiker, Brian Williams, and Maria Rightley. Computer model calibration using high-dimensional output. Journal of the American Statistical Association, 103(482):570 583, 2008. [Higdon, 2002] Dave Higdon. Space and space-time modeling using process convolutions. In Quantitative methods for current environmental issues, pages 37 56. Springer, 2002. [Kingma and Ba, 2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014. [Nguyen and Bonilla, 2013] Trung Nguyen and Edwin Bonilla. Efficient variational inference for gaussian process regression networks. In Artificial Intelligence and Statistics, pages 472 480, 2013. [Rakitsch et al., 2013] Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, and Oliver Stegle. It is all in the noise: Efficient multi-task gaussian process inference with structured residuals. In Advances in neural information processing systems, pages 1466 1474, 2013. [Sch olkopf et al., 1998] Bernhard Sch olkopf, Alexander Smola, and Klaus-Robert M uller. Nonlinear component analysis as a kernel eigenvalue problem. Neural computation, 10(5):1299 1319, 1998. [Stegle et al., 2011] Oliver Stegle, Christoph Lippert, Joris M Mooij, Neil D Lawrence, and Karsten Borgwardt. Efficient inference in matrix-variate gaussian models with\iid observation noise. In Advances in neural information processing systems, pages 630 638, 2011. [Wainwright et al., 2008] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends R in Machine Learning, 1(1 2):1 305, 2008. [Williams and Rasmussen, 2006] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning. MIT press Cambridge, MA, 2006. [Wilson et al., 2012] Andrew Gordon Wilson, David A Knowles, and Zoubin Ghahramani. Gaussian process regression networks. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1139 1146. Omnipress, 2012. [Xing et al., 2015] Wei Xing, Akeel A Shah, and Prasanth B Nair. Reduced dimensional gaussian process emulators of parametrized partial differential equations based on isomap. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2174):20140697, 2015. [Xing et al., 2016] WW Xing, V Triantafyllidis, AA Shah, PB Nair, and Nicholas Zabaras. Manifold learning for the emulation of spatial fields from computational models. Journal of Computational Physics, 326:666 690, 2016. [Zhe et al., 2019] Shandian Zhe, Wei Xing, and Robert M Kirby. Scalable high-order gaussian process regression. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2611 2620, 2019. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence (IJCAI-20)