# heterogeneous_multioutput_gaussian_process_prediction__81226ea1.pdf Heterogeneous Multi-output Gaussian Process Prediction Pablo Moreno-Muñoz1 Antonio Artés-Rodríguez1 Mauricio A. Álvarez2 1Dept. of Signal Theory and Communications, Universidad Carlos III de Madrid, Spain 2Dept. of Computer Science, University of Sheffield, UK {pmoreno,antonio}@tsc.uc3m.es, mauricio.alvarez@sheffield.ac.uk We present a novel extension of multi-output Gaussian processes for handling heterogeneous outputs. We assume that each output has its own likelihood function and use a vector-valued Gaussian process prior to jointly model the parameters in all likelihoods as latent functions. Our multi-output Gaussian process uses a covariance function with a linear model of coregionalisation form. Assuming conditional independence across the underlying latent functions together with an inducing variable framework, we are able to obtain tractable variational bounds amenable to stochastic variational inference. We illustrate the performance of the model on synthetic data and two real datasets: a human behavioral study and a demographic high-dimensional dataset. 1 Introduction Multi-output Gaussian processes (MOGP) generalise the powerful Gaussian process (GP) predictive model to the vector-valued random field setup (Alvarez et al., 2012). It has been experimentally shown that by simultaneously exploiting correlations between multiple outputs and across the input space, it is possible to provide better predictions, particularly in scenarios with missing or noisy data (Bonilla et al., 2008; Dai et al., 2017). The main focus in the literature for MOGP has been on the definition of a suitable cross-covariance function between the multiple outputs that allows for the treatment of outputs as a single GP with a properly defined covariance function (Alvarez et al., 2012). The two classical alternatives to define such cross-covariance functions are the linear model of coregionalisation (LMC) (Journel and Huijbregts, 1978) and process convolutions (Higdon, 2002). In the former case, each output corresponds to a weighted sum of shared latent random functions. In the latter, each output is modelled as the convolution integral between a smoothing kernel and a latent random function common to all outputs. In both cases, the unknown latent functions follow Gaussian process priors leading to straight-forward expressions to compute the cross-covariance functions among different outputs. More recent alternatives to build valid covariance functions for MOGP include the work by Ulrich et al. (2015) and Parra and Tobar (2017), that build the cross-covariances in the spectral domain. Regarding the type of outputs that can be modelled, most alternatives focus on multiple-output regression for continuous variables. Traditionally, each output is assumed to follow a Gaussian likelihood where the mean function is given by one of the outputs of the MOGP and the variance in that distribution is treated as an unknown parameter. Bayesian inference is tractable for these models. In this paper, we are interested in the heterogeneous case for which the outputs are a mix of continuous, categorical, binary or discrete variables with different likelihood functions. 32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada. There have been few attempts to extend the MOGP to other types of likelihoods. For example, Skolidis and Sanguinetti (2011) use the outputs of a MOGP for jointly modelling several binary classification problems, each of which uses a probit likelihood. They use an intrinsic coregionalisation model (ICM), a particular case of LMC. Posterior inference is perfomed using expectation-propagation (EP) and variational mean field. Both Chai (2012) and Dezfouli and Bonilla (2015) have also used ICM for modeling a single categorical variable with a multinomial logistic likelihood. The outputs of the ICM model are used as replacements for the linear predictors in the softmax function. Chai (2012) derives a particular variational bound for the marginal likelihood and computes Gaussian posterior distributions; and Dezfouli and Bonilla (2015) introduce an scalable inference procedure that uses a mixture of Gaussians to approximate the posterior distribution using automated variational inference (AVI) (Nguyen and Bonilla, 2014a) that requires sampling from univariate Gaussians. For the single-output GP case, the usual practice for handling non-Gaussian likelihoods has been replacing the parameters or linear predictors of the non-Gaussian likelihood by one or more independent GP priors. Since computing posterior distributions becomes intractable, different alternatives have been offered for approximate inference. An example is the Gaussian heteroscedastic regression model with variational inference (Lázaro-Gredilla and Titsias, 2011), Laplace approximation (Vanhatalo et al., 2013); and stochastic variational inference (SVI) (Saul et al., 2016). This last reference uses the same idea for modulating the parameters of a Student-t likelihood, a log-logistic distribution, a beta distribution and a Poisson distribution. The generalised Wishart process (Wilson and Ghahramani, 2011) is another example where the entries of the scale matrix of a Wishart distribution are modulated by independent GPs. Our main contribution in this paper is to provide an extension of multiple-output Gaussian processes for prediction in heterogeneous datasets. The key principle in our model is to use the outputs of a MOGP as the latent functions that modulate the parameters of several likelihood functions, one likelihood function per output. We tackle the model s intractability using variational inference. Furthermore, we use the inducing variable formalism for MOGP introduced by Alvarez and Lawrence (2009) and compute a variational bound suitable for stochastic optimisation as in Hensman et al. (2013). We experimentally provide evidence of the benefits of simultaneously modeling heterogeneous outputs in different applied problems. Our model can be seen as a generalisation of Saul et al. (2016) for multiple correlated output functions of an heterogeneous nature. Our Python implementation follows the spirit of Hadfield et al. (2010), where the user only needs to specify a list of likelihood functions likelihood_list = [Bernoulli(), Poisson(), Het Gaussian()], where Het Gaussian refers to the heteroscedastic Gaussian distribution, and the number of latent parameter functions per likelihood is assigned automatically. 2 Heterogeneous Multi-output Gaussian process Consider a set of output functions Y = {yd(x)}D d=1, with x Rp, that we want to jointly model using Gaussian processes. Traditionally, the literature has considered the case for which each yd(x) is continuous and Gaussian distributed. In this paper, we are interested in the heterogeneous case for which the outputs in Y are a mix of continuous, categorical, binary or discrete variables with several different distributions. In particular, we will assume that the distribution over yd(x) is completely specified by a set of parameters θd(x) X Jd, where we have a generic X domain for the parameters and Jd is the number of parameters thet define the distribution. Each parameter θd,j(x) θd(x) is a non-linear transformation of a Gaussian process prior fd,j(x), this is, θd,j(x) = gd,j(fd,j(x)), where gd,j( ) is a deterministic function that maps the GP output to the appropriate domain for the parameter θd,j. To make the notation concrete, let us assume an heterogeneous multiple-output problem for which D = 3. Assume that output y1(x) is binary and that it will be modelled using a Bernoulli distribution. The Bernoulli distribution uses a single parameter (the probability of success), J1 = 1, restricted to values in the range [0, 1]. This means that θ1(x) = θ1,1(x) = g1,1(f1,1(x)), where g1,1( ) could be modelled using the logistic sigmoid function σ(z) = 1/(1 + exp( z)) that maps σ : R [0, 1]. Assume now that the second output y2(x) corresponds to a count variable that can take values y2(x) N {0}. The count variable can be modelled using a Poisson distribution with a single parameter (the rate), J2 = 1, restricted to the positive reals. This means that θ2(x) = θ2,1(x) = g2,1(f2,1(x)) where g2,1( ) could be modelled as an exponential function g2,1( ) = exp( ) to ensure strictly positive values for the parameter. Finally, y3(x) is a continuous variable with heteroscedastic noise. It can be modelled using a Gaussian distribution where both the mean and the variance are functions of x. This means that θ3(x) = [θ3,1(x) θ3,2(x)] = [g3,1(f3,1(x)) g3,2(f3,2(x))] , where the first function is used to model the mean of the Gaussian, and the second function is used to model the variance. Therefore, we can assume the g3,1( ) is the identity function and g3,2( ) is a function that ensures that the variance takes strictly positive values, e.g. the exponential function. Let us define a vector-valued function y(x) = [y1(x), y2(x), , y D(x)] . We assume that the outputs are conditionally independent given the vector of parameters θ(x) = [θ1(x), θ2(x), , θD(x)] , defined by specifying the vector of latent functions f(x) = [f1,1(x), f1,2(x), f1,J1(x), f2,1(x), f2,2(x), , f D,JD(x)] RJ 1, where J = PD d=1 Jd, p(y(x)|θ(x)) = p(y(x)|f(x)) = d=1 p(yd(x)|θd(x)) = d=1 p(yd(x)|efd(x)), (1) where we have defined efd(x) = [fd,1(x), , fd,Jd(x)] RJd 1, the set of latent functions that specify the parameters in θd(x). Notice that J D. This is, there is not always a one-to-one map from f(x) to y(x). Most previous work has assumed that D = 1, and that the corresponding elements in θd(x), this is, the latent functions in ef1(x) = [f1,1(x), , f1,J1(x)] are drawn from independent Gaussian processes. Important exceptions are Chai (2012) and Dezfouli and Bonilla (2015), that assumed a categorical variable y1(x), where the elements in ef1(x) were drawn from an intrinsic coregionalisation model. In what follows, we generalise these models for D > 1 and potentially heterogeneuos outputs yd(x). We will use the word output to refer to the elements yd(x) and latent parameter function (LPF) or parameter function (PF) to refer to fd,j(x). 2.1 A multi-parameter GP prior Our main departure from previous work is in modeling of f(x) using a multi-parameter Gaussian process that allows correlations for the parameter functions fd,j(x). We will use a linear model of corregionalisation type of covariance function for expressing correlations between functions fd,j(x), and fd ,j (x ). The particular construction is as follows. Consider an additional set of independent latent functions U = {uq(x)}Q q=1 that will be linearly combined to produce J LPFs {fd,j(x)}Jd,D j=1,d=1. Each latent function uq(x) is assummed to be drawn from an independent GP prior such that uq( ) GP(0, kq( , )), where kq can be any valid covariance function, and the zero mean is assumed for simplicity. Each latent parameter fd,j(x) is then given as i=1 ai d,j,qui q(x), (2) where ui q(x) are IID samples from uq( ) GP(0, kq( , )) and ai d,j,q R. The mean function for fd,j(x) is zero and the cross-covariance function kfd,jfd ,j (x, x ) = cov[fd,j(x), fd ,j (x )] is equal to PQ q=1 bq (d,j),(d ,j )kq(x, x ), where bq (d,j),(d ,j ) = PRq i=1 ai d,j,qai d ,j ,q. Let us define X = {xn}N n=1 RN p as a set of common input vectors for all outputs yd(x). Although, the presentation could be extended for the case of a different set of inputs per output. Let us also define fd,j = [fd,j(x1), , fd,j(x N)] RN 1; efd = [f d,1 f d,Jd] RJd N 1, and f = [ef 1 ef D] RJN 1. The generative model for the heterogeneous MOGP is as follows. We sample f N(0, K), where K is a block-wise matrix with blocks given by {Kfd,jfd ,j }D,D,Jd,Jd d=1,d =1,j=1,j =1. In turn, the elements in Kfd,jfd ,j are given by kfd,jfd ,j (xn, xm), with xn, xm X. For the particular case of equal inputs X for all LPF, K can also be expressed as the sum of Kronecker products K = PQ q=1 Aq A q Kq = PQ q=1 Bq Kq, where Aq RJ Rq has entries {ai d,j,q}D,Jd,Rq d=1,j=1,i=1 and Bq has entries {bq (d,j),(d ,j )}D,D,Jd,Jd d=1,d =1,j=1,j =1. The matrix Kq RN N has entries given by kq(xn, xm) for xn, xm X. Matrices Bq RJ J are known as the coregionalisation matrices. Once we obtain the sample for f, we evaluate the vector of parameters θ = [θ 1 θ D] , where θd = efd. Having specified θ, we can generate samples for the output vector y = [y 1 y D] X DN 1, where the elements in yd are obtained by sampling from the conditional distributions p(yd(x)|θd(x)). To keep the notation uncluttered, we will assume from now that Rq = 1, meaning that Aq = aq RJ 1, and the corregionalisation matrices are rank-one. In the literature such model is known as the semiparametric latent factor model (Teh et al., 2005). 2.2 Scalable variational inference Given an heterogeneous dataset D = {X, y}, we would like to compute the posterior distribution for p(f|D), which is intractable in our model. In what follows, we use similar ideas to Alvarez and Lawrence (2009); Álvarez et al. (2010) that introduce the inducing variable formalism for computational efficiency in MOGP. However, instead of marginalising the latent functions U to obtain a variational lower bound, we keep their presence in a way that allows us to apply stochastic variational inference as in Hensman et al. (2013); Saul et al. (2016). 2.2.1 Inducing variables for MOGP A key idea to reduce computational complexity in Gaussian process models is to introduce auxiliary variables or inducing variables. These variables have been used already in the context of MOGP (Alvarez and Lawrence, 2009; Álvarez et al., 2010) . A subtle difference from the single output case is that the inducing variables are not taken from the same latent process, say f1(x), but from the latent processes U used also to build the model for multiple outputs. We will follow the same formalism here. We start by defining the set of M inducing variables per latent function uq(x) as uq = [uq(z1), , uq(z M)] , evaluated at a set of inducing inputs Z = {zm}M m=1 RM p. We also define u = [u 1 , , u Q] RQM 1. For simplicity in the exposition, we have assumed that all the inducing variables, for all q, have been evaluated at the same set of inputs Z. Instead of marginalising {uq(x)}Q q from the model in (2), we explicitly use the joint Gaussian prior p(f, u) = p(f|u)p(u). Due to the assumed independence in the latent functions uq(x), the distribution p(u) factorises as p(u) = QQ q=1 p(uq), with uq N(0, Kq), where Kq RM M has entries kq(zi, zj) with zi, zj Z. Notice that the dimensions of Kq are different to the dimensions of Kq in section 2.1. The LPFs fd,j are conditionally independent given u, so we can write the conditional distribution p(f|u) as j=1 p(fd,j|u) = j=1 N fd,j|Kfd,ju K 1 uuu, Kfd,jfd,j Kfd,ju K 1 uu K fd,ju , where Kuu RQM QM is a block-diagonal matrix with blocks given by Kq and Kfd,ju RN QM is the cross-covariance matrix computed from the cross-covariances between fd,j(x) and uq(z). The expression for this cross-covariance function can be obtained from (2) leading to kfd,juq(x, z) = ad,j,qkq(x, z). This form for the cross-covariance between the LPF fd,j(x) and uq(z) is a key difference between the inducing variable methods for the single-output GP case and the MOGP case. 2.2.2 Variational Bounds Exact posterior inference is intractable in our model due to the presence of an arbitrary number of non-Gaussian likelihoods. We use variational inference to compute a lower bound L for the marginal log-likelihood log p(y), and for approximating the posterior distribution p(f, u|D). Following Álvarez et al. (2010), the posterior of the LPFs f and the latent functions u can be approximated as p(f, u|y, X) q(f, u) = p(f|u)q(u) = j=1 p(fd,j|u) where q(uq) = N(uq|µuq, Suq) are Gaussian variational distributions whose parameters {µuq, Suq}Q q=1 must be optimised. Building on previous work by Saul et al. (2016); Hensman et al. (2015), we derive a lower bound that accepts any log-likelihood function that can be modulated by the LPFs f. The lower bound L for log p(y) is obtained as follows log p(y) = log Z p(y|f)p(f|u)p(u)dfdu Z q(f, u) log p(y|f)p(f|u)p(u) q(f, u) dfdu = L. We can further simplify L to obtain L = Z Z p(f|u)q(u) log p(y|f)dfdu q=1 KL q(uq)||p(uq) j=1 p(fd,j|u)q(u) log p(y|f)dudf q=1 KL q(uq)||p(uq) , where KL is the Kullback-Leibler divergence. Moreover, the approximate marginal posterior for fd,j is q(fd,j) = R p(fd,j|u)q(u)du, leading to q(fd,j) = N fd,j|Kfd,ju K 1 uuµu, Kfd,jfd,j + Kfd,ju K 1 uu(Su Kuu)K 1 uu K fd,ju , where µu = [µ u1, , µ u Q] and Su is a block-diagonal matrix with blocks given by Suq. The expression for log p(y|f) factorises, according to (1): log p(y|f) = PD d=1 log p(yd|efd) = PD d=1 log p(yd|fd,1, , fd,Jd). Using this expression for log p(y|f) leads to the following expression for the bound d=1 Eq(fd,1) q(fd,Jd) log p(yd|fd,1, , fd,Jd) q=1 KL q(uq)||p(uq) . When D = 1 in the expression above, we recover the bound obtained in Saul et al. (2016). To maximize this lower bound, we need to find the optimal variational parameters {µuq}Q q=1 and {Suq}Q q=1. We represent each matrix Suq as Suq = Luq L uq and, to ensure positive definiteness for Suq, we estimate Luq instead of Suq. Computation of the posterior distributions over fd,j can be done analytically. There is still an intractability issue in the variational expectations on the log-likelihood functions. Since we construct these bounds in order to accept any possible data type, we need a general way to solve these integrals. One obvious solution is to apply Monte Carlo methods, however it would be slow both maximising the lower bound and updating variational parameters by sampling thousands of times (for approximating expectations) at each iteration. Instead, we address this problem by using Gaussian-Hermite quadratures as in Hensman et al. (2015); Saul et al. (2016). Stochastic Variational Inference. The conditional expectations in the bound above are also valid across data observations so that we can express the bound as n=1 Eq(fd,1(xn)) q(fd,Jd(xn)) log p(yd(xn)|fd,1(xn), , fd,Jd(xn)) q=1 KL q(uq)||p(uq) . This functional form allows the use of mini-batches of smaller sets of training samples, performing the optimization process using noisy estimates of the global objective gradient in a similar fashion to Hoffman et al. (2013); Hensman et al. (2013, 2015); Saul et al. (2016) . This scalable bound makes our multi-ouput model applicable to large heterogenous datasets. Notice that computational complexity is dominated by the inversion of Kuu with a cost of O(QM 3) and products like Kfu with a cost of O(JNQM 2). Hyperparameter learning. Hyperparameters in our model include Z, {Bq}Q q=1, and {γq}Q q=1, the hyperparameters associated to the covariance functions {kq( , )}Q q=1. Since the variational distribution q(u) is sensitive to changes of the hyperparameters, we maximize the variational parameters for q(u), and the hyperparameters using a Variational EM algorithm (Beal, 2003) when employing the full dataset, or the stochastic version when using mini-batches (Hoffman et al., 2013). 2.3 Predictive distribution Consider a set of test inputs X . Assuming that p(u|y) q(u), the predictive distribution p(y ) can be approximated as p(y |y) R p(y |f )q(f )df , where q(f ) = R p(f |u)q(u)du. Computing the expression q(f ) = QD d=1 QJd j=1 q(fd,j, ) involves evaluating Kfd,j, u at X . As in the case of the lower bound, the integral above is intractable for the non-Gaussian likelihoods p(y |f ). We can once again make use of Monte Carlo integration or quadratures to approximate the integral. Simpler integration problems are obtained if we are only interested in the predictive mean, E[y ], and the predictive variance, var[y ]. 3 Related Work The most closely related works to ours are Skolidis and Sanguinetti (2011), Chai (2012), Dezfouli and Bonilla (2015) and Saul et al. (2016). We are different from Skolidis and Sanguinetti (2011) because we allow more general heterogeneous outputs beyond the specific case of several binary classification problems. Our inference method also scales to large datasets. The works by Chai (2012) and Dezfouli and Bonilla (2015) do use a MOGP, but they only handle a single categorical variable. Our inference approach scales when compared to the one in Chai (2012) and it is fundamentally different to the one in Dezfouli and Bonilla (2015) since we do not use AVI. Our model is also different to Saul et al. (2016) since we allow for several dependent outputs, D > 1, and our scalable approach is more akin to applying SVI to the inducing variable approach of Álvarez et al. (2010). More recenty, Vanhatalo et al. (2018) used additive multi-output GP models to account for interdependencies between counting and binary observations. They use the Laplace approximation for approximating the posterior distribution. Similarly, Pourmohamad and Lee (2016) perform combined regression and binary classification with a multi-output GP learned via sequential Monte Carlo. Nguyen and Bonilla (2014b) also uses the same idea from Álvarez et al. (2010) to provide scalability for multiple-output GP models conditioning the latent parameter functions fd,j(x) on the inducing variables u, but only considers the multivariate regression case. It is also important to mention that multi-output Gaussian processes have been considered as alternative models for multi-task learning (Alvarez et al., 2012). Multi-task learning also addresses multiple prediction problems together within a single inference framework. Most previous work in this area has focused on problems where all tasks are exclusively regression or classification problems. When tasks are heterogeneous, the common practice is to introduce a regularizer per data type in a global cost function (Zhang et al., 2012; Han et al., 2017). Usually, these cost functions are compounded by additive terms, each one referring to every single task, while the correlation assumption among heterogeneous likelihoods is addressed by mixing regularizers in a global penalty term (Li et al., 2014) or by forcing different tasks to share a common mean (Ngufor et al., 2015). Another natural way of treating both continuous and discrete tasks is to assume that all of them share a common input set that varies its influence on each output. Then, by sharing a jointly sparsity pattern, it is possible to optimize a global cost function with a single regularization parameter on the level of sparsity (Yang et al., 2009). There have also been efforts for modeling heterogeneous data outside the label of multi-task learning including mixed graphical models (Yang et al., 2014), where varied types of data are assumed to be combinations of exponential families, and latent feature models (Valera et al., 2017) with heterogeneous observations being mappings of a set of Gaussian distributed variables. 4 Experiments In this section, we evaluate our model on different heterogeneous scenarios 1. To demonstrate its performance in terms of multi-output learning, prediction and scalability, we have explored several applications with both synthetic and real data. For all the experiments, we consider an RBF kernel for each covariance function kq( , ) and we set Q = 3. For standard optimization we used the LBFGS-B algorithm. When SVI was needed, we considered ADADELTA included in the climin library, and a mini-batch size of 500 samples in every output. All performance metrics are given in terms of the negative log-predictive density (NLPD) calculated from a test subset and applicable to any type of likelihood. Further details about experiments are included in the appendix. Missing Gap Prediction: In our first experiment, we evaluate if our model is able to predict observations in one output using training information from another one. We setup a toy problem which consists of D = 2 heterogeneous outputs, where the first function y1(x) is real and y2(x) binary. Assumming that heterogeneous outputs do not share a common input set, we observe 1The code is publicly available in the repository github.com/pmorenoz/Het MOGP/ 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Real Output Output 1: Gaussian Regression 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Binary Output Output 2: Binary Classification 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Binary Output Single Output: Binary Classification (c) Figure 1: Comparison between multi-output and single-output performance for two heterogeneous sets of observations. a) Fitted function and uncertainty for the first output. It represents the mean function parameter µ(x) for a Gaussian distribution with σ2 = 1. b) Predictive output function for binary inputs. Blue curve is the fitting function for training data and the red one corresponds to predicting from test inputs (true test binary outputs in red too). c) Same output as Figure 1(b) but training an independent Chained GP only in the single binary output (GP binary classification). N1 = 600 and N2 = 500 samples respectively. All inputs are uniformly distributed in the input range [0, 1], and we generate a gap only in the set of binary observations by removing Ntest = 150 samples in the interval [0.7, 0.9]. Using the remaining points from both outputs for training, we fitted our MOGP model. In Figures 1(a,b) we can see how uncertainty in binary test predictions is reduced by learning from the first output. In contrast, Figure 1(c) shows wider variance in the predicted parameter when it is trained independently. For the multi-output case we obtained a NLPD value for test data of 32.5 0.2 10 2 while in the single-output case the NLPD was 40.51 0.08 10 2. Human Behavior Data: In this experiment, we are interested in modeling human behavior in psychiatric patients. Previous work by Soleimani et al. (2018) already explores the application of scalable MOGP models to healthcare for reliable predictions from multivariate time-series. Our data comes from a medical study that asked patients to download a monitoring app (EB2)2 on their smartphones. The system captures information about mobility, communication metadata and interactions in social media. The work has a particular interest in mental health since shifts or misalignments in the circadian feature of human behavior (24h cycles) can be interpreted as early signs of crisis. Table 1: Behavior Dataset Test-NLPD ( 10 2) Bernoulli Heteroscedastic Bernoulli Global Het MOGP 2.24 0.21 6.09 0.21 5.41 0.05 13.74 0.41 Chained GP 2.43 0.30 7.29 0.12 5.19 0.81 14.91 1.05 Output 1: Binary Presence/Absence at Home Output 2: Log-distance Distance from Home (Km) Output 3: Binary Use/non-use of Whatsapp Figure 2: Results for multi-output modeling of human behavior. After training, all output predictions share a common (daily) periodic pattern. In particular, we obtained a binary indicator variable of presence/absence at home by monitoring latitude-longitude and measuring its distance from the patient s home location within a 50m radius range. Then, using the already measured distances, we generated a mobility sequence with all log-distance values. Our last output consists of binary samples representing use/non-use of the 2This smartphone application can be found at https://www.eb2.tech/. Whatsapp application in the smartphone. At each monitoring time instant, we used its differential data consumption to determine use or non-use of the application. We considered an entire week in seconds as the input domain, normalized to the range [0, 1]. In Figure (2), after training on N = 750 samples, we find that the circadian feature is mainly contained in the first output. During the learning process, this periodicity is transferred to the other outputs through the latent functions improving the performance of the entire model. Experimentally, we tested that this circadian pattern was not captured in mobility and social data when training outputs independently. In Table 1 we can see prediction metrics for multi-output and independent prediction. London House Price Data: Based on the large scale experiments in Hensman et al. (2013), we obtained the complete register of properties sold in the Greater London County during 2017 (https://www.gov.uk/government/collections/price-paid-data). We preprocessed it to translate all property addresses to latitude-longitude points. For each spatial input, we considered two observations, one binary and one real. The first one indicates if the property is or is not a flat (zero would mean detached, semi-detached, terraced, etc.. ), and the second one the sale price of houses. Our goal is to predict features of houses given a certain location in the London area. We used a training set of N = 20, 000 samples, 1, 000 for test predictions and M = 100 inducing points. Table 2: London Dataset Test-NLPD ( 10 2) Bernoulli Heteroscedastic Global Het MOGP 6.38 0.46 10.05 0.64 16.44 0.01 Chained GP 6.75 0.25 10.56 1.03 17.31 1.06 -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 Property Type -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 -0.51 -0.34 -0.17 -0.0 0.16 0.33 51.29 Log-price Variance Figure 3: Results for spatial modeling of heterogeneous data. (Top row) 10% of training samples for the Greater London County. Binary outputs are the type of property sold in 2017 and real ones are prices included in sale contracts. (Bottom row) Test prediction curves for Ntest = 2, 500 inputs. Results in Figure (3) show a portion of the entire heterogeneous dataset and its test prediction curves. We obtained a global NLPD score of 16.44 0.01 using the MOGP and 17.31 1.06 in the independent outputs setting (both 10 2). There is an improvement in performance when training our multi-output model even in large scale datasets. See Table (2) for scores per each output. High Dimensional Input Data: In our last experiment, we tested our MOGP model for the arrhythmia dataset in the UCI repository (http://archive.ics.uci.edu/ml/). We use a dataset of dimensionality p = 255 and 452 samples that we divide in training, validation and test sets (more details are in the appendix). We use our model for predicting a binary output (gender) and a continuous output (logarithmic age) and we compared against independent Chained GPs per output. The binary output is modelled as a Bernoulli distribution and the continuous one as a Gaussian. We obtained an average NLPD value of 0.0191 for both multi-output and independent output models with a slight difference in the standard deviation. 5 Conclusions In this paper we have introduced a novel extension of multi-output Gaussian Processes for handling heterogeneous observations. Our model is able to work on large scale datasets by using sparse approximations within stochastic variational inference. Experimental results show relevant improvements with respect to independent learning of heterogeneous data in different scenarios. In future work it would be interesting to employ convolutional processes (CPs) as an alternative to build the multi-output GP prior. Also, instead of typing hand-made definitions of heterogeneous likelihoods, we may consider to automatically discover them (Valera and Ghahramani, 2017) as an input block in a pipeline setup of our tool. Acknowledgments The authors want to thank Wil Ward for his constructive comments and Juan José Giraldo for his useful advice about SVI experiments and simulations. We also thank Alan Saul and David Ramírez for their recommendations about scalable inference and feedback on the equations. We are grateful to Eero Siivola and Marcelo Hartmann for sharing their Python module for heterogeneous likelihoods and to Francisco J. R. Ruiz for his illuminating help about the stochastic version of the VEM algorithm. Also, we would like to thank Juan José Campaña for his assistance on the London House Price dataset. Pablo Moreno-Muñoz acknowledges the support of his doctoral FPI grant BES2016-077626 and was also supported by Ministerio de Economía of Spain under the project Macro-ADOBE (TEC2015-67719-P), Antonio Artés-Rodríguez acknowledges the support of projects ADVENTURE (TEC2015-69868-C2-1-R), AID (TEC2014-62194-EXP) and CASI-CAM-CM (S2013/ICE-2845). Mauricio A. Álvarez has been partially financed by the Engineering and Physical Research Council (EPSRC) Research Projects EP/N014162/1 and EP/R034303/1. M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In NIPS 21, pages 57 64, 2009. M. Álvarez, D. Luengo, M. Titsias, and N. Lawrence. Efficient multioutput Gaussian processes through variational inducing kernels. In AISTATS, pages 25 32, 2010. M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012. M. J. Beal. Variational algorithms for approximate Bayesian inference. Ph. D. Thesis, University College London, 2003. E. V. Bonilla, K. M. Chai, and C. Williams. Multi-task Gaussian process prediction. In NIPS 20, pages 153 160, 2008. K. M. A. Chai. Variational multinomial logit Gaussian process. Journal of Machine Learning Research, 13: 1745 1808, 2012. Z. Dai, M. A. Álvarez, and N. Lawrence. Efficient modeling of latent information in supervised learning using Gaussian processes. In NIPS 30, pages 5131 5139, 2017. A. Dezfouli and E. V. Bonilla. Scalable inference for Gaussian process models with black-box likelihoods. In NIPS 28, pages 1414 1422. 2015. J. D. Hadfield et al. MCMC methods for multi-response generalized linear mixed models: the MCMCglmm R package. Journal of Statistical Software, 33(2):1 22, 2010. H. Han, A. K. Jain, S. Shan, and X. Chen. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence, 2017. J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pages 282 290, 2013. J. Hensman, A. G. d. G. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classification. In AISTATS, pages 351 360, 2015. D. M. Higdon. Space and space-time modelling using process convolutions. In Quantitative methods for current environmental issues, pages 37 56, 2002. M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303 1347, 2013. A. G. Journel and C. J. Huijbregts. Mining Geostatistics. Academic Press, London, 1978. M. Lázaro-Gredilla and M. Titsias. Variational heteroscedastic Gaussian process regression. In ICML, pages 841 848, 2011. S. Li, Z.-Q. Liu, and A. B. Chan. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. In CVPR, pages 482 489, 2014. C. Ngufor, S. Upadhyaya, D. Murphree, N. Madde, D. Kor, and J. Pathak. A heterogeneous multi-task learning for predicting RBC transfusion and perioperative outcomes. In AIME, pages 287 297. Springer, 2015. T. V. Nguyen and E. V. Bonilla. Automated variational inference for Gaussian process models. In NIPS 27, pages 1404 1412. 2014a. T. V. Nguyen and E. V. Bonilla. Collaborative multi-output Gaussian processes. In UAI, 2014b. G. Parra and F. Tobar. Spectral mixture kernels for multi-output Gaussian processes. In NIPS 30, 2017. T. Pourmohamad and H. K. H. Lee. Multivariate stochastic process models for correlated responses of mixed type. Bayesian Analysis, 11(3):797 820, 2016. A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence. Chained Gaussian processes. In AISTATS, pages 1431 1440, 2016. G. Skolidis and G. Sanguinetti. Bayesian multitask classification with Gaussian process priors. IEEE Transactions on Neural Networks, 22(12), 2011. H. Soleimani, J. Hensman, and S. Saria. Scalable joint models for reliable uncertainty-aware event prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(8):1948 1963, 2018. Y. Teh, M. Seeger, and M. Jordan. Semiparametric latent factor models. In AISTATS, pages 333 340, 2005. K. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin. GP kernels for cross-spectrum analysis. In NIPS 28, 2015. I. Valera and Z. Ghahramani. Automatic discovery of the statistical types of variables in a dataset. In ICML, pages 3521 3529, 2017. I. Valera, M. F. Pradier, M. Lomeli, and Z. Ghahramani. General latent feature models for heterogeneous datasets. ar Xiv preprint ar Xiv:1706.03779, 2017. J. Vanhatalo, J. Riihimäki, J. Hartikainen, P. Jylänki, V. Tolvanen, and A. Vehtari. GPstuff: Bayesian modeling with Gaussian processes. Journal of Machine Learning Research, 14(1):1175 1179, 2013. J. Vanhatalo, M. Hartmann, and L. Veneranta. Joint species distribution modeling with additive multivariate Gaussian process priors and heteregenous data. ar Xiv preprint ar Xiv:1809.02432, 2018. A. G. Wilson and Z. Ghahramani. Generalised Wishart processes. In UAI, pages 736 744, 2011. E. Yang, P. Ravikumar, G. I. Allen, Y. Baker, Y.-W. Wan, and Z. Liu. A general framework for mixed graphical models. ar Xiv preprint ar Xiv:1411.0288, 2014. X. Yang, S. Kim, and E. P. Xing. Heterogeneous multitask learning with joint sparsity constraints. In NIPS 22, pages 2151 2159, 2009. D. Zhang, D. Shen, A. D. N. Initiative, et al. Multi-modal multi-task learning for joint prediction of multiple regression and classification variables in Alzheimer s disease. Neuro Image, 59(2):895 907, 2012.