# modular_gaussian_processes_for_transfer_learning__9ba759f5.pdf

Modular Gaussian Processes for Transfer Learning

Pablo Moreno-Muñoz Antonio Artés-Rodríguez Mauricio A. Álvarez

Section for Cognitive Systems, Technical University of Denmark (DTU) Dept. of Signal Theory and Communications, Universidad Carlos III de Madrid, Spain Evidence-Based Behavior (e B2), Spain Dept. of Computer Science, University of Shefﬁeld, UK pabmo@dtu.dk, antonio@tsc.uc3m.es, mauricio.alvarez@sheffield.ac.uk

We present a framework for transfer learning based on modular variational Gaussian processes (GP). We develop a module-based method that having a dictionary of well ﬁtted GPs, one could build ensemble GP models without revisiting any data. Each model is characterised by its hyperparameters, pseudo-inputs and their corresponding posterior densities. Our method avoids undesired data centralisation, reduces rising computational costs and allows the transfer of learned uncertainty metrics after training. We exploit the augmentation of high-dimensional integral operators based on the Kullback-Leibler divergence between stochastic processes to introduce an efﬁcient lower bound under all the sparse variational GPs, with different complexity and even likelihood distribution. The method is also valid for multi-output GPs, learning correlations a posteriori between independent modules. Extensive results illustrate the usability of our framework in large-scale and multitask experiments, also compared with the exact inference methods in the literature.

1 Introduction

Imagine a supervised learning problem, for instance regression, where N data points are processed for training a model. At a later time, new data are observed, this time corresponding to a binary classiﬁcation task, that we know are generated by the same phenomena, e.g. using a different sensor.

Figure 1: GP modules (A, B, C) are used for training (D) without revisiting data.

Having kept the observations from regression stored, a common approach would be to use them in combination with the classiﬁcation dataset to generate a new model. However, this practice might be inconvenient because of i) the need of centralising data to train the model, ii) the rising data-dependent computational cost as the number of samples increases and iii) the obsolescence of ﬁtted models, whose future usability is not guaranteed for the new set of observations. Looking at the deployment in large-scale scenarios, by any organization or use case, where data are ever changing, this solution becomes prohibitive. The main challenge is to incorporate unseen tasks as former models are recurrently discarded.

Alternatively, we propose a framework based on modules of Gaussian processes (GP) (Rasmussen and Williams, 2006). Considering the current example, the regression model (or module) is kept intact. Once new data arrives, one ﬁts a meta-GP using the module, but without revisiting any sample. If no data are observed, combining multiple modules is also allowed see Figure 1. Under this framework, a new family of module-driven GP models emerges.

35th Conference on Neural Information Processing Systems (Neur IPS 2021).

Background. Much recent attention has been paid to the notion of efﬁciency, e.g. in the way of accessing to data. The mere limitation of data availability forces learning algorithms to derive new capabilities, such as, distributing the data for federated learning (Smith et al., 2017), recurrently observe streaming samples for continual learning (Goodfellow et al., 2014) and limiting data exchange for private-owned models (Peterson et al., 2019). A common theme is the idea of model memorising and recycling, that is, using the already ﬁtted parameters for an additional task without revisiting any data. If we look to the probabilistic view of this idea, uncertainty is harder to be repurposed than parameters of functions. This is the point where Gaussian processes play their role.

The ﬂexible nature of GPs for deﬁning a prior distribution over non-linear function spaces has made them a suitable alternative in probabilistic regression and classiﬁcation. However, GP models are not immune to settings where we need to adapt to irregular ways of observing data, e.g. noncentralised observations, asynchronous samples or missing inputs. Such settings, together with the well-known computational cost of GPs, typically O(N 3), has motivated plenty of works focused on parallelising inference. Before the modern era of GP approximations, two seminal works distributed the computational load with local experts (Jacobs et al., 1991; Hinton, 2002). While the Bayesian committee machine (BCM) of Tresp (2000) focused on merging independently trained GP regression models on data partitions, the inﬁnite mixture of GP experts (Rasmussen and Ghahramani, 2002) combined local GPs to gain expressiveness. Moreover, the emergence of large datasets, with size N>104, led to the introduction of variational methods (Titsias, 2009a) for scaling up GP inference. Two recent works that combined sparse GPs with distributed regression ideas are Gal et al. (2014); Deisenroth and Ng (2015). Their approaches are focused in exact GP regression and the distribution of computational load, respectively.

Contribution. We present a framework based on modular Gaussian processes. Each module is considered to be a sparse variational GP, containing only variational parameters and hyperparameters. Our main contribution is to keep such modules intact, to build meta-GPs without accessing any past observation. This makes the computational cost minimal, becoming a competitor against inference methods for large-datasets. The key principle is the augmentation of integrals, that can be understood in terms of the Kullback-Leibler (KL) divergence between stochastic processes (Matthews et al., 2016). We experimentally provide evidence of the beneﬁts in different applied problems, given several GP architectures. The framework is valid for multi-output settings and heterogeneous likelihoods, that is, handling one data-type per output. Our idea follows the spirit of not being data-driven any longer, but module-driven instead, where the Python user speciﬁes a list of models as input: models = {model1, model2, . . . , model K}. This will be later called the dictionary of modules.

2 Modular Gaussian processes

We consider supervised learning problems, where we have an input-output training dataset D = {xn, yn}N n=1 with xn Rp. We assume i.i.d. outputs yn, that can be either continuous or discrete variables. For convenience, we will refer to the likelihood term p(y|θ) as p(y|f) where the generative parameters are linked via θ = f(x). We say that f( ) is a non-linear function drawn from a zeromean GP prior f GP(0, k( , )), and k( , ) is the covariance function or kernel. Importantly, when non-Gaussian outputs are considered, the GP output function f( ) might need an extra deterministic mapping Φ( ) to be transformed to the appropriate parametric domain of θ.

Data modules. The dataset D is assumed to be partitioned into an arbitrary number of K subsets or modules, that are observed and processed independently, that is, {D1, D2, . . . , DK}. There is not any restriction on the number of modules, {Dk}K k=1 do not need to have the same size, and we only restrict them to be Nk<N. However, since we deal with a huge number of observations, we still consider that Nk for all k {1, 2, . . . , K} is sufﬁciently large for not accepting exact GP inference due to temporal and computational demands. Notice that k is an index while k( , ) is the kernel.

2.1 Sparse variational approximations for independent modules

For each data module, we adopt the sparse variational GP approach based on inducing variables, and the framework of Titsias (2009b), where posterior mean vectors and covariance matrices are computed as parameters. The use of auxiliary variables with approximate inference methods is widely known in the GP literature (Hensman et al., 2013, 2015; Matthews et al., 2016). In the context of K independent data modules, we deﬁne subsets of Mk Nk inducing inputs Zk = {zm}Mk m=1, where zm Rp.

Their non-linear function evaluations by f( ) are denoted as uk = [f(z1), f(z2), . . . , f(z Mk)] . As a ﬁrst approach, we assume that f is stationary across all modules, being all uk k {1, 2, . . . , K} function evaluations of f( ). This will be later relaxed in the following section.

To obtain the independent approximation to the posterior distribution p(f|Dk) of the GP, we introduce a kth variational distribution qk(f) for each data module Dk. In particular, this variational density factorises as qk(f) = p(f =uk|uk)qk(uk), with qk(uk) = N(uk|µk, Sk) and p(f =uk|uk) being the standard conditional GP prior distribution given the hyperparameters ψk of the kth kernel. To ﬁt the GP modules, and their variational distributions qk(uk), we build lower bounds Lk on the marginal log-likelihood (ELBO) of every dataset Dk. Then, we use gradient-based optimisation methods to maximise the K objective functions Lk, one per module, in an separate asynchronous manner. The ELBOs, for either regression or classiﬁcation tasks are obtained as follows

n=1 Eqk(fn) [log p(yn|fn)] KL[qk(uk)||pk(uk)], (1)

with pk(uk) = N(uk|0, Kkk), where Kkk RMk Mk has entries k(zm, zm ) with zm, zm Zk and conditioned to certain kernel hyperparameters ψk that we also aim to optimise. The variable fn corresponds to f(xn) and the marginal posterior comes from qk(fn) = R p(fn|uk)qk(uk)duk. In practice, the distributed bounds Lk are identical to the one presented in Hensman et al. (2015) and can be combined with stochastic variational inference (Hoffman et al., 2013; Hensman et al., 2013).

Dictionary of modules. The principal goal here is to obtain a dictionary, containing the already ﬁtted GP modules, for their later use. Such dictionary consists, for instance, of a list of module objects {M1, M2, . . . , MK} without any speciﬁc order, where each Mk = {φk, ψk, Zk, L k}, being φk the corresponding variational parameters µk and Sk. Notice that we also include L k in addition to the ﬁtted inducing points and hyperparameters. This is the ELBO obtained during optimization. It will play a key role later for the module-driven learning process.

2.2 Module-driven lower bound

Ideally, to obtain an inference solution given the GP modules included in the dictionary, the resulting posterior density should be valid for all data modules {Dk}K k=1. This is only possible if we consider the entire dataset D in a maximum likelihood criterion, that is, using the global log-evidence. Speciﬁcally, our goal is now to obtain an approximate posterior distribution q(f) p(f|D), by maximising a lower bound LM under the log-marginal density log p(D). This bound should not revisit any data but only the objects in the dictionary of modules, that are models. Notice that the GP model will be no longer data-driven, but module-driven instead. This is, we build a new model from models, what we call a GP meta-model or meta-GP.

We begin by considering the full posterior distribution of the stochastic process, similarly as Burt et al. (2019) does for obtaining an upper bound on the Kullback-Leibler (KL) divergence. The idea is to use the large-dimensional integral operators introduced by Matthews et al. (2016) in the context of variational inference, and previously by Seeger (2002) for standard GP error bounds. The use of the large-dimensional integrals is equivalent to an augment-and-reduce strategy (Ruiz et al., 2018). It consists of two steps: i) we augment the model to be conditioned on the high-dimensional index set of the stochastic process and ii) we apply properties of Gaussian marginals to reduce the integral operators to a ﬁnite amount of GP function values of interest. Similar strategies have been previously used in continual learning for GPs (Bui et al., 2017; Moreno-Muñoz et al., 2019).

Global evidence objective. The construction considered is as follows. We ﬁrst denote y as all the output targets {yn}N n=1 in the dataset D and f+ as the augmented large-dimensional GP. Notice that f+ sufﬁciently large and it contains the function values taken by f( ) at both {xn}N n=1 and {Zk}K k=1 for all data modules. We consider it ﬁnite. The log-marginal expression is therefore

log p(y) = log p(y1, y2, . . . , y K) = log Z p(y, f+)df+, (2)

where yk = {yn}Nk n=1 are the output data-points already used for training each GP module Mk. The joint distribution in (2) factorises according to p(y|f+)p(f+), where the l.h.s. term is the augmented likelihood distribution and the r.h.s. term would correspond to the GP prior over the large index set of the stochastic process. Then, we introduce the global variational distribution q(u ) = N(u |µ , S )

that we aim to ﬁt by maximising a lower bound under log p(y) given the dictionary of modules. The variables u correspond to function values of f( ) given the new subset of inducing inputs Z = {zm}M m=1, where M is the free-complexity degree of the global variational distribution.

To derive the bound, we exploit the reparametrisation introduced by Gal et al. (2014) for distributing the computational load of the expectation term with GP latent variable models (GPLVM) (Lawrence, 2005). This is based on the decoupling of data-points conditioned on the inducing inputs u . Applying Jensen s inequality, it is obtained as

log p(y) = log ZZ q(u )p(f+ =u |u )p(y|f+)p(u )

q(u )df+ =u du

Ep(f+ =u |u ) [log p(y|f+)] + log p(u )

where we applied the properties of Gaussian conditionals to factorise the GP prior as p(f+) = p(f+ =u |u )p(u ). Notice that f+ = {f+ =u , u }. Here, the last prior distribution is p(u ) = N(u |0, K ) where [K ]m,n := k(zm, zn), with zm, zn Z . We also consider k( , ) conditioned to the global kernel hyperparameters ψ that we also aim to estimate. The double expectation in (3) comes from the factorization of the large-dimension integral given f+ and the application of Jensen s inequality twice. Its derivation is in the appendix.

Likelihood reconstruction. The augmented likelihood distribution p(y|f+) is perhaps the most important part of the derivation. It allows us to apply conditional independence (CI) between the data modules Dk, and particularly, the output variables yk. This gives a factorized term that we will later use for introducing the GP modules Mk stored in our dictionary, that is, log p(y|f+) = PK k=1 log p(yk|f+). To avoid revisiting each kth likelihood term, and hence, evaluating the corresponding data which might be unavailable, we want to estimate the likelihood distributions. In particular, we use Bayes theorem conditioned to the large-dimensional augmentation f+. Bayes theorem indicates that each kth likelihood can be approximated as

p(yk|f+) Zk qk(f+) pk(f+) = Zk (((((( p(f+ =uk|uk)qk(uk)

(((((( p(f+ =uk|uk)pk(uk) = Zk qk(uk) pk(uk), (4)

since qk(f+) p(f+|yk). Here, Zk refers to the normalization constant or evidence of Dk, which is still difﬁcult to compute. We also assumed that the augmented approximate density factorises as qk(f+) = p(f+ =uk|uk)qk(uk), using the structure of the GP prior pk(f+), as in the proposition of Titsias (2009a). Similar approximations based on the stochastic process conditionals were previously used in Bui et al. (2017) and Matthews et al. (2016), with emphasis on the theoretical consistency of augmentation. Importantly, all parameters and variables needed for the approximation in (4) are given, and stored in the kth module Mk. Then, the nested conditional expectation in (3) turns out to be

Ep(f+ =u |u ) [log p(y|f+)]

k=1 Ep(uk|u )

log Zk qk(uk) pk(uk)

where we applied properties of Gaussian marginals to reduce the large-dimensional expectation. For instance, the integral R p(f+)df+ =uk is analogous to R p(f+ =uk, uk)df+ =uk=p(uk) via marginalisation.

Variational contrastive expectations. The introduction of K expectation terms over the log-ratios given the likelihood approximations in (4), leads to particular advantages. Having a nested integration in (3), ﬁrst over u given the variational density q(u ), and second over uk for the log-ratio qk(uk)/pk(uk), we can exploit the GP predictive equation to write down

log Zk qk(uk)

k=1 Eq C(uk)

log Zk qk(uk) pk(uk)

where we obtained q C(uk) via the integral q C(uk) = R q(u )p(uk|u )du , that coincides with the approximate predictive GP posterior. The computation can be obtained analytically for each kth subset uk using the following expression

q C(uk) = N(uk|K k K 1 µ , Kkk + K k K 1 (S K )K 1 K k),

Figure 2: Modular GPs for {0, 1} MNIST data samples. The meta-GPs (B D) are built from ﬁtted modules and do not revisit samples, only parameters and variables stored in each Mk.

where, once again, φ = {µ , S } are the global variational parameters that we aim to learn. One important detail of the sum of expectations in (5) is that it works as an average contrastive indicator that measures how well the global q(u ) is being ﬁtted to the modular approximations qk(uk). Without the need of revisiting any data, the GP predictive q C(uk) is playing a different role in contrast with the usual one. Typically, we assume the approximate posterior ﬁxed and ﬁtted, and we evaluate its performance on some test data points. In this case, it goes in the opposite way, the approximate variational distribution is unﬁxed, and it is instead evaluated over the inducing inputs Zk provided by each kth module Mk.

Module-driven lower bound. We are now able to simplify the initial bound in (3) by substituting the ﬁrst term with the contrastive expectations presented in (5). This substitution gives us an initial version of the lower bound under the log-marginal likelihood of the GP, where no data intervene,

k=1 log Zk +

k=1 Eq C(uk) [log qk(uk) log pk(uk)] KL [q(u )||p(u )] .

However, since normalization constants {Zk}K k=1 are difﬁcult to compute, this immediately implies that the bound is not well deﬁned without them. We therefore, make use of the information provided by the modules. One can show that using the alternative construction of variational lower bounds on sparse GP models, the constant satisﬁes log Zk = L k + KL[q(f, uk)||p(f, uk|yk)], where L k is obtained from Eq. (1). Since we know the KL divergence is greater or equal to zero, and each L k per module is a lower bound of log Zk, we have that PK k=1 L k PK k=1 log Zk. This allows us to write down a closed-form bound LM log p(y) ,

k=1 Eq C(uk) [log qk(uk) log p(uk)] KL [q(u )||p(u )] . (6)

The maximisation of (6) is w.r.t. the parameters φ , the hyperparameters ψ and Z . To assure the positive-deﬁnitiness of variational covariance matrices {Sk}K k=1 and S on both local and global cases, we consider that they all factorize according to the Cholesky decomposition S = LL . We can then use unconstrained optimization to ﬁnd optimal values for the lower-triangular matrices L.

A priori, the module-driven bound is agnostic with respect to the likelihood model chosen. There is a general derivation in Matthews et al. (2016) of how stochastic processes and their integral operators are affected by projection functions, that is, different linking mappings of the function f( ) to the parameters θ. In such cases, the local lower bounds Lk in (1) might include expectation terms that are intractable. Since we build the framework to accept any possible data-type, we propose to solve the integrals via Gaussian-Hermite quadratures as in Hensman et al. (2015); Saul et al. (2016), and if this is not possible, an alternative would be to apply Monte-Carlo methods.

2.3 Transfer learning with multi-output modules

The arguments used so far are based on the idea that there exists a stationary GP across all modules, being uk {1, 2, . . . , K} values of the same function f( ). This viewpoint can be satisﬁed if we deal with data from the same domain see Figure 2 for illustrative examples with MNIST images.

However, regarding the type of outputs that can be modelled, it becomes difﬁcult when we have modules of heterogeneous data from the same phenomena, e.g. a mix of continuous, binary or discrete variables with different likelihood models. In such cases, we know that independent modules in our dictionary M might be strongly correlated, but not parametrized by one single-output GP. As a result, we relax the stationary assumption over f to accept an additional set of independent latent functions V = {vq( )}Q q=1. This is, we assume the output functions {fk( )}K k=1 per module Mk M to be linear combinations of V. Based on Moreno-Muñoz et al. (2018), this is equivalent to introducing a multi-output Gaussian process (MOGP) prior (Alvarez et al., 2012) in the formulation.

Figure 3: Illustration of modular MOGP. Regression, classiﬁcation and data tasks are linked to outputs fk( ).

Multi-parameter modules. Given Q latent functions V, each latent function vq(x) is assumed to be drawn from an independent GP prior, such that vq( ) GP(0, kq( , )), where kq can be any valid covariance function, and the zero mean is assumed for simplicity. Since we adopt the linear model of coregionalisation (LMC) (Journel and Huijbregts, 1978), each output function fk per module Mk is then given as fk(x) = PQ q=1 PRq i=1 ai k,qvi q(x), where vi q(x) are i.i.d. samples from each latent qth GP prior and ai k,q R are their associated coefﬁcients for i = 1, 2, . . . , Rq samples. Notice that the variational parameters per module Mk, are given in terms of uk, the evaluation of fk( ) given the index set Zk. Notice that the output functions {fk}K k=1 are CI, given V. This leads to heterogeneous likelihoods that we can also factorise. The augmented distribution becomes log p(y|F+) = PK k=1 log p(yk|fk+), where we denote F+ = {fk+}K k=1.

Augmentation of multiple likelihoods. We follow the same formalism for latent functions V used in the single-output scenario, where we introduced the inducing inputs Z = {zm}M m=1. Sparse approximations in the context of MOGP have been previously explored in Alvarez and Lawrence (2009); Álvarez et al. (2010). Additionally, we deﬁne Q variational distributions, such that q(v q) = N(v q|µ q, S q), where v q = [vq(z1), vq(z2), . . . , vq(z M)] . This gives us a way to construct approximations to multi-output likelihood terms p(yk|fk+), and particularly, we use the augmented joint distribution p(F+, V+), that factorises p(F+, V+) = p(F+, V+ = |V )p(V ), where p(V ) = QQ q=1 p(v q), GP priors per qth latent function are p(v q) = N(0, K q q) and matrices [K q q]m,n := k(zm, zn), with zm, zn Z . Since we can obtain closed-form conditionals between latent variables v q and uk from modules using the cross-covariance matrices of the MOGP, we are able to rewrite our multi-output module-driven bound as

k=1 Eq C(uk) [log qk(uk) log pk(uk)]

q=1 KL [q(v q)||p(v q)] , (7)

where we also rewrote the contrastive density q C(uk) = R p(uk|v )q(v )dv , with v = {v q}Q q=1 and q(v ) = QQ q=1 q(v q). This results in closed-form solution for q C(uk), which is in the appendix.

Computational cost and connections. The computational cost of training modules is O(Nk M 2 k), while the meta-GP reduces to O((P

k Mk)M 2) and O(M 2) in training and memory, respectively. If we consider the multi-output approach, the cost only increases to O((P

k Mk)QM 2). Note that learning the meta-GP is no longer data-driven. The methods for distributed GPs typically need O(P

k N 2 k) for global prediction. We also ﬁnd a link between the module-driven bound in (6) and the underlying idea in Tresp (2000); Deisenroth and Ng (2015). To approximate the predictive posterior, such methods combine local estimates divided by the GP prior. This is analogous to (6), where we formulate in the logarithmic plane and the variational inference setup.

Table 1: Properties of distributed/modular GP models

MODEL N REG. non-N REG. CLASS. HET. INFERENCE GP NODE FREE OF DATA ST.

Tresp (2000) Analytical Ng and Deisenroth (2014) Analytical Cao and Fleet (2014) Analytical Deisenroth and Ng (2015) Analytical Gal et al. (2014) Variational This work Variational

( ) Respectively, Gaussian and non-Gaussian regression (N & non-N REG), classiﬁcation (CLASS), heterogeneous (HET) and storage (ST).

3 Related work

In terms of distributed inference for scaling up computation, that is, the delivery of tasks across parallel nodes, we are similar to Gal et al. (2014). However, their approach distributes calculus operations, i.e. product of vectors or matrix inversions that are later centralized, but no usable models under our deﬁnition of module. This is, we obtain a meta-model from models that are pre-trained and non-obsolete. Alternatively, if we look to the property of having nodes that contain usable GP models (Table 1), we are similar to Deisenroth and Ng (2015); Cao and Fleet (2014); Neumann et al. (2009) and Tresp (2000), with the difference that we introduce variational approximation methods for non-Gaussian likelihoods. An important detail is that the idea of exploiting properties of full stochastic processes (Matthews et al., 2016) for substituting likelihood terms in a general bound has been previously considered in Bui et al. (2017), ending in the derivation of expectation-propagation (EP) methods for streaming inference in GPs. There is also the method of Bui et al. (2018) for both federated and continual learning, but focused both on EP and the Bayesian approach of Nguyen et al. (2018). A short analysis of their application to GPs is included for continual learning settings but far from the large-scale scope of our paper. Moreover, the spirit of using inducing-points as pseudo-approximations of local subsets of data is shared with Bui and Turner (2014), that comments its potential application to distributed setups. More oriented to dynamical modular models, we ﬁnd the work by Velychko et al. (2018), whose factorisation across tasks is similar to Ng and Deisenroth (2014) but closer to state-space methods. It is also important to mention that we are different from Duvenaud et al. (2013), because we work in the concept of models, not a dictionary of kernels as they do. We also ﬁnd Mallasto and Feragen (2017), where they incorporate uncertainty in a large population analysis from the optimal transport perspective, using GPs and 2-Wasserstein metrics.

4 Experiments

In this section, we evaluate the performance of the module-based GP framework (MODULARGP) for multiple architectures, scenarios and data access settings. To illustrate its usability, we present results in three different learning scenarios: i) regression, ii) classiﬁcation and iii) multi-output learning where we also consider heterogeneous likelihoods. All experiments are numbered from one to six in roman characters. Performance metrics are given in terms of the negative log-predictive density (NLPD), root mean-square error (RMSE) and mean-absolute error (MAE). For standard optimization, we used the Adam algorithm (Kingma and Ba, 2015). We provide Pytorch code that allows to easily learn the meta models from GP modules.1 It also includes the baseline methods used. Details about strategies for initialization and optimization are provided in the appendix. We also remark that data are never revisited and their presence in the meta-GP plots is just for clarity.

Table 2: Comparative error metrics for distributed GP models.

DATA SIZE 10K 100K 1M

MODEL NLPD RMSE MAE NLPD RMSE MAE NLPD RMSE MAE

BCM 2.99 0.94 11.94 18.89 2.05 1.31 3.51 0.73 2.33 0.96 1.34 1.03 NA NA NA Po E 2.79 0.16 2.32 0.22 1.86 0.22 2.82 0.67 2.19 0.91 1.71 0.84 2.91 0.63 1.98 0.61 1.32 0.05 GPo E 2.79 0.56 2.43 0.52 1.96 0.48 2.73 0.72 2.19 0.91 1.71 0.84 2.72 0.52 1.98 0.61 1.32 0.05 RBCM 2.96 0.51 2.49 0.51 2.02 0.46 3.03 0.86 2.51 1.12 1.99 1.04 2.56 0.06 1.82 0.02 1.37 0.03 Modular GP 2.71 0.11 1.56 0.04 0.97 0.05 2.89 0.07 1.73 0.01 1.23 0.02 2.87 0.09 1.87 0.07 1.34 0.09 Acronyms: BCM (Tresp, 2000), Po E (Ng and Deisenroth, 2014), GPo E (Cao and Fleet, 2014) and RBCM (Deisenroth and Ng, 2015).

4.1 Regression. In our ﬁrst experiments for sparse variational GP regression, we provide both qualitative and quantitative results about the performance of the modular GP framework. (i) Toy concatenation: In Figure 4, we show three of ﬁve tasks united in a new GP model. Tasks are GP modules ﬁtted independently with Nk=500 synthetic data points and Mk=15 inducing variables

1The code is publicly available in the repository: https://github.com/pmorenoz/Modular GP/.

Table 3: Comparative metrics of modular multi-output GPs for US-FLIGHT dataset.

PARTITION DAYS MONTHS

MODELS NLPD MAE RMSE NLPD MAE RMSE

MODULES ( ) 2.36 0.18 1.48 0.26 2.31 0.24 2.03 0.02 1.53 0.06 1.83 0.03 MODULARGP (Q = 2) 2.49 0.37 1.49 0.26 2.31 0.24 2.51 0.34 1.56 0.14 2.37 0.13 MODULARGP (Q = 3) 2.38 0.23 1.49 0.25 2.31 0.25 2.38 0.13 1.57 0.13 2.38 0.11 MODULARGP (Q = 4) 2.36 0.15 1.49 0.26 2.31 0.24 2.39 0.03 1.57 0.14 2.37 0.12 MOGP (Q = 2) 2.49 0.38 1.51 0.25 2.31 0.25 2.58 0.42 1.61 0.12 2.23 0.14 MOGP (Q = 3) 2.39 0.25 1.50 0.26 2.31 0.26 2.46 0.38 1.61 0.11 2.18 0.12 MOGP (Q = 4) 2.37 0.17 1.51 0.26 2.31 0.25 2.34 0.28 1.63 0.11 2.14 0.13

PARTITION DAYS MONTHS

AVG. DIFFERENCE PER OUTPUT NLPD MAE RMSE NLPD MAE RMSE

MODULARGP (Q = 2) vs. Modules 3.91% 0.64% 0.17% 17.69% 1.41% 22.73% MODULARGP (Q = 3) vs. Modules 0.51% 0.99% 0.16% 14.19% 1.41% 22.81% MODULARGP (Q = 4) vs. Modules +0.13% 1, 31% 0.19% 14.93% 1.87% 22.34% MODULARGP (Q = 2) vs. MOGP 3.71% 0.75% 0.15% 19.49% 4.21% 18.04% MODULARGP (Q = 3) vs. MOGP 0.91% 1.25% 0.33% 13.96% 3.32% 15.66% MODULARGP (Q = 4) vs. MOGP 0.11% 1.59% 0.27% 11.54% 4.92% 13.84%

( ) Modules as the metric of reference.

per module. The ensemble ﬁts a variational solution of dimension M=35. Notice that the variational meta-GP tends to match the uncertainty of the modules. (ii) Distributed GPs: We provide error metrics for the modular GP framework compared with the state-of-the-art models in Table 2. The training data is synthetic and generated as a combination of sin( ) functions (in the appendix). For the case with 10K observations, we used K=50 tasks with Nk=200 data-points and Mk=3 inducing variables in each GP module. The scenario for 100K is similar but divided into K=250 tasks with Nk=400. Our method obtains better results than the exact distributed solutions, as the meta-GP ﬁnds the average solution among all modular GPs. The baseline methods are based on a combination of solutions, if one is bad-ﬁtted, it has a direct effect on the predictive performance. We also tested the data with the inference setup of Gal et al. (2014), obtaining an NLPD of 2.58 0.11 with 250 nodes for 100K data. It is slightly better than ours and the baseline methods. Note that it only distributes the computation of matrix products and inversions, which is equivalent to the standard model in Hensman et al. (2013).

Figure 4: Modular GPs with synthetic data. Three of ﬁve tasks are plotted in the upper row.

(iii) Modular Meta-GPs: For a large synthetic dataset (N=106), we tested the modular GP framework with K=5 103 tasks as shown in Table 2. However, if we join large amounts of modular GPs, e.g. K 103, it is often problematic for baseline methods, due to partitions must be revisited for building ﬁnal predictions. So data revisiting is not totally avoided. This is not a constraint for us, as we work with the expectation of modules. Table 2 (1M) shows that we are closer to the exact inference solutions, even in large-scale cases. Additionally, we repeated the experiment in a pyramidal way see appendix. That is, building meta-GPs from modular meta-GPs, inspired in Deisenroth and Ng (2015). Our method obtained {NLPD=4.15, RMSE=2.71, MAE=2.27}. 4.2 Classiﬁcation. We adapted the modular GP framework to accept non-Gaussian likelihoods, and in particular, binary classiﬁcation with Bernoulli distributions. We used the sigmoid mapping to link the GP function and the probit parameters. Figure 2 shows an illustrative pixel-wise MNIST {0, 1} experiment inspired in Van der Wilk et al. (2017), where {0, 1} meta-GPs are free of data revisiting. (iv) Banana dataset: We used the popular dataset in sparse GP classiﬁcation for testing our method with M=25. We obtained a test NLPD= 7.21 0.04, while the baseline variational GP test NLPD was 7.29 7.85 10 4. The improvement is understandable as the total number of inducing points used, including the ones in modules (C), is higher in the modular GP scenario.

4.3 Multi-output learning. We tested the performance of multi-output meta-GPs on large datasets. Error metrics are averaged across all output functions {fk}K k=1 and we are interested in the difference

Figure 5: Modular GP classiﬁer for banana dataset. Figure 6: Modular MOGP f2( ) .

of predictive performance between the outputs of the meta-GP vs. modules and the standard MOGP model (Alvarez and Lawrence, 2009). Links to datasets are provided in the appendix. (v) Airline delays (US ﬂight data): We took data of US airlines from 2008 (1.5M), where 200K samples where used for testing. The output target is the ﬂight delay and inputs are 6 dimensional. Our goal is to analyse if having GP regression modules per day (K=366) or per month (K=12), one can obtain a meta-MOGP linking one-module per output. The results are shown in Table 3, where we see that the relative error difference ( ) per output is small. This is, the meta-MOGP prediction error only decreases around 1% approx. w.r.t. the GP modules and the variational MOGP baseline. (vi) London household: Based on Hensman et al. (2013), we obtained the register of properties sold in the Greater London County during 2017. The inputs are longitude-latitude coordinates and we use an heterogeneous model with an heteroscedastic Gaussian and the two Bernoulli likelihood distributions. We trained GP modules on the Gaussian samples and one of the binary data channels (prices and type of contracts). The scheme is provided in Figure 3. Later, we obtained a meta-MOGP from the two modules and new binary data register (type of house). The predictive curves of the meta MOGP for the binary module are provided in Figure 6. We obtained NLPD=4.18 0.06 in regression and NLPD=4.78 0.03 in the classiﬁcation module. The meta-MOGP obtained NLPDf1=5.49 0.17, NLPDf2=4.64 0.06 and NLPDf3=4.51 0.11 in the Gaussian and two Bernoulli tasks respectively. The performance has a slight decrease on the regression task, while improves in the binary output w.r.t. the module. The dataset size is N=20K. We used Q=3 and M=Q 16 inducing inputs.

5 Conclusion

We introduced a new framework for building meta-models from independently trained GP modules. Our main contribution is to keep modules intact based on their parameters, avoid their obsolescence and mix them to form new usable tool without revisiting any data. The formulation is principled and allows for GP regression, classiﬁcation and multi-output learning with heterogeneous likelihoods. We analysed its performance on synthetic and real data, and compared it with the state-of-the-art works. Experimental results show remarkable evidence that the method is robust, and successfully ﬁts to the dictionary of modules. In future work, it would be interesting to extend the framework to include convolutional kernels (Van der Wilk et al., 2017) for large-scale image processing and functional regularisation (Titsias et al., 2020; Moreno-Muñoz et al., 2019) for continual learning applications. As a potential societal impact of our work, we argue that in the long-term, ﬁtted models must be protected, as we do with data to the preserve privacy of individuals.

Acknowledgements

The authors want to thank Daniel Hernández-Lobato for his constructive comments and Javier González for the useful discussion on this work during the Ph D defense of PMM in Madrid last March. We also thank Søren Hauberg for the valuable feedback and the Section for Cognitive Systems at DTU for providing the computational resources. PMM has been supported by FPI grant BES2016-077626 and ERC funding under the EU s Horizon 2020 research and innovation programme (grant agreement nº 757360). AAR acknowledges the grant TEC2017-92552-EXP (a MBITION) by Ministerio de Ciencia, Innovación y Universidades and the grants TEC2017-86921-C2-2-R (CAIMAN) and RTI2018-099655-B-I00 (CLARA) jointly with the European Comission (ERDF). He also acknowledges the grant Y2018/TCS-4705 (PRACTICO-CM) by the Comunidad de Madrid. MAA has been ﬁnanced by the EPSRC Research Projects EP/T00343X/2 and EP/V029045/1.

M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In NIPS 21, pages 57 64, 2009.

M. Álvarez, D. Luengo, M. Titsias, and N. Lawrence. Efﬁcient multioutput Gaussian processes through variational inducing kernels. In AISTATS, pages 25 32, 2010.

M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. Kernels for vector-valued functions: A review. Foundations and Trends in Machine Learning, 4(3):195 266, 2012.

T. D. Bui and R. E. Turner. Tree-structured Gaussian process approximations. In Advances in Neural Information Processing Systems (NIPS), pages 2213 2221, 2014.

T. D. Bui, C. Nguyen, and R. E. Turner. Streaming sparse Gaussian process approximations. In Advances in Neural Information Processing Systems (NIPS), pages 3299 3307, 2017.

T. D. Bui, C. V. Nguyen, S. Swaroop, and R. E. Turner. Partitioned variational inference: A uniﬁed framework encompassing federated and continual learning. ar Xiv preprint ar Xiv:1811.11206, 2018.

D. R. Burt, C. E. Rasmussen, and M. Van der Wilk. Rates of convergence for sparse variational Gaussian process regression. In International Conference on Machine Learning (ICML), pages 862 871, 2019.

Y. Cao and D. J. Fleet. Generalized product of experts for automatic and principled fusion of Gaussian process predictions. ar Xiv preprint ar Xiv:1410.7827, 2014.

M. Deisenroth and J. W. Ng. Distributed Gaussian processes. In International Conference on Machine Learning (ICML), pages 1481 1490, 2015.

D. Duvenaud, J. Lloyd, R. Grosse, J. Tenenbaum, and G. Zoubin. Structure discovery in nonparametric regression through compositional kernel search. In International Conference on Machine Learning (ICML), pages 1166 1174. PMLR, 2013.

Y. Gal, M. Van Der Wilk, and C. E. Rasmussen. Distributed variational inference in sparse Gaussian process regression and latent variable models. In Advances in Neural Information Processing Systems (NIPS), pages 3257 3265, 2014.

I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. In International Conference on Learning Representations (ICLR), 2014.

J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in Artiﬁcial Intelligence (UAI), pages 282 290, 2013.

J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classiﬁcation. In Artiﬁcial Intelligence and Statistics (AISTATS), pages 351 360, 2015.

G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8): 1771 1800, 2002.

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research (JMLR), 14(1):1303 1347, 2013.

R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton. Adaptive mixtures of local experts. Neural Computation, 3(1):79 87, 1991.

A. G. Journel and C. J. Huijbregts. Mining Geostatistics. Academic Press, London, 1978.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. 2015.

N. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6(11), 2005.

A. Mallasto and A. Feragen. Learning from uncertain curves: The 2-Wasserstein metric for Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), pages 5665 5674, 2017.

A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In Artiﬁcial Intelligence and Statistics (AISTATS), pages 231 239, 2016.

P. Moreno-Muñoz, A. Artés-Rodríguez, and M. A. Álvarez. Heterogeneous multi-output Gaussian process prediction. In Advances in Neural Information Processing Systems (Neur IPS), pages 6711 6720, 2018.

P. Moreno-Muñoz, A. Artés-Rodríguez, and M. A. Álvarez. Continual multi-task Gaussian processes. ar Xiv preprint ar Xiv:1911.00002, 2019.

M. Neumann, K. Kersting, Z. Xu, and D. Schulz. Stacked Gaussian process learning. In IEEE International Conference on Data Mining, pages 387 396. IEEE, 2009.

J. W. Ng and M. P. Deisenroth. Hierarchical mixture-of-experts model for large-scale Gaussian process regression. ar Xiv preprint ar Xiv:1412.3078, 2014.

C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner. Variational continual learning. In International Conference on Learning Representations (ICLR), 2018.

D. Peterson, P. Kanani, and V. J. Marathe. Private federated learning with domain adaptation. Workshop on Federated Learning for Data Privacy and Conﬁdentiality at Neur IPS, 2019.

C. E. Rasmussen and Z. Ghahramani. Inﬁnite mixtures of Gaussian process experts. In Advances in neural information processing systems (NIPS), pages 881 888, 2002.

C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning, volume 2. MIT Press, 2006.

F. J. Ruiz, M. K. Titsias, A. B. Dieng, and D. M. Blei. Augment and reduce: Stochastic inference for large categorical distributions. In International Conference on Machine Learning (ICML), 2018.

A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence. Chained Gaussian processes. In Artiﬁcial Intelligence and Statistics (AISTATS), pages 1431 1440, 2016.

M. Seeger. PAC-Bayesian generalisation error bounds for Gaussian process classiﬁcation. Journal of Machine Learning Research (JMLR), 3(Oct):233 269, 2002.

V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar. Federated multi-task learning. In Advances in Neural Information Processing Systems (NIPS), pages 4424 4434, 2017.

M. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Artiﬁcial Intelligence and Statistics (AISTATS), pages 567 574, 2009a.

M. K. Titsias. Variational model selection for sparse Gaussian process regression. Technical Report, University of Manchester, 2009b.

M. K. Titsias, J. Schwarz, A. G. de G. Matthews, R. Pascanu, and Y. W. Teh. Functional regularisation for continual learning with Gaussian processes. In International Conference on Learning Representations (ICLR), 2020.

V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719 2741, 2000.

M. Van der Wilk, C. E. Rasmussen, and J. Hensman. Convolutional Gaussian processes. In Advances in Neural Information Processing Systems (NIPS), pages 2849 2858, 2017.

D. Velychko, B. Knopp, and D. Endres. Making the coupled Gaussian process dynamical model modular and scalable with variational approximations. Entropy, 20(10):724, 2018.