# modular_gaussian_processes_for_transfer_learning__9ba759f5.pdf Modular Gaussian Processes for Transfer Learning Pablo Moreno-Muñoz Antonio Artés-Rodríguez Mauricio A. Álvarez Section for Cognitive Systems, Technical University of Denmark (DTU) Dept. of Signal Theory and Communications, Universidad Carlos III de Madrid, Spain Evidence-Based Behavior (e B2), Spain Dept. of Computer Science, University of Sheffield, UK pabmo@dtu.dk, antonio@tsc.uc3m.es, mauricio.alvarez@sheffield.ac.uk We present a framework for transfer learning based on modular variational Gaussian processes (GP). We develop a module-based method that having a dictionary of well fitted GPs, one could build ensemble GP models without revisiting any data. Each model is characterised by its hyperparameters, pseudo-inputs and their corresponding posterior densities. Our method avoids undesired data centralisation, reduces rising computational costs and allows the transfer of learned uncertainty metrics after training. We exploit the augmentation of high-dimensional integral operators based on the Kullback-Leibler divergence between stochastic processes to introduce an efficient lower bound under all the sparse variational GPs, with different complexity and even likelihood distribution. The method is also valid for multi-output GPs, learning correlations a posteriori between independent modules. Extensive results illustrate the usability of our framework in large-scale and multitask experiments, also compared with the exact inference methods in the literature. 1 Introduction Imagine a supervised learning problem, for instance regression, where N data points are processed for training a model. At a later time, new data are observed, this time corresponding to a binary classification task, that we know are generated by the same phenomena, e.g. using a different sensor. Figure 1: GP modules (A, B, C) are used for training (D) without revisiting data. Having kept the observations from regression stored, a common approach would be to use them in combination with the classification dataset to generate a new model. However, this practice might be inconvenient because of i) the need of centralising data to train the model, ii) the rising data-dependent computational cost as the number of samples increases and iii) the obsolescence of fitted models, whose future usability is not guaranteed for the new set of observations. Looking at the deployment in large-scale scenarios, by any organization or use case, where data are ever changing, this solution becomes prohibitive. The main challenge is to incorporate unseen tasks as former models are recurrently discarded. Alternatively, we propose a framework based on modules of Gaussian processes (GP) (Rasmussen and Williams, 2006). Considering the current example, the regression model (or module) is kept intact. Once new data arrives, one fits a meta-GP using the module, but without revisiting any sample. If no data are observed, combining multiple modules is also allowed see Figure 1. Under this framework, a new family of module-driven GP models emerges. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Background. Much recent attention has been paid to the notion of efficiency, e.g. in the way of accessing to data. The mere limitation of data availability forces learning algorithms to derive new capabilities, such as, distributing the data for federated learning (Smith et al., 2017), recurrently observe streaming samples for continual learning (Goodfellow et al., 2014) and limiting data exchange for private-owned models (Peterson et al., 2019). A common theme is the idea of model memorising and recycling, that is, using the already fitted parameters for an additional task without revisiting any data. If we look to the probabilistic view of this idea, uncertainty is harder to be repurposed than parameters of functions. This is the point where Gaussian processes play their role. The flexible nature of GPs for defining a prior distribution over non-linear function spaces has made them a suitable alternative in probabilistic regression and classification. However, GP models are not immune to settings where we need to adapt to irregular ways of observing data, e.g. noncentralised observations, asynchronous samples or missing inputs. Such settings, together with the well-known computational cost of GPs, typically O(N 3), has motivated plenty of works focused on parallelising inference. Before the modern era of GP approximations, two seminal works distributed the computational load with local experts (Jacobs et al., 1991; Hinton, 2002). While the Bayesian committee machine (BCM) of Tresp (2000) focused on merging independently trained GP regression models on data partitions, the infinite mixture of GP experts (Rasmussen and Ghahramani, 2002) combined local GPs to gain expressiveness. Moreover, the emergence of large datasets, with size N>104, led to the introduction of variational methods (Titsias, 2009a) for scaling up GP inference. Two recent works that combined sparse GPs with distributed regression ideas are Gal et al. (2014); Deisenroth and Ng (2015). Their approaches are focused in exact GP regression and the distribution of computational load, respectively. Contribution. We present a framework based on modular Gaussian processes. Each module is considered to be a sparse variational GP, containing only variational parameters and hyperparameters. Our main contribution is to keep such modules intact, to build meta-GPs without accessing any past observation. This makes the computational cost minimal, becoming a competitor against inference methods for large-datasets. The key principle is the augmentation of integrals, that can be understood in terms of the Kullback-Leibler (KL) divergence between stochastic processes (Matthews et al., 2016). We experimentally provide evidence of the benefits in different applied problems, given several GP architectures. The framework is valid for multi-output settings and heterogeneous likelihoods, that is, handling one data-type per output. Our idea follows the spirit of not being data-driven any longer, but module-driven instead, where the Python user specifies a list of models as input: models = {model1, model2, . . . , model K}. This will be later called the dictionary of modules. 2 Modular Gaussian processes We consider supervised learning problems, where we have an input-output training dataset D = {xn, yn}N n=1 with xn Rp. We assume i.i.d. outputs yn, that can be either continuous or discrete variables. For convenience, we will refer to the likelihood term p(y|θ) as p(y|f) where the generative parameters are linked via θ = f(x). We say that f( ) is a non-linear function drawn from a zeromean GP prior f GP(0, k( , )), and k( , ) is the covariance function or kernel. Importantly, when non-Gaussian outputs are considered, the GP output function f( ) might need an extra deterministic mapping Φ( ) to be transformed to the appropriate parametric domain of θ. Data modules. The dataset D is assumed to be partitioned into an arbitrary number of K subsets or modules, that are observed and processed independently, that is, {D1, D2, . . . , DK}. There is not any restriction on the number of modules, {Dk}K k=1 do not need to have the same size, and we only restrict them to be Nk