# nongaussian_gaussian_processes_for_fewshot_regression__fdcd1743.pdf Non-Gaussian Gaussian Processes for Few-Shot Regression Marcin Sendera Jagiellonian University Jacek Tabor Jagiellonian University Aleksandra Nowak Jagiellonian University Andrzej Bedychaj Jagiellonian University Massimiliano Patacchiola University of Cambridge Tomasz Trzcinski Jagiellonian University, Warsaw University of Technology, Przemysław Spurek Jagiellonian University Maciej Zieba Wrocław University of Science and Technology, Gaussian Processes (GPs) have been widely used in machine learning to model distributions over functions, with applications including multi-modal regression, time-series prediction, and few-shot learning. GPs are particularly useful in the last application since they rely on Normal distributions and enable closed-form computation of the posterior probability function. Unfortunately, because the resulting posterior is not flexible enough to capture complex distributions, GPs assume high similarity between subsequent tasks a requirement rarely met in real-world conditions. In this work, we address this limitation by leveraging the flexibility of Normalizing Flows to modulate the posterior predictive distribution of the GP. This makes the GP posterior locally non-Gaussian, therefore we name our method Non-Gaussian Gaussian Processes (NGGPs). We propose an invertible ODE-based mapping that operates on each component of the random variable vectors and shares the parameters across all of them. We empirically tested the flexibility of NGGPs on various few-shot learning regression datasets, showing that the mapping can incorporate context embedding information to model different noise levels for periodic functions. As a result, our method shares the structure of the problem between subsequent tasks, but the contextualization allows for adaptation to dissimilarities. NGGPs outperform the competing state-of-the-art approaches on a diversified set of benchmarks and applications. 1 Introduction Gaussian Processes (GPs) [33, 46] are one of the most important probabilistic methods, and they have been widely used to model distributions over functions in a variety of applications such as multi-modal regression [56], time-series prediction [3, 27] and meta-learning [29, 45]. Recent works propose to use GPs in the few-shot learning scenario [4, 29, 39, 49], where the model is trained to solve a supervised task with only a few labeled samples available. This particular application is well-fitted to GPs since they can determine the posterior distribution in closed-form from a small set of data samples [29]. Corresponding author: marcin.sendera@doctoral.uj.edu.pl 35th Conference on Neural Information Processing Systems (Neur IPS 2021). Figure 1: Results of Deep Kernels with classical GP (left) and NGGP (right). The one-dimensional samples were generated randomly from sin(x) and sin(x) functions with additional noise. NGGP, compared to GP, does not have an assumption of Gaussian prior, which allows for modeling a multi-modal distribution. However, the generalization capabilities of GPs come at the price of reduced flexibility when the modeled distributions are complex, e.g., they have high skewness or heavy tails. Furthermore, GPs assume a high similarity between subsequent tasks. This condition is rarely met in realworld applications where tasks can vary during time, as is the case in heteroscedastic regression. These limitations of GPs also extend to multi-modal learning or, more generally, to multi-label regression [56]. In this work, we address those drawbacks by modeling the GPs posterior predictive distributions with a local non-Gaussian approximation. We do so by introducing a new method that we have named Non-Gaussian Gaussian Processes (NGGPs). In NGGPs, we leverage the flexibility of Continuous Normalizing Flows (CNF) [16] to model arbitrary probability distributions. In particular, we propose an invertible ODE-based mapping that operates on each component of the random variable vectors. This way, we can compute a set of CNFs parameters shared across all vectors, with the resulting mapping incorporating the information of the context to model different noise for periodic functions. Figure 1 shows how NGGPs are able to capture the overall structure of a problem, whereas standard GPs fail. NGGPs are able to reconstruct a multi-modal sine function while adapting to local dissimilarities thanks to the contextualization provided by the ODE-based mapping. We provide empirical evidence that NGGPs outperform competitive state-of-the-art approaches on a diversified set of benchmarks and applications in a few-shot learning scenario; the code is released with an open-source license2. The contributions of our work can be summarized as follows: We introduce Non-Gaussian Gaussian Processes (NGGPs), a new probabilistic method for modeling complex distributions through locally non-Gaussian posteriors. We show how invertible ODE-based mappings can be coupled with GPs to process the marginals of multivariate random variables resulting in more flexible models. We extensively test NGGPs on a variety of few-shot learning benchmarks, achieving state- of-the-art performances in most conditions. 2 Related Work The related work section is divided into three parts. First, we present a general Few-Shot Learning problem. Then, we discuss GPs, focusing on models, which use flow architectures. Finally, in the third paragraph, we describe existing approaches to Few-Shot Learning, which use Gaussian Processes. Few-Shot Learning Few-Shot Learning aims at solving problems in which the number of observations is limited. Some of the early methods in this domain have applied a two-phase approach by pre-training on the base set of training tasks and then fine-tuning the parameters to the test tasks [4, 28]. An alternative approach is given by non-parametric metric-learning algorithms, which aim at optimizing a metric, that is then used to calculate the distance between the target observations 2https://github.com/gmum/non-gaussian-gaussian-processes and the support set items [48, 38, 42]. Another popular approach to few-shot learning is Model Agnostic Meta-Learning (MAML) [9] and its variants [12, 24, 32, 54, 14, 52, 6]. MAML aims at finding a set of joined task parameters that can be easily fine-tuned to new test tasks via few gradient descent updates. MAML can also be treated as a Bayesian hierarchical model [10, 15, 18]. Bayesian MAML [55] combines efficient gradient-based meta-learning with non-parametric variational inference in a principled probabilistic framework. A few algorithms have been focusing exclusively on regression tasks. An example is given by ALPa CA [17], which uses a dataset of sample functions to learn a domain-specific encoding and prior over weights. Gaussian Processes GPs have been applied to numerous machine learning problems, such as spatio-temporal density estimation [7], robotic control [53], or dynamics modeling in transcriptional processes in the human cell [21]. The drawback of GP lies in the computational cost of the training step, which is O(n3) (where n denotes the number of observations in the training sample). In [41], the authors extend the flexibility of GPs by processing the targets with a learnable monotonic mapping (the warping function). This idea is further extended in [22], which shows that it is possible to place the prior of another GP on the warping function itself. Our method is different from these approaches, since the likelihood transformation is obtained by the use of a learnable CNF mapping. In [26], the authors present the Transformed Gaussian Processes (TGP), a new flexible family of function priors that use GPs and flow models. TGPs exploit Bayesian Neural Networks (BNNs) as input-dependent parametric transformations. The method can match the performance of Deep GPs at a fraction of the computational cost. The methods discussed above are trained on a single dataset, that is kept unchanged. Therefore, it is not trivial to adapt such methods to the the few-shot setting. Few-Shot Learning with Gaussian Processes When the number of observations is relatively small, GPs represent an interesting alternative to other regression approaches. This makes GPs a good candidate for meta-learning and few-shot learning, as shown by recent publications that have explored this research direction. For instance, Adaptive Deep Kernel Learning (ADKL) [45] proposes a variant of kernel learning for GPs, which aims at finding appropriate kernels for each task during inference by using a meta-learning approach. A similar approach can be used to learn the mean function [11]. In [37], the authors presented a theoretically principled PAC-Bayesian framework for meta-learning. It can be used with different base learners (e.g., GPs or BNNs). Topics related to kernel tricks and meta-learning have been explored in [47]. The authors propose to use nonparametric kernel regression for the inner loop update. In [43], the authors introduce an information-theoretic framework for meta-learning by using a variational approximation to the information bottleneck. In their GP-based approach, to account for likelihoods other than Gaussians, they propose approximating the non-Gaussian terms in the posterior with Gaussian distributions (by using amortized functions), while we use CNFs to increase the flexibility of the GPs. In [29], the authors present Deep Kernel Transfer (DKT): a Bayesian treatment for the meta-learning inner loop through the use of deep kernels, which has achieved state-of-the-art results. In DKT, the deep kernel and the parameters of the GP are shared across all tasks and adjusted to maximize the marginal log-likelihood, which is equivalent to Maximum-Likelihood type II (ML-II) learning. DKT is particularly effective in the regression case since it is able to capture prior knowledge about the data through the GP kernel. However, in many settings, prior assumptions could be detrimental if they are not met during the evaluation phase. This is the case in few-shot regression, where there can be a significant difference between the tasks seen at training time and the tasks seen at evaluation time. For instance, if we are given few-shot tasks consisting of samples from periodic functions but periodicity is violated at evaluation time, then methods like DKT may suffer in terms of predictive accuracy under this domain shift. In this work, we tackle this problem by exploiting the flexibility of CNFs. 3 Background Gaussian Processes. The method proposed in this paper strongly relies on Gaussian Processes (GPs) and their applications in regression problems. GPs are a well-established framework for principled uncertainty quantification and automatic selection of hyperparameters through a marginal likelihood objective [35]. More formally, a GP is a collection of random variables such that the joint distribution of every finite subset of random variables from this collection is a multivariate Gaussian [31]. We denote Gaussian Process as f( ) GP(µ( ), k( , )), where µ(x) and k(x, x0) are the mean and covariance functions. When prior information is not available, a common choice for µ is the zero constant function. The covariance function must impose a valid covariance matrix. This is achieved by restricting k to be a kernel function. Examples of such kernels include the Linear kernel, Radial Basis Function (RBF) kernel, Spectral Mixture (Spectral) kernel [50], or Cosine-Similarity kernel [33]. Kernel functions can also be directly modeled as inner products defined in the feature space imposed by a feature mapping : X ! V : k(x, x0) = h (x), (x0)i V (1) An advantage of the formulation above is that it can be easily implemented by modeling through a neural network. Throughout this work, we call this technique the NN Linear kernel (sometimes called Deep Kernel [29]). Since every kernel can be described in terms of Equation (1), such an approach may be desired if no prior information about the structure of the kernel function is available. Gaussian Processes provide a method for modeling probability distributions over functions. Consider a regression problem: yi = f(xi) + i, for i = 1, . . . , m, (2) where i are i.i.d. noise variables with independent N(0, σ2) distributions. Let X be the matrix composed of all samples xi and let y be the vector composed of all target values yi. Assuming that f( ) GP (0, k ( , )), we obtain: y|X N(0, K + σI), (3) where ki,j = k(xi, xj). Analogously, inference over the unknown during the training samples is obtained by conditioning over the normal distribution. Let (y, X) be the train data and let (y , X ) be the test data. Then the distribution of y given y, X, X is also a Gaussian distribution [34]: y |y, X, X N(µ , K ), (4) where: µ = K (X , X) K (X, X) + σ2I K = K (X , X ) + σ2I K (X , X) K (X, X) + σ2I " 1 K (X, X ) Continuous Normalizing Flows. Normalizing Flows (NF) [36] are gaining popularity among generative models thanks to their flexibility and the ease of training via direct negative log-likelihood (NLL) optimization. Flexibility is given by the change-of-variable technique that maps a latent variable z with know prior p(z) to y from some observed space with unknown distribution. This mapping is performed through a series of (parametric) invertible functions: y = fn f1(z). Assuming known prior p(z) for z, the log-likelihood for y is given by: log p(y) = log p(z) $$$$det @fn @zn 1 where z = f 1 n (y) is a result of the invertible mapping. The biggest challenge in normalizing flows is the choice of the invertible functions fn, . . . , f1. This is due to the fact that they need to be expressive while guaranteeing an efficient calculation of the Jacobian determinant, which usually has a cubic cost. An alternative approach is given by CNF models [16]. CNFs use continuous, time-dependent transformations instead of sequence of discrete functions fn, . . . , f1. Formally, we introduce a function gβ(z(t), t) that models the dynamics of z(t), @z(t) @t = gβ(z(t), t), parametrized by β. In the CNF setting, we aim at finding a solution y := z(t1) for the differential equation, assuming the given initial state z := z(t0) with a known prior. As a consequence, the transformation function fβ is defined as: y = fβ(z) = z + gβ(z(t), t)dt. (6) The inverted form of the transformation can be easily computed using the formula: gβ(z(t), t)dt. (7) The log-probability of y can be computed by: log p(y) = log p(f 1 dt where f 1 β (y) = z. (8) Feature extractor Figure 3: The general architecture of our approach. The input data are embedded by the feature extractor h( ) and then used to create a kernel for the GP. Next, the output z of the GP is adjusted using an invertible mapping f( ) which is conditioned on the output of the feature extractor. This allows us to model complex distributions of the target values y. 4 Non-Gaussian Gaussian Processes Figure 2: General idea of NGGP. A complex multimodal distribution can be modelled by exploiting a continuous invertible transformation to fit the Normal distribution used by the GP. Image inspired by Figure 1 in [16]. In this work, we introduce Non-Gaussian Gaussian Processes (NGGPs) to cope with the significant bottlenecks of Gaussian Processes for Few Shot regression tasks: reduced flexibility and assumption about the high similarity between the structure of subsequent tasks. We propose to model the posterior predictive distribution as non-Gaussian on each datapoint. We are doing so by incorporating the flexibility of CNFs. However, we do not stack the CNF on GP to model the multidimensional distribution over y. Instead, we attack the problem with an invertible ODE-based mapping that can utilize each component of the random variable vector and create the specific mapping for each datapoint (see Figure 2). The general overview of our method is presented in Figure 3. Consider the data matrix X, which stores the observations xi for a given task. Each element is processed by a feature extractor h( ) to create the latent embeddings. Next, we model the distribution of the latent variable z with a GP. Further, we use an invertible mapping f( ) in order to model more complex data distributions. Note that the transformation is also conditioned on the output of the feature extractor h( ) to include additional information about the input. The rest of this section is organized as follows. In Section 4.1, we demonstrate how the marginal can be calculated during training. In Section 4.2, we demonstrate how to perform an inference stage with the model. Finally, in Section 4.3, we show how the model is applied to the few-shot setting. 4.1 Training objective Consider the GP with feature extractor hφ( ) parametrized by φ and any kernel function k ( , ) parametrized by . Assuming the given input data X and corresponding output values z, we can define the marginal log-probability for the GP: log p(z|X, φ, ) = 1 2z T(K + σ2I) 1z 1 2 log |K + σ2I| D 2 log(2 ), (9) where D is the dimension of y, K is the kernel matrix, and ki,j = k (hφ(xi), hφ(xj)). Taking into account Equation (8) we can express the log marginal likelihood as follows: log p(y|X, φ, , β) = log p(z|X, φ, ) β (y) = z, p(z|X, φ, ) is the marginal defined by Equation (9) and f 1 β ( ) is the transformation given by Equation (6). In the next stage of the pipeline, we propose to apply the flow transformation f 1 β ( ) independently to each one of the marginal elements in y, that is f 1 β (y) = [f 1 β (y1), . . . , f 1 β (y D)]T, with f 1 β ( ) sharing its parameters across all components. In other words, while the GP captures the dependency across the variables, the flow operates independently on the marginal components of y. Additionally, the flow is conditioned on the information Algorithm 1 NGGP in the few-shot setting, train and test functions. Require: D = {Tn}N n=1 train dataset and T = {S , Q } test task. Parameters: kernel hyperparameters, φ feature extractor parameters, β flow transformation parameters. Hyperparameters: , , γ: step size hyperparameters for the optimizers. 1: function TRAIN(D, , , γ, , φ, β) 2: while not done do 3: Sample task T = (X, y) D 4: L = log p(y|X, , φ, β) . See Equation (13) 5: Update r L, . Updating kernel hyperparameters 6: φ φ rφL, . Updating feature extractor parameters 7: β β γrβL . Updating flow transformation parameters 8: end while 9: return , φ, β 10: end function 11: function TEST(T , , φ, β) 12: Assign support S = (X ,s, y ,s) and query Q = (X ,q, y ,q) 13: return p(y ,q|X ,q, y ,s, X ,s, , , φ, β) . See Equation (14) 14: end function encoded by the feature extractor, such that it can account for the context information hφ(xd) from the corresponding input value xd: yd = fβ(zd, hφ(xd)) = zd + gβ(zd(t), t, hφ(xd))dt. (11) The inverse transformation can be easily calculated with the following formula: β (yd) = yd gβ(zd(t), t, hφ(xd))dt (12) The final marginal log-likelihood can be expressed as: log p(y|X, φ, , β) = log p(zh|X, φ, ) @gβ @zd(t)dt, (13) where zh = f 1 β (y, hφ(X)) is the vector of inverse functions fβ(zd, hφ(xd)) given by Equation (12). The transformation described above can be paired with popular CNF models. Here we choose Ffjord [16], which has showed to perform better on low-dimensional data when compared against discrete flows like Real NVP [5] or Glow [19]. Note that, the CNF is applied independently on the components of the GP outputs and shared across them. Therefore, we do not have any issue with the estimation of the Jacobian, since this corresponds to the first-order derivative of the output w.r.t. the scalar input. 4.2 Inference with the model At inference time, we estimate the posterior predictive distribution p(y |X , y, X, φ, , β), where we have access to training data (y, X) and model the probability of D test outputs y given the inputs X . The posterior has a closed expression (see Section 3). Since the transformation given by Equation (11) operates independently on the outputs, we are still able to model the posterior in closed form: log p(y |X , y, X, φ, , β) = log p(zh |X, zh, X, φ, ) @gβ @zd(t)dt, (14) β (y , hφ(X )), zh = f 1 β (y, hφ(X)) are the inverted transformations for test and train data, and p(zh |X , zh, X, φ, ) is the GP posterior described in Equation (4). (a) NGGP + NN Linear (b) DKT + Spectral Figure 4: The results for the sines dataset with mixed-noise for the best performing kernels for NGGP (NN Linear) and DKT (Spectral). The top plot in each figure represents the estimated density (blue hue) and predicted curve (red line), as well as the true test samples (navy blue dots). For three selected input points (denoted by black vertical lines), we plot the obtained marginal densities in the bottom images (red color). In addition, for the NGGP method, we also plot the marginal priors (in green) for each of these three points. It may be observed that NGGP is more successful in modeling the marginal for varying noise levels. 4.3 Adaptation for few-shot regression In few-shot learning, we are given a meta-dataset of tasks D = {Tn}N n=1 where each task Tn contains a support set Sn, and a query set Qn. At training time, both support and query contain input-output pairs (X, y), and the model is trained to predict the target in the query set given the support. At evaluation time, we are given a previously unseen task T = (S , Q ), and the model is used to predict the target values of the unlabeled query points. We are interested in few-shot regression, where inputs are vectors and outputs are scalars. We follow the paradigm of Deep Kernel Transfer (DKT) introduced in [29] and propose the following training and testing procedures (see Algorithm 1). During the training stage, we randomly sample the task, calculate the loss defined by Equation (13) and update all the parameters using gradient-based optimization. During testing, we simply identify the query and support sets and calculate the posterior given by Equation (14). 5 Experiments In this section, we provide an extensive evaluation of our approach (NGGP) on a set of challenging few-shot regression tasks. We compare the results with other baseline methods used in this domain. As quantitative measures, we use the standard mean squared error (MSE) and, when applicable, the negative log-likelihood (NLL). Sines dataset We start by comparing NGGP to other few-shot learning algorithms in a simple regression task defined on sines functions. To this end, we adapt the dataset from [9] in which every task is composed of points sampled from a sine wave with amplitude in the range [0.1, 5.0], phase in the range [0, ], and Gaussian noise N(0, 0.1). The input points are drawn uniformly at random from the range [ 5, 5]. We consider 5 support and 5 query points during the training and 5 support and 200 query points during inference. In addition, following [29], we also consider an out-of-range scenario, in which the range during the inference is extended to [ 5, 10]. We also perform a variation of sines experiment in which we inject input-dependent noise. The target values in this setting are modeled by A sin (x + ') + |x + '| , where the amplitude, phase, input, and noise points are drawn from the same distributions as in the standard setup described before. We refer to this dataset ablation as mixed-noise sines. For more information about the training regime and architecture, refer to Supplementary Materials A. Table 1 presents the results of the experiments. We use the DKT method as a reference since it provides state-of-the-art results for the few-shot sines dataset [29]. For a report with more baseline methods, please refer to Supplementary Materials B. Both DKT and our NGGP perform very well when paired with the Spectral Mixture Kernel, achieving the same performance on in-range data. However, our approach gives superior results in the out-of- Table 1: The MSE and NLL results for the inference tasks on sines datasets in the in-range and out-range settings. Lowest results in bold (the lower the better). sines mixed-noise sines in-range out-of-range in-range out-of-range MSE NLL MSE NLL MSE NLL MSE NLL DKT + RBF 1.36 1.64 -0.76 0.06 2.94 2.70 -0.69 0.06 1.60 1.63 0.48 0.22 2.99 2.37 2.01 0.59 DKT + Spectral 0.02 0.01 -0.83 0.03 0.04 0.03 -0.70 0.14 0.18 0.12 0.37 0.16 1.33 1.10 1.58 0.40 DKT + NN Linear 0.02 0.02 -0.73 0.11 6.61 31.63 38.38 40.16 0.18 0.11 0.45 0.23 5.85 12.10 8.64 6.55 NGGP + RBF 1.02 1.40 -0.74 0.07 3.02 2.53 -0.65 0.08 1.30 1.36 0.33 0.16 3.90 2.60 1.83 0.53 NGGP + Spectral 0.02 0.01 -0.83 0.05 0.03 0.02 -0.80 0.07 0.22 0.14 0.44 0.19 1.14 0.90 1.35 0.38 NGGP + NN Linear 0.04 0.03 -0.73 0.10 7.34 12.85 29.86 27.97 0.20 0.12 0.17 0.15 4.74 6.29 2.92 1.93 range scenario, confirming that NGGP is able to provide a better estimate of the predictive posterior for the unseen portions of the task. It is also worth noting that in all settings, NGGP consistently achieves the best NLL results. This is particularly evident for the in-range mixed-noise sines dataset. We analyze this result in Figure 4, where NGGP successfully models the distribution of the targets, predicting narrow marginals for the more centralized points and using wider distributions for the points with larger noise magnitude. This is in contrast with DKT, which fails to capture different noise levels within the data. These observations confirm our claim that the NGGP is able to provide a good estimate in the case of heteroscedastic data. Head-pose trajectory In this experiment, we use the Queen Mary University of London multiview face dataset [13]. This dataset is composed of grayscale face images of 37 people (32 train, 5 test). There are 133 facial images per person, covering a viewsphere of 90 in yaw and 30 in tilt at 10 increment. We follow the evaluation procedure provided in [29]. Each task consists of randomly sampled trajectories taken from this discrete manifold. The in-range scenario includes the full manifold, while the out-of-range scenario includes only the leftmost 10 angles. At evaluation time, the inference is performed over the full manifold with the goal of predicting the tilt. The results are provided in Table 2. In terms of MSE, our NGGP method is competitive with other approaches, but it achieves significantly better NLL results, especially in the out-of-range setting. This suggests that NGGPs are indeed able to adapt to the differences between the tasks seen at training time and tasks seen at evaluation time by providing a probability distribution that accurately captures the true underlying data. Table 2: Quantitative results for Queen Mary University of London for in-range and out-of-range settings, taking into account NLL and MSE measures. Method in-range out-of-range MSE NLL MSE NLL Feature Transfer/1 0.25 0.04 - 0.20 0.01 - Feature Transfer/100 0.22 0.03 - 0.18 0.01 - MAML (1 step) 0.21 0.01 - 0.18 0.02 - DKT + RBF 0.12 0.04 0.13 0.14 0.14 0.03 0.71 0.48 DKT + Spectral 0.10 0.01 0.03 0.13 0.07 0.05 0.00 0.09 DKT + NN Linear 0.04 0.03 -0.12 0.12 0.12 0.05 0.30 0.51 NGGP + NN Linear 0.02 0.02 -0.47 0.32 0.06 0.05 0.24 0.91 NGGP + Spectral 0.03 0.03 -0.68 0.23 0.03 0.03 -0.62 0.24 Object pose prediction We also study the behavior of NGGP in a pose prediction dataset introduced in [54]. Each task in this dataset consists of 30 gray-scale images with resolution 128 128, divided evenly into support and query. The tasks are created by selecting an object from the Pascal 3D [51] dataset, rendering it in 100 random orientations, and sampling out of it 30 representations. The goal is to predict the orientation relative to a fixed canonical pose. Note that 50 randomly selected objects are used to create the meta-training dataset, while the remaining 15 are utilized to create a distinct meta-test set. Since the number of objects in meta-training is small, a model could memorize the canonical pose of each object and then use it to predict the target value, completely disregarding the support points during the inference. This would lead to poor performance on the unseen objects in the meta-test tasks. This special case of overfitting is known as the memorization problem [54]. We analyze the performance of GP-based models in this setting by evaluating the performance of DKT and NGGP models3. We compare them against the methods used in [54], namely MAML [9], 3Information about architecture and training regime is given in Supplementary Materials A. (a) NLL results DKT vs. NGGP. (b) Single day comparison DKT vs. NGGP. Figure 5: The results for the Power dataset experiment: (a) The quantitative comparison between DKT and NGGP considering different numbers of support examples. (b) The power consumption for a single day randomly selected from the test data. We compare DKT vs. NGGP (with RBF kernel) considering 10 and 100 support points. NGGP captures multi-modality and thus better adjusts to the data distribution. Conditional Neural Processes (CNP) [12] and their meta-regularized versions devised to address the memorization problem MR-MAML and MR-CNP [54]. In addition, we also include the fine-tuning (FT) baseline and CNP versions with standard regularization techniques such as Bayes-by-Backprop (Bb B) [2] and Weight Decay [20]. The results are presented in Table 3. Table 3: Quantitative results for the object pose prediction task. We report the mean and standard deviation over 5 trials. The lower the better. Asterisks (*) denote values reported in [54]. Method MSE NLL MAML* 5.39 1.31 - MR-MAML* 2.26 0.09 - CNP* 8.48 0.12 - MR-CNP* 2.89 0.18 - FT* 7.33 0.35 - FT + Weight Decay* 6.16 0.12 - CNP + Weight Decay* 6.86 0.27 - CNP + Bb B* 7.73 0.82 - DKT + RBF 1.82 0.17 1.35 0.10 DKT + Spectral 1.79 0.15 1.30 0.06 NGGP + RBF 1.98 0.27 0.22 0.08 NGGP + Spectral 2.34 0.28 0.86 0.45 Both GP-related approaches: NGGP and DKT are similar or usually outperform the standard and metaregularized methods, which indicates that they are less prone to memorization and therefore benefit from a better generalization. The NLL is significantly lower for NGGP than for DKT, confirming that NGGP is better at inferring complex data distributions. Power Dataset In this series of experiments, we use the Power [1] dataset and define an experimental setting for the few-shot setting. We treat each time series composed of 1440 values (60 minutes 24 hours) that represents the daily power consumption (sub_metering_3) as a single task. We train the model using the tasks from the first 50 days, randomly sampling 10 points per task, while validation tasks are generated by randomly selecting from the following 50 days. Quantitative and qualitative analysis are provided in Figure 5. We use only NLL to assess the results due to the multi-modal nature of the data and analyze the value of the criterion for different numbers of support examples. NGGP better adjusts to the true data distribution, even in the presence of very few support examples during inference. This experiment supports the claim that NGGPs are well-suited for modeling multi-modal distributions and step functions. NASDAQ and EEG datasets In order to test the performance of our methods for real-world time series prediction, we used two datasets - NASDAQ100 [30] and EEG [8]. For an extensive description of the datasets and evaluation regime of this experiment, see Supplementary Materials A. Quantitative results are presented in Table 4. Our experiments show that NGGP outperforms the baseline DKT method across all datasets. The improvement is especially visible for the out-of-range NASDAQ100 when both methods use the RBF kernel. The results suggest that NGGPs can be successfully used to model real-world datasets, even when the data does not follow a Gaussian distribution. Table 4: Quantitative results for NASDAQ and EEG datasets. (a) NASDAQ100 Method MSE 100 NLL NGGP + RBF 0.012 0.014 -3.092 0.255 NGGP + NN Linear 0.023 0.044 -2.567 1.235 DKT + NN Linear 0.027 0.032 -2.429 0.271 DKT + RBF 0.022 0.042 -2.878 0.706 out-of-range Method MSE 100 NLL NGGP + RBF 0.016 0.034 -2.978 0.571 NGGP + NN Linear 0.003 0.004 -2.998 0.260 DKT + NN Linear 0.005 0.006 -2.612 0.059 DKT + RBF 0.181 0.089 1.049 2.028 Method MSE 100 NLL NGGP + RBF 0.222 0.181 -1.715 0.282 NGGP + NN Linear 0.361 0.223 -1.387 0.273 DKT + NN Linear 0.288 0.169 -1.443 0.188 DKT + RBF 0.258 0.218 -1.640 0.237 out-of-range Method MSE 100 NLL NGGP + RBF 0.463 0.415 -1.447 0.221 NGGP + NN Linear 0.452 0.578 -1.046 0.624 DKT + NN Linear 0.528 0.642 -1.270 0.622 DKT + RBF 0.941 0.917 -1.242 0.685 6 Conclusions In this work, we introduced NGGP a generalized probabilistic framework that addresses the main limitations of Gaussian Processes, namely its rigidity in modeling complex distributions. NGGP leverages the flexibility of Normalizing Flows to modulate the posterior predictive distribution of GPs. Our approach offers a robust solution for few-shot regression since it finds a shared set of parameters between consecutive tasks while being adaptable to dissimilarities and domain shifts. We have provided an extensive empirical validation of our method, verifying that it can obtain state-of-the-art performance on a wide range of challenging datasets. In future work, we will focus on applications of few-shot regression problems needing the estimation of exact probability distribution (e.g., continuous object-tracking) and settings where there is a potential discontinuity in similarity for subsequent tasks (e.g., continual learning). Limitations The main limitation of NGGP s is the costs of learning flow-based models, that could be more expensive than using a standard DKT when the data come from a simple distribution. In such a case, other methods like DKT could be more efficient. Moreover, GPs are expensive for tasks with a large number of observations, making NGGP a better fit for few-shot learning rather than bigger settings. Finally, in some cases, it can be more challenging to train and fine-tune NGGP than DKT because the number of parameters and hyper-parameters is overall larger (e.g. the parameters of the flow). Broader Impact Gaussian Processes for regression already have had a huge impact on various real-world applications [7, 53, 21, 25]. NGGPs make it possible to apply a priori knowledge and expertise to even more complex real-world systems, providing fair and human-conscious solutions, i.e., in neuroscience or social studies (see experiments on individual power consumption, EEG, and NASDAQ datasets from section 5). The proposed method is efficient and represents a great tool for better uncertainty quantification. Careful consideration of possible applications of our method must be taken into account to minimize any possible societal impact. For instance, the use of NGGP in object-tracking could be harmful if deployed with malevolent and unethical intents in applications involving mass surveillance. Acknowledgments This research was funded by Foundation for Polish Science (grant no POIR.04.04.00-00-14DE/18-00 carried out within the Team-Net program co-financed by the European Union under the European Regional Development Fund) and National Science Centre, Poland (grant no 2020/39/B/ST6/01511). The work of M. Zieba was supported by the National Centre of Science (Poland) Grant No. 2020/37/B/ST6/03463. The work of P. Spurek was supported by the National Centre of Science (Poland) Grant No. 2019/33/B/ST6/00894. This research was funded by the Priority Research Area Digiworld under the program Excellence Initiative Research University at the Jagiellonian University in Kraków. The authors have applied a CC BY license to any Author Accepted Manuscript (AAM) version arising from this submission, in accordance with the grants open access conditions. [1] Individual household electric power consumption data set. https://archive.ics.uci.edu/ml/ datasets/individual+household+electric+power+consumption. Accessed: 2021-05-25. [2] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613 1622. PMLR, 2015. [3] Sofiane Brahim-Belhouari and Amine Bermak. Gaussian process for nonstationary time series prediction. Computational Statistics & Data Analysis, 47(4):705 712, 2004. [4] Wei-Yu Chen, Yen-Cheng Liu, Zsolt Kira, Yu-Chiang Frank Wang, and Jia-Bin Huang. A closer look at few-shot classification. ar Xiv preprint ar Xiv:1904.04232, 2019. [5] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016. [6] Yingjun Du, Haoliang Sun, Xiantong Zhen, Jun Xu, Yilong Yin, Ling Shao, and Cees G. M. Snoek. Metakernel: Learning variational random features with limited labels, 2021. [7] Vincent Dutordoir, Hugh Salimbeni, Marc Deisenroth, and James Hensman. Gaussian process conditional density estimation, 2018. [8] SM Fernandez-Fraga, MA Aceves-Fernandez, JC Pedraza-Ortega, and JM Ramos-Arreguin. Screen task experiments for eeg signals based on ssvep brain computer interface. International Journal of Advanced Research, 6(2):1718 1732, 2018. Accessed: 2021-05-25. [9] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International Conference on Machine Learning, pages 1126 1135. PMLR, 2017. [10] Chelsea Finn, Kelvin Xu, and Sergey Levine. Probabilistic model-agnostic meta-learning. ar Xiv preprint ar Xiv:1806.02817, 2018. [11] Vincent Fortuin, Heiko Strathmann, and Gunnar Rätsch. Meta-learning mean functions for gaussian processes. ar Xiv preprint ar Xiv:1901.08098, 2019. [12] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola, Danilo J Rezende, SM Eslami, and Yee Whye Teh. Neural processes. ar Xiv preprint ar Xiv:1807.01622, 2018. [13] Shaogang Gong, Stephen Mc Kenna, and John J Collins. An investigation into face pose distributions. In Proceedings of the Second International Conference on Automatic Face and Gesture Recognition, pages 265 270. IEEE, 1996. [14] Jonathan Gordon, John Bronskill, Matthias Bauer, Sebastian Nowozin, and Richard E Turner. Meta-learning probabilistic inference for prediction. ar Xiv preprint ar Xiv:1805.09921, 2018. [15] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Griffiths. Recasting gradient-based meta-learning as hierarchical bayes. ar Xiv preprint ar Xiv:1801.08930, 2018. [16] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duvenaud. Ffjord: Free- form continuous dynamics for scalable reversible generative models. ar Xiv preprint ar Xiv:1810.01367, 2018. [17] James Harrison, Apoorva Sharma, and Marco Pavone. Meta-learning priors for efficient online bayesian regression. In International Workshop on the Algorithmic Foundations of Robotics, pages 318 337. Springer, 2018. [18] Ghassen Jerfel, Erin Grant, Thomas L Griffiths, and Katherine Heller. Reconciling meta-learning and continual learning with online mixtures of tasks. ar Xiv preprint ar Xiv:1812.06080, 2018. [19] Diederik P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. ar Xiv preprint ar Xiv:1807.03039, 2018. [20] Anders Krogh and John A Hertz. A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950 957, 1992. [21] Neil Lawrence, Guido Sanguinetti, and Magnus Rattray. Modelling transcriptional regulation using gaussian processes. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems, volume 19. MIT Press, 2007. [22] Miguel Lázaro-Gredilla. Bayesian warped gaussian processes. Advances in Neural Information Processing Systems, 25:1619 1627, 2012. [23] Miguel Lázaro-Gredilla. Bayesian warped gaussian processes. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. [24] Yan Li, Ethan X Fang, Huan Xu, and Tuo Zhao. International conference on learning representations 2020. In International Conference on Learning Representations 2020, 2020. [25] Zhu Li, Adrian Perez-Suay, Gustau Camps-Valls, and Dino Sejdinovic. Kernel dependence regularizers and gaussian processes with applications to algorithmic fairness, 2019. [26] Juan Maroñas, Oliver Hamelijnck, Jeremias Knoblauch, and Theodoros Damoulas. Transforming gaussian processes with normalizing flows. In International Conference on Artificial Intelligence and Statistics, pages 1081 1089. PMLR, 2021. [27] Fatemeh Najibi, Dimitra Apostolopoulou, and Eduardo Alonso. Enhanced performance gaussian process regression for probabilistic short-term solar output forecast. International Journal of Electrical Power & Energy Systems, 130:106916, 2021. [28] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345 1359, 2009. [29] Massimiliano Patacchiola, Jack Turner, Elliot J Crowley, Michael O Boyle, and Amos J Storkey. Bayesian meta-learning for the few-shot setting via deep kernels. Advances in Neural Information Processing Systems, 33, 2020. [30] Yao Qin, Dongjin Song, Haifeng Chen, Wei Cheng, Guofei Jiang, and Garrison Cottrell. A dual-stage attention-based recurrent neural network for time series prediction, 2017. Accessed: 2021-05-25. [31] Joaquin Quinonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939 1959, 2005. [32] Aravind Rajeswaran, Chelsea Finn, Sham Kakade, and Sergey Levine. Meta-learning with implicit gradients. ar Xiv preprint ar Xiv:1909.04630, 2019. [33] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer school on machine learning, pages 63 71. Springer, 2003. [34] Carl Edward Rasmussen and C Williams. Gaussian processes for machine learning the mit press. Cam- bridge, MA, 2006. [35] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. [36] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, pages 1530 1538. PMLR, 2015. [37] Jonas Rothfuss, Vincent Fortuin, Martin Josifoski, and Andreas Krause. Pacoh: Bayes-optimal meta- learning with pac-guarantees. In International Conference on Machine Learning, pages 9116 9126. PMLR, 2021. [38] Jake Snell, Kevin Swersky, and Richard S Zemel. Prototypical networks for few-shot learning. ar Xiv preprint ar Xiv:1703.05175, 2017. [39] Jake Snell and Richard Zemel. Bayesian few-shot classification with one-vs-each p\ olya-gamma aug- mented gaussian processes. ar Xiv preprint ar Xiv:2007.10417, 2020. [40] Edward Snelson, Zoubin Ghahramani, and Carl Rasmussen. Warped gaussian processes. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems, volume 16. MIT Press, 2004. [41] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped gaussian processes. Advances in neural information processing systems, 16:337 344, 2004. [42] Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1199 1208, 2018. [43] Michalis K Titsias, Sotirios Nikoloutsopoulos, and Alexandre Galashov. Information theoretic meta learning with gaussian processes. ar Xiv preprint ar Xiv:2009.03228, 2020. [44] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026 5033. IEEE, 2012. [45] Prudencio Tossou, Basile Dura, Francois Laviolette, Mario Marchand, and Alexandre Lacoste. Adaptive deep kernel learning. ar Xiv preprint ar Xiv:1905.12131, 2019. [46] Martin Trapp, Robert Peharz, Franz Pernkopf, and Carl Edward Rasmussen. Deep structured mixtures of gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 2251 2261. PMLR, 2020. [47] Arun Venkitaraman, Anders Hansson, and Bo Wahlberg. Task-similarity aware meta-learning through nonparametric kernel regression. ar Xiv preprint ar Xiv:2006.07212, 2020. [48] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching networks for one shot learning. ar Xiv preprint ar Xiv:1606.04080, 2016. [49] Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys (CSUR), 53(3):1 34, 2020. [50] Andrew Wilson and Ryan Adams. Gaussian process kernels for pattern discovery and extrapolation. In International conference on machine learning, pages 1067 1075. PMLR, 2013. [51] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In IEEE winter conference on applications of computer vision, pages 75 82. IEEE, 2014. [52] Jin Xu, Jean-Francois Ton, Hyunjik Kim, Adam Kosiorek, and Yee Whye Teh. Metafun: Meta-learning with iterative functional updates. In International Conference on Machine Learning, pages 10617 10627. PMLR, 2020. [53] Dit-Yan Yeung and Yu Zhang. Learning inverse dynamics by gaussian process begrression under the multi-task learning framework. In The Path to Autonomous Robots, pages 1 12. Springer, 2009. [54] Mingzhang Yin, George Tucker, Mingyuan Zhou, Sergey Levine, and Chelsea Finn. Meta-learning without memorization. ar Xiv preprint ar Xiv:1912.03820, 2019. [55] Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, and Sungjin Ahn. Bayesian model-agnostic meta-learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 7343 7353, 2018. [56] Maciej Zi eba, Marcin Przewi e zlikowski, Marek Smieja, Jacek Tabor, Tomasz Trzcinski, and Prze- mysław Spurek. Regflow: Probabilistic flow-based regression for future prediction. ar Xiv preprint ar Xiv:2011.14620, 2020. 1. For all authors... (a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] See Section 6. (c) Did you discuss any potential negative societal impacts of your work? [Yes] See Section 6. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] 2. If you are including theoretical results... (a) Did you state the full set of assumptions of all theoretical results? [N/A] (b) Did you include complete proofs of all theoretical results? [N/A] 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experi- mental results (either in the supplemental material or as a URL)? [Yes] See Section 5 and Supplementary Materials A. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5 and Supplementary Materials A. (c) Did you report error bars (e.g., with respect to the random seed after running experi- ments multiple times)? [Yes] See Section 5 and Supplementary Materials A. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [No] 4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... (a) If your work uses existing assets, did you cite the creators? [Yes] See Section 5 and Supplementary Materials A. (b) Did you mention the license of the assets? [Yes] See Supplementary Materials A. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [Yes] See Supplementary Materials A. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] 5. If you used crowdsourcing or conducted research with human subjects... (a) Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]