# variational_elliptical_processes__4a7252fc.pdf Published in Transactions on Machine Learning Research (08/2023) Variational Elliptical Processes Maria B ankestad maria.bankestad@it.uu.se RISE Research Institutes of Sweden and Uppsala University, Sweden Jens Sjölund jens.sjolund@it.uu.se Department of Information Technology, Uppsala University, Sweden Jalil Taghia jalil.taghia@ericsson.com Ericsson Research, Sweden Thomas B. Schön thomas.schon@it.uu.se Department of Information Technology, Uppsala University, Sweden Reviewed on Open Review: https: // openreview. net/ forum? id= dj N3Taqbd A& We present elliptical processes a family of non-parametric probabilistic models that subsumes Gaussian processes and Student s t processes. This generalization includes a range of new heavy-tailed behaviors while retaining computational tractability. Elliptical processes are based on a representation of elliptical distributions as a continuous mixture of Gaussian distributions. We parameterize this mixture distribution as a spline normalizing flow, which we train using variational inference. The proposed form of the variational posterior enables a sparse variational elliptical process applicable to large-scale problems. We highlight advantages compared to Gaussian processes through regression and classification experiments. Elliptical processes can supersede Gaussian processes in several settings, including cases where the likelihood is non-Gaussian or when accurate tail modeling is essential. 1 Introduction 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x Figure 1: Posterior distributions of elliptical and Gaussian processes with equal kernel hyperparameters and covariance. The shaded areas are credible intervals of the posterior processes. The elliptical credible intervals are wider due to the process heavier tail. Systems for autonomous decision-making are increasingly dependent on predictive models. To ensure safety and reliability, it is essential that these models capture uncertainties and risks accurately. Gaussian processes (GPs) offer a framework for probabilistic modeling that is widely used partly because it provides uncertainty estimates. However, these estimates are only reliable to the extent that the model is correctly specified, i.e., that the assumptions of Gaussianity hold true. On the contrary, heavy-tailed data arise in many real-world applications, including finance (Mandelbrot, 1963), signal processing (Zoubir et al., 2012), and geostatistics (Diggle et al., 1998). We use a combination of normalizing flows and modern variational inference techniques to extend the modeling capabilities of GPs to the more general class of elliptical processes (EPs). Elliptical processes. Elliptical processes subsume Gaussian processes and Student s t processes (Rasmussen & Williams, 2006; Shah et al., 2014). They are based on Published in Transactions on Machine Learning Research (08/2023) the elliptical distribution a scale-mixture of Gaussian distributions that is attractive mainly because it can describe heavy-tailed distributions while retaining most of the Gaussian distribution s computational tractability (Fang et al., 1990). We use a normalizing flow (Rezende & Mohamed, 2015; Papamakarios et al., 2021) to model the continuous scale-mixture, which provides an added flexibility that can benefit a range of applications. We explore using elliptical processes as both a prior (over functions) and a likelihood, as well as the combination thereof. We also explore using EPs as a variational posterior that can adapt its shape to match complex posterior distributions. Variational inference. Variational inference is a tool for approximate inference that uses optimization to find a member of a predefined family of distributions that is close to the target distribution (Wainwright et al., 2008; Blei et al., 2017). Significant advances in the last decade have made variational inference the method of choice for scalable approximate inference in complex parametric models (Ranganath et al., 2014; Hoffman et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014). It is thus not surprising that the quest for more expressive and scalable variations of Gaussian processes has gone hand-in-hand with the developments in variational inference. For instance, sparse GPs use variational inference to select inducing points to approximate the prior (Titsias, 2009). Inducing points are a common building block in deep probabilistic models such as deep Gaussian processes (Damianou & Lawrence, 2013; Salimbeni et al., 2019) and can also be applied in Bayesian neural networks (Maroñas et al., 2021; Ober & Aitchison, 2021). Similarly, the combination of inducing points and variational inference enables scalable approximate inference in models with non-Gaussian likelihoods (Hensman et al., 2013), such as when performing GP classification (Hensman et al., 2015; Wilson et al., 2016b). Figure 2: Left: A contour plot of an elliptical twodimensional, correlated distribution with zero means. The name derives from its elliptical level sets. Right: Three examples of one-dimensional elliptical distributions with zero means and varying tail-heaviness. Elliptical distributions are symmetric around the mean E[X] = µ. However, the closeness of the variational distribution to the target distribution is bounded by the flexibility of the variational distribution. Consequently, the success of deep (neural network) models has inspired various suggestions on flexible yet tractable variational distributions, often based on parameterized transformations of a simple base distribution (Tran et al., 2016). In particular, models using a composition of invertible transformations, known as normalizing flows, have been especially popular (Rezende & Mohamed, 2015; Papamakarios et al., 2021). Our contributions. We propose an adaptation of elliptical distributions and processes in the same spirit as modern Gaussian processes. Our EP construction is similar to that of Maume Deschamps et al. (2017) but aims to turn EPs into a practical modeling tool. Constructing elliptical distributions based on a normalizing flow provides a high degree of flexibility without sacrificing computational tractability. This makes it possible to sidestep the curse of Gaussianity , and adapt to heavy-tailed behavior when called for. We thus foresee many synergies between EPs and recently developed GP methods. We make a first exploration of these, and simultaneously demonstrate the versatility of the elliptical process as a model for the prior and/or the likelihood, or as the variational posterior. In more detail, our contributions are: a construction of the elliptical process and the elliptical likelihood as a continuous scale-mixture of Gaussian processes parameterized by a normalizing flow; a variational approximation that can either learn an elliptical likelihood or handle known non Gaussian likelihoods, such as in classification problems; formulating a sparse variational approximation for large-scale problems; and describing extensions to heteroscedastic data. Published in Transactions on Machine Learning Research (08/2023) 2 Background This section provides the necessary background on elliptical distributions, elliptical processes, and normalizing flow models. Throughout, we consider a regression problem, where we are given a set of N scalar observations, y = [y1, . . . , y N] , at the locations X = [x1, . . . , x N] , where xi is D-dimensional. The measurement yi is assumed to be a noisy measurement, such that, yi = f(xi) + ϵi, (1) where ϵi is zero mean i.i.d. noise. The task is to infer the underlying function f : RD R. 2.1 Elliptical distributions Elliptical processes are based on elliptical distributions (Figure 2), which include Gaussian distributions as well as more heavy-tailed distributions, such as the Student s t distribution and the Cauchy distribution. The probability density of a random variable Y RN that follows the elliptical distribution can be expressed as p(u; η) = c N,η|Σ| 1/2g N(u; η), (2) where u := (y µ)TΣ 1(y µ) is the squared Mahalanobis distance, µ is the location vector, Σ is the non-negative definite scale matrix, and c N,η is a normalization constant. The density generator g N(u; η) is a non-negative function with finite integral parameterized by η, which determines the shape of the distribution. Elliptical distributions are consistent, i.e., closed under marginalization, if and only if p(u; η) is a scalemixture of Gaussian distributions (Kano, 1994). The density can be expressed as p(u; η) = |Σ| 1 2ξ p(ξ; ηξ)dξ, (3) using a mixing variable ξ p(ξ; ηξ). Any mixing distribution p(ξ; ηξ) that is strictly positive can be used to define a consistent elliptical process. In particular, we recover the Gaussian distribution if the mixing distribution is a Dirac delta function and the Student s t distribution if it is a scaled inverse chi-square distribution. For more information on the elliptical distribution, see Appendix A. 2.2 Elliptical processes The elliptical process is defined analogously to a Gaussian process as: Definition 1 An elliptical process (EP) is a collection of random variables such that every finite subset has a consistent elliptical distribution, where the scale matrix is given by a covariance kernel. This means that an EP is specified by a mean function µ(x), a scale matrix (a kernel) k(x, x ) and the mixing distribution p(ξ; ηξ). Since the EP is built upon consistent elliptical distributions, it is closed under marginalization. The marginal mean µ is the same as the mean for the Gaussian distribution, and the covariance is Cov[Y ] = E [ξ] Σ where Y is an elliptical random variable, Σ is the covariance for a Gaussian distribution, and ξ is the mixing variable. Formally, a stochastic process {Xt : t T} on a probability space (Ω, F, P) consists of random maps Xt : ω St, t T, for measurable spaces (St, St), t T (Bhattacharya & Waymire, 2007). We focus on the setting where S = R and the index set T is a subset of RN, in particular, the half-line [0, ). Due to Kolmogorov s extension theorem, we may construct the EP from the family of finite-dimensional, consistent, elliptical distributions, which is due to the restriction to S = R (which is a Polish space) and Kano s characterization above. Note that the above definition of the elliptical process is essentially the same as that in Maume Deschamps et al. (2017) but makes the use of a covariance kernel explicit. (This definition also appears Published in Transactions on Machine Learning Research (08/2023) in Bånkestad et al. (2020), which is an early preprint version of the present article.) Deceptively, there are also definitions of elliptical processes that differ in important ways from ours, notably ones based on Lévy processes (Bingham & Kiesel, 2002) which assume independent increments. Figure 3: Graphical models of (a) the elliptical likelihood, (b) the EP-prior, and (c) the EP with independent elliptical noise, where ω is sampled from the likelihood mixing distribution p(ω; ηω). Identifiability. When using a GP for regression or classification, it is typically assumed that the data originate from a single sample path, representing a realization from the GP. However, an elliptical process introduces a hierarchical model wherein ξ is first sampled from the distribution p(ξ; ηξ), followed by sampling f from GP(f; µ, Kξ). In this structure, inferring the mixing distribution p(ξ; ηξ) from a single path is impossible since we only have one observation from p(ξ; ηξ). In other words, to infer the mixing distribution p(ξ; ηξ), we must have multiple paths drawn from the same EP. Therefore, to infer an elliptical prior from data, the dataset would have to contain multiple sample paths, such as multiple time series, all originating from the same prior. Prediction. To use the EP for predictions, we need the predictive mean and covariance of the corresponding elliptical distribution. The predictive distribution is guaranteed to be a consistent elliptical distribution but not necessarily the same as the original one the shape depends on the training samples. (Recall that consistency only concerns the marginal distribution.) The noise-free predictive distribution can be derived analytically (see Appendix B), but to account for additive elliptical noise, we will instead solve it by approximating the posterior p(ξ| y; ηξ) with a variational distribution q(ξ; φξ). The approximate inference framework also lets us incorporate (non-Gaussian) noise according to the graphical models in Figure 3. We aim to model mixing distributions that can capture any shape of the elliptical noise in the data. Thus, to improve upon the piecewise constant parameterization in an earlier version of this work (Bånkestad et al., 2020), we use normalizing flows: a class of methods for learning complex probability distributions. 2.3 Flow based models Normalizing flows are a family of generative models that map simple distributions to complex ones through a series of learned transformations (Papamakarios et al., 2021). Suppose we have a random variable x that follows an unknown probability distribution px(x). Then, the main idea of a normalizing flow is to express x as a transformation Tγ of a variable z with a known simple probability distribution pz(z). The transformation Tγ has to be bijective, and it can have learnable parameters γ. Both T and its inverse have to be differentiable. A change of variables obtains the probability density of x: px(x) = pz(z) det Tγ(z) We focus on one-dimensional flows since we are interested in modeling the mixing distribution. In particular, we use linear rational spline flows (Dolatabadi et al., 2020; Durkan et al., 2019), wherein the mapping Tγ is an elementwise, monotonic linear rational spline: a piecewise function where each piece is a linear rational function. The parameters are the number of pieces (bins) and the knot locations. We propose the variational EP with elliptical noise, where the variational EP can learn any consistent elliptical process, and the elliptical noise can capture any consistent elliptical noise. The key idea is to model the Published in Transactions on Machine Learning Research (08/2023) mixing distributions with a normalizing flow. The joint probability distribution of the model (see Figure 3c) is p(y, f, ω, ξ; η) = p(f|ξ; ηf)p(ξ; ηξ) | {z } prior i=1 p(yi|fi, ωi)p(ωi; ηω) | {z } likelihood Here, p(f|ξ; ηf) N(0, Kξ) is a regular GP prior with the covariance kernel K containing the parameters ηf, p(ξ; ηξ) is the process mixing distribution, and p(ω; ηω) is the noise mixing distribution. To learn the mixing distributions p(ξ; ηξ) and p(ω; ηω) by gradient-based optimization, they need to be differentiable with respect to the parameters ηξ and ηω in addition to being flexible and computationally efficient to evaluate and sample from. Based on these criteria, a spline flow (Section 2.3) is a natural fit. We construct the mixing distributions by transforming a sample from a standard normal distribution with a spline flow. The output of the spline flow is then projected onto the positive real axis using a differentiable function such as Softplus or Sigmoid. In the following sections, we detail the construction of the model and show how to train it using variational inference. For clarity, we describe the likelihood first before combining it with the prior and describing a (computationally efficient) sparse approximation. 3.1 Likelihood By definition, the likelihood (Figure 3a) describes the measurement noise ϵi (Equation (1)). The probability distribution of the independent elliptical likelihood is p(ϵi; σ, ηω) = Z N(ϵi; 0, σ2ωi)p(ωi; ηω)dωi, (6) where σ > 0, and can be set to unity without loss of generality. In other words, the likelihood is a continuous mixture of Gaussian distributions where, e.g., ϵi follows a Student s t distribution if ω is scaled inverse chi-squared distributed. Parameterization. We parameterize p(ω; ηω) as a spline flow, p(ω; ηω) = p(ζ) T(ζ; ηω) although it could be, in principle, any positive, finite probability distribution. Here, p(ζ) N(0, 1) is the base distribution and ω = T(ζ ; ηω) represents the spline flow transformation followed by a Softplus transformation to guarantee positivity of ω. The flexibility of this flow-based construction lets us capture a broad range of elliptical likelihoods, but we could also specify an appropriate likelihood ourselves. For instance, using a categorical likelihood enables EP classification; see Section 4.5. Training objective. Now, suppose that we observe N independent and identically distributed residuals ϵi = yi fi between the observations y and some function f. We are primarily interested in estimating the noise for the purpose of denoising the measurements. Hence, we fit an elliptical distribution to the residuals by maximizing the (log) marginal likelihood with respect to the parameters ηω, that is log p(ϵ; ηω) = i=1 log Z N (ϵi; 0, ωi) p(ωi; ηω)dωi. (8) For general mixing distributions, this integral lacks a closed-form expression, but since it is one-dimensional we can approximate it efficiently by numerical integration (for example, using the trapezoidal rule). Ultimately, we arrive at the likelihood Z N (yi; fi, ωi) p(ωi; ηω)dωi. (9) Published in Transactions on Machine Learning Research (08/2023) Recall that our main objective is to infer the latent function f = f(x ) at arbitrary locations x RD given a finite set of noisy observations y. In probabilistic machine learning, the mapping y 7 f is often defined by the posterior predictive distribution p(f |y) = Z p(f |f)p(f|y)df, (10) which turns modeling into a search for suitable choices of p(f |f) and p(f|y). Accordingly, the noise estimation described in the previous section is only done in pursuit of this higher purpose. Sparse formulation. For an elliptical process (EP) we can rewrite the posterior predictive distribution as p(f |y) = Z p(f |f, ξ) p(f, u, ξ|y)dfdu dξ, (11) where we are marginalizing not only over the mixing variable ξ and the function values f (at the given inputs X), but also over the function values u at the so-called M inducing inputs Z. Introducing inducing points lets us derive a sparse variational EP a computationally scalable version of the EP similar to the sparse variational GP (Titsias, 2009). The need for approximation arises because of the intractable second factor p(f, u, ξ|y) in Equation (11). (The first factor p(f |f, ξ) is simply a Normal distribution.) We summarize the sparse variational EP below and refer to Appendices D and E for additional details. Variational approximation. We make the variational ansatz p(f, u, ξ|y) p(f|u, ξ)q(u|ξ) q(ξ), where u is the latent function at the induced input locations Z, and parameterize this variational posterior as an elliptical distribution. We do so for two reasons: first, this makes the variational posterior similar to the true posterior, and second, we can then use the conditional distribution to make predictions. In full detail, we factorize the posterior as q(f, u, ξ; φ) = p(f|u, ξ; ηf)q(u|ξ; φu)q(ξ; φξ), (12) where φ = (φf, φu, φξ) are the variational parameters, q(u|ξ; φu) = N(m, Sξ) is a Gaussian distribution with variational parameters m and S, and the mixing distribution ξ q(ξ; φξ). Again, q(ξ; φξ) could be any positive finite distribution, but since we only know that the posterior is elliptical with a data-dependent mixing distribution, and since it is impossible to overfit a variational distribution (Bishop & Nasrabadi, 2006), we choose a very flexible yet tractable model, namely a spline flow. Note that, because of the conditioning on ξ, the first two factors in Equation (12) are a Gaussian conjugate pair in u. Thus, marginalization over u results in a Gaussian distribution, for which the marginals of f only depend on the corresponding input x (Salimbeni et al., 2019): q(f |ξ; φ) = N(f |µf(x ), σf(x )ξ), (13) µf(x ) = k K 1 uum, (14) σf(x ) = k k K 1 uu K 1 uu SK 1 uu k , (15) and k = k(x , x ), k = k(Z, x ), and Kuu = k(Z, Z). Predictions on unseen data points x are then computed according to (see Appendix E) q(f |y; x ) = Eq(ξ; φξ) [N(f |µf(x ), σf(x )ξ)] . (16) For training, we use variational inference (VI), i.e., maximizing the evidence lower bound (ELBO) to indirectly maximize the marginal likelihood. We train the model using stochastic gradient descent and black-box variational inference (Bingham et al., 2019; Wingate & Weber, 2013; Ranganath et al., 2014). Published in Transactions on Machine Learning Research (08/2023) VI training. The marginal likelihood is p(y; ηf, ηu, ηξ) = Z p(y, f, u, ξ; ηf, ηu, ηξ)dfdudξ = Z p(y|f)p(f|u, ξ; ηf)p(u, ξ; ηu, ηξ)dfdudξ. (17) This integral is intractable since p(ξ; ηξ) is parameterized by a spline flow. To overcome this we approximate the marginal likelihood with the ELBO LELBO(ηf, ηu, ηξ, φf, φu, φξ) = Eq(f,ξ; φ) [ log p(y|f)] DKL (q(u, ξ; φ) || p(u, ξ; η)) i=1 Eq(fi,ξ; φ) [log p(yi|fi)] DKL (q(u, ξ; φ) || p(u, ξ; η)) . (18) Had the likelihood been Gaussian, the expectation Eq(fi,ξ; φ) [log p(yi|fi; ηf)] could have been computed analytically. In our case, however, it is elliptical, and we use a Monte Carlo estimate instead. Inserting the elliptical likelihood (Equation (9)) from the previous section yields i=1 Eq(fi,ξ; φ) [log (N (yi; fi, ωi) p(ωi, ηω))] DKL (q(u, ξ; φ)||p(u, ξ; η)) . (19) 3.3 Extension to heteroscedastic noise We now extend the elliptical likelihood to capture heteroscedastic (input-dependent) noise. The main idea is to let the parameters ηωi of the likelihood s mixing distribution depend on the input location xi. In heteroscedastic regression, the noise depends on the input location xi. For example, heteroscedastic elliptical noise can be useful in a time series where the noise variance and tail-heaviness change over time. Examples of this can be found in statistical finance (Liu et al., 2020) and robotics (Kersting et al., 2007). To model this, we use a neural network gγω( ) with parameters γω to represent the mapping from input location to spline flow parameters, xi gγω ( ) 7 ηωi. We train the model by maximizing the log-likelihood i=1 log Z p(yi|fi, ωi)p(ωi; ηωi = gγω(xi))dωi. (20) Additional information, such as time of the day or season for time series data, can be incorporated by simply passing it as extra inputs to the neural network gγω( ). 4 Experiments We examined the variational elliptical processes using four different experiments. In the first experiment, we investigated how well the elliptical likelihood (Section 3.1) recovers known elliptical noise in synthetic data. In the second experiment, we demonstrated the use of the heteroscedastic EP on a synthetic dataset. We evaluated regression performance on seven standard benchmarks in the third experiment where we compared the sparse and the heteroscedastic EP formulations with both sparse GP (SVGP) (Hensman et al., 2013) and full GP baselines. Finally, in the fourth experiment, we examined if using an EP is beneficial in classification tasks. Implementation. The mixing distribution of the variational EP used a quadratic rational spline flow, where we transformed the likelihood flow p(ω) using Softplus and the posterior flow p(ξ) using arctan to ensure that they were positive (remember that a mixing distribution must be positive, see Equation (3)). We used a squared exponential kernel with independent length scales in all experiments. See Appendix F for further implementation details. Accompanying code is found at . Published in Transactions on Machine Learning Research (08/2023) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x y * 90% credibility region f * 90% credibility region Training data Predictive mean 0.0 0.2 0.4 0.6 0 Learned p( ) 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 0.0 0.2 0.4 0.6 0.8 1.0 0 2.0 1.5 1.0 0.5 0.0 0.5 1.0 1.5 2.0 x 0.0 0.2 0.4 0.6 0.8 1.0 0 Figure 4: The posterior predictive distribution when using a GP with elliptical noise modeled by a spline flow. Each row represents a synthetic dataset with different noise. The first row adds Gaussian noise ω δ(ω 0.04) the second row adds Student s t noise ω Scale-Inv-χ2(ν = 4), and the third row adds Cauchy noise ω Scale-Inv-χ2(ν = 1). The shaded areas show the latent function posterior f and the noisy posterior y 90% credibility areas. The histograms show the learned and the true noise mixing distribution. 4.1 Noise identification To examine how well the elliptical likelihood, described in Section 3.1, captures different types of elliptical noise, we created three synthetic datasets with the same latent function fi = sin(3xi)/2 sampled independently at N = 200 locations xi U( 2, 2). Further, each dataset was contaminated with independent elliptical noise ϵi that was added to the latent function, yi = fi + ϵi. The added noise varied for the three datasets. The first was Gaussian distributed which is the same as ω being Delta distributed ω δ(ω 0.04). The second was Student s t which means that ω follows the scaled inverse chi-squared distribution ω Scale Inv-χ2(ν = 4). The third was Cauchy distributed ω Scale-Inv-χ2(ν = 1). We trained a sparse variational GP for each dataset with an elliptical likelihood. Figure 4 illustrates the results from the experiments. The histograms in the right column compare the learned mixing distribution p(ω; ηω) to the true mixing distribution (the red curve) from which the noise ϵi originated. The learned distributions follow the shape of the true mixing distribution reasonably well, considering the small number of samples, indicating that we can learn the correct likelihood regardless of the noise variance. Furthermore, if the noise is actually Gaussian, as at x = 0.7, then so is the learned likelihood. The left column shows the predictive posterior of the final models, demonstrating that the models managed to learn suitable kernel parameters jointly with the likelihood. Published in Transactions on Machine Learning Research (08/2023) 0 1 2 3 4 x True func Pred mean 0.5 1.0 1.5 0.0 Learned p( ) 0 1 2 3 4 x 0.0 0.5 1.0 1.5 0.0 Learned p( ) Figure 5: The predictive posterior after training on a small synthetic dataset. The shaded areas show the 95 % credibility area of the latent function posterior f (blue) and the noisy posterior y (magenta) when using (top row) a GP with elliptical noise modeled by a spline flow and a (bottom row) a GP with Gaussian noise. The histograms show the learned and the true noise mixing distribution. Robust regression on synthetic data. An elliptical likelihood is better at handling outliers and non-Gaussian noise than a Gaussian likelihood because it can better match the whole distribution of the noise rather than just a single variance. This is shown in Figure 5, where a GP with an elliptical and a Gaussian likelihood were trained on a small synthetic dataset with additive Student s t noise, with η = 4. The Gaussian likelihood approximates the mixing distribution with a single variance at approximately ω = 0.4, while the elliptical likelihood fits the entire mixing distribution. As a result, the GP with the Gaussian likelihood needs to use a shorter length scale to compensate for the thin tail of the likelihood. In contrast, the GP with the elliptical likelihood can focus on the slower variations, thus producing a better fit to data. 4.2 Elliptic heteroskedastic noise In this experiment, we aimed to exemplify the benefits of using heteroscedastic elliptical noise as described in Section 3.3. To this end, we created the synthetic dataset shown in Figure 6. It consisted of 150 samples generated by adding heteroscedastic noise to the function f(xi) = sin(5xi)+xi, where xi U(0, 4). Specifically, we added Student s t noise, ϵ(xi) St (ν(xi), σ(xi)), where the noise scale followed ν(xi) = 25 11|xi +1|0.9, and the standard deviation by σ(xi) = 0.5|xi + 1|1.6 + 0.001. We used a variational sparse GP with heteroscedastic noise as described in Section 3.3. The experimental results, depicted in Figure 6, show that, qualitatively, even though the rapid change in the noise distribution and the low number of training samples, the model captures the varying noise in terms of the scale and the increasing heaviness of the tail. Remember that a single spike in the mixing distribution, as at x = 0.7, indicates that the noise is Gaussian, and the wider the mixing distribution is, as at x = 0.7, the heavier-tailed the noise is. When the synthetic data has Gaussian noise, then so has also the learned elliptical noise. Similarly, when the synthetic noise is heavier-tailed, so is the learned mixing distribution. This indicates that this model could be helpful for data with varying elliptical noise. 4.3 Regression We conducted experiments on seven datasets from the UCI repository (Dua & Graff, 2017) to study the impact of the elliptical noise, elliptical posterior, and heteroscedastic noise. We used various regression models based on a Gaussian Process (GP) prior; see Table 1 for a summary. As baselines, we compare with the sparse variational GP model of Hensman et al. (2013), which we call SVGP, and an exact GP. Models evaluated. We used a GP model with elliptical noise (EP GP) to compare its performance to the traditional GP model with Gaussian noise. Theoretically, an elliptical posterior should result from combining a Gaussian prior and an elliptical likelihood, but in this case, we approximated the posterior with a Gaussian. We also included a model that used an elliptical posterior (EP EP) to explore the potential benefits of using the theoretically more accurate elliptical posterior. Additionally, we tested a heteroscedastic elliptical noise model (Het-EP) and a heteroscedastic Gaussian noise model (Het-GP) to compare their performance. The difference between these two is that in Het-GP the neural network only predicts the noise variance, whereas the Het-EP model predicts the 26 parameters corresponding to nine bins of the spline flow. Published in Transactions on Machine Learning Research (08/2023) 1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 x 0.0 0.2 0.4 0 0.0 0.2 0.4 0 0.0 0.5 1.0 0 0.0 0.2 0.4 0 0.0 0.1 0.2 0 0.0 0.5 1.0 0 Figure 6: The results from training a GP with heteroscedastic elliptical noise on a synthetic dataset. The top row shows the posterior distribution where the shaded areas are the 95 % credibility areas of the latent posterior f (magenta) and the noisy posterior y (blue). The histograms in the middle row show the noise mixing distributions at the different x-values indicated by the vertical dashed lines in the top plot. The bottom row shows the mixing distribution used when creating the synthetic data. MPG n_train = 294, d = 7 Concrete n_train = 778, d = 8 0.35 0.40 0.45 Elevators n_train = 9960, d = 18 2.0 1.5 1.0 Bike n_train = 10427, d = 17 0.4 0.5 0.6 0.7 California n_train = 12384, d = 8 Kin40k n_train = 24000, d = 8 0.95 1.00 1.05 Protein n = 27438, d = 9 0.10 0.12 0.14 0.16 MPG n_train = 294, d = 7 0.10 0.15 0.20 Concrete n_train = 778, d = 8 Elevators n_train = 9960, d = 18 0.000 0.025 0.050 0.075 Bike n_train = 10427, d = 17 0.21 0.22 0.23 0.24 California n_train = 12384, d = 8 Kin40k n_train = 24000, d = 8 0.35 0.40 0.45 0.50 Protein n = 27438, d = 9 Figure 7: Predictive negative log-likelihood (LL) (top) and mean squared error (MSE) (bottom) on heldout test data from the regression benchmarks (smaller is better). We show the average of the ten splits as a dot and the standard deviation as a line. The models with bold fonts are our models. Note that the spread of the error varies between the datasets. For example, the MSE error for the Bike dataset is low for all six models. Overall, the EP posterior outperforms the GP posterior, regarding the log-likelihood, for the five larger datasets . First, we summarize the results in Figure 7, and then we discuss the results of each method in more detail. The figure displays the mean and standard deviation of ten randomly chosen training, validation, and test data splits. The training procedure for all models optimizes directly or indirectly the log-likelihood. Therefore, the most relevant figure of merit is the negative test log-likelihood (LL), shown in the top row of Figure 7. We stress that log-likelihood is more critical than MSE because mean squared error (MSE) does not consider the predictive variance. However, we show the MSE on held-out test sets in the bottom row for completeness. Additional details on the experiments can be found in Appendix G. Published in Transactions on Machine Learning Research (08/2023) Table 1: The different types of models we trained on the regression datasets. NAME APPROXIMATION LOSS LIKELIHOOD POSTERIOR Exact GP Exact Marginal likelihood Gaussian Gaussian SVGP Variational ELBO Gaussian Gaussian EP-GP Variational ELBO Elliptic Gaussian EP-EP Variational ELBO Elliptic Elliptic Het-GP Variational ELBO Gaussian Gaussian Het-EP Variational ELBO Elliptic Gaussian GP baseline. To assess the quality of the approximations introduced, we first establish an exact GP baseline that made predictions without any approximation. We trained the GP hyperparameters using LBFGS and early stopping on a validation dataset. For this to be feasible on large datasets, we used the Blackbox Matrix-Matrix multiplication inference procedure (Gardner et al., 2018; Wang et al., 2019). In the following sections, we discuss each method in detail. Variational GP approximation. First, we compare the exact GP baseline with its variational approximation, i.e., SVGP. First, consider the results on MPG and Concrete, where SVGP did not make use of inducing points due to the small sample size. Consequently, SVGP s worse performance on Concrete is only due to the change of inference method. For the other datasets, we investigated the effect of the number of inducing points on the predictive log-likelihood; see Figure 11 in Appendix H. The dependence is very similar for all methods. In particular, the performance saturates at roughly 500 inducing points on all datasets except Kin40k, which continues to improve. However, the relative performance of the different methods on Kin40k is fairly stable. Elliptical likelihood. Next, we consider whether it is advantageous to use an elliptical likelihood instead of a Gaussian. To this end, we compare the performance of SVGP and EP-GP, which only differ in this respect. The results show that switching to an elliptical likelihood improves the log-likelihood on most datasets, as would be expected theoretically. Elliptical posterior. We now compare EP-GP to EP-EP to analyze the potential benefit of having an elliptical posterior. On three of the datasets (Bike, Kin40k, and Protein), which are all relatively large, the elliptical posterior produces a clear improvement in log-likelihood. In contrast, on the other datasets, it is similar, possibly because the posterior is well-approximated by a Gaussian. Regardless, we conclude that, when using an elliptical likelihood, an elliptical posterior is preferable over a Gaussian one. Heteroscedastic models. Is there an additional benefit of having heteroscedastic noise? On the two smallest datasets (MPG and Concrete), the answer is clearly no: the heteroscedastic models perform worse than SVGP and EP-GP in terms of both log-likelihood and mean squared error, indicating potential overfitting and that regularization may be warranted. (Note that the most relevant comparisons are Het-GP vs. SVGP and Het-EP vs. EP-GP.) On the remaining datasets, however, the heteroscedastic models clearly outperform SVGP and EP-GP in terms of log-likelihood. On the other hand, they perform poorly in terms of mean squared error; in fact, worse than SVGP on all datasets. Hypothetically, this is because the heteroscedastic models attribute too much variation to the likelihood, thus sacrificing the mean-function prediction. This could potentially be mitigated by decoupling the mean and the covariance models (Salimbeni et al., 2018; Jankowiak et al., 2020). Another option would be to increase the weight of the KL-divergence term in the ELBO (Higgins et al., 2017). We expect such improvements to be more critical for the Het-EP model, which has a more flexible likelihood than Het-GP. Still, Het-EP already performs slightly better than Het-GP on the three largest datasets. Note, however, the EP-EP model often achieves both a log-likelihood similar to the heteroscedastic models and a mean squared error similar to SVGP. Published in Transactions on Machine Learning Research (08/2023) 2010 2011 2012 2013 2014 2015 2016 2017 Year Wind power production ( GWh ) Test data Training data y * 95% confidence region f * 95% confidence region Data points Predictive mean Figure 9: The predictive posterior after training the elliptical process on the wind power dataset. The shaded areas show the 95 % credibility area of the latent function posterior f and the noisy posterior y . Computational considerations. Empirically, we found that replacing the Gaussian likelihood with the elliptical likelihood had a minor impact on the computational demand. Further changing to an elliptical posterior increased the computational time per iteration, but a faster convergence partially compensated for this. Finally, modeling heteroscedastic noise with a neural network adds significant complexity, but this was offset by running it on a GPU. Prediction accuracy. In summary, the results show that an elliptical likelihood results in better or equal predictive log-likelihoods than a Gaussian likelihood. However, the advantage is less significant on small datasets. Similarly, the more flexible elliptical posterior tends to produce better results. However, when looking at the mean squared error (MSE), the exact GP outperforms the other models. Thus, if predictive performance is the main objective, an exact GP (or, even better, a neural network) may be the best choice. However, the well-known scalability issues of exact GP clearly limit its applicability. In such scenarios, our results suggest that EP-EP is a better choice than SVGP. 4.4 Application: Forecasting wind power production 0.5 1.0 0.0 Figure 8: The mixing distribution p(ω) of the elliptical noise. Wind power stands for a significant and increasing share of global power production. However, since wind power generation is inherently stochastic and hard to control it poses severe challenges to power system balance and stability (Impram et al., 2020). In principle, these could be addressed via sophisticated control theoretical tools such as stochastic model predictive control (Schildbach et al., 2014), but that requires accurate and reliable wind power forecasts. In this section, we illustrate the applicability of the elliptical process to time series forecasting. Specifically, we consider the task of making a probabilistic forecast of German country-wide totals of wind power production at the end of 2016 based on data from the preceding seven years. The data comes from Wiese et al. (2019). Since the data is strictly positive, we model the time series in the log domain, using an elliptical process with elliptical noise (EP EP). To capture the periodicity in the data we use a sum of a periodic kernel and a linear kernel. See Appendix I for more details. Figure 9 presents the predictive distribution from the elliptical process. The model successfully captures the underlying annual periodicity and slowly increasing trend while producing qualitatively reasonable credibility regions. Figure 8 shows the mixing distribution of the trained elliptical likelihood, which is rather broad and hence non-Gaussian. The two modes discernible in the mixing distribution reflect the seasonal changes in the underlying data. For comparison, Appendix I shows the corresponding results when using a full GP and Published in Transactions on Machine Learning Research (08/2023) a heteroscedastic EP, where the full GP seems to be more likely to overfit to short-scale trends in the data and produces credible regions that are too wide. The heteroscedastic EP, on the other hand, produces a fit that is overall on par with the EP model but disentangles the seasonal variations. 4.5 Binary classification 0.85 0.90 0.95 -prior, -post -prior, -post 0.75 0.80 0.85 0.86 0.88 0.90 0.92 Mammographic 0.80 0.85 0.90 -prior, -post -prior, -post 0.70 0.75 0.80 0.80 0.82 0.84 0.86 Mammographic 0.65 0.60 0.55 0.50 -prior, -post -prior, -post 0.65 0.60 0.55 0.50 Mammographic Log-likelihood Figure 10: The classification AUC (Area Under the Curve), accuracy, and predictive log-likelihood from the ten-fold cross-validation (higher is better). We show the average of the ten splits as a dot and the standard deviation as a line. To evaluate the EP on classification tasks, we perform variational EP and GP classification by simply replacing the likelihood with a binary one. To derive the expectation in Equation (18) we first sample fi N(fi|µf(xi), σf(xi)ξ) and then derive the likelihood as Ber(Sigmoid(fi)). This realization is interesting since we do not have a likelihood that captures the noise in the data; instead, the process itself has to do it. Therefore, we can indicate the value of the elliptical process itself without the elliptical noise. We compare two variational EP models with a variational GP model. The two EPs differ in the prior mixing distributions, where the first model has a GP prior and a EP posterior. For the second model, we replace the GP prior to an elliptical one. We can see the trainable prior mixing distribution as using a continuously scaled mixture of Gaussian processes, which can be more expressive than a single GP. To evaluate the models, we performed a ten-fold cross-validation where we trained the models on three classification datasets, described in Appendix J. Figure 10 presents the results from ten folds. From the area under the curve and the log-likelihood score, we see that the EP separates the two classes better, especially using the sparse models. The variational elliptical distribution is sufficient to get a high log-likelihood while training the mixing distribution of the EP did not further improve the score. 5 Related work In general, attempts at modeling heavy-tailed stochastic processes modify either the likelihood or the stochastic process prior rarely both. Approximate inference is typically needed when going beyond Gaussian likelihoods (Neal, 1997; Jylänki et al., 2011), e.g., for robust regression, but approximations that preserve analytical tractability have been proposed (Shah et al., 2014). Elliptical processes gain flexibility by learning the mixing distribution, which makes training and inference reasonably efficient and the resulting predictions interpretable. It is, however, also possible to construct deep process priors by composing several stochastic process layers (Damianou & Lawrence, 2013) and including non-linear transformations as activation functions (Aitchison et al., 2021). The tractability of such deep processes depends on the specifics of the distributions involved. For instance, the deep inverse Wishart process (Aitchison et al., 2021) uses an inverse Wishart prior over the kernel just as the Student s t process (Shah et al., 2014). This suggests it might be possible to generalize this approach to deep elliptical processes. While intriguing, we leave this to future work. Other attempts at creating more expressive GP priors include Maroñas et al. (2021), who used a GP in combination with a normalizing flow, and Luo & Sun (2017), who used a discrete mixture of Gaussian processes. Similar ideas combining mixtures and normalizing flows have also been proposed to create more expressive likelihoods (Abdelhamed et al., 2019; Daemi et al., 2019; Winkler et al., 2019; Rivero & Dvorkin, 2020) and variational posteriors (Nguyen & Bonilla, 2014). Non-stationary extensions of Gaussian Published in Transactions on Machine Learning Research (08/2023) processes, such as when modeling heteroscedastic noise, are somewhat rare. Examples include Zhao et al. (2021) who propose a hierarchical model in parameter space, the mixture model of Li et al. (2021), and the variational model of Lázaro-Gredilla & Titsias (2011). Deep kernel learning (Calandra et al., 2016; Wilson et al., 2016b;a) is another class of deep GPs that uses a neural network to learn the input features of a GP. A similar approach was taken by Ma et al. (2019), who describe a class of stochastic processes where the finite-dimensional distributions are only defined implicitly as a parameterized transformation of some base distribution, thereby generalizing earlier work on warped Gaussian processes (Snelson et al., 2003; Rios & Tobar, 2019). However, the price of this generality is that standard variational inference is no longer possible. Based on an assumption of a Gaussian likelihood, they describe an alternative based on the wake-sleep algorithm by Hinton et al. (1995). In the statistics literature, it is well-known that elliptical processes can be defined as scale-mixtures of Gaussian processes (Huang & Cambanis, 1979; O Hagan, 1991; O Hagan et al., 1999). However, unlike in machine learning, little emphasis is placed on building the models from data (i.e., training). These models have found applications in environmental statistics because of the field s inherent interest in modeling spatial extremes (Davison et al., 2012). Like us, several works take the mixing distribution as the starting point and make localized predictions of quantiles (Maume-Deschamps et al., 2017) or other tail-risk measures (Opitz, 2016). 6 Conclusions The Gaussian distribution is the default choice in statistical modeling for good reasons. Even so, far from everything is Gaussian casually pretending it is, comes at a risk. The elliptical distribution offers a computationally tractable alternative that can capture heavy-tailed distributions. The same reasoning applies when comparing the Gaussian process to the elliptical process. A sensible approach in many applications would be to start from the weaker assumptions of the elliptical process and let the data decide whether the evidence supports Gaussianity. We constructed the elliptical process as a scale mixture of Gaussian distributions. By parameterizing the mixing distribution using a normalizing flow, we showed how a corresponding elliptical process can be trained using variational inference. The variational approximation we propose enables us to capture heavy-tailed posteriors and makes it straightforward to create a sparse variational elliptical process that scales to large datasets. We performed both experiments on regression and classification. In particular, we investigated the benefits of various combinations of elliptical posterior and elliptical likelihood and their heteroscedastic counterparts. We concluded that using an elliptical likelihood and an elliptical posterior often achieves a better loglikelihood and similar mean squared error as the sparse variational GP. The added flexibility of elliptical processes could benefit a range of classical and new applications. However, advanced statistical models are not a cure-all, and one needs to avoid overreliance on such models, especially in safety-critical applications. 7 Acknowledgments We thank Sebastian Mair and Zheng Zhao for providing feedback on the manuscript. We also want to thank the reviewers whose comments significantly improved the quality and clarity of our paper. This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation, the Swedish Foundation for Strategic Research grant SM19-0029, and by Kjell och Märta Beijer Foundation. Published in Transactions on Machine Learning Research (08/2023) Abdelrahman Abdelhamed, Marcus A Brubaker, and Michael S Brown. Noise flow: Noise modeling with conditional normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3165 3173, 2019. Laurence Aitchison, Adam Yang, and Sebastian W Ober. Deep kernel processes. In International Conference on Machine Learning (ICML), pp. 130 140. PMLR, 2021. Jesús Alcalá-Fdez, Alberto Fernández, Julián Luengo, Joaquín Derrac, Salvador García, Luciano Sánchez, and Francisco Herrera. Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing, 17, 2011. Maria Bånkestad, Jens Sjölund, Jalil Taghia, and Thomas Schön. The elliptical processes: a family of fat-tailed stochastic processes. ar Xiv preprint ar Xiv:2003.07201, 2020. Rabindra Nath Bhattacharya and Edward C Waymire. A basic course in probability theory, volume 69. Springer, 2007. Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. Journal of Machine Learning Research, 20(1):973 978, 2019. Nicholas H Bingham and Rüdiger Kiesel. Semi-parametric modelling in finance: theoretical foundations. Quantitative Finance, 2(4):241, 2002. Christopher M Bishop and Nasser M Nasrabadi. Pattern recognition and machine learning, volume 4. Springer, 2006. David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American statistical Association, 112(518):859 877, 2017. Roberto Calandra, Jan Peters, Carl Edward Rasmussen, and Marc Peter Deisenroth. Manifold Gaussian processes for regression. In 2016 International joint conference on neural networks (IJCNN), pp. 3338 3345. IEEE, 2016. Atefeh Daemi, Hariprasad Kodamana, and Biao Huang. Gaussian process modelling with Gaussian mixture likelihood. Journal of Process Control, 81:209 220, 2019. Andreas Damianou and Neil D Lawrence. Deep Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 207 215, 2013. Anthony C Davison, Simone A Padoan, Mathieu Ribatet, et al. Statistical modeling of spatial extremes. Statistical science, 27(2):161 186, 2012. Peter J Diggle, Jonathan A Tawn, and Rana A Moyeed. Model-based geostatistics. Journal of the Royal Statistical Society: Series C (Applied Statistics), 47(3):299 350, 1998. Hadi Mohaghegh Dolatabadi, Sarah Erfani, and Christopher Leckie. Invertible generative modeling using linear rational splines. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 4236 4246, 2020. Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ ml. Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019. Hadi Fanaee-T and Joao Gama. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence, 2(2):113 127, 2014. Published in Transactions on Machine Learning Research (08/2023) Kai-Tai Fang, Samuel Kotz, and Kai Wang Ng. Symmetric multivariate and related distributions. Chapman and Hall, 1990. Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In Advances in Neural Information Processing Systems (Neur IPS), volume 31, 2018. James Hensman, Nicolò Fusi, and Neil D. Lawrence. Gaussian processes for big data. In Ann Nicholson and Padhraic Smyth (eds.), Uncertainty in Artificial Intelligence (UAI), volume 29. AUAI Press, 2013. James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 351 360, 2015. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. β-VAE: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations (ICLR), 2017. Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The wake-sleep algorithm for unsupervised neural networks. Science, 268(5214):1158 1161, 1995. Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 2013. Steel T Huang and Stamatis Cambanis. Spherically invariant processes: Their nonlinear structure, discrimination, and estimation. Journal of Multivariate Analysis, 9(1):59 83, 1979. Semich Impram, Secil Varbak Nese, and Bülent Oral. Challenges of renewable energy penetration on power system flexibility: A survey. Energy Strategy Reviews, 31:100539, 2020. Martin Jankowiak, Geoff Pleiss, and Jacob Gardner. Parametric Gaussian process regressors. In International Conference on Machine Learning (ICML), pp. 4702 4712. PMLR, 2020. Pasi Jylänki, Jarno Vanhatalo, and Aki Vehtari. Robust Gaussian process regression with a Student-t likelihood. Journal of Machine Learning Research, 12(Nov):3227 3257, 2011. Yutaka Kano. Consistency property of elliptic probability density functions. Journal of Multivariate Analysis, 51(1):139 147, 1994. Kristian Kersting, Christian Plagemann, Patrick Pfaff, and Wolfram Burgard. Most likely heteroscedastic Gaussian process regression. In International Conference on Machine Learning (ICML), pp. 393 400, 2007. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015. Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014. Miguel Lázaro-Gredilla and Michalis K Titsias. Variational heteroscedastic Gaussian process regression. In International Conference on Machine Learning (ICML), 2011. Tao Li, Di Wu, and Jinwen Ma. Mixture of robust Gaussian processes and its hard-cut EM algorithm with variational bounding approximation. Neurocomputing, 452:224 238, 2021. Bingqing Liu, Ivan Kiskin, and Stephen Roberts. An overview of Gaussian process regression for volatility forecasting. In International Conference on Artificial Intelligence in Information and Communication (ICAIIC), pp. 681 686, 2020. Chen Luo and Shiliang Sun. Variational mixtures of Gaussian processes for classification. In International Joint Conferences on Artificial Intelligence (IJCAI), volume 357, pp. 4603 4609, 2017. Published in Transactions on Machine Learning Research (08/2023) Chao Ma, Yingzhen Li, and José Miguel Hernández-Lobato. Variational implicit processes. In International Conference on Machine Learning (ICML), pp. 4222 4233, 2019. Benoit Mandelbrot. The variation of certain speculative prices. The Journal of Business, 36(4):394 419, 1963. Juan Maroñas, Oliver Hamelijnck, Jeremias Knoblauch, and Theodoros Damoulas. Transforming gaussian processes with normalizing flows. In Arindam Banerjee and Kenji Fukumizu (eds.), International Conference on Artificial Intelligence and Statistics (AISTATS), 2021. Véronique Maume-Deschamps, Didier Rullière, and Antoine Usseglio-Carleve. Quantile predictions for elliptical random fields. Journal of Multivariate Analysis, 159:1 17, 2017. Radford M Neal. Monte Carlo implementation of Gaussian process models for Bayesian regression and classification. Technical Report 9702, Department of Statistics, University of Toronto, 1997. Trung V Nguyen and Edwin V Bonilla. Automated variational inference for Gaussian process models. Advances in Neural Information Processing Systems (Neur IPS), 27, 2014. Sebastian W Ober and Laurence Aitchison. Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes. In International Conference on Machine Learning (ICML), pp. 8248 8259, 2021. Anthony O Hagan. Bayes Hermite quadrature. Journal of statistical planning and inference, 29(3):245 260, 1991. Anthony O Hagan, Marc C Kennedy, and Jeremy E Oakley. Uncertainty analysis and other inference tools for complex computer codes. In Bayesian statistics 6, pp. 503 524. Oxford University Press, 1999. Thomas Opitz. Modeling asymptotically independent spatial extremes based on Laplace random fields. Spatial Statistics, 16:1 18, 2016. R Kelley Pace and Ronald Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3): 291 297, 1997. George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. Rajesh Ranganath, Sean Gerrish, and David Blei. Black Box Variational Inference. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 814 822, 2014. Carl Edward Rasmussen and Christopher K I Williams. Gaussian processes for machine learning. The MIT Press, 2006. p. 194. Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), pp. 1530 1538, 2015. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning (ICML), pp. 1278 1286, 2014. Gonzalo Rios and Felipe Tobar. Compositionally-warped Gaussian processes. Neural Networks, 118:235 246, 2019. Ana Diaz Rivero and Cora Dvorkin. Flow-based likelihoods for non-Gaussian inference. Physical Review D, 102(10):103507, 2020. Hugh Salimbeni, Ching-An Cheng, Byron Boots, and Marc Deisenroth. Orthogonally decoupled variational Gaussian processes. Advances in neural information processing systems (Neur IPS), 31, 2018. Published in Transactions on Machine Learning Research (08/2023) Hugh Salimbeni, Vincent Dutordoir, James Hensman, and Marc Deisenroth. Deep Gaussian processes with importance-weighted variational inference. In International Conference on Machine Learning (ICML), pp. 5589 5598, 2019. Georg Schildbach, Lorenzo Fagiano, Christoph Frei, and Manfred Morari. The scenario approach for stochastic model predictive control with bounds on closed-loop constraint violations. Automatica, 50(12):3009 3018, 2014. Amar Shah, Andrew Wilson, and Zoubin Ghahramani. Student-t processes as alternatives to Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 877 885, 2014. Jack W Smith, James E Everhart, WC Dickson, William C Knowler, and Robert Scott Johannes. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, pp. 261. American Medical Informatics Association, 1988. Edward Snelson, Zoubin Ghahramani, and Carl Rasmussen. Warped gaussian processes. In Advances in Neural Information Processing Systems (Neur IPS), volume 16, 2003. Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 567 574, 2009. Dustin Tran, Rajesh Ranganath, and David M Blei. Variational Gaussian process. In International Conference on Learning Representations (ICLR), 2016. Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gordon Wilson. Exact gaussian processes on a million data points. In Advances in Neural Information Processing Systems (Neur IPS), volume 32, 2019. Frauke Wiese, Ingmar Schlecht, Wolf-Dieter Bunke, Clemens Gerbaulet, Lion Hirth, Martin Jahn, Friedrich Kunz, Casimir Lorenz, Jonathan Mühlenpfordt, Juliane Reimann, et al. Open power system data frictionless data for electricity system modelling. Applied Energy, 236:401 409, 2019. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 370 378. PMLR, 2016a. Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016b. David Wingate and Theophane Weber. Automated variational inference in probabilistic programming. ar Xiv preprint ar Xiv:1301.1299, 2013. Christina Winkler, Daniel Worrall, Emiel Hoogeboom, and Max Welling. Learning likelihoods with conditional normalizing flows. ar Xiv preprint ar Xiv:1912.00042, 2019. I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 28(12):1797 1808, 1998. Zheng Zhao, Muhammad Emzir, and Simo Särkkä. Deep state-space Gaussian processes. Statistics and Computing, 31:1 26, 2021. Abdelhak M Zoubir, Visa Koivunen, Yacine Chakhchoukh, and Michael Muma. Robust estimation in signal processing: A tutorial-style treatment of fundamental concepts. IEEE Signal Processing Magazine, 29(4): 61 80, 2012. Published in Transactions on Machine Learning Research (08/2023) A The elliptical distribution The Gaussian distribution the basic building block of Gaussian processes has several attractive properties that we wish the elliptical process to inherit, namely (i) closure under marginalization, (ii) closure under conditioning, and (iii) straightforward sampling. This leads us to consider the family of consistent elliptical distributions. Following Kano (1994), we say that a family of elliptical distributions {p(u(y N); η) | N N} is consistent if and only if Z p (u(y N+1); η) dy N+1 = p (u(y N); η) . (21) In other words, a consistent elliptical distribution is closed under marginalization. Far from all elliptical distributions are consistent, but the complete characterization of those that are is provided by the following theorem (Kano, 1994). Theorem 1 An elliptical distribution is consistent if and only if it originates from the integral p(u; η) = |Σ| 1 2ξ p(ξ; ηξ)dξ, (22) where ξ is a mixing variable with the corresponding, strictly positive finite, mixing distribution p(ξ; η), that is independent of N. This shows that consistent elliptical distributions p(u; η) are scale-mixtures of Gaussian distributions, with a mixing variable ξ p(ξ; η). Note that any mixing distribution fulfilling Theorem 1 can be used to define a consistent elliptical process. We recover the Gaussian distribution if the mixing distribution is a Dirac delta function and the Student s t distribution if it is a scaled inverse chi-square distribution. If p(u; η) is a scale-mixture of normal distributions, it has the stochastic representation Y | ξ N(µ, Σξ), ξ p(ξ; η). (23) By using the following representation of the elliptical distribution, Y = µ + Σ1/2Zξ1/2, (24) where Z follows the standard normal distribution, we get the mean E[Y ] = µ + Σ1/2E [Z] E[ξ1/2] = µ (25) and the covariance Cov(Y ) = E (Y )µ)(Y µ) = E h (Σ1/2Z p = E h ξΣ1/2ZZ (Σ1/2) i = E [ξ] Σ. (26) The variance is a scale factor of the scale matrix Σ. To get the variance we have to derive E [ξ]. Note that if ξ follows the scaled inverse chi-square distribution, E[ξ] = ν/(ν 2). We recognize this from the Student s t distribution, where Cov(Y ) = ν/(ν 2)Σ. B Conditional distribution To use the EP for predictions, we need the conditional mean and covariance of the corresponding elliptical distribution, which we derive next. We partition the data as y = [y1, y2], where y1 are the N1 observed data points, y2 are the next N2 data points to predict, and N1 + N2 = N. We have the following result: Published in Transactions on Machine Learning Research (08/2023) Proposition 1 If the data y = [y1, y2] originate from the consistent elliptical distribution in Equation (3), the conditional distribution originates from the distribution py2|u1(y2) = c N1,η Σ22|1 1 2 (2π) N2 2 e (u2|1+u1) 1 2ξ p(ξ; η)dξ (27) with the conditional mean E[y2|y1] = µ2|1 and the conditional covariance Cov[Y2|Y1 = y2] = E[ˆξ]Σ22|1, ˆξ ξ|y1, (28) where u1 = (y1 µ1) Σ 1 11 (y1 µ1), u2|1 = (y2 µ2|1) Σ 1 22|1(y2 µ2|1), and c N1,η is a normalization constant. The conditional scale matrix Σ22|1 and the conditional mean vector µ2|1 are the same as the mean and the covariance matrix for a Gaussian distribution. The conditional distribution is guaranteed to be a consistent elliptical distribution but not necessarily the same as the original one the shape depending on the training samples. (Recall that consistency only concerns the marginal distribution.) Proof of proposition 1. The joint distribution of [y1, y2] is p(y1, y2|ξ)p(ξ; η) and the conditional distribution of y2, given y1 is p(y2|y1, ξ)p(ξ|y1; η). For a given ξ, p(y2|y1, ξ) is the conditional normal distribution and so p(y2|y1, ξ) N(µ2|1, Σ22|1 ˆξ), ˆξ p(ξ|y1; η) (29) µ2|1 = µ2 + Σ21Σ 1 11 (y1 µ1) (30) Σ22|1 = Σ22 Σ21Σ 1 11 Σ21, (31) the same as for the conditional Gaussian distribution. We obtain the conditional distribution p(ξ|y1; η) by remembering that p(y1|ξ) N(µ1, Σ11ξ). (32) Using Bayes Theorem we get p(ξ|y1; η) p(y1|ξ)p(ξ; η) |Σ11ξ| 1/2 exp u1 ξ N1/2 exp n ξ u1 o p(ξ; η). (33) Recall that u1 = (y µ1) Σ 1 11 (y µ1)). We normalize the distribution by c 1 N1,η = Z 0 ξ N1/2 exp u1 p(ξ; η)dξ. (34) The conditional mixing distribution is p(ξ|y1; η) = c N1,ηξ N1/2 exp u1 p(ξ; η). (35) The conditional distribution of y2 given y1 is derived by using the consistency formula p(y2|y1) = 1 |Σ22|1|1/2(2π)N2/2 0 ξ N2/2 exp u2|1 2ξ p(ξ|y1)dξ, (36) where u2|1 = (y2 µ2|1) Σ 1 22|1(y2 µ2|1). Using Equation (35) we get p(y2|y1) = c N1,η |Σ22|1|1/2(2π)N2/2 0 ξ n/2e (u2|1+u1)/(2ξ)p(ξ; η)dξ. (37) Published in Transactions on Machine Learning Research (08/2023) C Derivation of the credible intervals of the elliptical process We derive the credible interval of the elliptical process, by using the Monte Carlo approximation of the integral, as p( zσ < x < zσ) = 1 σ 0 ξ 1/2e x2/(ξ2σ2)p(ξ)dξdx (38) i=1 ξ 1/2 i e x2/(2ξiσ2)dx (39) i=1 ξ 1/2 i zσ e x2/(2ξiσ2)dx (40) 0 e u2du (41) i=1 erf z 2ξi For every mixing distribution, we can derive the credibility of the prediction. It is the number of samples m we take that decides the accuracy of the credible interval. D Details on the non-sparse variational elliptical process For a Gaussian process, the posterior of the latent variables f is p(f|y) p(y|f)p(f). (43) Here, the prior p(f|X) N(0, K) is a Gaussian process with kernel K and the likelihood p(y|f) N(f, σ2I) is Gaussian. The posterior derives to p(f|y) N f|K K + σ2I 1 y, K 1 + σ 2I 1 (44) and the predictive distribution of an arbitrary input location x is p(f |y) = Z p(f |f)p(f|y)df, (45) where p(f |f, x, x ) is the conditional distribution, which is again Gaussian with N f |k (k + σ2I) 1y, k k (K + σ2I) 1k . (46) Going back to the elliptical process, we want to derive the predictive distribution. The problem, though, is that the posterior is now intractable. In order to get a tractable posterior, we train the model using variational inference, where we approximate the intractable posterior with a tractable one, p(f, ξ|y; ηf, ηξ) q(f, ξ; φf, φξ) = q(f|ξ; φf)q(ξ; φξ). (47) Here ηf are the parameters of the prior GP process (such as the kernel parameters), ηξ are the parameters of the mixing distribution, and q(f|ξ; φf) N(mf, Sfξ), where mf and Sf are variational parameters. The posterior q(ξ; φξ) is parameterized with any positive distribution such as a normalizing flow. We use this approximation when we derive the predictive distribution p(f |y) = Z p(f |f, ξ; ηf)p(f, ξ|y; ηf, ηξ)dfdξ (48) = Z p(f |f, ξ; ηf)p(f, ξ|y; ηf, ηξ)dfdξ (49) Z p(f |f, ξ; ηf)q(f|ξ; φf)q(ξ; φξ)dfdξ. (50) Published in Transactions on Machine Learning Research (08/2023) By first taking a look at the prior distribution p(f , f|ξ) when ξ is constant, f ξ N 0, k k k K we arrive at the conditional distribution p(f |f, ξ; ηf) = N k K 1f, k k K 1k ξ . (52) We use this expression together with the variational approximation to derive the posterior predictive distribution p(f |y) = Z p(f |f, ξ; ηf)q(f|ξ; φf)q(ξ; φξ)dfdξ (53) = Eq(ξ; φξ) Z p(f |f, ξ; ηf)q(f|ξ; φf)df (54) = Eq(ξ; φξ) Z N f |k K 1f, (k k K 1k )ξ N (f|m, Sξ) df (55) = Eq(ξ; φξ) [N(f |µf(x ), σf(x )ξ)] , (56) µf(x ) = k K 1m, (57) σf(x ) = k k K 1 K 1SK 1 k . (58) We get the variance by E[ξ]σf(x ). Optimizing the ELBO We train the model by optimizing the evidence lower bound (ELBO) given by LELBO(ηf, ηξ, φf, φξ) = Eq(f,ξ; φ) [ log p(y|f)] DKL (q(f, ξ; φf, φξ) || p(f, ξ; ηf, ηξ)) . (59) E Details on sparse elliptical processes With the variational inference framework, we create a sparse version of the model Z p(f, u, ξ; η)dξ = Z p(f|u, ξ; ηf)p(u|ξ; ηu)p(ξ; ηξ)dξ, (60) where u are outputs of the elliptical process located at the inducing inputs Z. We approximate the posterior with p(f, u, ξ|y; η) p(f|u, ξ; ηf)q(u|ξ; φu)q(ξ; φξ). (61) The posterior predictive distribution is then given by p(f |y) = Z p(f |f, u, ξ; η)p(f, u, ξ|y; η)dfdudξ Z p(f |f, u, ξ; η)p(f|u, ξ; ηf)q(u|ξ; φu)q(ξ; φξ)dfdudξ = Z Z p(f |f, u, ξ; η)p(f|u, ξ; ηf)df q(u|ξ; φu)q(ξ; φξ)dudξ. (62) We can simplify the inner expression by using the fact that the elliptical distribution is consistent, Z p(f |f, u, ξ; η)p(f|u, ξ; η)df = Z p(f , f|u, ξ; η)df = p(f |u, ξ; η). (63) Published in Transactions on Machine Learning Research (08/2023) Hence, Equation (62) is simplifies to p(f |y) = Z p(f |u, ξ; η)q(u|ξ; φu)q(ξ; φξ)dudξ, (64) where q(u|ξ; φu) = N(mu, Suξ) with the variational parameters mu and Su, and ξ is parameterized, e.g., by a normalizing flow. Finally, we obtain the posterior p(f |x ) = Eq(ξ;φξ) [N(f |µf(x ), ξσf(x ))], where µf(x ) = k K 1 uum (65) σf(x ) = k k i K 1 uu K 1 uu SK 1 uu k . (66) Here, k = k(x , Z), k = k(x , x ), and Kuu = k(Z, Z). F Implementation: variational inference We used the Pyro library (Bingham et al., 2019), a universal probabilistic programming language (PPL) written in Python and supported by Py Torch on the backend. In Pyro, we trained a model with variational inference (Kingma & Welling, 2014) by creating "stochastic functions" called model and a guide, where the model samples from the prior latent distributions p(f, ξ, ω; η), and the observed distribution p(y|f, ω), and the guide samples the approximate posterior q(f|ξ; φf)q(ξ; φξ). We then trained the model by maximizing the evidence lower bound (ELBO), where we simultaneously optimized the model parameters η and the variational parameters φ. (See more details here.) To implement the model in Pyro, we created the guide and the model (see Algorithm 1) by building upon the already implemented variational Gaussian process. We used the guide and the model to derive the ELBO, which we then optimized with stochastic gradient descent using the Adam optimizer (Kingma & Ba, 2015). Algorithm 1 Py Torch implementation of the variational sparse elliptical process (VI-EP-EP). 1: procedure model(X, y) 2: Sample ξ = from p(ξ; ηξ)( Normalizing flow ) 3: Sample u from N(0, ξKuu)) Take a sample from the latent u and ξ 4: Derive the variational posterior i=1 q(fi|ξ; φ) = N(µf(xi), σf(xi)ξ). During training ξ is sampled from the posterior/guide. 5: Take a Monte-Carlo sample ˆfi from each q(fi|ξ; φ) 6: For or each yi approximate ℓyi = log R N (yi|fi, ω) p(ω; ηω)dω using the trapezoid rule. 7: Get the log probability of y by PN i=1 ℓyi. 8: end procedure 9: procedure guide 10: Sample ξ = from q(ξ; φξ)( Normalizing flow ) 11: Sample u, from N(m, Sξ) 12: end procedure G Regression experiment setup In the regression experiments in Section 4.3, we ran all experiments using the Adam optimizer (Kingma & Ba, 2015) with a learning rate of 0.01. For the full GP, we used the L-BFGS optimizer to train the hyperparameters. Here, we, in the same way as for the other models, used early stopping on a validation dataset, which operated by saving the model with the lowest validation log predictive likelihood. For all experiments, we created ten random train/val/test splits with the proportions 0.6/0.2/0.2, except for the two smallest datasets (mpg and concrete), where we neglected the validation dataset and used a Published in Transactions on Machine Learning Research (08/2023) train/test proportions of 0.7/0.3. For the test set evaluation, we used the model with the highest predictive probability on the validation set. For the large datasets (N > 1000), we used 500 inducing points. We did not use a sparse version of the model for the small datasets but instead used Z = Xtrain and kept them fixed during the training. We run the optimizer for the large dataset for 500 epochs and the small dataset for 2000 epochs. Elliptical process setup. The likelihood mixing distribution uses a spline flow with nine bins and Softplus as its output transformation. The elliptic posterior mixing distribution uses a spline flow with five bins and a Sigmoid output transformation. These parameters were not tuned, and fixed for all experiments. For the heteroscedastic noise models, we used two-layer neural networks with hidden dimensions of 128. For the elliptical noise, we learned a spline flow with nine bins which results in 26 hyperparameters to learn while for the heteroscedastic Gaussian likelihood we learned the variance solely. The regression results from Figure 7 are presented in Tables 2 and 3. Figure 11 presents the outcome for different numbers of inducing points. We see that the results have stabilized for almost all datasets at 500 inducing points. We also notice that the relative log-likelihood between the models stays constant after 400-500 inducing points. Table 2: Predictive mean squared error (MSE) on the hold-out sets from the experiments. We show the average of the ten random splits and one standard deviation in parenthesis. MPG CONCRETE ELEVATORS BIKE CALIFORNIA KIN40K PROTEIN Het-EP 0.144 (0.018) 0.127 (0.014) 0.149 (0.005) 0.018 (0.002) 0.223 (0.012) 0.141 (0.008) 0.531 (0.009) Het-GP 0.142 (0.020) 0.178 (0.022) 0.148 (0.005) 0.020 (0.002) 0.230 (0.011) 0.112 (0.007) 0.508 (0.007) EP-EP 0.122 (0.027) 0.176 (0.022) 0.145 (0.005) 0.007 (0.001) 0.223 (0.012) 0.042 (0.002) 0.433 (0.008) EP-GP 0.121 (0.026) 0.128 (0.013) 0.145 (0.005) 0.011 (0.001) 0.226 (0.013) 0.056 (0.003) 0.481 (0.007) SVGP 0.122 (0.018) 0.128 (0.011) 0.143 (0.004) 0.007 (0.001) 0.219 (0.012) 0.047 (0.001) 0.477 (0.007) Exact GP 0.135 (0.135) 0.103 (0.103) 0.134 (0.134) 0.002 (0.002) 0.134 (0.134) 0.006 (0.000) 0.357 (0.006) Table 3: Negative log likelihood (Neg LL) on the hold-out test sets from the experiments. We show the average of the ten random splits and one standard deviation in parenthesis. MPG CONCRETE ELEVATORS BIKE CALIFORNIA KIN40K PROTEIN Het-EP 0.463 (0.089) 0.332 (0.033) 0.400 (0.018) -1.327 (0.020) 0.506 (0.022) -0.397 (0.012) 0.921 (0.017) Het-GP 0.530 (0.155) 0.456 (0.074) 0.399 (0.017) -1.532 (0.047) 0.376 (0.021) -0.316 (0.010) 0.986 (0.013) EP-EP 0.266 (0.074) 0.443 (0.083) 0.425 (0.015) -1.406 (0.028) 0.506 (0.022) -0.246 (0.020) 0.976 (0.008) EP-GP 0.268 (0.071) 0.344 (0.031) 0.427 (0.014) -1.304 (0.023) 0.515 (0.021) -0.053 (0.049) 1.056 (0.006) SVGP 0.352 (0.062) 0.382 (0.030) 0.446 (0.013) -0.865 (0.016) 0.649 (0.022) -0.028 (0.004) 1.056 (0.006) Exact GP 0.387 (0.387) 0.206 (0.206) 0.463 (0.463) -1.103 (-1.103) 0.463 (0.463) -0.166 (0.097) 0.970 (0.005) Published in Transactions on Machine Learning Research (08/2023) 200 400 600 Num inducing Train, Elevators 200 400 600 Num inducing Val, Elevators 200 400 600 Num inducing Train, Bike 200 400 600 Num inducing - - -hetero -hetero 200 400 600 Num inducing Train, Calefornia 200 400 600 Num inducing Val, Calefornia 200 400 600 Num inducing Train, Kin40k 200 400 600 Num inducing Val, Kin40k 200 400 600 Num inducing Train, Protein 200 400 600 Num inducing Val, Protein Figure 11: The train and validation negative log-likelihood for the datasets using a varying number of inducing points. I Application: Wind power production In the wind power production experiment, we used an EP process with elliptical likelihood, elliptical posterior, and 300 inducing points; a heteroscedastic EP with heteroscedastic elliptical likelihood, and elliptical posterior; and a full GP with Gaussian likelihood. They all used a kernel that was a sum of a periodic kernel and a linear kernel. Before training, we transformed to data to log scale and then normalized it. See Figure 12 for a visualization of the data before and after the transformation. This transformation made the data more symmetric around its mean and removed the nonnegative constraints. Figure 9 in Section 4.4 shows the results when training the data with an EP process with elliptical likelihood. For comparison, we here plot the results when using a full GP (Figure 13), and heteroscedastic EP (Figure 14). Published in Transactions on Machine Learning Research (08/2023) 2010 2012 2014 2016 2018 Year Before transformation 2010 2012 2014 2016 2018 Year After transformation Figure 12: The data used in the wind power production experiments, where the left plot illustrates the data before it is transformed and the right plot shows the data after it is transformed, which is what we used as input to the model. The main takeaway is that all three models are able to capture the periodicity in the data. The full GP seems to have a tendency to overfit the data and also to create wide credibility regions. This is probably because the data is non-Gaussian and the thin tails of the GP forces variances to be high. The heteroscedastic EP fits the data well, although the difference from using a non-heteroscedastic likelihood is small. Figure 15 illustrates the mixing distribution during some of the months of the year where we clearly can see that we have a seasonal change in the noise. 2010 2011 2012 2013 2014 2015 2016 2017 Year Wind power production ( GWh ) Test data Training data f * 95% confidence region y * 95% confidence region Data points Predictive mean Figure 13: Wind power forecast using an exact GP The plot shows the predictive posterior where the shaded areas show the 95 % credibility area of the latent function posterior f and the noisy posterior y . Bike dataset (Fanaee-T & Gama, 2014) is obtained from bike sharing data, especially it contains the hourly and daily count of rental bikes between the years 2011 and 2012 with the corresponding weather and seasonal information. Elevators dataset (Dua & Graff, 2017) is obtained from the task of controlling an F16 aircraft, and the objective is related to an action taken on the elevators of the aircraft according to the status attributes of the airplane. Physicochemical properties of protein tertiary structure dataset The dataset is taken from CASP 5-9. There are 45730 decoys and sizes varying from 0 to 21 Armstrong. Published in Transactions on Machine Learning Research (08/2023) 2010 2011 2012 2013 2014 2015 2016 2017 Year Wind power production ( GWh ) Test data Training data y * 95% confidence region f * 95% confidence region Data points Predictive mean Figure 14: The predictive posterior after training the elliptical process with heteroscedastic noise on the wind power dataset. The shaded areas show the 95 % credibility area of the latent function posterior f and the noisy posterior y . 0.5 0.6 0.7 0 0.5 0.6 0.7 0.5 0.6 0.7 0.5 0.6 0.7 Figure 15: The mixing distribution of the elliptical heteroscedastic likelihood sampled during different months during the year. California housing dataset was originally published by Pace & Barry (1997). There are 20 640 samples and 9 feature variables in this dataset. The targets are the prices of houses in the California area. The Concrete dataset (Yeh, 1998) has 8 input variables and 1030 observations. The target variables are the concrete compressive strength. Auto MPG dataset (Alcalá-Fdez et al., 2011) originally from the Stat Lib library which is maintained at Carnegie Mellon University. The data concerns city-cycle fuel consumption in miles per gallon and consists of 392 samples with five features each. Pima Indians Diabetes Database (Smith et al., 1988) originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to predict diagnostically whether or not a patient has diabetes based on certain diagnostic measurements included in the dataset. The dataset consists of 768 samples with eight attributes. The Cleveland Heart Disease dataset consists of 13 input variables and 270 samples. The target classifies whether a person is suffering from heart disease or not. The Mammography Mass dataset predicts the severity (benign or malignant) of a mammography mass lesion from BI-RADS attributes and the patient s age. This dataset consists of 961 with six attributes.