# extended_and_unscented_kitchen_sinks__a5e1839e.pdf

Extended and Unscented Kitchen Sinks

Edwin V. Bonilla E.BONILLA@UNSW.EDU.AU The University of New South Wales

Daniel Steinberg DANIEL.STEINBERG@NICTA.COM.AU NICTA

Alistair Reid ALISTAIR.REID@NICTA.COM.AU NICTA

We propose a scalable multiple-output generalization of unscented and extended Gaussian processes. These algorithms have been designed to handle general likelihood models by linearizing them using a Taylor series or the Unscented Transform in a variational inference framework. We build upon random feature approximations of Gaussian process covariance functions and show that, on small-scale single-task problems, our methods can attain similar performance as the original algorithms while having less computational cost. We also evaluate our methods at a larger scale on MNIST and on a seismic inversion which is inherently a multi-task problem.

1. Introduction

Gaussian process (GP) models can be used as nonparametric probabilistic approaches to standard machine learning settings such as regression and classiﬁcation (Rasmussen & Williams, 2006), where the latent functions modeled by the GP are only important as a means to an end, that of providing greater ﬂexibility than their parametric counterparts. In other application areas such as inversion problems, the latent functions are quantities of interest themselves, and they are passed through a domain-speciﬁc forward model in order to generate the observations.

Standard machine learning tasks and inversion problems present three key challenges when having Gaussian process (GP) priors. The ﬁrst challenge is scalability, as GPs are notorious for their poor scalability as a function of the number of training points. The second challenge is multi-

Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

output and multi-task task learning, as required by problems such as multi-output regression, multi-class classiﬁcation or inversions over a multi-layer geological structure (where each layer is a task). Finally, the third challenge is that of dealing with nonlinear non-Gaussian likelihoods, for example in classiﬁcation, regression with non-Gaussian noise, and seismic inversion, as the posterior over the latent functions is analytically intractable.

In order to address the latter challenge, dealing with nonlinear non-Gaussian likelihoods, Steinberg & Bonilla (2014) have shown recently that it is possible to obtain good posterior estimates in GP models using approximations of the nonlinear likelihood via a Taylor series expansion or via the Unscented Transform (Julier & Uhlmann, 2004). They refer to their methods as the extended Gaussian process (EGP) and the unscented Gaussian process (UGP). One of the fundamental reasons why such linearizations are effective is because of their locality and adaptivity, as they are constructed around the current posterior estimate, which is iteratively updated within a variational inference procedure. While such methods are an effective way to tackle nonlinearities in the likelihood, their approach does not deal with the other two challenges mentioned above, namely multitask learning and scalability. Indeed their method is speciﬁc to single-output problems and inherits the cubic scalability of standard GPs on the number of training points.

In this paper we propose a scalable multiple-output generalization of the method of Steinberg & Bonilla (2014). We deal with multiple-output problems by using afﬁne transformations of the latent functions and achieve scalability by introducing random feature approximations of the covariance function of the Gaussian processes, in the style of Rahimi & Recht (2008). Inference of all parameters and hyperparameters is carried out using a variational inference framework, and so the kernel learning methods introduced by Yang et al. (2015) can be applied to our methods. Since Rahimi & Recht (2008) refer to their approach as Random

Extended and Unscented Kitchen Sinks

Kitchen Sinks, we will refer to our methods as extended and unscented kitchen sinks (EKS, UKS), when using the Taylor series approximation or the Unscented Transform approximation in the conditional likelihood respectively.

Our approach naturally avoids the cubic scalability of the original EGP and UGP methods (Steinberg & Bonilla, 2014) as a function of the number of training points. Our algorithms complexity is dominated by the inverse of the feature covariance of size D, which has a time complexity of O D3 , where typically D N.

Our experiments on small-scale synthetic nonlinear inversion tasks and on a classiﬁcation task on the USPS dataset show that random feature approximations to the EGP and the UGP can attain similar performance to the original methods. This applies even when using a small number of features, hence reducing the complexity of inference signiﬁcantly. Furthermore, experiments at a larger scale on MNIST show that our algorithms are competitive with recently developed approaches for inference in GP models, while the application of the EGP and UGP to this task is simply infeasible. Finally, on a multi-task (joint) nonlinear seismic inversion problem we show that our algorithms can recover accurate representations of the underlying geological structure and rock properties (seismic velocities).

2. Gaussian Process Models

We are given N input data points {xn} Rd and their corresponding targets {yn} RP , which will we describe compactly with {X, Y}, where X RN d and Y RN P . Our goal is to learn a probabilistic mapping from inputs to outputs, which can be achieved through Q latent functions {fq} and a given non-linear forward model g : RQ RP . Additionally, we are interested in estimating the posterior over the latent functions given the observed data. While the former problem is the standard multi-task supervised learning setting, we refer to the latter as a probabilistic joint inversion problem, as we are given a forward mapping from latent functions to noiseless outputs but not the reverse.

A ﬂexible modeling approach places independent zeromean Gaussian process (GP) priors over the latent functions {fq} with covariance functions kq( , ) and assumes i.i.d observations given these latent functions. When these function are realized at the training data, we obtain the following prior and likelihood models:

q=1 N(f q; 0, Kq) (1)

n=1 p(yn|g(fn )), (2)

where Kq is the covariance matrix induced by evaluating the covariance function kq at all input data X; f q are the values of latent function q at all training inputs; and fn are the values of all latent functions at input xn.

Having a nonlinear function g(fn ) in the conditional likelihood terms gives us the ﬂexibility to go beyond standard regression with linear Gaussian noise, even when each p(yn|g(fn )) is a Gaussian. For example, we can address problems such as classiﬁcation, where g is e.g. a softmax function, and nonlinear inversion problems such as seismic inversion, where g maps depths and seismic velocities of geological layers to sound reﬂexion times.

From a probabilistic inference perspective, solving the inversion problem (and the subsequent prediction problem), boils down to computing the posterior distribution p(f|X, Y). Unfortunately, this posterior distribution is, in general, intractable due to the non-linearities in g, so one must resort to approximations.

2.1. Variational Inference in Linearized GP Models

Steinberg & Bonilla (2014) have recently proposed a variational algorithm that addresses the above problem for single-output observations (Q = 1). Their algorithm relies upon the linearization of the forward model around the posterior mean, allowing for an analytic approximation of the variational objective and enabling parameter learning within a simple but effective optimization procedure. To build such linearizations they use a Taylor series approximation and the unscented transform and refer to their methods as the Extended Gaussian Process (EGP) and the Unscented Gaussian Process (UGP). The main advantage of their algorithms is that their approximation is local and adaptive, as it is constructed around the posterior mean for a single data point n, and it gets updated at every iteration of the algorithm as a function of the variational parameters.

However, such algorithms have the fundamental problem of poor scalability, as they inherit the computational cost of traditional GP models, which is O N 3 . This problem is, of course, exacerbated when having multi-task learning settings or multiple outputs, which renders their approach impractical for large datasets.

3. Random Features Approximations

Our starting point to scale up linearized GP models builds upon the work of Rahimi & Recht (2008; 2009), who used Bochner s theorem regarding the relationship between a kernel and the Fourier transform of a non-negative measure. In particular, when such a non-negative measure exists, one obtains Wiener-Khintchine s theorem, which establishes the Fourier duality of the covariance function of a

Extended and Unscented Kitchen Sinks

stationary process and its spectral density:

k(τ) = Z S(s)e2πis T τds, (3)

S(s) = Z k(τ)e 2πis T τdτ. (4)

Rahimi & Recht s main insight (2008) is that we can approximate the above kernel by explicitly constructing suitable random features and (Monte Carlo) averaging over samples from S(s):

k(x x ) = k(τ) 1

i=1 φi(x) φi(x ) , (5)

where φ(x) : Rd RD is a feature map and φi(x) corresponds to the ith component of that map. An example of a feature vector construction in the above approximation is:

[φi(x) , φD+i(x)] = 1

D [cos(2πs T i x), sin(2πs T i x)],

with si N si 0, σ2 φId , (6)

for i = 1, . . . , D, which in fact is a mapping into a 2Ddimensional feature space. Rahimi & Recht (2008) used the above feature map to approximate the commonly used (isotropic) squared exponential kernel, and showed that such an approximation converges in expectation to the true kernel. They refer to algorithms that use such randomized feature expansions as random kitchen sinks (RKS). If we use RKS bases such that k(xi, xj) = E[φ(xi) φ(xj)], we can approximate GP models that involve nonlinear likelihoods with simple linear-in-the-parameters models. For our purposes, we are interested in using such random feature approximations within a variational inference framework in order to estimate the posterior distributions of models of the form given in Equations (1) and (2).

4. Multi-task GP Models

Our next step is to approximate our prior over latent functions in Equation (1) using the random feature approximation described above, and to specify our multi-output likelihood in Equation (2):

q=1 N wq 0, ω2 q ID , (7)

n=1 N(yn|g(Wφn) , Σ) , (8)

where, φn def= φ(xn) is the D-dimensional vector of features corresponding to datapoint n; wq RD; W RQ D; ω2 q is the prior variance over the weights; and

Σ = diag σ2 1, . . . , σ2 P is the noise variance. Additionally, we note that we are effectively approximating our prior over latent functions as fq = Φwq, with Φ def= φ(X) being the N D matrix of features evaluated at all the training data.

Having RKS-based approximations allows us to circumvent the inherent scalability problem in GP models. However, we note that the likelihood model in Equation (8), still involves a non linear transformation of the corresponding latent functions, which yields, as before, intractable posteriors. In order to address this problem, we will build upon the work of Steinberg & Bonilla (2014), and develop a variational inference procedure that exploits linearization methods around the posterior mean.

4.1. Posterior Approximation

Let us now make the simplifying assumption that the posterior factorizes over latent functions and has the form,

q=1 N(wq|mq, Cq) . (9)

We can use variational inference to learn this posterior approximation, and thereby allowing us to infer the posterior latent tasks,

q=1 N f q Φmq, ΦCqΦT . (10)

4.2. Evidence Lower Bound

With the prior speciﬁed in Equation (7); the likelihood speciﬁed in Equation (8); and the approximate posterior deﬁned in Equation (9), we are now ready to write down the variational lower bound that we aim to maximize in order to learn the parameters of our model. Deﬁning q W def= q(W), the variational log-evidence lower bound is,

L = log p(Y|W, X) q W KL[ q(W) p(W)] . (11)

Here we can straight-forwardly arrive at the KL term,

KL[ q(W) p(W)] = 1

ω2q tr(Cq) + 1

log |Cq| + D log ω2 q D . (12)

The expected log-likelihood term is less straight forward,

log p(Y|W, X) q W = N

2 [log 2π + log |Σ|]

D (yn g(Wφn)) Σ-1 (yn g(Wφn)) E

Extended and Unscented Kitchen Sinks

since this expectation cannot be easily evaluated. We make another approximation,

g(Wnφn) An Wφn + bn, (14)

where An RP Q is some linearization matrix that we will deﬁne later, and bn RP is an intercept term that we will also deﬁne later. Now we can evaluate the expectation in (13) as approximately,

log p(Y|W, X) q W N

2 [P log 2π + log |Σ|]

q=1 φ n Cqφna nqΣ-1anq

Where we have deﬁned ϵn def= yn (An Mφn + bn); used An W = P

q anqw q deﬁning anq RP and An = [an1, . . . , an Q]. Here we can see that this objective easily factorizes over the data, and so it should be straightforward to apply parallel or stochastic gradient descent algorithms to learn the posterior parameters.

4.3. Learning the Variational Parameters

As in Steinberg & Bonilla (2014), we can use Newton s method to learn the approximate posterior parameters for each task,

m(k+1) q = m(k) q αk mq mq L -1 mq L mq=m(k) q .

(16) Here αk (0, 1] is a step length, and the gradients of the variational lower bound with respect to the posterior mean are:

n=1 φna nqΣ-1 yn anqm q φn bn 1

(17) Similarly, the Hessian of the variational objective is:

mq mq L = 1

n=1 φna nqΣ-1anqφ n. (18)

When (16) has converged to m+ q we can calculate the approximate posterior covariance,

" 1 ω2q ID +

n=1 φna nqΣ-1anqφ n

4.4. Hyperparameter Learning

Once we have found the optimum M and Cq s, we can optimise the linearized L (which becomes an approximation to the lower bound) with respect to the parameters

Σ, {ω2 q} and θ, assuming the features are parameterized φn = φ(xn, θ). Once we have found the optimum Σ, {ω2 q}, θ, we then re-optimize for the posterior parameters in a generalized variational-EM like procedure.

When we have determined the optimum posterior parameters, the trace term in Equation (12) cancels out with the only term in Equation (15) involving Cq, i.e. QD P

q tr(Cq) /ω2 q = P

q φ n Cqφna nqΣ-1anq, to give,

2 [P log 2π + log |Σ|] 1

h (ϵn) Σ-1 (ϵn) i

ω2q m q mq log |Cq| + D log ω2 q

Unfortunately, the optimum mean (M+) is an implicit function of both Σ and θ, and so in the case of the EGP, we would require second and higher derivatives of g in order to calculate partial derivatives of (20) with respect to these parameters. However, if we assume O(10) tasks, P, and a lightly parameterized feature function, φ( ), then we can use numerical methods for local derivative-free optimization such as COBYLA (Powell, 1994; 1998) or BOBYQA (Powell, 2009).

5. Linearization Methods

So far we have assumed that we are given the linearization parameters {An, bn} that allow us to approximate locally the nonlinear forward model given in Equation (14). As g( ) is a function of our latent functions approximation Wφn, it makes sense to linearize around our current posterior estimates of these latent functions, ˆfn = Mφn. In this section we describe two methods for linearizing the forward model around these posterior estimates. The ﬁrst method uses a Taylor series approximation and the second method uses a statistical linearization (Geist & Pietquin, 2010) based on the unscented transform (Julier & Uhlmann, 2004). Because of the relation of our algorithms to the Extended Gaussian Process (EGP) and the Unscented Gaussian Process (UGP), we refer to the proposed inference methods as the Extended Kitchen Sinks (EKS) and the Unscented Kitchen Sinks (UKS), respectively.

5.1. Taylor Series Linearization

We can use a ﬁrst order Taylor series to linearize g( ) at each iteration of (16),

g(Wφn) g(Mφn) + Jn (W M) φn, (21)

Jn = g(fn )

fn =Mφn . (22)

Extended and Unscented Kitchen Sinks

Equating coefﬁcients with (14) we have that

An = Jn and bn = g(Mφn) Jn Mφn. (23)

We will refer to the method that uses this Taylor series linearization within the variational framework described in section 4 as Extended Kitchen Sinks (EKS).

5.2. Statistical Linearization

In order to have a statistical approach to estimating the linearization parameters in Equation (14), we can use, for example, weighted least squares. The main question is what training data can we use to ﬁt the linear model? Although we can sample from F using Equation (10) to generate these data, the unscented transform (UT; Julier & Uhlmann, 2004) provides a deterministic and more elegant solution.

The main point to notice here is that we are interested in linearizing g as a function of fn , where fn is a Q-dimensional random variable corresponding to the Q latent function values at datapoint n. Interestingly, our choice of variational distribution in Equation (14) assumes that the joint posterior factorizes across latent functions, and so does the marginal:

q(fn ) = N(fn |µn, En) , with µn = Mφn, and (24)

[En]qq = φ n Cqφn for q = q and 0 otherwise. (25)

This greatly simpliﬁes the computation of the UT, which involves the deﬁnition of 2Q + 1 so-called sigma-points:

F0,n = µn (26)

Fi,n = µn + [ p

(Q + κ)En] ,i 1 i Q (27)

Fi,n = µn [ p

(Q + κ)En] ,i Q Q < i 2Q (28)

where [A] ,i denotes the ith column of matrix A and κ is a free parameter. The corresponding forward model evaluations Yi,n, and weights ui,n:

Yi,n = g(Fi,n) for 0 i 2Q (29)

u0 = κ Q + κ, ui = 1 2(Q + κ) for 0 < i 2Q, (30)

where we note that κ = 1/2 corresponds to uniform weights ui = 1/(2Q + 1).

Solving the weighted linear least squares problems with inputs, outputs, and weights {Fi,n, Yi,n, ui} yields the solution:

bn = yn An Mφn, (31)

An = Γn E-1 n , (32)

where En is the diagonal matrix given in Equation (25),

and yn and Γn are the sufﬁcient statistics:

i=0 ui Yi,n, (33)

i=0 ui(Yi,n yn)(Fi,n Mφn) . (34)

We will refer to the method that uses this statistical linearization within the variational framework described in section 4 as Unscented Kitchen Sinks (UKS). One of the key advantages of the UKS over the EKS is that the UKS, like the original UGP, is a black-box method in that it requires no gradient information of the nonlinearity in the likelihood function in order to learn the parameters/hyperparameters and carry out posterior approximation and predictions.

6. Prediction

The predictive distribution over the latent function values f can be computed similarly to Equation (24):

p(f |x ) = N(f |µ , E ) with µ = Mφ and (35)

[E ]qq = φ Cqφ for q = q and 0 otherwise, (36)

where φ is the Q-dimensional vector of test features resulting from evaluating φ(x ). We can also predict the noiseless observations by evaluating the integral:

g = Z g(f )p(f |x ) df , (37)

which we can estimate using Monte Carlo averaging.

7. Experiments

In this section we describe the experiments carried out in order to assess the performance and the behavior of our extended kitchen sinks (EKS) and unscented kitchen sinks (UKS) methods. we ﬁrst look at experiments on smallscale synthetic inversion problems and a binary classiﬁcation task on the USPS dataset. As these experiments were also carried out by Steinberg & Bonilla (2014) to evaluate the EGP and the UGP, our ﬁrst goal is to investigate how well random kitchen sinks bases can approximate the original EGP and UGP. Additionally, we are interested in determining whether the complexity of the algorithms can be reduced signiﬁcantly by having a number of basis functions smaller than the number of training points.

7.1. Synthetic Inversion Problem

In this experiment we generate latent function values (f) from a GP with isotropic squared exponential covariance function (having a signal variance σ2 s = 0.82 and a lengthscale ℓ= 0.6) at 1000 input points, x R, which are

Extended and Unscented Kitchen Sinks

Table 1. Performance of the EKS and UKS methods compared to their GP counterparts (EGP and UGP) on a range of synthetic benchmarks when using D = 200 features. The mean for each measure is shown and the standard deviation in brackets.

g(f) Method SMSE-f* NLPD-f* SMSE-g*

linear EKS 0.03 (0.01) -0.99 (0.23) 0.03 (0.01) UKS 0.04 (0.00) -0.97 (0.11) 0.04 (0.00) EGP 0.03 (0.01) -1.01 (0.30) 0.03 (0.01) UGP 0.03 (0.02) -0.94 (0.43) 0.03 (0.02) poly3 EKS 0.02 (0.01) -1.39 (0.20) 0.01 (0.00) UKS 0.06 (0.04) 0.41 (1.69) 0.02 (0.01) EGP 0.07 (0.02) -0.36 (0.68) 0.02 (0.01) UGP 0.06 (0.04) -0.35 (0.62) 0.01 (0.01) exp EKS 0.03 (0.01) -1.29 (0.20) 0.01 (0.00) UKS 0.02 (0.01) -1.26 (0.24) 0.01 (0.00) EGP 0.08 (0.04) 0.25 (1.52) 0.04 (0.02) UGP 0.03 (0.02) -1.02 (0.58) 0.02 (0.01) sin EKS 0.03 (0.02) -1.05 (0.19) 0.03 (0.01) UKS 0.03 (0.01) -1.13 (0.10) 0.03 (0.01) EGP 0.04 (0.02) -0.94 (0.25) 0.04 (0.02) UGP 0.06 (0.02) -0.80 (0.22) 0.05 (0.03) tanh EKS 0.05 (0.03) -1.14 (0.24) 0.02 (0.00) UKS 0.04 (0.03) -0.85 (1.00) 0.03 (0.02) EGP 0.09 (0.06) -0.85 (0.24) 0.04 (0.02) UGP 0.05 (0.03) -0.87 (0.22) 0.03 (0.01)

uniformly spaced between [ 2π, 2π]. We test our algorithms and the baselines (UGP, EGP) with ﬁve simple forward models; an identity function (linear), a 3rd order polynomial with no cross terms (poly3), an exponential function, a sinusoid, and a tangent function. We present the results of 5-fold cross validation (200 training, 800 testing) in Table 1, and also show the behavior of our algorithms as a function of the number of features in Figure 1. We use standardized mean square error (SMSE) and negative log probability density (NLPD) as the performance metrics.

We see in Table 1 that, in general, the EKS and UKS perform similarly to the EGP and UGP algorithms, and sometimes better. For example, the EGP with exponential forward model has larger mean SMSE and mean NLPD with associated high standard deviations, suggesting the algorithm converged sub-optimally in one or more of the folds. More importantly, this comparable or superior performance by the proposed algorithms is attained with signiﬁcantly less computational cost since D < N. Finally, as seen in Figure 1, apart from the UKS with 50 basis functions, both algorithms appear to be relatively robust to using smaller numbers of features, which translates into less computational cost while maintaining similar performance levels. Analogous results are observed for the NLPD as a function of the number of features, see the supplementary material for details.

LINEARPOLY3 EXP SIN TANH 0

D=50 D=100 D=200

LINEARPOLY3 EXP SIN TANH 0

D=50 D=100 D=200

Figure 1. The performance of the EKS (left) and UKS (right) on the synthetic inversion problems as a function of the number of features.

200 400 800 EGP/UGP 0

200 400 800 EGP/UGP 0

Figure 2. The performance of the EKS and UKS on the binary classiﬁcation problem for the USPS dataset as a function of the number of bases used. EGP and UGP are the original (full) GP models.

7.2. Binary Handwritten Digit Classiﬁcation

This is a binary classiﬁcation task to distinguish between images of the handwritten digits 3 and 5 in the USPS digits datasets (Rasmussen & Williams, 2006). There are 767 images in the training set, and 773 images in the test set. We use a logistic sigmoid as a forward model in this task and the same settings as in the original experiments for covariance functions and observation variance. The main aim of this experiment is to benchmark the performance of the EKS and UKS against the EGP and UGP for varying numbers of basis functions, as shown in Figure 2. It is important to emphasize that Steinberg & Bonilla (2014) showed that the EGP and UGP outperform or perform similarly to hard-coded inference methods based on variational inference and expectation propagation. We observe that there is a detriment in performance with a small number of features (D = 200), which is to be expected since the number of training points is N = 767, and so the exact covariance would be harder to represent with a low number of random basis functions. However, the performance of the EKS and UKS approaches that of the UGP and EGP with 400 basis functions, which indicates that our random-feature approaches are reasonable approximations to the original GP model. More importantly, for problems with a large number of training points, the EGP and the UGP are simply unfeasible, and this is the subject of discussion in the next section.

Extended and Unscented Kitchen Sinks

Table 2. The performance of the models on the MNIST dataset for the task of classifying the even digits vs the odd digits.

NLP Error Rate D = 1000 D = 2000 D = 1000 D = 2000

EKS 0.129 0.088 0.043 0.026 UKS 0.129 0.088 0.043 0.026 HMG 0.069 0.022 DB 0.068 0.022

7.3. Large Scale Classiﬁcation

Here we present results on a larger application on the MNIST dataset, which contains examples of handwritten digits, 50,000 for training, 10,000 for validation and 10,000 for testing. In our experiments, we always train on 60,000 examples that include the training and the validation set and tune the parameters of our models via optimization of the variational bound. This is probably a disadvantage when compared to other approaches that use cross-validation but our goal is only to show that our models can achieve competitive performance at this scale.

Odd digits vs even digits. We ﬁrst consider the binary classiﬁcation problem of distinguishing the odd digits from the even digits, a task that has also been investigated by Hensman et al. (2015a). The results are shown in Table 2, when using a logistic sigmoid function as the forward model in our methods (EKS and UKS) and the methods by Hensman et al. (2015a) and Dezfouli & Bonilla (2015). We refer to these methods as HMG and DB respectively. As before, we report the mean negative log probability and the error rate on the test set for different number of features. We see that our methods achieve similar performance to HMG and DB when using 2000 features. While DB is an inducing-point approach that uses 2000 inducing points ﬁxed via clustering, HMG uses 200 inducing points with the extra overhead of learning their locations. Overall, we conclude that random feature approximations in our extended and unscented models are competitive with the state-of-theart approaches to sparse GPs.

Multi-class classiﬁcation: One of the contributions of our approach with respect to the original EGP and UGP models is its scalability to a large number of observations when having multiple outputs. In this experiment we applied our algorithm to the task of classifying all digits on MNIST using a softmax forward model. Our methods (EKS and UKS) achieved similar performance. For example, the EKS obtained an error rate of 4.75% and an NLP of 0.1887, when using D = 1000 features. When we increased the number of features to D = 2000, it attained an error rate of 3.81% and an NLP of 0.1304. These error rates are lower than that reported by Gal et al. (2014) of 5.95%, while Dezfouli

& Bonilla (2015) reported an error rate of 2.51%. As a reference, linear classiﬁers achieve around 12% error rate on this task while the state of the art is less than 1%.

7.4. Seismic Inversion

In this experiment we perform an inversion of a seismic survey line using a one-dimensional seismic sensor forward model. The goal is to infer both the geometry of the interfaces between subsurface geological layers and the seismic propagation velocity within each layer from noisy surface observations of the sound reﬂection times, Y. Our dataset is part of a real seismic survey of the Otway basin region in Victoria, Australia. The survey is interpreted, specifying reﬂection time estimates rather than raw amplitudes. Our transect contains 113 sites with four interface reﬂections (layers) per site.

The inputs, X, are the surface positions (meters) of ground seismic sensors, and we use the following forward model,

2 f depth p f depth p 1 f vel p

+ gtime p 1, if p > 0

2 f depth 0 f vel 0

, if p = 0 (38)

where there are P output tasks gtime p , the reﬂection times from each layer. These outputs depend on two latent input tasks; f depth p (X), the geological depth of layer p corresponding to each surface location, and f vel p (X), the velocity of layer p below each surface input location. The layer depths are also clipped to be at least as deep as the previous, f depth p = max{f depth p , f depth p 1 } p, to enforce a geologically valid depth structure.

We wish to infer the latent depths and velocities of each of the four layers. Consequently, Q = 2P, and the problem is under-constrained since there is an inﬁnite set of layer depths and velocities that could result in the observed reﬂection times. Thus, we cannot evaluate the quality of our inference by its ability to predict a single truth, but instead must assess its ability to correctly condition our prior on these constraining observations. For this experiment, we deﬁne ﬂat priors for each latent function centered on average depths and velocities of each layer. A baseline inference using MCMC has been applied to sample the posterior distribution. See the supplementary material for more details.

The results of both the EKS and UKS methods were very similar. Figure 3 shows the results from the EKS method inferring the distribution over layer boundary depths (left) and seismic transmission velocities (right). The MCMC solution is overlaid. Clearly the MCMC solution is in agreement with the structure of the layer velocities and depths. However, we do note that the MCMC draws are smoother. This may be because the limited number of radial basis

Extended and Unscented Kitchen Sinks

0 1 2 3 4 5 Sensor location (m) 104

Depth of boundaries

0 1 2 3 4 5 Sensor location (m) 104

Velocity (m/s)

Velocity of layers

Figure 3. Results of the EKS on the seismic inversion problem. The inferred layer boundaries (left) and seismic velocities (right) are shown in blue, indicating the predictive means and standard deviation envelopes. Draws from the MCMC inversion are overlaid in dotted black.

functions used in the MCMC has restricted its ability to infer high frequency details. Additional evaluation of the results can be found in the supplementary material.

8. Related work

Most previous work in the GP community has focused on addressing the scalability, multi-task learning or nonlinear likelihood challenges in isolation. For example, with regards to scalability, the seminal work of Qui nonero Candela & Rasmussen (2005) allowed the community to understand most sparse approximations in GP models from a probabilistic perspective, and the framework of Titsias (2009) has become the underpinning machinery of most modern scalable approaches to GP regression and classiﬁcation. Nevertheless, the scalability problem in GP models continues to be an intensive area of research, with the recently developed distributed inference framework of Gal et al. (2014), and the variational inference frameworks for scalable GP regression and classiﬁcation by Hensman et al. (2013) and Hensman et al. (2015a), respectively. With regards to multi-output and multi-task learning, one of the most notable approaches has been developed by Alvarez & Lawrence (2009) using the convolution formalism, although their later work also focuses on developing efﬁcient inference algorithms for such models ( Alvarez & Lawrence, 2011). Finally, concerning non-linear likelihoods, Opper & Archambeau (2009) presented the seemingly surprising (but powerful) result of estimating a full Gaussian posterior for models with GP priors and general i.i.d likelihoods efﬁciently using variational inference.

Although the work by Rahimi & Recht (2008; 2009) has been highly inﬂuential in the area of deterministic kernel machines, it is surprising that their random kitchen sinks (RKS) approximations had not been investigated more thoroughly in probabilistic kernel frameworks such as Gaussian process models. For example, Dai et al. (2014) have used random features for kernel machines, although their main focus is on non-probabilistic approaches that are under-

pinned by convex optimization. Only very recently, Yang et al. (2015) have investigated the problem of developing scalable and ﬂexible probabilistic kernel approaches for regression using fast approximations to the original RKS (Le et al., 2013).

Contemporary to our work, Hensman et al. (2015b) and Dezfouli & Bonilla (2015) have proposed scalable approaches to inference in GP models with general likelihoods. We see our extended and unscented kitchen sinks methods as alternative approaches to their work that builds upon the inducing-point formalism of Titsias (2009). A thorough comparison and evaluation of both approaches is an interesting area for immediate future work.

9. Conclusion & Discussion

We have presented the EKS and the UKS methods for posterior inference in GP models with nonlinear likelihoods. These methods are multi-task and scalable generalizations of the EGP and UGP algorithms of Steinberg & Bonilla (2014). The UKS, like the UGP, is a black-box method in that it requires no gradients of the nonlinearity in the likelihood.

We have shown in our experiments that, by using RKS approximations, we can achieve similar prediction performance to the original algorithms, while drastically reducing their computational complexity and increasing their scalability. Furthermore, we have demonstrated the EKS and UKS successfully performing a multi-task Bayesian seismic inversion which is an ideal use case for these algorithms as a fast and scalable alternative to MCMC. Because our linearization algorithms are local and adaptive, the EKS (and the original EGP) can be seen as reﬁned (iterated) versions of the Laplace approximation, where the linearization gets updated at every iteration as a function of the variational parameters. The UKS (and the original UGP) can provide us with more elaborate and effective ways to propagate the ﬁrst and second moments through nonlinearities.

Since the approximating model in Equations (7) and (8) is no longer a GP model, other inference methods such as

MCMC can be used in practice. In fact, as mentioned in 7.4, we implemented MCMC algorithms for these problems but found them too slow to be applied to large datasets. Nevertheless, further study and improvement of sampling algorithms such as MCMC is indeed a very promising research direction. On an alternative vein, we would also like to use a stochastic gradient optimizer for learning all model (hyper)parameters, and extend the posterior representation to a mixture of Gaussians, in a similar fashion as Nguyen & Bonilla (2014) and Gershman et al. (2012).

Extended and Unscented Kitchen Sinks

Acknowledgments

DS and AR were supported by the Science Industry Endowment Fund (RP 04-174) Big Data Knowledge Discovery project. NICTA is funded by the Australian Government through the Department of Communications and the Australian Research Council through the ICT Centre of Excellence Program.

Alvarez, Mauricio and Lawrence, Neil D. Sparse convolved Gaussian processes for multi-output regression. In NIPS, pp. 57 64. 2009.

Alvarez, Mauricio A and Lawrence, Neil D. Computationally efﬁcient convolved multiple output Gaussian processes. JMLR, 12(5):1459 1500, 2011.

Dai, Bo, Xie, Bo, He, Niao, Liang, Yingyu, Raj, Anant, Balcan, Maria-Florina F, and Song, Le. Scalable kernel methods via doubly stochastic gradients. In NIPS, 2014.

Dezfouli, Amir and Bonilla, Edwin V. Scalable inference for Gaussian process models with black-box likelihoods. In NIPS. 2015.

Gal, Yarin, van der Wilk, Mark, and Rasmussen, Carl. Distributed variational inference in sparse Gaussian process regression and latent variable models. In NIPS. 2014.

Geist, Matthieu and Pietquin, Olivier. Statistically linearized recursive least squares. In IEEE International Workshop on Machine Learning for Signal Processing, 2010.

Gershman, Samuel J., Hoffman, Matthew D., and Blei, David M. Nonparametric variational inference. In ICML, 2012.

Hensman, James, Fusi, Nicolo, and Lawrence, Neil D. Gaussian processes for big data. In UAI, 2013.

Hensman, James, Matthews, Alexander, and Ghahramani, Zoubin. Scalable variational Gaussian process classiﬁcation. In AISTATS, 2015a.

Hensman, James, Matthews, Alexander G, Filippone, Maurizio, and Ghahramani, Zoubin. MCMC for variationally sparse Gaussian processes. In NIPS. 2015b.

Julier, S.J. and Uhlmann, J.K. Unscented ﬁltering and nonlinear estimation. Proceedings of the IEEE, 92(3):401 422, Mar 2004.

Le, Quoc, Sarlos, Tamas, and Smola, Alex. Fastfood - approximating kernel expansions in loglinear time. In ICML, 2013.

Nguyen, Van Trung and Bonilla, Edwin. Collaborative multi-output Gaussian processes. In UAI, 2014.

Opper, Manfred and Archambeau, C edric. The variational Gaussian approximation revisited. Neural Computation, 21(3):786 792, 2009.

Powell, M. J. D. A direct search optimization method that models the objective and constraint functions by linear interpolation. In Advances in Optimization and Numerical Analysis. 1994.

Powell, M. J. D. Direct search algorithms for optimization calculations. Acta Numerica, 7:287 336, 1998.

Powell, M. J. D. The BOBYQA algorithm for bound constrained optimization without derivatives. Technical report, Department of Applied Mathematics and Theoretical Physics, Cambridge England, 2009.

Qui nonero-Candela, Joaquin and Rasmussen, Carl Edward. A unifying view of sparse approximate Gaussian process regression. JMLR, 6:1939 1959, 2005.

Rahimi, Ali and Recht, Benjamin. Random features for large-scale kernel machines. In NIPS. 2008.

Rahimi, Ali and Recht, Benjamin. Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning. In NIPS. 2009.

Rasmussen, Carl Edward and Williams, Christopher K. I. Gaussian processes for machine learning. The MIT Press, 2006.

Steinberg, Daniel M and Bonilla, Edwin V. Extended and unscented Gaussian processes. In NIPS. 2014.

Titsias, Michalis. Variational learning of inducing variables in sparse Gaussian processes. In AISTATS, 2009.

Yang, Zichao, Wilson, Andrew Gordon, Smola, Alexander J., and Song, Le. a la carte - learning fast kernels. In AISTATS, 2015.