# incremental_variational_sparse_gaussian_process_regression__71950bc6.pdf

Incremental Variational Sparse Gaussian Process Regression

Ching-An Cheng Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, GA 30332 cacheng@gatech.edu

Byron Boots Institute for Robotics and Intelligent Machines Georgia Institute of Technology Atlanta, GA 30332 bboots@cc.gatech.edu

Recent work on scaling up Gaussian process regression (GPR) to large datasets has primarily focused on sparse GPR, which leverages a small set of basis functions to approximate the full Gaussian process during inference. However, the majority of these approaches are batch methods that operate on the entire training dataset at once, precluding the use of datasets that are streaming or too large to ﬁt into memory. Although previous work has considered incrementally solving variational sparse GPR, most algorithms fail to update the basis functions and therefore perform suboptimally. We propose a novel incremental learning algorithm for variational sparse GPR based on stochastic mirror ascent of probability densities in reproducing kernel Hilbert space. This new formulation allows our algorithm to update basis functions online in accordance with the manifold structure of probability densities for fast convergence. We conduct several experiments and show that our proposed approach achieves better empirical performance in terms of prediction error than the recent state-of-the-art incremental solutions to variational sparse GPR.

1 Introduction

Gaussian processes (GPs) are nonparametric statistical models widely used for probabilistic reasoning about functions. Gaussian process regression (GPR) can be used to infer the distribution of a latent function f from data. The merit of GPR is that it ﬁnds the maximum a posteriori estimate of the function while providing the proﬁle of the remaining uncertainty. However, GPR also has drawbacks: like most nonparametric learning techniques the time and space complexity of GPR scale polynomially with the amount of training data. Given N observations, inference of GPR involves inverting an N N covariance matrix which requires O(N 3) operations and O(N 2) storage. Therefore, GPR for large N is infeasible in practice.

Sparse Gaussian process regression is a pragmatic solution that trades accuracy against computational complexity. Instead of parameterizing the posterior using all N observations, the idea is to approximate the full GP using the statistics of ﬁnite M N function values and leverage the induced low-rank structure to reduce the complexity to O(M 2N + M 3) and the memory to O(M 2). Often sparse GPRs are expressed in terms of the distribution of f( xi), where X = { xi X}M i=1 are called inducing points or pseudo-inputs [21, 23, 18, 26]. A more general representation leverages the information about the inducing function (Lif)( xi) deﬁned by indirect measurement of f through a bounded linear operator Li (e.g. integral) to more compactly capture the full GP [27, 8]. In this work, we embrace the general notion of inducing functions, which trivially includes f( xi) by choosing Li to be identity. With abuse of notation, we reuse the term inducing points X to denote the parameters that deﬁne the inducing functions.

30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

Learning a sparse GP representation in regression can be summarized as inference of the hyperparameters, the inducing points, and the statistics of inducing functions. One approach to learning is to treat all of the parameters as hyperparameters and ﬁnd the solution that maximizes the marginal likelihood [21, 23, 18]. An alternative approach is to view the inducing points and the statistics of inducing functions as variational parameters of a class of full GPs, to approximate the true posterior of f, and solve the problem via variational inference, which has been shown robust to over-ﬁtting [26, 1].

All of the above methods are designed for the batch setting, where all of the data is collected in advance and used at once. However, if the training dataset is extremely large or the data are streaming and encountered in sequence, we may want to incrementally update the approximate posterior of the latent function f. Early work by Csató and Opper [6] proposed an online version of GPR, which greedily performs moment matching of the true posterior given one sample instead of the posterior of all samples. More recently, several attempts have been made to modify variational batch algorithms to incremental algorithms for learning sparse GPs [1, 9, 10]. Most of these methods rely on the fact that variational sparse GPR with ﬁxed inducing points and hyperparameters is equivalent to inference of the conjugate exponential family: Hensman et al. [9] propose a stochastic approximation of the variational sparse GPR problem [26] based on stochastic natural gradient ascent [11]; Hoang et al. [10] generalizes this approach to the case with general Gaussian process priors. Unlike the original variational algorithm for sparse GPR [26], which ﬁnds the optimal inducing points and hyperparameters, these algorithms only update the statistics of the inducing functions f X.

In this paper, we propose an incremental learning algorithm for variational sparse GPR, which we denote as i VSGPR. Leveraging the dual formulation of variational sparse GPR in reproducing kernel Hilbert space (RKHS), i VSGPR performs stochastic mirror ascent in the space of probability densities [17] to update the approximate posterior of f, and stochastic gradient ascent to update the hyperparameters. Stochastic mirror ascent, similar to stochastic natural gradient ascent, considers the manifold structure of probability functions and therefore converges faster than the naive gradient approach. In each iteration, i VSGPR solves a variational sparse GPR problem of the size of a minibatch. As a result, i VSGPR has constant complexity per iteration and can learn all the hyperparameters, the inducing points, and the associated statistics online.

2 Background

In this section, we provide a brief summary of Gaussian process regression and sparse Gaussian process regression for efﬁcient inference before proceeding to introduce our incremental algorithm for variational sparse Gaussian process regression in Section 3.

2.1 Gaussian Processes Regression

Let F be a family of real-valued continuous functions f : X 7 R. A GP is a distribution of functions f in F such that, for any ﬁnite set X X, {f(x)|x X} is Gaussian distributed N(f(x)|m(x), k(x, x )): for any x, x X, m(x) and k(x, x ) represent the mean of f(x) and the covariance between f(x) and f(x ), respectively. In shorthand, we write f GP(m, k).

The mean m(x) and the covariance k(x, x ) (the kernel function) are often parametrized by a set of hyperparameters which encode our prior belief of the unknown function f. In this work, for simplicity, we assume that m(x) = 0 and the kernel can be parameterized as k(x, x ) = ρ2gs(x, x ), where gs(x, x ) is a positive deﬁnite kernel, ρ2 is a scaling factor and s denotes other hyperparameters [20].

The objective of GPR is to infer the posterior probability of the function f given data D = {(xi, yi)}N i=1. In learning, the function value f(xi) is treated as a latent variable and the observation yi = f(xi) + ϵi is modeled as the function corrupted by i.i.d. noise ϵi N(ϵ|0, σ2). Let X = {xi}N i=1. The posterior probability distribution p(f|y) can be compactly summarized as GP(m|D, k|D):

m|D(x) = kx,X(KX + σ2I) 1y (1)

k|D(x, x ) = kx,x kx,X(KX + σ2I) 1k X,x (2)

where y = (yi)N i=1 RN, kx,X R1 N denotes the vector of the cross-covariance between x and X, and KX RN N denotes the empirical covariance matrix of the training set. The hyperparameters

θ := (s, ρ, σ) in the GP are learned by maximizing the log-likelihood of the observation y

max θ log p(y) = max θ log N(y|0, KX + σ2I). (3)

2.2 Sparse Gaussian Processes Regression

A straightforward approach to sparse GPR is to approximate the GP prior of interest with a degenerate GP [21, 23, 18]. Formally, for any xi, xj X, it assumes that

f(xi) yi|f X, f(xi) f(xj)|f X, (4)

where f X denotes ((Lif)( xi))M i=1 and denotes probabilistic independence between two random variables. That is, the original empirical covariance matrix KX is replaced by a rank-M approximation ˆKX := KX, XK 1 X K X,X, where K X is the covariance of f X and KX, X RN M is the cross-covariance between f X and f X. Let Λ RN N be diagonal. The inducing points X are treated as hyperparameters and can be found by jointly maximizing the log-likelihood with θ

max θ, X log N(y|0, ˆKX + σ2I + Λ), (5)

Several approaches to sparse GPR can be viewed as special cases of this problem [18]: the Deterministic Training Conditional (DTC) [21] approximation sets Λ as zero. To heal the degeneracy in p(f X), the Fully Independent Training Conditional (FITC) approximation [23] includes heteroscedastic noise, setting Λ = diag(KX ˆKX). As a result, their sum Λ + ˆKX matches the true covariance KX in the diagonal term. This general maximum likelihood scheme for ﬁnding the inducing points is adopted with variations in [24, 27, 8, 2]. A major drawback of all of these approaches is that they can over-ﬁt due to the high degrees-of-freedom X in the prior parametrization [26].

Variational sparse GPR can alternatively be formulated to approximate the posterior of the latent function by a full GP parameterized by the inducing points and the statistics of inducing functions [1, 26]. Speciﬁcally, Titsias [26] proposes to use

q(f X, f X) = p(f X|f X)q(f X) (6)

to approximate p(f X, f X|y), where q(f X) = N(f X| m, S) is the Gaussian approximation of p(f X|y) and p(f X|f X) = N(f X|KX, XK 1 X f X, KX ˆKX) is the conditional probability in the full GP. The novelty here is that q(f X, f X), despite parametrization by ﬁnite parameters, is still a full GP, which, unlike its predecessor [21], can be inﬁnite-dimensional.

The inference problem of variational sparse GPR is solved by minimizing the KL-divergence KL[q(f X, f X)||p(f X, f X|y)]. In practice, the minimization problem is transformed into the maximization of the lower bound of the log-likelihood [26]:

max θ log p(y) max θ, X, m, S

Z q(f X, f X) log p(y|f X)p(f X|f X)p(f X) q(f X, f X) df Xdf X

= max θ, X, m, S

Z p(f X|f X)q(f X) log p(y|f X)p(f X) q(f X) df Xdf X

= max θ, X log N(y|0, ˆKX + σ2I) 1 2σ2 Tr(KX ˆKX). (7)

The last equality results from exact maximization over m and S; for treatment of non-conjugate likelihoods, see [22]. We note that q(f X) is a function of m and S, whereas p(f X) and p(f X|f X) are functions of X. As a result, X become variational parameters that can be optimized without over-ﬁtting. Compared with (5), the variational approach in (7) regularizes the learning with penalty Tr(KX ˆKX) and therefore exhibits better generalization performance. Several subsequent works employ similar strategies: Alvarez et al. [3] adopt the same variational approach in the multi-output regression setting with scaled basis functions, and Abdel-Gawad et al. [1] use expectation propagation to solve for the approximate posterior under the same factorization.

3 Incremental Variational Sparse Gaussian Process Regression

Despite leveraging sparsity, the batch solution to the variational objective in (7) requires O(M 2N) operations and access to all of the training data during each optimization step [26], which means that learning from large datasets is still infeasible. Recently, several attempts have been made to incrementally solve the variational sparse GPR problem in order to learn better models from large datasets [1, 9, 10]. The key idea is to rewrite (7) explicitly into the sum of individual observations:

max θ, X, m, S

Z p(f X|f X)q(f X) log p(y|f X)p(f X) q(f X) df Xdf X

= max θ, X, m, S

i=1 Ep(fxi|f X)[log p(yi|fxi)] + log p(f X) q(f X)

The objective function in (8), with ﬁxed X, is identical to the problem of stochastic variational inference [11] of conjugate exponential families. Hensman et al. [9] exploit this idea to incrementally update the statistics m and S via stochastic natural gradient ascent,1 which, at the tth iteration, takes the direction derived from the limit of maximizing (8) subject to KLsym(qt(f X)||qt 1(f X)) < ϵ as ϵ 0. Natural gradient ascent considers the manifold structure of probability distribution derived from KL divergence and is known to be Fisher efﬁcient [4]. Although the optimal inducing points X, like the statistics m and S, should be updated given new observations, it is difﬁcult to design natural gradient ascent for learning the inducing points X online. Because p(f X|f X) in (8) depends on all the observations, evaluating the divergence with respect to p(f X|f X)q(f X) over iterations becomes infeasible.

We propose a novel approach to incremental variational sparse GPR, i VSGPR, that works by reformulating (7) in its RKHS dual form. This avoids the issue of the posterior approximation p(f X|f X)q(f X) referring to all observations. As a result, we can perform stochastic approximation of (7) while monitoring the KL divergence between the posterior approximates due to the change of m, S, and X across iterations. Speciﬁcally, we use stochastic mirror ascent [17] in the space of probability densities in RKHS, which was recently proven to be as efﬁcient as stochastic natural gradient ascent [19]. In each iteration, i VSGPR solves a subproblem of fractional Bayesian inference, which we show can be formulated into a standard variational sparse GPR of the size of a minibatch in O(M 2Nm + M 3) operations, where Nm is the size of a minibatch.

3.1 Dual Representations of Gaussian Processes in RKHS

An RKHS H is a Hilbert space of functions satisfying the reproducing property: kx H such that f H, f(x) = f, kx H. In general, H can be inﬁnite-dimensional and uniformly approximate continuous functions on a compact set [16]. To simplify the notation we write k T x f for f, kx H, and f T Lg for f, Lg , where f, g H and L : H 7 H, even if H is inﬁnite-dimensional.

A Gaussian process GP(m, k) has a dual representation in an RKHS H [12]: there exists µ H and a positive semi-deﬁnite linear operator Σ : H 7 H such that for any x, x X, φx, φx H,

m(x) = ρφT x µ, k(x, x ) = ρ2φT x Σφx . (9)

That is, the mean function has a realization µ in H, which is deﬁned by the reproducing kernel k(x, x ) = ρ2φT x φx ; the covariance function can be equivalently represented by a linear operator Σ. In shorthand, with abuse of notation, we write N(f|µ, Σ).2 Note that we do not assume the samples from GP(m, k) are in H. In the following, without loss of generality, we assume the GP prior considered in regression has µ = 0 and Σ = I. That is, m(x) = 0 and k(x, x ) = ρ2φT x φx .

3.1.1 Subspace Parametrization of the Approximate Posterior

The full GP posterior approximation p(f X|f X)q(f X) in (7) can be written equivalently in a subspace parametrization using {ψ xi H| xi X}M i=1:

µ = Ψ Xa, Σ = I + Ψ XAΨT X, (10)

1Although X was ﬁxed in their experiments, it can potentially be updated by stochastic gradient ascent. 2Because a GP can be inﬁnite-dimensional, it cannot deﬁne a density but only a Gaussian measure. The notation N(f|µ, Σ) is used to indicate that the Gaussian measure can be deﬁned, equivalently, by µ and Σ.

where a RM, A RM M such that Σ 0, and Ψ X : RM 7 H is deﬁned as Ψ Xa = PM i=1 aiψ xi. Suppose q(f X) = N(f X| m, S) and deﬁne ψ xi to satisfy ΨT X µ = m. By (10), m = K Xa and S = K X + K XAK X, which implies the relationship

a = K 1 X m, A = K 1 X SK 1 X K 1 X , (11)

where the covariances related to the inducing functions are deﬁned as K X = ΨT XΨ X and KX, X = ρΦT XΨ X. The sparse structure results in f(x) GP(kx, XK 1 X m, kx,x + kx, X(K 1 X SK 1 X K 1 X )k X,x), which is the same as R p(f(x)|f X)q(f X)df X, the posterior GP found in (7), where kx,x = k(x, x) and kx, X = ρφT x Ψ X. We note that the scaling factor ρ is associated with the evaluation of f(x), not the inducing functions f X. In addition, we distinguish the hyperparameter s (e.g. length scale) that controls the measurement basis φx from the parameters in inducing points X.

A subspace parametrization corresponds to a full GP if Σ 0. More precisely, because (10) is completely determined by the statistics m, S, and the inducing points X, the family of subspace parametrized GPs lie on a nonlinear submanifold in the space of all GPs (the degenerate GP in (4) is a special case if we allow I in (10) to be ignored).

3.1.2 Sparse Gaussian Processes Regression in RKHS

We now reformulate the variational inference problem (7) in RKHS3. Following the previous section, the sparse GP structure on the posterior approximate q(f X, f X) in (6) has a corresponding dual representation in RKHS q(f) = N(f| µ, Σ). Specially, q(f) and q(f X, f X) are related as follows:

q(f) p(f X|f X)q(f X)|K X|1/2|KX ˆKX|1/2, (12)

in which the determinant is due to the change of measure. The equality (12) allows us to rewrite (7) in terms of q(f) simply as

max q(f) L(q(f)) = max q(f)

Z q(f) log p(y|f)p(f)

q(f) df, (13)

or equivalently as minq(f) KL[q(f)||p(f|y)]. That is, the heuristically motivated variational problem (7) is indeed minimizing a proper KL-divergence between two Gaussian measures. A similar justiﬁcation on (7) is given rigorously in [14] in terms of KL-divergence minimization between Gaussian processes, which can be viewed as a dual of our approach. Due to space limitations, the proofs of (12) and the equivalence between (7) and (13) can be found in the Appendix.

The beneﬁt of the formulation of (13) is that in its sampling form,

i=1 log p(yi|f) + log p(f)

the approximate posterior q(f) nicely summarizes all the variational parameters X, m, and S without referring to the samples as in p(f X|f X)q(f X). Therefore, the KL-divergence of q(f) across iterations can be used to regulate online learning.

3.2 Incremental Learning

Stochastic mirror ascent [17] considers (non-)Euclidean structure on variables induced by a Bregman divergence (or prox-function) [5] in convex optimization. We apply it to solve the variational inference problem in (14), because (14) is convex in the space of probabilities [17]. Here, we ignore the dependency of q(f) on f for simplicity. At the tth iteration, stochastic mirror ascent solves the subproblem

qt+1 = arg max q γt

Z ˆ L(qt, yt)q(f)df KL[q||qt], (15)

3Here we assume the set X is ﬁnite and countable. This assumption sufﬁces in learning and allows us to restrict H be the ﬁnite-dimensional span of ΦX. Rigorously, for inﬁnite-dimensional H, the equivalence can be written in terms of Radon Nikodym derivative between q(f)df and normal Gaussian measure, where q(f)df denotes a Gaussian measure that has an RKHS representation given as q(f)

where γt is the step size, ˆ L(qt, yt) is the sampled subgradient of L with respect to q when the observation is (xt, yt). The algorithm converges in O(t 1/2) if (15) is solved within numerical error ϵt such that P ϵt O(P γ2 t ) [7].

The subproblem (15) is actually equivalent to sparse variational GP regression with a general Gaussian prior. By deﬁnition of L(q) in (14), (15) can be derived as

qt+1 = arg max q γt

Z q(f) N log p(yt|f) + log p(f)

df KL[q||qt]

= arg max q

Z q(f) log p(yt|f)Nγtp(f)γtq1 γt t (f) q(f) df. (16)

This equation is equivalent to (13) but with the prior modiﬁed to p(f)γtqt(f)1 γt and the likelihood modiﬁed to p(yi|f)Nγt. Because p(f) is an isotropic zero-mean Gaussian, p(f)γtqt(f)1 γt has the subspace parametrization expressed in the same basis functions as qt. Suppose qt has mean µt and precision Σ 1 t . Then p(f)γtqt(f)1 γt is equivalent to N(f|ˆµ, ˆΣ) up to a constant factor, where ˆµt = (1 γt)ˆΣt Σ 1 t µt and ˆΣ 1 t = (1 γt) Σ 1 t +γt I. By (10), Σ 1 t = I Ψ X(A 1 t +K X) 1Ψ X for some At, and therefore ˆΣ 1 t = I (1 γt)Ψ X(A 1 t + K X) 1Ψ X, which is expressed in the same basis. In addition, by (12), (16) can be further written in the form of (7) and therefore solved by a standard sparse variational GPR program with modiﬁed m and S (Please see Appendix for details).

Although we derived the equations for a single observation, minibatchs can be used with the same convergence rate and reduced variance by changing the factor p(yt|f)Nγt to QNm i=1 p(yti|f)

Nγt Nm . The hyperparameters θ = (s, ρ, σ) in the GP can be updated along with the variational parameters using stochastic gradient ascent along the gradient of R qt(f) log p(yt|f)p(f)

3.3 Related Work

The subproblem (16) is equivalent to ﬁrst performing stochastic natural gradient ascent [11] of q(f) in (14) and then projecting the distribution back to the low-dimensional manifold speciﬁed by the subspace parametrization. At the tth iteration, deﬁne q t(f) := p(yt|f)Nγtp(f)γtqt(f)1 γt. Because a GP can be viewed as Gaussian measure in an inﬁnite-dimensional RKHS, q t(f) (16) can be viewed as the result of taking natural stochastic gradient ascent with step size γt from qt(f). Then (16) becomes minq KL[q||q t] in order to project q t back to subspace parametrization speciﬁed by M basis functions. Therefore, (16) can also be viewed as performing stochastic natural gradient ascent with a KL divergence projection. From this perspective, we can see that if X, which controls the inducing functions, are ﬁxed in the subproblem (16), i VSGPR degenerates to the algorithm of Hensman et al. [9].

Recently, several researches have considered the manifold structure induced by KL divergence in Bayesian inference [7, 25, 13]. Theis and Hoffman [25] use trust regions to mitigate the sensitivity of stochastic variational inference to choices of hyperparameters and initialization. Let ξt be the size of the trust region. At the tth iteration, it solves the objective maxq L(q) ξt KL[q||qt], which is the same as subproblem (16) if applied to (14). The difference is that in (16) γt is a decaying step sequence in stochastic mirror ascent, whereas ξt is manually selected. A similar formulation also appears in [13], where the part of L(q) non-convex to the variational parameters is linearized. Dai et al. [7] use particles or a kernel density estimator to approximate the posterior of X in the setting with low-rank GP prior. By contrast, we follow Titsias s variational approach [26] to adopt a full GP as the approximate posterior, and therefore avoid the difﬁculties in estimating the posterior of X and focus on the approximate posterior q(f) related to the function values.

The stochastic mirror ascent framework sheds light on the convergence condition of the algorithm. As pointed out in Dai et al. [7], the subproblem (15) can be solved up to ϵt accuracy as long as P ϵt is order O(P γ2 t ), where γt O(1/

t) [17]. Also, Khan et al. [13] solves a linearized approximation of (15) in each step and reports satisfactory empirical results. Although variational sparse GPR (16) is a nonconvex optimization in X and is often solved by nonlinear conjugate gradient ascent, empirically the objective function increases most signiﬁcantly in the ﬁrst few iterations. Therefore, based on the results in [7], we argue that in online learning (16) can be solved approximately by performing a small ﬁxed number of line searches.

4 Experiments

We compare our method i VSGPR with VSGPRsvi the state-of-the-art variational sparse GPR method based on stochastic variational inference [9], in which i.i.d. data are sampled from the training dataset to update the models. We consider a zero-mean GP prior generated by a squared-exponential with automatic relevance determination (SE-ARD) kernel [20] k(x, x ) = ρ2 QD d=1 exp( (xd x d)2

2s2 d ), where sd > 0 is the length scale of dimension d and D is the dimensionality of the input. For the inducing functions, we modiﬁed the multi-scale kernel in [27] to

2lx,dlx ,d l2 x,d + l2 x ,d

l2 x,d + l2 x ,d

where lx,d is the length-scale parameter. The deﬁnition (17) includes the SE-ARD kernel as a special case, which can be recovered by identifying ψx = φx and (lx,d)D d=1 = (sd)D d=1, and hence their cross covariance can be computed.

In the following experiments, we set the number inducing functions to 512. All models were initialized with the same hyperparameters and inducing points: the hyperparameters were selected as the optimal ones in the batch variational sparse GPR [26] trained on subset of the training dataset of size 2048; the inducing points were initialized as random samples from the ﬁrst minibatch. We chose the learning rate to be γt = (1 +

t) 1, for stochastic mirror ascent to update the posterior approximation; the learning rate for the stochastic gradient ascent to update the hyperparameters is set to 10 4γt . We evaluate the models in terms of the normalized mean squared error (n MSE) on a held-out test set after 500 iterations.

We performed experiments on three real-world robotic datasets datasets, kin40k4, SARCOS5, KUKA6, and three variations of i VSGPR: i VSGPR5, i VSGPR10, and i VSGPRada.7 For the kin40k and SARCOS datasets, we also implemented VSGPR svi, which uses stochastic variational inference to update m and S but ﬁxes hyperparameters and inducing points as the solution to the batch variational sparse GPR [26] with all of the training data. Because VSGPR svi reﬂects the perfect scenario of performing stochastic approximation under the selected learning rate, we consider it as the optimal goal we want to approach.

The experimental results of kin40k and SARCOS are summarized in Table 1a. In general, the adaptive scheme i VSGPRada performs the best, but we observe that even performing a small ﬁxed number of iterations ( i VSGPR5, i VSGPR10) results in performance that is close to, if not better than VSGPR svi. Possible explanations are that the change of objective function in gradient-based algorithms is dominant in the ﬁrst few iterations and that the found inducing points and hyper-parameters have ﬁnite numerical resolution in batch optimization. For example, Figure 1a shows the change of test error over iterations in learning joint 2 of SARCOS dataset. For all methods, the convergence rate improves with a larger minibatch. In addition, from Figure 1b, we observe that the required number of steps i VSGPRada needed to solve (16) decays with the number of iterations; only a small number line searches is required after the ﬁrst few iterations.

Table 1b and Table 1c show the experimental results on two larger datasets. In the experiments, we mixed the ofﬂine and online partitions in the original KUKA dataset and then split 90% into training and 10% into testing datasets in order to create an online i.i.d. streaming scenario. We did not compare to VSGPR svi on these datasets, since computing the inducing points and hyperparameters in batch is infeasible. As above, i VSGPRada stands out from other models, closely followed by i VSGPR10. We found that the difference between VSGPRsvi and i VSGPRs is much greater on these larger real-world benchmarks.

Auxiliary experimental results illustrating convergence for all experiments summarized in Tables 1a, 1b, and 1c are included in the Appendix.

4kin40k: 10000 training data, 30000 testing data, 8 attributes [23] 5SARCOS: 44484 training data, 4449 testing data, 28 attributes. http://www.gaussianprocess.org/gpml/data/ 6KUKA1&KUKA2: 17560 ofﬂine data, 180360 online data, 28 attributes. [15] 7The number in the subscript denotes the number of function calls allowed in nonlinear conjugate gradient descent [20] to solve subproblems (16) and ada denotes (16) is solved until the relative function change is less than 10 5.

VSGPRsvi i VSGPR5 i VSGPR10 i VSGPRada VSGPR svi

kin40k 0.0959 0.0648 0.0608 0.0607 0.0535 SARCOS J1 0.0247 0.0228 0.0214 0.0210 0.0208 SARCOS J2 0.0193 0.0176 0.0159 0.0156 0.0156 SARCOS J3 0.0125 0.0112 0.0104 0.0103 0.0104 SARCOS J4 0.0048 0.0044 0.0040 0.0038 0.0039 SARCOS J5 0.0267 0.0243 0.0229 0.0226 0.0230 SARCOS J6 0.0300 0.0259 0.0235 0.0229 0.0230 SARCOS J7 0.0101 0.0090 0.0082 0.0081 0.0101

(a) kin40k and SARCOS

VSGPRsvi i VSGPR5 i VSGPR10 i VSGPRada

J1 0.1699 0.1455 0.1257 0.1176 J2 0.1530 0.1305 0.1221 0.1138 J3 0.1873 0.1554 0.1403 0.1252 J4 0.1376 0.1216 0.1151 0.1108 J5 0.1955 0.1668 0.1487 0.1398 J6 0.1766 0.1645 0.1573 0.1506 J7 0.1374 0.1357 0.1342 0.1333

VSGPRsvi i VSGPR5 i VSGPR10 i VSGPRada

J1 0.1737 0.1452 0.1284 0.1214 J2 0.1517 0.1312 0.1183 0.1081 J3 0.2108 0.1818 0.1652 0.1544 J4 0.1357 0.1171 0.1104 0.1046 J5 0.2082 0.1846 0.1697 0.1598 J6 0.1925 0.1890 0.1855 0.1809 J7 0.1329 0.1309 0.1287 0.1275

(c) KUKA2 Table 1: Testing error (n MSE) after 500 iterations. Nm = 2048; Ji denotes the ith joint.

(a) Test error

(b) Functions calls of i VSGPRada

Figure 1: Online learning results of SARCOS joint 2. (a) n MSE evaluated on the held out test set; the dash lines and the solid lines denote the results with Nm = 512 and Nm = 2048, respectively. (b) Number of function calls used by i VSGPRada in solving (16) (A maximum of 100 calls is imposed )

5 Conclusion

We propose a stochastic approximation of variational sparse GPR [26], i VSGPR. By reformulating the variational inference in RKHS, the update of the statistics of the inducing functions and the inducing points can be uniﬁed as stochastic mirror ascent on probability densities to consider the manifold structure. In our experiments, i VSGPR shows better performance than the direct adoption of stochastic variational inference to solve variational sparse GPs. As i VSGPR executes a ﬁxed number of operations for each minibatch, it is suitable for applications where training data is abundant, e.g. sensory data in robotics. In future work, we are interested in applying i VSGPR to extensions of sparse Gaussian processes such as GP-LVMs and dynamical system modeling.

[1] Ahmed H Abdel-Gawad, Thomas P Minka, et al. Sparse-posterior gaussian processes for general likelihoods. ar Xiv preprint ar Xiv:1203.3507, 2012.

[2] Mauricio Alvarez and Neil D Lawrence. Sparse convolved gaussian processes for multi-output regression. In Advances in neural information processing systems, pages 57 64, 2009.

[3] Mauricio A Alvarez, David Luengo, Michalis K Titsias, and Neil D Lawrence. Efﬁcient multioutput gaussian processes through variational inducing kernels. In International Conference on Artiﬁcial Intelligence and Statistics, pages 25 32, 2010.

[4] Shun-Ichi Amari. Natural gradient works efﬁciently in learning. Neural computation, 10(2):251 276, 1998.

[5] Arindam Banerjee, Srujana Merugu, Inderjit S Dhillon, and Joydeep Ghosh. Clustering with bregman divergences. The Journal of Machine Learning Research, 6:1705 1749, 2005.

[6] Lehel Csató and Manfred Opper. Sparse on-line gaussian processes. Neural computation, 14(3):641 668, 2002.

[7] Bo Dai, Niao He, Hanjun Dai, and Le Song. Scalable bayesian inference via particle mirror descent. ar Xiv preprint ar Xiv:1506.03101, 2015.

[8] Anibal Figueiras-vidal and Miguel Lázaro-gredilla. Inter-domain gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, pages 1087 1095, 2009.

[9] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. ar Xiv preprint ar Xiv:1309.6835, 2013.

[10] Trong Nghia Hoang, Quang Minh Hoang, and Kian Hsiang Low. A unifying framework of anytime sparse gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pages 569 578, 2015.

[11] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303 1347, 2013.

[12] Irina Holmes and Ambar N Sengupta. The gaussian radon transform and machine learning. Inﬁnite Dimensional Analysis, Quantum Probability and Related Topics, 18(03):1550019, 2015.

[13] Mohammad E Khan, Pierre Baqué, François Fleuret, and Pascal Fua. Kullback-leibler proximal variational inference. In Advances in Neural Information Processing Systems, pages 3384 3392, 2015.

[14] Alexander G de G Matthews, James Hensman, Richard E Turner, and Zoubin Ghahramani. On sparse variational methods and the kullback-leibler divergence between stochastic processes. In Proceedings of the Nineteenth International Conference on Artiﬁcial Intelligence and Statistics, 2016.

[15] Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local gaussian regression. In Advances in Neural Information Processing Systems, pages 972 980, 2014.

[16] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. The Journal of Machine Learning Research, 7:2651 2667, 2006.

[17] Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574 1609, 2009.

[18] Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate gaussian process regression. The Journal of Machine Learning Research, 6:1939 1959, 2005.

[19] Garvesh Raskutti and Sayan Mukherjee. The information geometry of mirror descent. Information Theory, IEEE Transactions on, 61(3):1451 1457, 2015.

[20] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine learning. 2006.

[21] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse gaussian process regression. In Artiﬁcial Intelligence and Statistics 9, number EPFL-CONF-161318, 2003.

[22] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized gp models. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1302 1311, 2015.

[23] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257 1264, 2005.

[24] Edward Snelson and Zoubin Ghahramani. Local and global sparse gaussian process approximations. In International Conference on Artiﬁcial Intelligence and Statistics, pages 524 531, 2007.

[25] Lucas Theis and Matthew D Hoffman. A trust-region method for stochastic variational inference with applications to streaming data. ar Xiv preprint ar Xiv:1505.07649, 2015.

[26] Michalis K Titsias. Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artiﬁcial Intelligence and Statistics, pages 567 574, 2009.

[27] Christian Walder, Kwang In Kim, and Bernhard Schölkopf. Sparse multiscale gaussian process regression. In Proceedings of the 25th international conference on Machine learning, pages 1112 1119. ACM, 2008.