# dual_parameterization_of_sparse_variational_gaussian_processes__b3b07c44.pdf Dual Parameterization of Sparse Variational Gaussian Processes Vincent Adam Aalto University / Secondmind.ai Espoo, Finland / Cambridge, UK vincent.adam@aalto.fi Paul E. Chang Aalto University Espoo, Finland paul.chang@aalto.fi Mohammad Emtiyaz Khan RIKEN Center for AI Project Tokyo, Japan emtiyaz.khan@riken.jp Arno Solin Aalto University Espoo, Finland arno.solin@aalto.fi Sparse variational Gaussian process (SVGP) methods are a common choice for non-conjugate Gaussian process inference because of their computational benefits. In this paper, we improve their computational efficiency by using a dual parameterization where each data example is assigned dual parameters, similarly to site parameters used in expectation propagation. Our dual parameterization speeds-up inference using natural gradient descent, and provides a tighter evidence lower bound for hyperparameter learning. The approach has the same memory cost as the current SVGP methods, but it is faster and more accurate. 1 Introduction Gaussian processes (GPs, [31]) have become ubiquitous models in the probabilistic machine learning toolbox, but their application is challenging due to two issues: poor O(n3) scaling in the number of data points, n, and challenging approximate inference in non-conjugate (non-Gaussian) models. In recent years, variational inference has become the go-to solution to overcome these problems, where sparse variational GP methods [35] tackle both the non-conjugacy and high computation cost. The computation is reduced to O(nm2) by using a small number of m n inducing-input locations. For large problems, the SVGP framework is a popular choice as it enables fast stochastic training, and reduces the cost to O(m3 + nbm2) per step, where nb is the batch size [10, 11]. It is a common practice in SVGP to utilize the standard mean-covariance parameterization which requires O(m2) memory. Inference is carried out by optimizing an objective L(q) that uses a Gaussian distribution q parameterized by parameters ξ = (m, L) where m is the mean and L is the Cholesky factor of the covariance matrix. We will refer to the SVGP methods using such parameterization as the q-SVGP methods. The advantage of this formulation is that, for log-concave likelihoods, the objective is convex and gradient-based optimization works well [2]. The optimization can further be improved by using natural gradient descent (NGD) which is shown to be less sensitive to learning rates [33]. The NGD algorithm with q-SVGP parameterization is currently the state-of-the-art and available in the existing software implementations such as GPflow [26] and GPy Torch [9]. An alternate parameterization to the q-SVGP parameterization is the one where every likelihood is assigned two sets of parameters which require O(n) memory. Existence of such parameterizations was initially shown by Csató and Opper [8] for general GP models, and later on extended to variational Both authors contributed equally. 35th Conference on Neural Information Processing Systems (Neur IPS 2021). 0 0.2 0.4 0.6 0.8 1 68.6 68.4 68.2 ELBO and gradient match Tighter bound ELBO, Lξ(ξ (θold), θ) t-SVGP (ours) q-SVGP q-SVGP (whitened) θ1 (length-scale) θ2 (magnitude) t-SVGP (ours) q-SVGP Figure 1: The dual parameterization gives a tighter bound compared to standard SVGP (whitened/ unwhitened) as shown on the left for a GP classification tasks and varying kernel magnitude θ. The tighter bound helps take longer steps, speeding up convergence of hyperparameters, as shown on the right for the banana classification task with coordinate ascent w.r.t. θ and variational parameters ξ. objectives for GPs [28, 29], and also to latent Gaussian models by using Lagrangian duality [20, 16]. Due to this later connection, we refer to this parameterization as the dual parameterization where the parameters are the Lagrange multipliers, ensuring that the marginal mean and variance of each latent function is consistent to the marginals obtained by the full GP [16]. Expectation propagation (EP, [27]) too naturally employ such parameterizations, but by using site parameters. Although unrelated to duality, such methods are popular for GP inference, and due to this connection, we will refer to the methods using dual parameterization as t-SVGP (the letter t refers to the sites). To the best of our knowledge, the dual parameterization for SVGP has only been used to speed up computation and inference for the specific case of Markovian GPs [4, 38]. The main contribution of this work is to introduce the dual parameterization for SVGP and show that it speeds up both the learning and inference. For inference, we show that the dual parameters are automatically obtained through a different formulation of NGD, written in terms of the expectation parameters [17, 15, 18]. The formulation is fast since it avoids the use of sluggish automatic differentiation to compute the natural gradients. We also match the typical O(m2) memory complexity in other SVGP methods by introducing a tied parametrization. For learning, we show that the dual parameterization results in a tighter lower bound to the marginal likelihood, which speed-up the hyperparameter optimization (see Fig. 1). We provide extensive evaluation on benchmark data sets, which confirms our findings. Our work attempts to revive the dual parameterization, which was popular in the early 2000s, but was somehow forgotten and not used in the recent SVGP algorithms. 2 Background: Variational Inference for Gaussian Processes Models Gaussian processes (GPs, [31]) are distributions over functions, commonly used in machine learning to endow latent functions in generative models with rich and interpretable priors. These priors can provide strong inductive biases for regression tasks in the small data regime. GP-based models are the ones that employ a GP prior over the (latent) functions f( ) GP(µ( ), κ( , )), where the prior is completely characterized by the mean function µ( ) and covariance function κ( , ). We denote f( ) as a function but occasionally simplify the notation to just f in the interest of reducing clutter. Given a data set D = (X, y) = {(xi, yi)}n i=1 of input output pairs, we denote by f the vector of function evaluations at the inputs {f(xi)}n i=1. The function evaluation at xi are passed through likelihood functions to model the outputs y R, e.g. by specifying p(y | f) := Qn i=1 p(yi | fi). Prediction at a new test input x is obtained by computing the distribution p(f(x ) | D, x ). For Gaussian likelihoods N(yi | fi, σ2), the predictive distribution is available in closed form as a Gaussian distribution N(f(x ) | m GPR(x ), v GPR(x )) with mean and variance defined as follows: m GPR(x ) := k f (Kff+ σ2In) 1y, and v GPR(x ) := κ k f (Kff+ σ2In) 1kf , (1) where kf is a vector of κ(x , xi) as the ith element for all xi X, Kffis an n n matrix with κ(xi, xj) as the ijth entry, and κ = κ(x , x ). 2.1 Variational Expectation Maximization for GPs with Non-Conjugate Likelihoods For non-Gaussian likelihoods, the posterior and predictive distributions are no longer Gaussian, and we need to resort to approximate inference methods. Variational inference is a popular choice because it allows for fast posterior approximation and hyperparameter learning via stochastic training [35, 10]. Denoting kernel hyperparameters by θ and the corresponding GP prior by pθ(f), the posterior distribution can be written as pθ(f | y) = pθ(f) p(y | f)/pθ(y), where pθ(y) = R pθ(f, y)df is the marginal likelihood of the observations. We seek to approximate pθ(f | y) qf(f) by a Gaussian distribution whose parameters can be obtained by optimizing the following evidence lower bound (ELBO) to the log-marginal likelihood, log pθ(y) Lq(qf, θ) = Pn i=1 Eqf (fi) [log p(yi | fi)] DKL[qf(f) pθ(f)] . (2) For the variational approximation, it is a standard practice to choose the mean-covariance parameterization, denoted by ξ = (m, S). It is also common to use the Cholesky factor L instead [3] since it is uniquely determined for a covariance matrix. The multivariate normal distribution is part of the exponential family [37], i.e. it s probability density function take the form p(x) = exp(η T(x) a(η)), with natural parameters η = (S 1m, S 1/2) and sufficient statistics T(x) = [x, xx ]. This natural parameterization is also a common choice, along with the associated expectation parameterization µ = Eq [T(x)] = (m, S + mm ). A final choice of the parameterization, called the whitened parameterization [36], uses a variable v N(v; mv, Sv) along with the transformation f = Lv, to parameterize m = Lmv and S = LSv L . In the following, we will consider the ELBO with several parameterizations and, to make the notation clearer, we will indicate the parameterization used with a subscript with L. For example, we may have Lξ(ξ, θ) = Lη(η, θ) = Lµ(µ, θ), which are all clearly equal due to a unique mapping between the parameterizations; see [25] for more details on the maps. The ELBO can be optimized by using a variational expectation maximization (VEM) procedure, where we alternate between optimizing variational parameters, say ξ, and hyperparameters θ, E-step: ξ t arg maxξ Lξ(ξ, θt), M-step: θt+1 arg maxθ Lξ(ξ t , θ), (3) where t denotes the iterations, and we have explicitly written the dependence of optimal parameter ξ (θt) as a function of the old parameter θt. Both and E and M-steps can be carried out with gradient descent, for example, using an iteration of the form ξ(k+1) t ξ(k) t + ρk ξLξ(ξ(k) t , θt) for E-step, which would ultimately converge to ξ t . A similar iterative method can be used for the M-step. 2.2 Inference via Natural-Gradient Descent (NGD) A popular strategy for the E-step is to use natural gradient descent where we replace the gradient by the one preconditioned using the Fisher information matrix F(ξ) of qf(f). We denote natural gradients by ξLξ(ξ, θ) = F(ξ) 1 ξLξ(ξ, θ), to get the following update, ξ(k+1) t ξ(k) t + ρk ξLξ(ξ(k) t , θt). (4) Such updates can converge faster than gradient descent [33, 17, 18], and at times are less sensitive to the choice of the learning rate ρk due to the scaling with the Fisher information matrix. The implementation simplifies greatly when using natural parameterizations, η(k+1) t η(k) t + ρk µLµ(µ(k) t , θt), (5) because ηLη(η, θ) = µLµ(µ, θ), that is, the natural gradients with respect to η are in fact the gradients with respect to the expectation parameter µ [18]. In the remainder of the paper we will frequently use this property, and refer to the natural gradient with respect to η by the gradients with respect to expectation parameterization µ. The VEM procedure with NGD has recently become a popular choice for sparse variants of GPs, which we explain next. 2.3 Sparse Variational GP Methods and Their Challenges Inference in GP models, whether conjugate or non-conjugate, suffers from an O(n3) computational bottleneck required to invert the posterior covariance matrix. A common approach to reduce the computational complexity is to use a sparse approximation relying on a small number m n representative inputs, also called inducing inputs, denoted by Z := (z1, z2, . . . , zm) [34, 7, 30, 39]. Sparse variational GP methods [35, 10, 11, 5, 6, 32] rely on a Gaussian approximation qu(u) over the functions u = (f(z1), f(z2), . . . , f(zm)) to approximate the posterior over arbitrary locations, qu,θ(f( )) = R pθ(f( ) | u) qu(u) du, (6) where pθ(f( ) | u) is a conditional of the GP prior. For example, for a Gaussian qu(u) = N(u; mu, Su), the posterior marginal of fi = f(xi) takes the following form, qu,θ(fi) = N fi | a i mu, κii a i (Kuu Su)ai , (7) where ai = K 1 uukui with Kuu as the prior covariance evaluated at Z, and kui as an m-length vector of κ(zj, xi), j. The parameters ξu = (mu, Su) can be learned via an ELBO similar to Eq. (2), Lξ(ξu, θ) := Pn i=1 Equ,θ(fi) [log p(yi | fi)] DKL[qu(u) pθ(u)]. (8) The variational objective for such sparse GP posteriors can be evaluated at a cost O(nm2 + m3), and optimization can be performed in O(m3+nbm2) per iteration via stochastic natural-gradient methods with mini-batch size nb [10]. This formulation also works for general likelihood functions [11]. This and the low computational complexity has lead to a wide adoption of the SVGP algorithm. It is currently the state-of-the-art for sparse variants of GP and is available in the existing software implementations such as GPflow [26] and GPy Torch [9]. Despite their popularity, the current implementations are cumbersome and there is plenty of room for improvements. For example, the methods discussed in Salimbeni et al. [33] (see App. D for a summary) rely on the mean-covariance parameterization and NGD is performed in η-space. However, the natural gradients are implemented via chain rule: ( µξ) ξLξ(ξ, θ) utilizing the gradients in the ξ-space, which requires computation of additional Jacobians and multiplication operations. Since other operations are done via mean-covariance parameterization, we need to go back and forth between η and ξ, which further increases the cost. In addition, many existing implementations currently compute the natural gradient of the whole ELBO, including the KL term which is not required; see Khan and Rue [19, Sec. 2.2]. Finally, the M-step is dependent on the choice of the parameterization used in E-step and can affect the convergence speed. To the best of our knowledge, this has not been investigated in the literature. In what follows, we argue to use a dual parameterization instead of the usual mean-covariance parameterization, and show that this not only simplifies computations of natural-gradients, but also gives rise to a tighter bound for hyperparameter learning and speed-up the whole VEM procedure. 3 The t-VGP Method: Dual-Parameter Based Learning for GPs We start with a property of the optimal q f of Eq. (2). Khan and Nielsen [18, Eq. (18)] show, that q f can be parameterized by 2D vectors λ i = (λ 1,i, λ 2,i) used in site functions t i (fi), q f (f) pθ(f) Qn i=1 e λ i ,T(fi) | {z } t i (fi) , where λ i = µi Eq f (fi)[log p(yi | fi)]. (9) The vectors λ i are equal to the natural gradient of the expected log-likelihood, where the expectation is taken with respect to the posterior marginal q f (fi) = N(fi; m i , S ii) with S ii as the i th diagonal element of S , and the gradient is taken with respect to its expectation parameter µi = (mi, m2 i +Sii) and evaluated at µ i . The vector T(fi) = (fi, f 2 i ) are the sufficient statistics of a Gaussian. The q f uses local unnormalized Gaussian sites t i (fi), similarly to those used in the Expectation Propagation (EP) algorithm [27]. The difference here is that the site parameters λ i are equal to natural gradients of the expected log-likelihood which are easy to compute using the gradient with respect to µi. The parameters λ i can be seen as the optimal dual parameters of a Lagrangian function with moment-matching constraints. We can show this in two steps: 1. For each p(yi|fi), we introduce a local Gaussian eqi(fi; eµi) with expectation parameters eµi. 2. Then, we aim to match eµi with the marginal moments µi of the global Gaussian qf(f; µ). This is written below as a Lagrangian where the middle term (shown in red) decouples the terms using the local Gaussians from those using the global Gaussian, LLagrange(µ, eµ, λ) = Pn i=1 Eeqi(fi;eµi)[log p(yi | fi)] Pn i=1 λi, eµi µi DKL[qf(f; µ) pθ(f)]. (10) The parameter λi is the Lagrange multiplier of the moment-matching constraint eµi = µi (here, λ and eµ denote the sets containing all λi and eµi). The optimal λi is equal to λ i shown in Eq. (9). We can show this by, first setting the derivative with respect to λi to 0, finding that the constraints eµ i = µ i are satisfied. Using this and by setting the derivative with respect to eµi to 0, we see that the optimal λ i is in fact equal to the natural gradients, as depicted in Eq. (9). The optimal natural parameter of q f (f), denoted by η , has an additive structure that, as we will show, can be exploited to speed-up learning. The structure follows by setting derivatives w.r.t. µ to 0, µDKL[qf(f; µ ) pθ(f)] = λ = η = η0(θ) + λ . (11) The second equality is obtained by using the result that µDKL[q(f; µ) pθ(f)] = η η0(θ) where η and η0(θ) are natural parameters of q(f; µ) and pθ(f) respectively [19, Sec. 2.2]. The final result follows by noting that the right-hand side is the natural parameter of q f (f) from Eq. (9). This implies that the global qf(f; µ ) = q f (f) is also equal to the optimal approximation, as desired. The Lagrangian formulation is closely related to the maximum-entropy principle [13] which forms the foundations of Bayesian inference [14]. Through moment matching, the prior is modified to obtain posterior approximations that explain the data well. Since λ i are the optimal Lagrange multipliers, they measure the sensitivity of the optimal q f (f) to the perturbation in the constraints, and reveal the relative importance of data examples. Therefore, the structure of the solution η shown in Eq. (11) is useful for estimating θ. The additive structure can be used to measure the relative importance of the prior to the dual parameters λ i . Our main idea is to use the structure to speed-up learning for SVGPs. Eq. (11) can be rewritten in terms of the mean-covariance parameterization, to gain further insight about the structure. To do so, we use Bonnet and Price s theorem, and rearrange to get the following (see Eqs. 10 and 11 in [19] for a similar derivation), m = Kffα , where α is a vector of α i = Eq f (fi) [ f log p(yi | fi)] , (12) (S ) 1 = K 1 ff+ diag(β ), where β is a vector of β i = Eq f (fi) 2 ff log p(yi | fi) . (13) The variables α i and β i can be easily obtained by using the gradient and Hessian of the log-likelihood, and using those we can get λ i = (β i m i + α i , 1 Several other works have discussed such parameterizations, although our work is the first to connect it to natural gradients as the optimal Lagrange multiplier. The representation theorem by Kimeldorf and Wahba [22] is perhaps the most general result, but Csató and Opper [8] were the first to derive such parameterization for GPs; see Lemma 1 in their paper. Their result is for exact posteriors which is intractable while ours is for Gaussian approximations and easy to compute. A minor difference there is that their parameterizations use the integrals of likelihoods (instead of log-likelihoods) with respect to the GP prior (instead of the posterior), but we can also express them as Eq. (9) where q f (f) is replaced by the true posterior pθ(f | y). Parameterization of the variational posterior similar to ours are discussed in [28, 29], but the one by Khan et al. [20] is the most similar. They establish the first connection to duality for cases where ELBO is convex with respect to the mean-covariance parameterization. Khan [16] extends this to nonconvex ELBO using the Lagrangian function similar to ours, but written with the mean-covariance parameterization to get the solutions shown in Eqs. (12) and (13). As shown earlier, their (α, β) parameterization is just a reparameterization of our λ parameterization. Here, we argue in favour of our formulation which enables the reformulation in terms of site functions in Eq. (9) and also allows us to exploit the additive structure in Eq. (11) to speed up hyperparameter learning. The mean-covariance parameterization does not have these features. 3.1 Improved Objective for Hyperparameter Learning We will now discuss a method to speed-up VEM by using the dual parameterization. The key idea is to exploit the form given in Eq. (11) to propose a better objective for the M-step. Standard VEM procedures, such as those shown in Eq. (3), iterate pairs of E and M steps which we here describe in the context of the dual parameterization. In the E-step, starting from a hyperparameter θt, the optimal variational distribution q f (f) maximizing the ELBO in Eq. (2) is computed. For the dual parameterization, we get the optimal variational parameters η t = η0(θt) + λ t . Here, the subscripts t in η t and λ t indicate the dependence of the E-step iterations on θt, while η0(θt) indicates a direct dependence of the prior natural parameter over θt. The standard M-step would then be to use η t in the ELBO in Eq. (3) as shown below, while we propose an alternate procedure where the prior η0(θ) is left free (shown in red): Standard M-step: θt+1 = arg minθ Lη(η0(θt) + λ t , θ) (14) Proposed M-step: θt+1 = arg minθ Lη(η0(θ) + λ t , θ) (15) This proposed objective is still a lower bound to the marginal likelihood and it corresponds to the ELBO in Eq. (2) with a distribution whose natural parameter is ˆηt(θ) = η0(θ)+λ t and thus depends on θ. We denote this distribution by qf(f; ˆηt(θ)). The ELBO is different from the one the distribution obtained after the E-step, with natural parameter η t which is independent of θ. We denote this distribution by qf(f; η t ). Clearly, at θ = θt both objectives match, and so do their gradient with respect to θ, but they generally differ otherwise. We argue that the proposed M-step could lead to a tighter lower bound; see Fig. 1 for an illustration. In the standard M-step, the dependency of the bound on θ is only via the KL divergence in Eq. (2). In the M-step we propose, this dependency is more intricate because the expected log-likelihood also depend on θ. Yet, as we show now, it remains simple to implement. The lower bound in the proposed M-step takes a form where an existing implementation of GP regression case can be reused. Lη(η0(θ) + λ t , θ) = Eqf (f; ˆηt(θ)) log Qn i=1 p(yi | fi) pθ(f) 1 Zt(θ) Qn i=1 t i (fi) pθ(f) = log Zt(θ) + c(θ), (16) where c(θ) = Pn i=1 Eqf (f; ˆηt(θ)) h log p(yi | fi) t i (fi) i and log Zt(θ) is the log-partition of qf(f; ˆηt(θ)), log Zt(θ) = n 2 log(2π) 1 2 log | diag(β t ) 1 + Kff(θ)| 1 2 ey diag(β t ) 1 + Kff(θ) 1 ey. (17) Here, ey is a vector of eyi = 1 2λ 1,i/λ 2,i, and we have explicitly written Kff(θ) to show its direct dependence on the hyperparameter θ. The gradients of Zt(θ) can be obtained using GP regresssion code, while the gradient of c(θ) can be obtained using standard Monte-Carlo methods. A similar lower bound was originally used in the implementation2 provided by Khan and Lin [17], but they did not use it for hyperparameter learning. For GP regression, we recover the exact log-marginal likelihood log pθ(y | D) for all values of θ. Indeed λ i = (yi/σ2, 1/(2σ2)), which means that the sites exactly match the likelihood terms so c(θ) = 0. This also gives us eyi = yi and βi = 1/σ2, and we get log Zt(θ) = log pθ(y | D). For non-conjugate problems, we found it to be tighter bound than the standard ELBO (Eq. (14))) which could speed-up the procedure. This is illustrated in Fig. 2 (top row) where the proposed ELBO is compared to two other parameterizations (mean-covariance and whitened) for many values of θt = θold. We see that the maximum value (shown with a dot) remains rather stable for the proposed method compared to the other two. This is as expected due to Eq. (15) where we expect the solutions to become less sensitive to θt because we have replaced η0(θt) by η0(θ). The bottom row in Fig. 2 shows the iterative steps (θt+1, θt) for a few iterations, where we see that, due to the stable solutions of the new ELBO, the iterations quickly converge to the optimum. Exact theoretical reasons behind the speed-ups are currently unknown to us. We believe that the conditioning of the ELBO is improved under the new parameterization. We provide some conditions in App. A under which the new ELBO would provably be tighter. 3.2 Faster Natural Gradients for Inference Using the Dual Paramterization So far, we have assumed that the both E and M steps are run until convergence, but it is more practical to use a stochastic procedure with partial E and M steps, for example, such as those used in [10, 12]. Fortunately, with natural-gradient descent, we can ensure that the iterations also follow the same structure as that of the solution shown in Eq. (9) and Eq. (11). Specifically, we use the method of 2See https://github.com/emtiyaz/cvi/blob/master/gp/inf KL_cvi.m Khan and Lin [17], expressed in terms of the dual parameters λ and natural gradients of the expected log-likelihoods g(k) i = µi Eq(k) f (fi)[log p(yi | fi)] (see also [19, Sec. 5.4]), q(k+1) f (f) pθ(f) Qn i=1 e λ(k+1) i ,T(fi) | {z } t(k+1) i (fi) , where λ(k+1) i = (1 rk)λ(k) i + rkg(k) i . (18) The convergence of these iterations is guaranteed under mild conditions discussed in [21]. The natural parameter of q(k) f (f) at iteration k can be written in terms of θ as follows, η(k) = η0(θ) + λ(k), (19) and the expectation parameters µ(k) i , required to compute the natural gradients of the expected log-likelihood, can be obtained by using a map from the natural parameter η(k). The updates hold for any θ and can be conveniently used as θ = θt at the E-step of the tth EM iteration. We name t-VGP the EM-like algorithm with 1) an E-step consisting of the natural gradient updates of Eq. (18), and, 2) the proposed M-step introduced in Eq. (15). 4 The t-SVGP Method: Dual-Parameter Based Inference for SVGP We now extend the dual parameterization based stochastic VEM procedure to the SVGP case and refer to the resulting algorithm as t-SVGP. The optimality property shown in Eq. (9) is shared by the ELBO given in Eq. (8). That is, we can express the optimal q u(u) in terms of n 2D parameters λ i , q u(u) pθ(u) Qn i=1 e λ i ,T(a i u) | {z } t i (u) , where λ i = µu,i Eq u(fi)[log p(yi | fi)]. (20) The difference here is that the site parameters use the sufficient statistics T(a i u), defined via the projections ai = K 1 uukui. The natural gradients of the expected log-likelihood are computed by using the marginal q u(fi) defined in Eq. (6) by using ξ u = (µ u, S u) evaluated at the expectation parameters µ u,i. Note that both ai and q u(fi) depend on θ, but we have suppressed the subscript for notation simplicity. Similarly to the VGP case, the λ i are the optimal dual parameters that measure the sensitivity of the solution to the perturbation in the moments of the posterior marginal q u(fi). This suggests that we can design a similar VEM procedure that exploits the structure of solution in Eq. (20). The structure is shown below in terms of the natural parameterization of q u(u) for sufficient statistics T(u), (S u) 1 m u = K 1 uu Pn i=1 kuiλ 1,i | {z } = λ 1 and (S u) 1 = K 1 uu +K 1 uu Pn i=1 kuiλ 2,ik u,i | {z } = Λ 2 K 1 uu, (21) the quantities Kuu and kui directly depend on θ and we can express the ELBO as the partition function of a Gaussian distribution, similarly to Eq. (17) (exact expression in App. B). For large data sets, storing all the {λ i }n i=1 might be problematic, and we can instead store only λ 1, a m-length vector, and Λ 2, a m m matrix. This tied parameterization is motivated from the site-tying setting in sparse EP [1, 24] where the goal is to reduce the storage. The parameterization ignores the dependency of kui over θ and may reduce the coupling between θ and q u, but it is suitable for large data sets. An alternative tying method consists in storing the sums and the flanking K 1 uu terms. We detail the final stochastic variational procedure which we refer to as t-SVGP: Given a parameter θt, we run a few iterations of the E-step. At each iteration k of the E-step, given m(k) u and S(k) u , we sample a minibatch M and compute the natural gradients by first computing α(k) i = Eq(k) u (fi) [ f log p(yi | fi)] and β(k) i = Eq(k) u (fi) 2 ff log p(yi | fi) , using the marginals q(k) u (fi) from Eq. (6). The natural gradients of the expected log-likelihood for the ith site is then equal to g(k) i = (β(k) i m(k) i + α(k) i , β(k) i ). Using these natural gradients we can use an iterative procedure similar to Eq. (18) but now on the tied parameters, λ(k+1) 1 (1 rk) λ(k) 1 + rk P i M kuig(k) 1,i , (22) Λ(k+1) 2 (1 rk) Λ(k) 2 + rk P i M kuik uig(k) 2,i . (23) 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 Lξ(ξ (θold), θ) 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 Lξ(ξ (θold), θ) 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 Lξ(ξ (θold), θ) 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 (a) t-SVGP (ours) 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 (b) q-SVGP (whitened) 4 5 6 7 8 9 0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 Figure 2: EM iterations for hyperparameter learning on a classification task on a a toy dataset. Top row: for the M-step, the optima of the t-SVGP objective (left) are less sensitive to the initial hyperparameter value θold, compared to q-SVGP (middle and right). Bottom row: the EM iterations are shown in white on top of the EM objectives for a range of starting values for θold. The starting point θ = 2.5 is marked with a white dot and the optimum with a star. t-SVGP converges much faster (2 iterations in leftmost plot) compared to q-SVGP (middle and rightmost plots which take >5 iterations). The natural parameter required can be obtained using Eq. (21), S(k) u K 1 uu + K 1 uu Λ(k) 2 K 1 uu 1 and m(k) u S(k) u K 1 uu λ(k) 1 . (24) After a few E-steps, we update the parameters with a gradient descent step using the gradient of the log-partition function of q(k) f (u) with respect to θ. In App. C, we detail how to efficiently make predictions and compute the ELBO under parameterization Eq. (21). The full algorithm is given in App. E. The convergence of the sequence of stochastic updates in the E-step is guaranteed under mild conditions discussed in [21] for the untied setting. Site-tying introduces a bias but does not seem to affect convergence in practice. 5 Empirical Evaluation We conduct experiments to highlight the advantages of using the dual parameterization. Firstly, we study the effects of the improved objective for hyperparameter learning of t-SVGP versus q-SVGP. We study the objective being optimized for a single M-step, after an E-step ran until convergence. We then show a full sequence of EM iterations on small data sets. For large-scale data, where running steps to convergence is expensive, we use partial E and M-steps and mini-batching. Our improved bound and faster natural gradient computations show benefits in both settings. It is worth noting that the E-step for both q-SVGP with natural gradients and t-SVGP are identical up to machine precision, and any differences in performance are to be attributed to the different parameterization. The Role of the Learning Objective In Fig. 1, we learn the kernel hyperparameters θ in a GP classification task via coordinate ascent of the lower bound Lξ(ξ, θ), where ξ are the variational parameters, i.e. via EM. Starting at hyperparameter θold, we denote by ξ (θold) the associated optimal 4000 3000 2000 1000 ELBO 1000 800 600 400 200 2500 2000 1500 1000 500 200 150 100 300 200 100 600 500 400 300 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.46 0.48 0.5 0.52 q-SVGP (whitened) q-SVGP (whitened, with natgrads) t-SVGP (with natgrads) Figure 3: Comparison of convergence in terms of ELBO and negative log-predictive density (NLPD) as averages over 5-fold cross-validation runs, on UCI regression and classification tasks. All methods were trained (incl. hyperparameters and 50 inducing input locations) with matching learning rate. Natural gradient based training is superior, and the t-SVGP parameterization improves stability. variational parameters. Updating θ consists in optimizing Lξ(ξ (θold), θ) which we show on the left panel for the dual parameterization (blue), the standard whitened (orange) and unwhitened (green) SVGP parameterizations. The dual parameterization leads to a tighter bound and thus to bigger steps and faster overall convergence as shown on the right for the illustrative toy classification task, starting at (θ1, θ2) = (1, 1), in the extreme case of taking both the E and M step to convergence. For the toy data set we use m = 10 inducing points (see details in App. F). In Fig. 2, we also use the toy data and parameters as in Fig. 1, but we show how the learning objective changes over iterations. The blue contours show, for all initial θold, the objective maximized in the M-step, i.e. Lξ(ξ (θold), θ). The orange lines show, for all initial θold, the outcome of an E-step followed by and M-step, i.e. θ (θold) = arg maxθ Lξ(ξ (θold), θ). The EM iterations converge to the fixed points of θ , i.e. its intersection with the diagonal line of the identity function. A flatter line around the optimal value is more desirable as it means the iterations converge faster to the optimum value which in this experiment is just below one, while a line close to the diagonal leads to slow convergence. Here t-SVGP has the fastest convergence, q-SVGP performs poorly, although whitening clearly helps the optimisation problem. The dark dashed lines show how optimising θ would look starting from θ0 = 2.5 and running 8 iterations for the different models. Evaluation on UCI Classification and Regression Tasks We use common small and mid-sized UCI data sets to test the performance of our method against q-SVGP with natural gradient optimisation and normal q-SVGP trained with Adam optimizer for the variational parameters. All methods use Adam for the hyperparameters. The exact details of the data sets can be found in App. F. Here we again take the approach that the optimal way to optimize the ELBO if computational budget allows is to alternate between performing E and M-steps till convergence. We plot how the different inference schemes perform for ELBO and NLPD on a hold test set. We perform 5-fold cross validation with the results in Fig. 3 showing the mean of the folds for ELBO and NLPD. Natural gradient variants of q-SVGP clearly perform better than non natural gradient q-SVGP. Our method t-SVGP seems more stable specifically in NLPD for most data sets if not equal to q-SVGP. For q-SVGP, we have used the whitened version which as we noted helps with hyperparameter optimisation. Improved Efficiency in Large-scale Inference To highlight practical benefits, we show the performance of our stochastic and sparse t-SVGP framework on the MNIST ([23], available under CC BY-SA 3.0) multiclass-classification task (for details see App. F). Given the data set is n = 70,000, minibatching is needed and we adopt the parameterization of Eq. (21) meaning we match q-SVGP for parameter storage complexity. We compare against the natural gradient q-SVGP implementation but not the non-natural gradient version since it produces considerably worse performance. In large scale minibatching experiments performing full E and M steps may not be efficient. Instead we perform partial steps for both. A single E and M step can be thought of as the approach outlined in [33]. All experiments are performed with a batch size of nb = 200 and m = 100 inducing points and the 0 20 40 60 80 1.5 1 0.5 105 Wall-clock time (s) t-SVGP q-SVGP 0 20 40 60 80 0.5 1 1.5 2 Wall-clock time (s) t-SVGP q-SVGP Figure 4: Comparison of practical inference and learning on MNIST. We compare training time on a laptop between t-SVGP to the q-SVGP model in GPflow in terms of wall-clock time of training for 150 steps, where both methods use natural gradient updates and share the same learning rates. optimization is ran until convergence using the Adam optimizer for the hyperparameters (M-step). Table 1 shows different variations of learning rates and iterations of Eand M-steps. The results suggest some benefits in running partial EM steps. The t-SVGP formulation performs equally if not better than q-SVGP under all settings. Table 1: NLPD on MNIST benchmarks for different learning rates and E and M steps. NLPD LR STEPS q-SVGP t-SVGP E M #E #M 0.304 0.015 0.304 0.006 0.040 0.05 1 1 0.289 0.010 0.283 0.007 0.035 0.10 2 1 0.293 0.020 0.281 0.010 0.030 0.10 3 1 0.259 0.010 0.255 0.006 0.025 0.03 4 2 0.282 0.007 0.283 0.006 0.050 0.03 4 2 0.243 0.003 0.230 0.009 0.030 0.03 4 1 In Fig. 4, we show the speed advantage of t SVGP over q-SVGP due to cheaper natural gradient updates. We compare against the state-ofthe-art implementation of SVGP in GPflow ([26], v2.2.1) and a closely matched implementation of our method in GPflow. We compare wall-clock time to compute 150 steps of the algorithm for both methods in terms of NLPD and ELBO taking single E and M-steps (Mac Book pro, 2 GHz CPU, 16 GB RAM). Our implementation avoids the use of sluggish automatic differentiation to compute the natural gradients, and, even if our implementation is not as optimized as SVGP in GPflow, it is roughly 5 times faster on this standard benchmark. 6 Discussion and Conclusion Sparse variational GP (SVGP) methods are the current de facto approach to allow GPs to scale to large problems. In this paper, we introduced an alternative parameterization to variational GPs that leads to an improved loss landscape for learning (cf., Fig. 1). This improvement hinges on writing the variational problem in terms of its dual similar to the conjugate-computation variational inference (CVI) approach by Khan and Lin [17] parameterization to capture sites: we assume the approximate posterior decomposes into a prior contribution and a Gaussian approximate likelihood contribution. Variational inference under this model can conveniently be implemented by mirror descent and corresponds to natural gradient based learning, thus improving convergence in variational parameter optimization (the E-step ), at the same time as improving hyperparameter optimization (the M-step ), due to the tighter evidence lower bound. We further show that we can derive the sparse equivalent of this method, which also allows for stochastic training through mini-batching, reducing the computational complexity to O(m3 + nbm2) per step. Our method matches the asymptotic computational cost of other SVGP methods, while marginally reducing compute due to simpler expressions to back-propagate through (see discussion in Sec. 5). Our empirical validation across a wide variety of regression and classification tasks confirms the benefits suggested by our theory: The proposed strategy typically allows for improved stability over gold-standard SVGP methods even when the learning rates remain the same. It allows for higher learning rates, and reduces computational cost leading to improved learning both in terms of reduced steps as well as expected wall-clock time. We provide a reference implementation of our method under the GPflow framework at https: //github.com/Aalto ML/t-SVGP. Acknowledgments and Disclosure of Funding AS acknowledges funding from the Academy of Finland (grant numbers 324345 and 339730). We acknowledge the computational resources provided by the Aalto Science-IT project. We thank Stefanos Eleftheriadis, Richard E. Turner, and Hugh Salimbeni for comments on the manuscript. [1] T. D. Bui, J. Yan, and R. E. Turner. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research, 18(1):3649 3720, 2017. [2] E. Challis and D. Barber. Concave Gaussian variational approximations for inference in large-scale Bayesian linear models. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 15 of Proceedings of Machine Learning Research, pages 199 207. PMLR, 2011. [3] E. Challis and D. Barber. Gaussian Kullback Leibler approximate inference. Journal of Machine Learning Research, 14(32):2239 2286, 2013. [4] P. E. Chang, W. J. Wilkinson, M. E. Khan, and A. Solin. Fast variational learning in state-space Gaussian process models. In 2020 IEEE 30th International Workshop on Machine Learning for Signal Processing (MLSP), pages 1 6. IEEE, 2020. [5] C.-A. Cheng and B. Boots. Incremental variational sparse Gaussian process regression. In Advances in Neural Information Processing Systems 29 (NIPS), pages 4410 4418. Curran Associates, Inc., 2016. [6] C.-A. Cheng and B. Boots. Variational inference for Gaussian process models with linear complexity. In Advances in Neural Information Processing Systems 30 (NIPS), pages 5184 5194. Curran Associates, Inc., 2017. [7] L. Csató. Gaussian Processes: Iterative Sparse Approximations. Ph D thesis, Aston University, Birmingham, UK, 2002. [8] L. Csató and M. Opper. Sparse on-line Gaussian processes. Neural Computation, 14(3):641 668, 2002. [9] J. Gardner, G. Pleiss, K. Q. Weinberger, D. Bindel, and A. G. Wilson. GPy Torch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 7576 7586. Curran Associates, Inc., 2018. [10] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence (UAI), pages 282 290. AUAI Press, 2013. [11] J. Hensman, A. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classification. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 38 of Proceedings of Machine Learning Research, pages 351 360. PMLR, 2015. [12] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(4):1303 1347, 2013. [13] E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106:620 630, 1957. [14] E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939 952, 1982. [15] M. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava. Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Proceedings of the 35th International Conference on Machine Learning (ICML), volume 80 of Proceedings of Machine Learning Research, pages 2611 2620. PMLR, 2018. [16] M. E. Khan. Decoupled variational Gaussian inference. In Advances in Neural Information Processing Systems 27 (NIPS), pages 1547 1555. Curran Associates, Inc., 2014. [17] M. E. Khan and W. Lin. Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 of Proceedings of Machine Learning Research, pages 878 887. PMLR, 2017. [18] M. E. Khan and D. Nielsen. Fast yet simple natural-gradient descent for variational inference in complex models. In 2018 International Symposium on Information Theory and Its Applications (ISITA), pages 31 35. IEEE, 2018. [19] M. E. Khan and H. Rue. Learning-algorithms from Bayesian principles. ar Xiv preprint ar Xiv, 2021. [20] M. E. Khan, A. Aravkin, M. Friedlander, and M. Seeger. Fast dual variational inference for non-conjugate latent Gaussian models. In Proceedings of the 30th International Conference on Machine Learning (ICML), volume 28 of Proceedings of Machine Learning Research, pages 951 959. PMLR, 2013. [21] M. E. Khan, R. Babanezhad, W. Lin, M. Schmidt, and M. Sugiyama. Faster stochastic variational inference using proximal-gradient methods with general divergence functions. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence (UAI), pages 319 328. AUAI Press, 2016. [22] G. Kimeldorf and G. Wahba. Some results on Tchebycheffian spline functions. Journal of Mathematical Analysis and Applications, 33(1):82 95, 1971. [23] Y. Le Cun, C. Cortes, and C. J. Burges. The MNIST database of handwritten digits, 1998. URL http: //yann.lecun.com/exdb/mnist/. [24] Y. Li, J. M. Hernández-Lobato, and R. E. Turner. Stochastic expectation propagation. In Advances in Neural Information Processing Systems 28 (NIPS), pages 2323 2331. Curran Associates, Inc., 2015. [25] L. Malagò and G. Pistone. Information geometry of the Gaussian distribution in view of stochastic optimization. In Proceedings of the 2015 ACM Conference on Foundations of Genetic Algorithms XIII, pages 150 162, 2015. [26] A. G. d. G. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, and J. Hensman. GPflow: A Gaussian process library using Tensor Flow. Journal of Machine Learning Research, 18(40):1 6, 2017. [27] T. P. Minka. Expectation propagation for approximate Bayesian inference. In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI), volume 17, pages 362 369. AUAI Press, 2001. [28] H. Nickisch and C. E. Rasmussen. Approximations for binary Gaussian process classification. Journal of Machine Learning Research, 9(Oct):2035 2078, 2008. [29] M. Opper and C. Archambeau. The variational Gaussian approximation revisited. Neural Computation, 21 (3):786 792, 2009. [30] J. Quiñonero-Candela and C. E. Rasmussen. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939 1959, 2005. [31] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, Cambridge, MA, 2006. [32] H. Salimbeni, C.-A. Cheng, B. Boots, and M. Deisenroth. Orthogonally decoupled variational Gaussian processes. In Advances in Neural Information Processing Systems 31 (Neur IPS), pages 8711 8720. Curran Associates, Inc., 2018. [33] H. Salimbeni, S. Eleftheriadis, and J. Hensman. Natural gradients in practice: Non-conjugate variational inference in Gaussian process models. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics (AISTATS), volume 84 of Proceedings of Machine Learning Research, pages 689 697. PMLR, 2018. [34] M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds and Sparse Approximations. Ph D thesis, University of Edinburgh, Edinburgh, UK, 2003. [35] M. K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics (AISTATS), volume 5 of Proceedings of Machine Learning Research, pages 567 574. PMLR, 2009. [36] M. van der Wilk, V. Dutordoir, S. John, A. Artemev, V. Adam, and J. Hensman. A framework for interdomain and multioutput Gaussian processes. ar Xiv preprint ar Xiv:2003.01115, 2020. [37] M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008. [38] W. Wilkinson, A. Solin, and V. Adam. Sparse algorithms for Markovian Gaussian processes. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 130 of Proceedings of Machine Learning Research, pages 1747 1755. PMLR, 2021. [39] C. K. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In Advances in Neural Information Processing Systems 13 (NIPS), pages 682 688. MIT Press, 2001.