# tighter_sparse_variational_gaussian_processes__b5a23ec4.pdf

Published in Transactions on Machine Learning Research (05/2025)

Tighter sparse variational Gaussian processes

Thang D. Bui thang.bui@anu.edu.au School of Computing Australian National University

Matthew Ashman mca39@cam.ac.uk Department of Engineering University of Cambridge

Richard E. Turner ret26@cam.ac.uk Department of Engineering University of Cambridge

Reviewed on Open Review: https: // openreview. net/ forum? id= L33DSu3zvq

Sparse variational Gaussian process (GP) approximations based on inducing points have become the de facto standard for scaling GPs to large datasets, owing to their theoretical elegance, computational efficiency, and ease of implementation. This paper introduces a provably tighter variational approximation by relaxing the standard assumption that the conditional approximate posterior given the inducing points must match that in the prior. The key innovation is to modify the conditional posterior to have smaller variances than that of the prior at the training points. We derive the collapsed bound for the regression case, describe how to use the proposed approximation in large data settings, and discuss its application to handle orthogonally structured inducing points and GP latent variable models. Extensive experiments on regression benchmarks, classification, and latent variable models demonstrate that the proposed approximation consistently matches or outperforms standard sparse variational GPs while maintaining the same computational cost.

1 Introduction

Gaussian processes (GPs) (Rasmussen & Williams, 2006) provide a powerful framework for modelling probability distributions over functions, offering principled uncertainty quantification and ease of use. Their flexibility in encoding domain knowledge such as smoothness, peridocity, or domain-specific structure has led to widespread adoption across scientific and engineering applications. Exact inference in GP models poses significant computational challenges, requiring O(N 3) time and O(N 2) space complexity for N observations. A suite of approximations have been developed to address these limitations. Most notably, sparse variational Gaussian processes (SVGP; Titsias, 2009; Hensman et al., 2015; Matthews et al., 2016) address the poor computational complexity through the use of an approximate posterior distribution parameterised by a small set of inducing points.

The standard SVGP framework employs a structured variational approximation that factorises the posterior distribution over the unknown function f into two components: q(f) = p(f =u|u)q(u). Here, p(f =u|u) represents the GP prior distribution conditioned on the function values at inducing locations z, u = f(z). The second term, q(u), is modelled as a multivariate Gaussian distribution. Improved variational approximations have been developed such as SOLVE-GP (Shi et al., 2020) which use more sophisticated distributions for q(u).

This paper introduces a novel approach to improving SVGP approximations by modifying the conditional GP prior distribution at observed inputs, rather than focusing solely on the inducing point distribution. For

Published in Transactions on Machine Learning Research (05/2025)

Gaussian likelihoods, our approach yields a new and improved collapsed lower bound on the log marginal likelihood that involves no additional variational parameters. Furthermore, we show how the uncollapsed form of our bound facilitates the use of stochastic mini-batch optimisation and extends naturally to non Gaussian likelihoods through a single additional variational parameter. We demonstrate the versatility of our method by integrating it with SOLVE-GP and extending it to sparse variational approximations in the GP latent variable model (GPLVM; Lawrence, 2005; Damianou et al., 2016). Our results demonstrate that by targeting our improved lower bound, our approach consistently improves the predictive performance and log marginal likelihood estimates across a range of regression, classification, and latent variable modelling tasks. An implementation is made available in this repository https://github.com/thangbui/tighter_ sparse_gp. We note that the concurrent work of Titsias (2025) presents an identical approximation but offers a different exposition to this paper.

2 Background

This section provides a concise introduction to pseudo-point based sparse variational Gaussian processes (SVGP; Titsias, 2009; Hensman et al., 2015; Matthews et al., 2016; Bui et al., 2017). Consider GP regression with Gaussian observation noise:

p(f|γ) = GP(f; 0, kγ), (1)

p(y|f, x, σ2) = N(y; f(x), σ2I), (2)

where x RN D and y RN are the training inputs and corresponding noisy outputs, f denotes the unknown function mapping from input to output, kγ is the covariance function governed by hyperparameters γ, and σ2 is the observation noise. These hyperparameters, denoted collectively as θ, can be found by maximising the log marginal likelihood:

2 log(2π) 1

2y (Kff + σ2I) 1y 1

2 log |Kff + σ2I|, (3)

where Kff is the covariance between training function values f = f(x). This objective takes O(N 3) to compute and is thus computationally prohibitive for large N. To sidestep this, we use an approximate posterior judiciously parameterised by a small set of pseudo-points or inducing points as follows:

q(f) = p(f =f,u|f, u)p(f|u)q(u), (4)

where u = f(z) RM and z RM D are the inducing outputs and inputs, respectively, and M N. The conditional q(f =u|u) in the approximate posterior is chosen to match that in the prior, leading to the following variational objective,

F0(q(u), θ) = log p(f)p(y|f, x)

q(f) = log(((((( p(f =f,u|f, u) p(f|u)p(u)p(y|f, x)

(((((( p(f =f,u|f, u) p(f|u)q(u)

= KL[q(u)||p(u)] + X

u,f(xn) q(u)p(f(xn)|u) log p(yn|f(xn)). (5)

Titsias (2009) showed that when the likelihood is Gaussian, an analytic optimal form for q(u) can be found, q(u) p(u)N(y; Kfu K 1 uuu, σ2I), and that a collapsed bound is also analytically available,

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| 1 2σ2 trace(Dff), (6)

where Qff = Kfu K 1 uu Kuf and Dff = Kff Qff. Crucially, the bound above can be computed in O(NM 2). The non-collapsed bound in eq. (5) is amenable to non-Gaussian likelihoods and data mini-batch settings (see e.g., Hensman et al., 2015), further reducing the training computational complexity to O(BM 2 + M 3) where B is the mini-batch size. Due to this small complexity and the ease of implementation, the above variational approach has arguably become the go-to sparse approximation in the GP literature. In this work, we will revisit its core assumption of matching prior and posterior conditionals and show that relaxing this assumption results in a tighter and more performant approximation.

Published in Transactions on Machine Learning Research (05/2025)

3 A tighter variational approximation

The variational approximation in eq. (4) is chosen such that the conditional q(f|u) identically matches the prior conditional p(f|u). Instead, we propose using the following variational posterior,

q(f) = p(f =f,u|f, u)q(f|u)q(u), (7)

where q(f|u) = N(f; Kfu K 1 uuu; D1/2 ff MD /2 ff ), M is a diagonal matrix, M = diag([m1, m2, . . . , m N]) and mn > 0. Note that the mean of the prior conditional p(f|u) = N(f; Kfu K 1 uuu; Dff) is retained in q(f|u). The resulting variational bound is,

F2(q(u), θ, M) = log(((((( p(f =f,u|f, u)p(f|u)p(u)p(y|f, x)

(((((( p(f =f,u|f, u)q(f|u)q(u)

= KL[q(u)||p(u)] Z

u q(u)KL[q(f|u)||p(f|u)]

u,f(xn) q(u)q(f(xn)|u) log p(yn|f(xn)). (8)

Due to the structure of the variational distribution, the middle term can be simplified to,

u q(u)KL[q(f|u)||p(f|u)] = 1

2trace(M) + 1

2 log |M| + N

n [1 + log(mn) mn]

Collapsed bound and optimal M In the regression case, similar to the Titsias bound, we can obtain the optimal form for q(u) p(u)N(y; Kfu K 1 uuu, σ2I), leading to the following collapsed bound,

F3(θ, M) = N

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| 1

σ2 1 log(mn) + mn

Setting the partial derivatives of F3(θ, M) wrt mn to 0, we arrive at mn = σ2 dn+σ2 and the following bound,

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| 1

n log 1 + dn

where dn is the n-th element in the diagonal of Dff, dn = kfnfn kfnu K 1 uukufn.

Comparison to Titsias bound When M is the identity matrix, that is mn = 1 n, the approximation in eq. (7) becomes the Titsias variational approximation in eq. (4) and the bound in F3(θ) becomes the Titsias bound F1(θ) in eq. (6). We note that that F4(θ) is tighter than F1(θ) due to the inequality log(1+x) < x for all x > 1. Our solution improves upon the solution to the Titsias bound by allowing the marginals of the conditional approximate posterior, q(f(xn)|u), to have smaller variances than that of the conditional prior, since the optimal mn = σ2 dn+σ2 < 1. Intuitively, this reduces the strength of the coupling between q(u) and q(f), enabling q(f) to more freely model the data whilst allowing q(f) to be close to the prior elsewhere.

It is also worth noting that the middle term of our bound is always non-positive. One might think that adding this term to the bound would give a poorer approximation, yet, the improvement in the expected log-likelihood (due to the smaller predictive variances at the training points see predictions below) can yield a larger improvement to counteract.

Stochastic mini-batch settings The new bound can also handle data mini-batching, yielding an unbiased estimator of the uncollapsed bound in eq. (8) as follows,

F2(q(u), θ, M) KL[q(u)||p(u)] + N

b [1 + log(mb) mb]

u,f(xb) q(u)q(f(xb)|u) log p(yb|f(xb)). (10)

Published in Transactions on Machine Learning Research (05/2025)

Non-Gaussian likelihoods and mn parameterisation One can parameterise mn s to satisfy their positive constraint and optimise them directly at the cost of having N extra parameters. However, the optimal form for mn in the Gaussian likelihood setting suggests a more efficient parameterisation mn = β/(dn + β) with β > 0 shared across all data points. We will use the latter parameterisation for all of our experiments.

Predictions The predictive mean and variance of the predictive distribution at a test input x are

m = k u K 1 uumu, (11)

v = k k u K 1 uuku + k u K 1 uu Su K 1 uuku (k f Q f)Vff(kf Qf ), (12)

where Vff = D /2 ff (I M)D 1/2 ff . Note that (i) we can compute the predictive mean at the same cost as previous sparse approximations, and (ii) the predictive variance at a training point can be approximated by vn = mndn + kfnu K 1 uu Su K 1 uukufn. More generally, the variance at a new input that is not a training or inducing input is expensive due to the presence of Dff in the last term. One path to address this could be to approximate Dff by its diagonal matrix or to use only a subset of training points for this computation. However, we find that simply ignoring the last term at test time does not impact the predictive performance while substantially reducing the prediction cost (see section 6.2).

Connections to existing bounds We can use the log-sum inequality1 to bound the last term of our collapsed bound:

n=1 log 1 + dn

h PN n=1 1 + dn

N = N log 1 + trace(Kff Qff)

Thus a looser collapsed bound can be obtained:

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| N

2 log 1 + trace(Kff Qff)

This bound was derived by Artemev et al. (2021) based on bounds of the quadratic and log-determinant terms in the exact log marginal likelihood. This is also tighter than the Titsias bound, that is F4(θ) F5(θ) F1(θ).

One can also view the proposed variational approximation as an instance of the sparse orthogonal approach of Shi et al. (2020) in which there are two sets of inducing points u and v with v := f, mv := 0 and Sv := D1/2 ff MD /2 ff . However, this view does not suggest new insights or potential improvements. We will next discuss how to use the proposed variational approximation to improve the sparse orthogonal approach and in the latent variable settings.

4 Application to sparse orthogonal variational GPs

The sparse orthogonal approach (SOLVEGP) of Shi et al. (2020) can be viewed as a structured approximation with two sets of pseudo-points u and v,

q(f) = p(f =f,u,v|f, u, v)p(f|u, v)q(u, v),

q(u, v) = N(u; mu, Su)N(v; Kvu K 1 uuu + mv, Sv)

; mu Kvu K 1 uumu + mv

, Su Su K 1 uu Kuv Kvu K 1 uu Su Sv + Kvu K 1 uu Su K 1 uu Kuv

1For non-negative numbers a1, a2, . . . , an and b1, b2, . . . , bn, Pn i=1 ai log ai

bi Pn i=1 ai log Pn

i=1 bi with equality iff

ai/bi = constant.

Published in Transactions on Machine Learning Research (05/2025)

where (mu, Su) and (mv, Sv) are the mean and covariance variational parameters. This approximation brings computational benefits over naively using a single set of pseudo-points with cardinality M = Mu+Mv while matching the latter s performance. We will show that the same trick used for sparse variational GPs relaxing the conditional matching assumption q(f|u, v) = p(f|u, v) can improve SOLVEGP. In particular, similar to SVGP, we will use q(f|u, v) = N(f; Kf,uv K 1 uv,uv[u , v ] ; D1/2 ff MD /2 ff ) where Dff = Kff Kf,uv K 1 uv,uv Kuv,f, M is a diagonal matrix, M = diag([m1, m2, . . . , m N]) and mn > 0. The resulting variational bound is

F6(q(u, v), θ, M) = log(((((((( p(f =f,u,v|f, u, v)p(f|u, v)p(u, v)p(y|f, x)

(((((((( p(f =f,u,v|f, u, v)q(f|u, v)q(u, v)

= KL[q(u)||p(u)] KL[ q(v)|| p(v)] + 1

n [1 + log(mn) mn]

u,v,f(xn) q(u, v)q(f(xn)|u, v) log p(yn|f(xn))., (14)

where q(v) = N(v; mv, Sv), p(v) = N(v; 0, Cvv), and Cvv = Kvv Kvu K 1 uu Kuv. Note that the predictive distribution at a training point can be approximated efficiently, q(f(xn)) N(f(xn); mn, vn) with

mn = kfnu K 1 uumu + cfnv C 1 vvmu,

vn = mn(cfnfn cfnv C 1 vvcvfn) + k u K 1 uu Su K 1 uuku + cfnv C 1 vv Sv C 1 vvcfnv

where cab = kab kau K 1 uukub. Similar to SVGP, the predictive variance at a new test point is expensive due to the dependence on all training points. However, similar to the tighter approximation in section 3, we found that simply ignoring this difficult term works well in practice.

5 Application to Bayesian GP latent variable models

Consider a GP latent variable model (GPLVM; Lawrence, 2005) with Gaussian observation noise:

p(x) = N(x; 0, I),

p(f|γ) = GP(f; 0, kγ),

p(y|f, x) = N(y; f(x), σ2I).

Both the posterior p(f|y) and marginal likelihood p(y) are intractable. Instead, we introduce an approximate posterior of the following form:

q(f, x) = q(x)p(f =f,u|f, u)q(f|u, x)q(u)

q(f|u, x) = N(f; Kfu K 1 uuu, D1/2 ff M(x)D /2 ff ),

where M(x) = diag([m1(x1), m2(x2), . . . , m N(x N)]) and mn(xn) > 0. Note that q(f|u, x) depends on x through Kfu, Dff and M(x), and that when M is the identity matrix, that is mn = 1, we obtain the variational approximation of Damianou et al. (2016). We can bound the log marginal likelihood as

F(q(f, x), θ) = KL[q(x) p(x)] KL[q(u) p(u)] + 1

n 1 + log(m(xn)) m(xn) q(xn)

u,xn,f(xn) q(xn)q(u)q(f(xn)|xn, u) log p(yn|f(xn)).

We can obtain the collapsed bound by noting that the optimal form for q(u) is given by

q(u) p(u) exp log N(y; Kfu K 1 uuu, σ2I) q(x) .

Published in Transactions on Machine Learning Research (05/2025)

0 10 20 30 40 50 iteration [l-bfgs]

negative lower bound

SGPR T-SGPR

0 200 400 600 800 1000 iteration [adam]

SVGP T-SVGP

2 0 2 4 6 8 x

SGPR: 2 f =0.087, lf=0.435, 2=0.126 T-SGPR: 2 f =0.107, lf=0.430, 2=0.115

Figure 1: Left and middle: Optimisation traces for SGPR, T-SGPR, SVGP and T-SVGP on the Snelson dataset with 5 inducing points. Right: Predictive means and uncertainties. The stronger shade is for noiseless predictions.

Note also that Z

u,f q(u)q(f|u) log p(y|f) = Z

u q(u) log N(y; Kfu K 1 uuu, σ2I) X

Together with Jensen s inequality, we arrive at the collapsed bound

F(q(x)) = KL[q(x) p(x)] 1

2σ2 1 log mn(xn) + mn(xn)

u e log N(y;Kfu K 1 uuu,σ2I) q(x)p(u) .

Setting derivatives w.r.t. mn(x) to 0 gives

mn(xn) q(xn) = σ2

which is satisfied by mn(xn) = σ2 dn+σ2 or mn(xn) = D σ2 dn+σ2 E

q(xn). The former is easier to implement as we

do not need to (approximately) integrate out xn to find mn.

6 Experimental results

We validate the utility of the proposed variational posterior in a suite of experimental settings. We switch the variational objective with the proposed approximation in each setting, keep all other configurations unchanged, and measure the two s predictive performance. Implementations based on GPytorch and GPflow are released in this repository https://github.com/thangbui/tighter_sparse_gp.

6.1 Toy 1-D regression

To build intuition about the proposed method s behaviour, we first evaluate it on a 1-D regression problem used by Snelson & Ghahramani (2005). We compare (i) Titsias collapsed bound in eq. (5) [SGPR] with the proposed collapsed bound in eq. (9) [T-SGPR], and (ii) Titsias uncollapsed bound [SVGP] with the proposed uncollapsed bound in eq. (8) [T-SVGP]. Figure 1 illustrates the optimisation trajectories of these methods and the final fits for both SGPR and T-SGPR using five inducing points. The final values for both

Published in Transactions on Machine Learning Research (05/2025)

Test Log-likelihood

kin40k N=40000, D=8

protein N=45730, D=9

keggdirected N=48827, D=20

keggundirected N=63608, D=27

Test Log-likelihood

3droad N=434874, D=3

song N=515345, D=90

buzz N=583250, D=77

houseelectric N=2049280, D=11

M=256 M=512 M=1024 Tighter is better Tighter is worse

Figure 2: Test log-likelihood for various sparse approximations on eight regression datasets and various numbers of pseudo-points. For SOLVEGP and T-SOLVEGP, M is evenly split for u and v. Higher is better. Best viewed in colour.

uncollapsed and collapsed versions of the proposed bound appear tighter than that of the Titsias bound in practice. The learned hyperparameters reveal that T-SGPR prefers smaller observation noise (0.115) and larger kernel variance (0.107) compared to that of SGPR (0.126 and 0.087, respectively).

6.2 Efficient predictive variances

A key practical consideration is the computational cost of predictive variance in eq. (12). The exact computation requires D 1 ff which scales poorly with the training set size. We evaluate a simplified variant that omits the term that involves Dff, the last term in eq. (12), and compare it to the exact variance calculation across three small benchmark datasets: wine, solar, and pumadyn32nm. Table 1 presents a detailed comparison between the exact and approximate versions. We can see a pattern across all datasets: the simplified variant consistently matches the full model s performance while offering substantial computational savings. For this reason, we will be using the simplified version for all remaining experiments. The improvements in predictive performance in sections 6.3 and 6.4 are therefore solely due to better estimation of the hyperparameters and a different q(u).

Dataset N/D eq. (12) RMSE Log-likelihood Time (s)

wine 1599/11 w. last term 0.47 0.01 -0.66 0.01 0.15 0.00 wo. last term 0.47 0.01 -0.66 0.01 0.03 0.00

solar 1066/10 w. last term 0.93 0.07 -1.57 0.20 0.07 0.00 wo. last term 0.93 0.07 -1.56 0.20 0.03 0.00

pumadyn32nm 8192/32 w. last term 1.00 0.01 -1.42 0.01 21.12 0.06 wo. last term 1.00 0.01 -1.42 0.01 0.05 0.00

Table 1: RMSE, log-likelihood, and run time for two variants of predictive variance computation.

Published in Transactions on Machine Learning Research (05/2025)

Train ELBO / N

SVGP, M=512 T-SVGP, M=512

SOLVEGP, M=512+512 T-SOLVEGP, M=512+512

Test log-likelihood

0 50 100 150 200 250 300 Epoch

Figure 3: Variational bound and test performance for various approximations trained on the kin40k dataset. Best viewed in colour.

6.3 Large-scale regression benchmarks

We next compare four methods, SVGP, T-SVGP, SOLVEGP, and the SOLVEGP variant in eq. (14) [TSOLVEGP], across three inducing-point configurations (M = 256, 512, 1024), on eight medium to large regression datasets. The datasets range from 40K to 2M data points with varying input dimensionalities (Yang et al., 2015). We use the Matern-3/2 kernel and repeat each experiment 10 times, each employing a random train/test split. The comparison results are shown in figs. 2, 7 and 8. We note that (i) both T-SVGP and T-SOLVEGP consistently match or slightly outperform (on 5/8 datasets), or significantly outperform (on 3/8 datasets) their base counterparts, (ii) the performance improvement is also consistent across various inducing-point configurations, (iii) the improvements (on 3/8 datasets) are also consistent across training runs and iterations, as shown in fig. 3 for the kin40k dataset, and (iv) SVGP and T-SVGP (and similarly SOLVEGP and T-SOLVEGP) have almost identical run time so the improvements here do not come at any cost. We also quantitatively compare the hyperparameters provided by these approximations, including the kernel variance, kernel lengthscale, and observation noise in figs. 9 to 11, respectively. Similar to the 1D example above, we observe that the proposed approximations prefer smaller observation noises and larger kernel variances.

6.4 MNIST classification

To evaluate the performance of the proposed approximation on non-Gaussian likelihoods, we run an experiment on the MNIST digit classification task with 256, 512, 1024, and 2048 inducing points, using the SVGP, T-SVGP, SOLVEGP, and T-SOLVEGP variational objectives. Figure 4 shows that both the proposed approximations achieve substantial performance gains in all metrics compared to their base versions.

6.5 Comparison to exact GP regression

We further evaluate the proposed approximation by comparing it to exact GP regression on a small dataset, following the set-up in (Titsias, 2009). We use the Boston housing dataset,2 vary the number of inducing

2https://www.cs.toronto.edu/~delve/data/boston/boston Detail.html

Published in Transactions on Machine Learning Research (05/2025)

Test accuracy

Test log likelihood

M=256 M=512 M=1024 M=2048 Tighter is better Tighter is worse

Figure 4: Log marginal likelihood approximations and test performance on the MNIST 10-way classification task. Best viewed in colour.

points, and use the collapsed bounds in eq. (6) [SGPR] and eq. (9) [T-SGPR]. The hyperparameters are fixed to that of the exact GP regression model so that the only variational parameter is the inducing inputs. We use four metrics for this comparison: the variational bound, test error, test log-likelihood, and the KL divergence between the predictive distributions given by exact GP regression and that provided by SGPR or T-SGPR. The average results across five repeats are included in fig. 5, confirming that the proposed tighter approximation consistently gives predictive distributions closer to the exact GPR predictions than that of SGPR. The test metrics that only use the marginal predictive distributions for T-SGPR are also better, consistent with the trend that, for fixed hyperparameters, a tighter variational approximation is better.

Exact GPR SGPR T-SGPR

KL (exact pred. || approx pred.)

Test Log-likelihood

Figure 5: Comparing the Titsias collapsed bound and the proposed collapsed bound to exact GP regression, with the model hyperparameters fixed to those of exact GP. Best viewed in colour.

In addition, we evaluate the impact of the hyperparameters on the bound and predictive performance when the variational approximation is fixed. Specifically, we optimise the uncollapsed SVGP bound on the Boston

Published in Transactions on Machine Learning Research (05/2025)

housing dataset until convergence then subsequently switch to the tighter bound while keeping q(u) and the inducing inputs fixed. The results included in fig. 6 demonstrate that the hyperparameters given by the improved bound yield marginally better predictive performance.

SVGP T-SVGP

Figure 6: Comparing the SVGP bound and the proposed bound while keeping the variational distribution fixed. Best viewed in colour.

We build upon the standard sparse variational Gaussian process (SVGP) approximate posterior distribution through a simple modification to the conditional GP prior distribution at observed inputs. Using our proposed posterior approximation, we derive a collapsed bound which improves upon existing SVGP lower bounds to the log marginal likelihood, and an uncollapsed form which facilitates its application with non Gaussian likelihoods and is compatible with stochastic mini-batch optimisation. Furthermore, we show how our approach can be used to improve non-standard SVGP posterior approximations, such as SOLVE-GP (Shi et al., 2020).

Our empirical results demonstrate consistent improvements in both predictive performance and log marginal likelihood estimates across diverse applications, including regression, classification, and latent variables modelling tasks. The proposed posterior approximations can be easily applied to other settings such as deep GPs and convolutional GPs (Van der Wilk et al., 2017; Blomqvist et al., 2020; Sun et al., 2021; Bui et al., 2016; Salimbeni & Deisenroth, 2017).

Acknowledgments

We thank Martin Jankowiak for pointing out a parameterisation issue in an earlier version of this paper. We also thank the TMLR editor and reviewers for the constructive feedback.

Artem Artemev, David R Burt, and Mark van der Wilk. Tighter bounds on the log marginal likelihood of Gaussian process regression using conjugate gradients. In International Conference on Machine Learning, pp. 362 372, 2021.

Christopher M Bishop and Gwilym D James. Analysis of multiphase flows using dual-energy gamma densitometry and neural networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 327(2-3):580 593, 1993.

Kenneth Blomqvist, Samuel Kaski, and Markus Heinonen. Deep convolutional Gaussian processes. In European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 582 597, 2020.

Published in Transactions on Machine Learning Research (05/2025)

Thang Bui, Daniel Hernandez-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp. 1472 1481, 2016.

Thang D. Bui, Josiah Yan, and Richard E. Turner. A unifying framework for Gaussian process pseudo-point approximations using power expectation propagation. Journal of Machine Learning Research, 18(104): 1 72, 2017.

Andreas C. Damianou, Michalis K. Titsias, and Neil D. Lawrence. Variational inference for latent variables and uncertain inputs in Gaussian processes. Journal of Machine Learning Research, 17(42):1 62, 2016.

James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics, pp. 351 360, 2015.

Vidhi Lalchand, Aditya Ravuri, and Neil D. Lawrence. Generalised GPLVM with stochastic variational inference. In International Conference on Artificial Intelligence and Statistics, pp. 7841 7864, 2022.

Neil D. Lawrence. Probabilistic non-linear principal component analysis with Gaussian process latent variable models. Journal of Machine Learning Research, 6:1783 1816, 2005.

Alexander G de G Matthews, James Hensman, Richard Turner, and Zoubin Ghahramani. On sparse variational methods and the Kullback-Leibler divergence between stochastic processes. In International Conference on Artificial Intelligence and Statistics, pp. 231 239, 2016.

Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.

Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep Gaussian processes. Advances in Neural Information Processing Systems, 30, 2017.

Jiaxin Shi, Michalis Titsias, and Andriy Mnih. Sparse orthogonal variational inference for Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 1932 1942, 2020.

Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. Advances in Neural Information Processing Systems, 18, 2005.

Shengyang Sun, Jiaxin Shi, Andrew Gordon Gordon Wilson, and Roger B Grosse. Scalable variational Gaussian processes via harmonic kernel decomposition. In International Conference on Machine Learning, pp. 9955 9965, 2021.

Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567 574, 2009.

Michalis K. Titsias. New bounds for sparse variational Gaussian processes, 2025.

Mark Van der Wilk, Carl Edward Rasmussen, and James Hensman. Convolutional Gaussian processes. Advances in Neural Information Processing Systems, 30, 2017.

Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la Carte Learning Fast Kernels. In International Conference on Artificial Intelligence and Statistics, pp. 1098 1106, 2015.

A An even tighter but expensive approximation

We consider a more general form for the conditional covariance of q(f|u) as follows:

q(f) = p(f =f,u|f, u)q(f|u)q(u),

q(f|u) = N(f; Kfu K 1 uuu, C),

Published in Transactions on Machine Learning Research (05/2025)

Again, we can also obtain the optimal form for q(u) p(u)N(y; Kfu K 1 uuu, σ2I), leading to the following collapsed bound

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| 1

2trace[(σ 2I + D 1 ff )C] 1

2 log |C 1Dff|.

We can derive the optimal C, C 1 = D 1 ff + σ 2I and the bound becomes:

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| 1

2 log |I + σ 2Dff|

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| 1

n log(1 + σ 2λn(Dff)|,

where λn(X) is the n-th eigenvalue of X. The bound above is as expensive as the original log marginal likelihood.

B Exploring alternative parameterisations for the conditional posterior

Instead of the general C as above or the form considered in the main text C = D1/2 ff MD /2 ff , we consider two other parameterisations that might allow efficient collapsed/un-collapsed bounds and predictions. We first rewrite the uncollapsed bound and the predictive mean and variance here for clarity,

Funcollapsed = KL[q(u)||p(u)] Z

u q(u)KL[q(f|u)||p(f|u)] + X

u,f(xn) q(u)q(f(xn)|u) log p(yn|f(xn)),

m = k u K 1 uumu,

v = k k u K 1 uuku + k u K 1 uu Su K 1 uuku (k f Q f)(Dff C)(kf Qf ),

We first consider C = M1/2Dff M1/2. While this allows efficient exact predictive marginal distributions at training points, the middle term in the bound is costly to compute due to the presence of Dff:

u q(u)KL[q(f|u)||p(f|u)] = 1

2trace(D 1 ff M1/2Dff M1/2) + 1

2 log |M| + N

Another special case of the parameterisation presented in the main text is C = m Dff, i.e., a single m is shared across all training points. This conveniently leads to tractable exact predictive variances at training points, vn = mdn + kfnu K 1 uu Su K 1 uukufn. The middle term in the bound can be simplified to,

u q(u)KL[q(f|u)||p(f|u)] = N

2 [1 + log(m) m].

In the regression case, this leads to the optimal m = σ2/(N 1 P

n dn + σ2) and the following collapsed bound,

2 log(2π) 1

2y (Qff + σ2I) 1y 1

2 log |Qff + σ2I| N

2 log 1 + P

This is identical to the bound derived by Artemev et al. (2021). As shown in the main text, this bound is looser than the collapsed bound in eq. (9), due to the Jensen s inequality log(1+P

n xn/N) N 1 P

n log(1+ xn).

C Additional experimental results

C.1 Large-scale regression benchmarks

In addition to the test log likelihood results presented in the main text, we also compare the approximations using the test root mean squared error (RMSE) in fig. 7 and the variational lower bound in fig. 8. The

Published in Transactions on Machine Learning Research (05/2025)

proposed approximations (T-SVGP and T-SOLVEGP) tend to give smaller RMSEs and better log marginal likelihood approximations, indicating tighter bounds leading to better predictions.

We also quantitatively compare the hyperparameters provided by these approximations, including the kernel variance, kernel lengthscale, and observation noise in figs. 9 to 11, respectively. Overall, the proposed approximations prefer smaller observation noises and larger kernel variances.

kin40k N=40000, D=8

protein N=45730, D=9

keggdirected N=48827, D=20

keggundirected N=63608, D=27

3droad N=434874, D=3

song N=515345, D=90

buzz N=583250, D=77

houseelectric N=2049280, D=11

M=256 M=512 M=1024 Tighter is better Tighter is worse

Figure 7: Test root mean squared errors (RMSE) for various sparse approximations on eight regression datasets and various numbers of pseudo-points. For SOLVEGP and T-SOLVEGP, M is evenly split for u and v. Lower is better. Best viewed in colour.

C.2 mn parameterisations for non-Gaussian likelihoods

We next assess the difference between two parameterisations of mn, mn = β/(dn + β) with β shared across all data points, and {mn}N n=1 with each mn specific to a training point, on a simple binary classification dataset. The results are included in fig. 12. We observe that (i) the former parameterisation is tighter than vanilla SVGP and the latter parameterisation is even tighter, and (ii) the difference between the predictive probabilities given by these parameterisations is small. For small and medium datasets, it is thus potentially beneficial to have N additional parameters. For large datasets, the former parameterisation is preferred to keep the number of parameters manageable.

C.3 GPLVM on the oil flow dataset

Finally, we demonstrate the proposed method s applicability to latent variable models through experiments with Bayesian GPLVM on the oil flow dataset. The multi-phase oil flow dataset consists of 1000, 12dimensional data points belonging to three classes which correspond to the different phases of oil flow in a pipeline (Bishop & James, 1993). Figure 13 compares the standard variational BGPLVM (Damianou et al., 2016; Lalchand et al., 2022) [V-BGPLVM] against the proposed approximation in section 5 [TV-BGPLVM], averaged across five repeats. The optimisation trajectories show that TV-BGPLVM achieves a lower final negative ELBO (roughly 5.5 versus 5.2), indicating a more accurate posterior approximation.

Published in Transactions on Machine Learning Research (05/2025)

kin40k N=40000, D=8

protein N=45730, D=9

keggdirected N=48827, D=20

keggundirected N=63608, D=27

3droad N=434874, D=3

song N=515345, D=90

buzz N=583250, D=77

houseelectric N=2049280, D=11

M=256 M=512 M=1024 Tighter is better Tighter is worse

Figure 8: Log marginal likelihood approximations (ELBO) for various sparse approximations on eight regression datasets and various numbers of pseudo-points. For SOLVEGP and T-SOLVEGP, M is evenly split for u and v. Higher is better. Best viewed in colour.

Kernel variance

kin40k N=40000, D=8

protein N=45730, D=9

keggdirected N=48827, D=20

keggundirected N=63608, D=27

Kernel variance

3droad N=434874, D=3

song N=515345, D=90

buzz N=583250, D=77

houseelectric N=2049280, D=11

M=256 M=512 M=1024 Tighter gives higher values Tighter gives lower values

Figure 9: Kernel variances for various sparse approximations on eight regression datasets and various numbers of pseudo-points. For SOLVEGP and T-SOLVEGP, M is evenly split for u and v. Best viewed in colour.

Published in Transactions on Machine Learning Research (05/2025)

Observation noise

kin40k N=40000, D=8

protein N=45730, D=9

keggdirected N=48827, D=20

keggundirected N=63608, D=27

Observation noise

3droad N=434874, D=3

song N=515345, D=90

buzz N=583250, D=77

houseelectric N=2049280, D=11

M=256 M=512 M=1024 Tighter gives higher values Tighter gives lower values

Figure 10: Observation noise variances for various sparse approximations on eight regression datasets and various numbers of pseudo-points. For SOLVEGP and T-SOLVEGP, M is evenly split for u and v. Best viewed in colour.

Kernel lengthscale

kin40k N=40000, D=8

protein N=45730, D=9

keggdirected N=48827, D=20

keggundirected N=63608, D=27

Kernel lengthscale

3droad N=434874, D=3

song N=515345, D=90

buzz N=583250, D=77

houseelectric N=2049280, D=11

M=256 M=512 M=1024 Tighter gives higher values Tighter gives lower values

Figure 11: Kernel lengthscales for various sparse approximations on eight regression datasets and various numbers of pseudo-points. For SOLVEGP and T-SOLVEGP, M is evenly split for u and v. Best viewed in colour.

Published in Transactions on Machine Learning Research (05/2025)

2 1 0 1 2 3 1.0

SVGP ELBO: -15.43

2 1 0 1 2 3 1.0

T-SVGP ELBO: -15.38

2 1 0 1 2 3 1.0

T-SVGP with N params

ELBO: -15.33

2 1 0 1 2 3 1.0

Difference T-SVGP - T-SVGP with N params

0.0 0.2 0.4 0.6 0.8 1.0 Probability

0.0054 0.0036 0.0018 0.0000 0.0018 0.0036 0.0054

Figure 12: A comparison of mn parameterisations on a two-dimensional binary classification task. First: predictions when using the standard SVGP approximation. Second: predictions when using the proposed tighter SVGP bound with mn = β/(dn + β) with β shared across all data points. Third: predictions when using the proposed tighter bounds with N additional parameters {mn}N n=1. Fourth: the difference between the predictive probabilities given by the two mn parameterisations.

0 2500 5000 7500 10000 12500 15000 17500 20000 Iterations

Negative ELBO

5000 7500 10000 12500 15000 17500 20000

V-BGPLVM TV-BGPLVM

Figure 13: Optimisation traces for variational Bayesian GPLVM on the oil flow dataset. Best viewed in colour.