# projected_stein_variational_gradient_descent__971da148.pdf

Projected Stein Variational Gradient Descent

Peng Chen Omar Ghattas Oden Institute for Computational Engineering and Sciences The University of Texas at Austin Austin, TX 78712. {peng, omar}@oden.utexas.edu

The curse of dimensionality is a longstanding challenge in Bayesian inference in high dimensions. In this work, we propose a projected Stein variational gradient descent (p SVGD) method to overcome this challenge by exploiting the fundamental property of intrinsic low dimensionality of the data informed subspace stemming from ill-posedness of such problems. We adaptively construct the subspace using a gradient information matrix of the log-likelihood, and apply p SVGD to the much lower-dimensional coefﬁcients of the parameter projection. The method is demonstrated to be more accurate and efﬁcient than SVGD. It is also shown to be more scalable with respect to the number of parameters, samples, data points, and processor cores via experiments with parameters dimensions ranging from the hundreds to the tens of thousands.

1 Introduction

Given observation data for a system with unknown parameters, Bayesian inference provides an optimal probability framework for learning the parameters by updating their prior distribution to a posterior distribution. However, many conventional methods for solving high-dimensional Bayesian inference problems face the curse of dimensionality, i.e., the computational complexity grows rapidly, often exponentially, with respect to (w.r.t.) the number of parameters. To address the curse of dimensionality, the intrinsic properties of the posterior distribution, such as its smoothness, sparsity, and intrinsic low-dimensionality, have been exploited to reduce the parameter correlation and develop efﬁcient methods whose complexity grows slowly or remains the same with increasing dimension. By exploiting the geometry of the log-likelihood function, accelerated Markov chain Monte Carlo (MCMC) methods have been developed to reduce the sample correlation or increase effective sample size independent of the dimension [21, 25, 27, 17, 18, 2]. Nevertheless, these random and essentially serial sampling methods remain prohibitive for large-scale inference problems with expensive likelihoods. Deterministic methods using sparse quadratures [32, 30, 10, 11] were shown to converge rapidly with dimension-independent rates for problems with smooth and sparse posteriors. However, for posteriors lacking smoothness or sparsity, the convergence deteriorates signiﬁcantly, despite incorporation of Hessian-based transformations [31, 12].

Transport-based variational inference is another type of deterministic method that seeks a transport map in a function space (represented by, e.g., polynomials, kernels, or neural networks) that pushes the prior to the posterior by minimizing the difference between the transported prior and the posterior, measured in, e.g., Kullback Leibler divergence [26, 24, 4, 3, 20]. In particular, kernel-based Stein variational methods, using gradient-based (SVGD) [24, 16, 23, 8] and Hessian-based (SVN) [19, 35] optimization methods, are shown to achieve fast convergence in relatively low dimensions. Nonetheless, the convergence and accuracy of these methods deteriorates in high dimensions due to the curse of dimensionality in kernel representation. This can be partially addressed by a localized

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

SVGD on Markov blankets, which relies on conditional independence of the target distribution [37, 34], or by parameter projection for dimension reduction in p SVN [14] and lazy maps [3].

Contributions: Here, we propose, analyze, and apply a projected SVGD method to tackle the curse of dimensionality for high-dimensional nonlinear Bayesian inference problems, which relies on the fundamental property that the posterior effectively differs from the prior only in a low-dimensional subspace of high-dimensional parameters, see [6, 33, 22, 18, 12, 13, 7, 3, 14] and references therein. Speciﬁcally, our contributions are: (1) we propose dimension reduction by projecting the parameters to a low-dimensional subspace constructed using the gradient of the log-likelihood, and push the prior samples of the projection coefﬁcients to their posterior by p SVGD; (2) we prove the equivalence of the projected transport in the coefﬁcient space and the transport in the projected parameter space; (3) we propose adaptive and parallel algorithms to efﬁciently approximate the optimal proﬁle function and the gradient information matrix for the construction of the subspace; and (4) we demonstrate the accuracy (compared to SVGD) and scalability of p SVGD w.r.t. the number of parameters, samples, data points, and processor cores by classical high-dimensional Bayesian inference problems.

The major differences of this work compared to p SVN [14] are: (1) p SVGD uses only gradient information of the log-likelihood, which is available for many models, while p SVN requires Hessian information, which is challenging for complex models and codes in practical applications; (2) the upper bound for the p SVGD projection error w.r.t. the posterior, based on a result from [36], is sharper than that for p SVN; (3) we prove the equivalence of the projected transport for the coefﬁcient and the transport for the projected parameters; (4) we also test new benchmark examples and investigate the convergence of p SVGD w.r.t. the number of parameters and the scalability of p SVGD w.r.t. the number of data points.

2 Preliminaries

Let x Rd denote a random parameter of dimension d N, which has a continuous prior density p0 : Rd R. Let y = {yi}s i=1 denote a set of i.i.d. observation data. Let f(x) := Qs i=1 p(yi|x) denote, up to a multiplicative constant, a continuous likelihood of y at given x. Then the posterior density of parameter x conditioned on data y, denoted as p( ) : Rd R, is given by Bayes rule as

Z f(x)p0(x), (1)

where Z is the normalization constant deﬁned as

Rd f(x)p0(x)dx, (2)

whose computation is typically intractable, especially for a large d. The central task of Bayesian inference is to draw samples of parameter x from its posterior with density p, and compute some statistical quantity of interest, e.g., the mean and variance of the parameter x or some function of x.

SVGD is one type of variational inference method that seeks an approximation of the posterior density p by a function q in a predeﬁned function set Q, which is realized by minimizing the Kullback Leibler (KL) divergence that measures the difference between two densities, i.e.,

q = arg min q Q DKL(q|p), (3)

where DKL(q|p) = Ex q[log(q/p)], i.e., the average of log(q/p) with respect to the density q, which vanishes when q = p. In particular, a transport based function set is considered as Q = {T p0 : T T }, where T is a pushforward map that pushes the prior density to a new density q := T p0 through a transport map T( ) : Rd Rd in a space T . Let T be given by

T(x) = x + ϵφ(x), (4)

where φ : Rd Rd is a differentiable perturbation map w.r.t. x, and ϵ > 0 is small enough so that T is invertible. It is shown in [24] that

ϵDKL(T p0|p) ϵ=0 = Ex p0[trace(Apφ(x))], (5)

where Ap is the Stein operator given by

Apφ(x) = x log p(x)φ(x)T + xφ(x). (6)

Based on this result, a practical SVGD algorithm was developed in [24] by choosing the space T = (Hd)d = Hd Hd, a tensor product of a reproducing kernel Hilbert space (RKHS) Hd with kernel k( , ) : Rd Rd R. SVGD updates samples x0 1, . . . , x0 N drawn from the prior p0 as

xℓ+1 m = xℓ m + ϵl ˆφ ℓ(xℓ m), m = 1, . . . , N, ℓ= 0, 1, . . . (7)

where ϵl is a step size or learning rate, and ˆφ ℓ(xℓ m) is an approximate steepest direction given by

ˆφ ℓ(xℓ m) = 1

n=1 xℓn log p(xℓ n)k(xℓ n, xℓ m) + xℓnk(xℓ n, xℓ m). (8)

The kernel k plays a critical role in pushing the samples to the posterior. One choice is Gaussian

k(x, x ) = exp ||x x ||2 2 h

where h is the bandwidth, e.g., h = med2/ log(N) with med representing the median of sample distances [24]. However, it is known that the kernel suffers from the curse of dimensionality for large d [28, 37, 34], i.e., the kernel is degenerate and becomes close to delta function for large d, which leads to samples not representative of the posterior, as observed in [37, 34], and also demonstrated in Section 4 by the inaccurate posterior sample variance and large testing error.

3 Projected Stein Variational Gradient Descent

To tackle the curse of dimensionality of SVGD, we exploit one fundamental property of many highdimensional Bayesian inference problems the posterior only effectively differs from the prior in a relatively low-dimensional subspace due to the ill-posedness or over-parametrization of the inference problems, see many examples and some proofs in, e.g., [1, 5, 6, 33, 22, 18, 12, 36, 13, 7, 3, 14]. In the following, we present the method to ﬁnd the subspace and project the parameter to the subspace for dimension reduction in Section 3.1, which leads to the development of p SVGD in Section 3.2. Then we present practical algorithms of p SVGD and adaptive p SVGD in Section 3.3.

3.1 Dimension reduction by projection

We ﬁnd the subspace in which the posterior evidently differs from the prior using a gradient-based parameter sensitivity of the log-likelihood function [36]. More speciﬁcally, by H Rd d we denote a gradient information matrix, which is deﬁned as the average of the outer product of the gradient of the log-likelihood w.r.t. the posterior, i.e.,

Rd( x log f(x))( x log f(x))T p(x)dx. (10)

By (λi, ψi)r i=1 we denote the dominant eigenpairs of (H, Γ), with Γ representing the covariance of the parameter x w.r.t. its prior, i.e., (λi, ψi)r i=1 correspond to the r largest eigenvalues λ1 λr,

Hψi = λiΓψi. (11) Given H, which is practically computed in Section 3.3, the generalized eigenvalue problem (11) can be efﬁciently solved by a randomized algorithm [29] that only requires O(r) matrix vector product. We make the following key observation: The eigenvalue λi measures the sensitivity of the data w.r.t. the parameters along direction ψi, i.e., the data mostly inform parameters in directions ψi corresponding to large eigenvalues λi. For i with small λi, close to zero, the variation of the likelihood f in direction ψi is negligible, so the posterior is close to the prior in direction ψi.

We deﬁne a linear projector of rank r, Pr : Rd Rd, as

i=1 ψiψT i x = Ψrw, x Rd, (12)

where Ψr := (ψ1, . . . , ψr) Rd r represents the projection matrix and w := (w1, . . . , wr)T Rr is the coefﬁcient vector with element wi := ψT i x for i = 1, . . . , r. For this projection, we seek a

proﬁle function g : Rd R such that g(Prx) is a good approximation of the likelihood function f(x). For a given proﬁle function, we deﬁne a projected density pr : Rd R as

Zr g(Prx)p0(x), (13)

where Zr := Ex p0[g(Prx)]. It is shown in [36] that an optimal proﬁle function g exists such that

DKL(p|p r) DKL(p|pr), (14)

where p r is deﬁned as in (13) with an optimal proﬁle function g . Moreover, under certain mild assumptions for the prior (sub-Gaussian) and the likelihood function (whose gradient has the second moment w.r.t. the prior), the upper bound is shown (sharper than that for p SVN in [14]) as

DKL(p|p r) γ

i=r+1 λi, (15)

for a constant γ > 0 independent of r, which implies small projection error for the posterior when λi decay fast. The optimal proﬁle function g is nothing but the marginal likelihood given by

g (Prx) = Z

X f(Prx + ξ)p 0 (ξ|Prx)dξ, (16)

where X is the complement of the subspace Xr spanned by ψ1, . . . , ψr, and

p 0 (ξ|Prx) = p0(Prx + ξ)/pr 0(Prx) with pr 0(Prx) = Z

X p0(Prx + ξ)dξ. (17)

We defer a practical computation of the optimal proﬁle function to Section 3.3.

3.2 Projected Stein Variational Gradient Descent

By the projection (12), we consider a decomposition of the prior for the parameter x = xr + x as

p0(x) = pr 0(xr)p 0 (x |xr), (18)

where the marginals pr 0 and p 0 are deﬁned in (17). Moreover, since pr 0 only depends on xr = Prx = Ψrw, we deﬁne a prior density for w as

π0(w) = pr 0(Ψrw). (19)

Then we deﬁne a joint (posterior) density for w at the optimal proﬁle function g = g in (13) as

π(w) = 1 Zw g(Ψrw)π0(w), (20)

where the normalization constant Zw = Ew π0[g(Ψrw)]. It is easy to see that Zw = Zr, where Zr is in (13), and the projected density in (13) can be written as

pr(x) = π(w)p 0 (x |Ψrw). (21)

Therefore, to sample x from pr, we only need to sample w from π and x from p 0 (x |Ψrw).

To sample from the posterior π in (20), we employ the SVGD method presented in Section 2 in the coefﬁcient space Rr, with r < d. Speciﬁcally, we deﬁne a projected transport map T r : Rr Rr as

T r(w) = w + ϵφr(w), (22)

with a differentiable perturbation map φr : Rr Rr, and a small enough ϵ > 0 such that T r is invertible. Following the argument in [24] on the result (5) for SVGD, we obtain

ϵDKL(T r π0|π) ϵ=0 = Ew π0[trace(Aπφr(w))], (23)

where Aπ is the Stein operator for π given by

Aπφr(w) = w log π(w)φr(w)T + wφr(w). (24)

Using a tensor product of RKHS Hr with kernel kr( , ) : Rr Rr R for the approximation of φr (Hr)r = Hr Hr, a SVGD update of the samples w0 1, . . . , w0 N from π0(w) leads to

wℓ+1 m = wℓ m + ϵl ˆφr, ℓ(wℓ m), m = 1, . . . , N, ℓ= 0, 1, . . . , (25)

with a step size ϵl and an approximate steepest direction by sample average approximation (SAA)

ˆφr, ℓ(wℓ m) = 1

n=1 wℓn log π(wℓ n)kr(wℓ n, wℓ m) + wℓnkr(wℓ n, wℓ m). (26)

The kernel kr can be speciﬁed as in (9), i.e.,

kr(w, w ) = exp ||w w ||2 2 h

To account for data impact in different directions ψ1, . . . , ψr informed by the eigenvalues of (11), we propose to replace ||w w ||2 2 in (27) by (w w )T (Λ + I)(w w ) with Λ = diag(λ1, . . . , λr) for the likelihood and I for the prior.

The following theorem, proved in Appendix A, gives w log π(w) and the connection between p SVGD for the coefﬁcient w and SVGD for the projected parameter Prx under certain conditions. Theorem 1. The gradient of the posterior π in (20) is given by

w log π(w) = ΨT r

g(Prx) + xpr 0(Prx) pr 0(Prx)

Moreover, with the kernel kr( , ) : Rr Rr R deﬁned in (27) and k( , ) : Rd Rd R deﬁned in (9), if pr 0(Prx) = p0(Prx), for example p0 is Gaussian, we have the equivalence of the projected transport map T r for the coefﬁcient w and the transport map T for the projected parameter Prx, as

T r(w) = ΨT r T(Prx). (29)

In particular, we have w log π(w) = ΨT r x log pr(Prx). (30) Remark 1. We can compute the gradient of the log-posterior logw log π(w) either by (28) or by the simpliﬁed formula (30) if π0(w) = p0(Prx), which is satisﬁed for p0 such that its marginal for Prx is the same as the anchored density p0(Prx + ξ) at ξ = 0. Meanwhile, Gaussian priors satisfy this condition since p0(Prx + ξ) = p0(Prx)p0(ξ) where both p0(Prx) and p0(ξ) are Gaussian densities.

3.3 Practical algorithms

Sampling from the projected density p r(x) deﬁned in (13) for the optimal proﬁle function g in (16) involves, by the decomposition (21), sampling w from the posterior π by p SVGD and sampling x from the conditional distribution with density p 0 (x |Ψrw). The sampling is impractical because of two challenges: (1) Both p 0 (x |Ψrw) and g in (16) involve high-dimensional integrals. (2) The matrix H deﬁned in (10) for the construction of the basis ψ1, . . . , ψr involves integration w.r.t. the posterior distribution of the parameter x. However, drawing samples from the posterior to evaluate the integral turns out to be the central task of the Bayesian inference.

The ﬁrst challenge can be practically addressed by using the property that the posterior distribution only effectively differs from the prior in the dominant subspace Xr, or that the variation of likelihood f in the complement subspace X is negligible. Therefore, for any sample x0 n drawn from the prior p0, n = 1, . . . , N, we compute x n = x0 n Prx0 n and freeze it for given Pr as a sample from p 0 (x |Prx). Moreover, at sample xℓ n we approximate the optimal proﬁle function g in (16) as

g (Prxℓ n) f(Prxℓ n + x n ), (31)

which is equivalent to using one sample x n to approximate the integral (16) because the variation of f in the complement subspace X is small. This is used in computing w log π(w) in (28).

Given the projector Pr with basis Ψr, we summarize the p SVGD transport of samples in Algorithm 1. In particular, by leveraging the property that the samples can be updated in parallel, we implement

Algorithm 1 p SVGD in parallel

1: Input: samples {x0 n}N n=1 in each of K cores, basis Ψr, maximum iteration Lmax, tolerance wtol. 2: Output: posterior samples {x n}N n=1 in each core. 3: Set ℓ= 0, project w0 n = ΨT r x0 n, x n = x0 n Ψrw0 n, and perform MPI_Allgather for {w0 n}N n=1. 4: repeat 5: Compute gradients wℓn log π(wℓ n) by (28) for n = 1, . . . , N, and perform MPI_Allgather.

6: Compute the kernel values kr(wℓ n, wℓ m) and their gradients wℓnkr(wℓ n, wℓ m) for n = 1, . . . , NK, m = 1, . . . , N, and perform MPI_Allgather for them. 7: Update samples wℓ+1 m from wℓ m by (25) and (26) for m = 1, . . . , N, with NK samples used for SAA in (26), and perform MPI_Allgather for {w0 m}N m=1. 8: Set ℓ ℓ+ 1. 9: until ℓ Lmax or mean(||wℓ m wℓ 1 m ||2) wtol. 10: Reconstruct samples x n = Ψrwℓ n + x n .

a parallel version of p SVGD using MPI for information communication in K processor cores, each with N different samples, thus producing M = NK different samples in total.

To construct the projector Pr with basis Ψr, we approximate H in (10) by

m=1 x log f(xm)( x log f(xm))T . (32)

where x1, . . . , x M are supposed to be samples from the posterior, which are however not available at the beginning. We propose to adaptively construct the basis Ψℓ r with samples xℓ 1, . . . , xℓ M transported from the prior samples x0 1, . . . , x0 M by p SVGD. This procedure is summarized in Algorithm 2. We remark that by the adaptive construction, we push the samples to their posterior in each subspace Xℓx r spanned by (possibly) different basis Ψℓx r with different r for different ℓx, during which the frozen samples x n in Algorithm 1 are also updated at each step ℓx of Algorithm 2. We remark that different choices can be used for the step size function ϵl. In the numerical experiments for structured models, we use a backtracking line search method with Armijo Goldstein condition to look for the step size ϵl, where the line search objective function is taken as the negative log-posterior function, see more details in the accompanying code at https://github.com/cpempire/p SVGD.

Algorithm 2 Adaptive p SVGD in parallel

1: Input: samples {x0 n}N n=1 in each of K cores, Lx max, Lw max, xtol, wtol. 2: Output: posterior samples {x n}N n=1 in each core. 3: Set ℓx = 0. 4: repeat 5: Compute x log f(xℓx n ) in (32) for n = 1, . . . , N in each core, and perform MPI_Allgather. 6: Solve (11) with H approximated as in (32), with all M = NK samples, to get bases Ψℓx r . 7: Apply the p SVGD Algorithm 1, i.e.,

{x n}N n=1 = p SVGD({xℓx n }N n=1, Ψℓx r , Lw max, wtol).

8: Set ℓx ℓx + 1 and xℓx n = x n, n = 1, . . . , N. 9: until ℓx Lx max or mean(||xℓx m xℓx 1 m ||X) xtol.

4 Numerical Experiments

In this section, we present three Bayesian inference problems with high-dimensional parameters to demonstrate the accuracy of p SVGD compared to SVGD, and the convergence and scalability of p SVGD w.r.t. the number of parameters, samples, data points, and processor cores. A linear inference example, whose posterior is analytically given, is presented in Appendix B to demonstrate the accuracy of p SVGD. An application in COVID-19 to infer the time-dependent social distancing effect (which is high-dimensional after discretization in time) given hospitalized data is presented in

Appendix C. A more comprehensive application in COVID-19 to infer high-dimensional parameters and optimize mitigation strategies under uncertainty are given in [15, 9].

4.1 Conditional diffusion process

We consider a high-dimensional model that is often used to test inference algorithms in high dimensions, e.g., Stein variational Newton in [19], which is discretized from a conditional diffusion process

dut = 10u(1 u2)

1 + u2 dt + dxt, t (0, 1], (33)

with zero initial condition u0 = 0. The forcing term (xt)t 0 is a Brownian motion, whose prior is Gaussian with zero mean and covariance C(t, t ) = min(t, t ). We use the Euler-Maruyama scheme with step size t = 0.01 for the discretization, which leads to dimension d = 100 for the discrete Brownian path x. We generate the data by ﬁrst solving (33) at a true Brownian path xtrue, and taking pointwise observation of the solution as y = (y1, . . . , y20) with yi = uti + ξi for equispaced t1, . . . , t20 in (0, 1] and additive noise ξi N(0, σ2) with σ = 0.1. We run SVGD and the adaptive p SVGD with line search for the learning rate, using N = 128 samples to infer the true parameter xtrue, where the subspace for p SVGD is updated every 10 iterations. The results are displayed in Figure 1. From the left we can see a fast decay of eigenvalues of (11), especially when the iteration number ℓbecomes big with the samples converging to the posterior, which indicates the existence of an intrinsic low-dimensional subspace. From the right we can observe that p SVGD leads to samples at which the solutions are much closer to the noisy data as well as the true solution than that of SVGD. Moreover, the posterior mean of p SVGD is much closer to the true parameter with much tighter 90% conﬁdence interval covering xtrue than that of SVGD.

0 10 20 30 40 50 60 r

log10(| r|)

= 0 = 10 = 20 = 30 = 40 = 50 = 60 = 70 = 80 = 90

0.00 0.25 0.50 0.75 1.00

SVGD sample

0.00 0.25 0.50 0.75 1.00

SVGD solution

mean true noisy data

0.00 0.25 0.50 0.75 1.00 p SVGD sample

0 mean true

0.00 0.25 0.50 0.75 1.00 p SVGD solution

0.0 mean true noisy data

Figure 1: Left: Decay of the eigenvalues of (11) at different iteration numbers ℓ. Right: SVGD (top) and p SVGD (bottom) samples x and solutions u at iteration ℓ= 100, including the synthetic true, posterior mean, 90% conﬁdence interval in shadow, and noisy data points.

4.2 Bayesian logistic regression

We consider Bayesian logistic regression for binary classiﬁcation of cancer and normal patterns for mass-spectrometric data with 10, 000 attributes from https://archive.ics.uci.edu/ml/datasets/Arcene, which leads to d = 10, 000 parameters (with i.i.d. uniform distribution as prior for Figure 2; Gaussian is also tested with similar results). We use 100 data samples for training and 100 for testing. We run p SVGD and SVGD with line search and 32 samples, with projection basis updated every 100 iterations. The results are shown in Figure 2. We can see a dramatic decay of the eigenvalues, which indicates an intrinsic dominant low dimensional subspace in which p SVGD effectively drives the samples to the posterior, and leads to more accurate prediction than SVGD. We remark that 32 samples in the estimate for the gradient information matrix in (32) are sufﬁcient to capture the subspace since more samples lead to similar decay of eigenvalues as in Figure 2.

0 5 10 15 20 25 30 r

log10(| r|)

= 0 = 100 = 200 = 300 = 400 = 500 = 600 = 700 = 800 = 900

0 20 40 60 80 100

SVGD test accuracy = 0.75

mean test data

0 20 40 60 80 100 p SVGD test accuracy = 0.85

1.0 mean test data

Figure 2: Left: Decay of the eigenvalues of (11) at different iteration numbers ℓ. Right: SVGD (top) and p SVGD (bottom) test data and prediction at iteration ℓ= 1000, including posterior mean, 90% conﬁdence interval in shadow. The 85% test accuracy for p SVGD is the same as for SVM from the data source ﬁle. The training time for p SVGD is 201 seconds compared to 477 seconds for SVGD in a Mac Book Pro Laptop (2019) with the processor of 2.4 GHz 8-Core Intel Core i9.

4.3 Partial differential equation

In this example we consider an elliptic partial differential equation model (widely used in various ﬁelds, e.g., inference for permeability in groundwater ﬂow, thermal conductivity in material science, electrical impedance in medical imaging, etc.), with a simpliﬁed form as

(ex u) = 0, in (0, 1)2, (34)

which is imposed with Dirichlet boundary conditions u = 1 on the top boundary and u = 0 on bottom boundary, and homogeneous Neumann boundary conditions on the left and right boundaries. is a divergence operator, and is a gradient operator. x and u are discretized by ﬁnite elements with piecewise linear elements in a uniform mesh of triangles of size d. x Rd and u Rd are the nodal values of x and u. We consider a Gaussian distribution for x N(0, C) with covariance C = ( 0.1 + I) 2 (here is Laplacian), which leads to a Gaussian distribution for x N(0, Σx), where Σx Rd d is discretized from C. We consider a parameter-to-observable map h(x) = O S(x), where S : x u is a nonlinear discrete solution map of the equation (34), O : Rd Rs is a pointwise observation map at s = 7 7 = 49 points equally distributed in (0, 1)2. We consider an additive 5% noise ξ N(0, Σξ) with Σξ = σ2I and σ = max(|Ou|)/20 for data y = h(x) + ξ.

3.0 3.5 4.0 4.5 5.0 5.5 6.0 log2( d 1)

log10(RMSE of variance)

SVGD p SVGD

0 10 20 30 40 50 r

log10(| r|)

d=289 d=1,089 d=4,225 d=16,641

0 25 50 75 100 125 150 175 200 # iterations

log2(averaged step norm)

d=289 d=1,089 d=4,225 d=16,641

Figure 3: Left: RMSE of pointwise sample variance in L2-norm, with dimension d = (2n + 1)2, n = 3, 4, 5, 6. Middle: Scalability w.r.t. d by decay of eigenvalues λr w.r.t. r. Right: decay of the averaged step norm meanm||wℓ+1 m wℓ m||2 w.r.t. the number of iterations for different dimension d.

We use a DILI-MCMC algorithm [18] to generate 10, 000 effective posterior samples and use them to compute a reference sample variance. We run SVGD and the adaptive p SVGD (with λr+1 < 10 2) using 256 samples and 200 iterations for different dimensions, both using line search to seek the step size ϵℓ. The comparison of accuracy can be observed in the left of Figure 3 by the root-meansquare-error (RMSE) of pointwise sample variance in L2-norm. We can see that SVGD samples fail

to capture the posterior distribution in high dimensions and become worse with increasing dimension, while p SVGD samples represent the posterior distribution well, measured by sample variance, and the approximation remains accurate with increasing dimension. The accuracy of p SVGD can be further

0 25 50 75 100 125 150 175 200 # iterations

log2(averaged step norm)

N=64 N=128 N=256 N=512

0 1 2 3 4 5 6 7 8 log2(r)

log10(| r|)

s=49 s=225 s=961 s=3,969

0 1 2 3 4 5 log2(# processor cores)

log2(time (s))

total gradient kernel update

Figure 4: Scalability w.r.t. the number of (1) samples N by decay of the averaged step norm (left), (2) data points s by decay of eigenvalues (middle), and (3) processor cores K by decay of CPU time (for gradient including eigendecompostion (11), kernel, sample update, total) of p SVGD (right).

demonstrated by the signiﬁcant decay (about 7 orders of magnitude) of the eigenvalues for different dimensions in the middle of Figure 3. Only about 50 dimensions (with small relative projection error, about Er < 10 6, committed in the posterior by (15)) are preserved out of from 289 to 16,641 dimensions, representing over 300 dimension reduction for the last case. The similar decays of the eigenvalues λr in the projection rank r and the averaged step norm meanm||wℓ+1 m wℓ m||2 in the number of iterations shown in the left of Figure 3 imply that p SVGD is scalable w.r.t. the parameter dimension. Moreover, the similar decays for different sample size N = 64, 128, 256, 512 in the left of Figure 4 demonstrate that p SVGD is scalable w.r.t. the number of samples N. Furthermore, as displayed in the middle of Figure 4, with increasing number of i.i.d. observation data points s = 72, 152, 312, 632 in a reﬁned mesh of size d = 172, 332, 652, 1292, the eigenvalues decay at almost the same rate with similar relative projection error Er, and lead to similar reduction d/r for r such that λr+1 < 10 2, which implies weak scalability of p SVGD w.r.t. the number of data points. Lastly, from the right of Figure 4 by the nearly O(K 1) decay of CPU time we can see that p SVGD achieves strong parallel scalability (in computing gradient, kernel, and sample update) w.r.t. the number of processor cores K for the same work with KN = 1024 samples.

5 Conclusions

We proposed a new algorithm p SVGD for Bayesian inference in high dimensions to tackle the critical challenge of the curse of dimensionality. The projection error committed in the posterior can be bounded by the truncated (fast decaying) eigenvalues. We proved that p SVGD for the coefﬁcient is equivalent to SVGD for the projected parameter under suitable assumptions. We demonstrated that p SVGD overcomes the curse of dimensionality for several high-dimensional Bayesian inference problems. In particular, we showed that p SVGD is scalable w.r.t. the number of parameters, samples, data points, and processor cores for a widely used benchmark problem in various scientiﬁc and engineering ﬁelds, which is crucial for solving high-dimensional and large-scale inference problems.

Broader Impact

The proposed algorithm applies to high-dimensional Bayesian inference problems whose posterior effectively differs from the prior in a low-dimensional subspace discovered by the gradient information matrix, which is generally the case due to the fundamental property of the ill-posedness or overparametrization of the inference problems. As one example, we applied the algorithm to Bayesian inference of the COVID-19 pandemics, which is expected to bring impact on learning the spread of the virus. However, for intrinsically high-dimensional problems, e.g., wave propagation in high frequency, a direct application of the proposed algorithm may not effectively alleviate the curse of dimenaionlity. It needs to be further extended by exploiting other properties such as sparsity, conditional independence, low-dimensionality with map transformation, etc. Impact on ethical aspects and future societal consequences may not be applicable here.

Funding Disclosure

This work was partially funded by the Department of Energy, Ofﬁce of Science, Ofﬁce of Advanced Scientiﬁc Computing Research, Mathematical Multifaceted Integrated Capability Centers (MMICCS) program under award DE-SC0019303; the Simons Foundation under award 560651; and the National Science Foundation, Division of Mathematical Sciences under award DMS-2012453. Both authors have no competing interests with other entities.

[1] O. Bashir, K. Willcox, O. Ghattas, B. van Bloemen Waanders, and J. Hill. Hessian-based model reduction for large-scale systems with initial condition inputs. International Journal for Numerical Methods in Engineering, 73:844 868, 2008.

[2] Alexandros Beskos, Mark Girolami, Shiwei Lan, Patrick E. Farrell, and Andrew M. Stuart. Geometric MCMC for inﬁnite-dimensional inverse problems. Journal of Computational Physics, 335:327 351, 2017.

[3] Daniele Bigoni, Olivier Zahm, Alessio Spantini, and Youssef Marzouk. Greedy inference with layers of lazy maps. ar Xiv preprint ar Xiv:1906.00031, 2019.

[4] David M Blei, Alp Kucukelbir, and Jon D Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

[5] T. Bui-Thanh and O. Ghattas. Analysis of the Hessian for inverse scattering problems: I. Inverse shape scattering of acoustic waves. Inverse Problems, 28(5):055001, 2012.

[6] T. Bui-Thanh, O. Ghattas, J. Martin, and G. Stadler. A computational framework for inﬁnitedimensional bayesian inverse problems part I: The linearized case, with application to global seismic inversion. SIAM Journal on Scientiﬁc Computing, 35(6):A2494 A2523, 2013.

[7] Peng Chen and Omar Ghattas. Hessian-based sampling for high-dimensional model reduction. International Journal for Uncertainty Quantiﬁcation, 9(2), 2019.

[8] Peng Chen and Omar Ghattas. Stein variational reduced basis Bayesian inversion. ar Xiv preprint ar Xiv:2002.10924, 2020.

[9] Peng Chen and Omar Ghattas. Stochastic optimization of heterogeneous epidemic models: Application to COVID-19 spread accounting for LTC facilities. preprint, 2020.

[10] Peng Chen and Christoph Schwab. Sparse-grid, reduced-basis Bayesian inversion. Computer Methods in Applied Mechanics and Engineering, 297:84 115, 2015.

[11] Peng Chen and Christoph Schwab. Sparse-grid, reduced-basis Bayesian inversion: Nonafﬁneparametric nonlinear equations. Journal of Computational Physics, 316:470 503, 2016.

[12] Peng Chen, Umberto Villa, and Omar Ghattas. Hessian-based adaptive sparse quadrature for inﬁnite-dimensional Bayesian inverse problems. Computer Methods in Applied Mechanics and Engineering, 327:147 172, 2017.

[13] Peng Chen, Umberto Villa, and Omar Ghattas. Taylor approximation and variance reduction for PDE-constrained optimal control under uncertainty. Journal of Computational Physics, 385:163 186, 2019.

[14] Peng Chen, Keyi Wu, Joshua Chen, Tom O Leary-Roseberry, and Omar Ghattas. Projected Stein variational Newton: A fast and scalable Bayesian inference method in high dimensions. In Advances in Neural Information Processing Systems, pages 15104 15113, 2019.

[15] Peng Chen, Keyi Wu, and Omar Ghattas. Bayesian inference of heterogeneous epidemic models: Application to COVID-19 spread accounting for LTC facilities. preprint, 2020.

[16] Wilson Ye Chen, Lester Mackey, Jackson Gorham, François-Xavier Briol, and Chris J Oates. Stein points. ar Xiv preprint ar Xiv:1803.10161, 2018.

[17] Paul G Constantine, Carson Kent, and Tan Bui-Thanh. Accelerating Markov chain Monte Carlo with active subspaces. SIAM Journal on Scientiﬁc Computing, 38(5):A2779 A2805, 2016.

[18] Tiangang Cui, Kody JH Law, and Youssef M Marzouk. Dimension-independent likelihoodinformed MCMC. Journal of Computational Physics, 304:109 137, 2016.

[19] Gianluca Detommaso, Tiangang Cui, Youssef Marzouk, Alessio Spantini, and Robert Scheichl. A stein variational Newton method. In Advances in Neural Information Processing Systems, pages 9187 9197, 2018.

[20] Gianluca Detommaso, Jakob Kruse, Lynton Ardizzone, Carsten Rother, Ullrich Köthe, and Robert Scheichl. HINT: Hierarchical invertible neural transport for general and sequential Bayesian inference. ar Xiv preprint ar Xiv:1905.10687, 2019.

[21] Mark Girolami and Ben Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123 214, 2011.

[22] Tobin Isaac, Noemi Petra, Georg Stadler, and Omar Ghattas. Scalable and efﬁcient algorithms for the propagation of uncertainty from data through inference to prediction for large-scale problems, with application to ﬂow of the Antarctic ice sheet. Journal of Computational Physics, 296:348 368, September 2015.

[23] Chang Liu and Jun Zhu. Riemannian Stein variational gradient descent for Bayesian inference. In Thirty-Second AAAI Conference on Artiﬁcial Intelligence, 2018.

[24] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378 2386, 2016.

[25] J. Martin, L.C. Wilcox, C. Burstedde, and O. Ghattas. A stochastic Newton MCMC method for large-scale statistical inverse problems with application to seismic inversion. SIAM Journal on Scientiﬁc Computing, 34(3):A1460 A1487, 2012.

[26] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via measure transport: An introduction. In Handbook of Uncertainty Quantiﬁcation, pages 1 41. Springer, 2016.

[27] N. Petra, J. Martin, G. Stadler, and O. Ghattas. A computational framework for inﬁnitedimensional Bayesian inverse problems, part ii: Stochastic Newton MCMC with application to ice sheet ﬂow inverse problems. SIAM Journal on Scientiﬁc Computing, 36(4):A1525 A1555, 2014.

[28] Aaditya Ramdas, Sashank Jakkam Reddi, Barnabás Póczos, Aarti Singh, and Larry Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015.

[29] A.K. Saibaba, J. Lee, and P.K. Kitanidis. Randomized algorithms for generalized Hermitian eigenvalue problems with application to computing Karhunen Loève expansion. Numerical Linear Algebra with Applications, 23(2):314 339, 2016.

[30] Claudia Schillings and Christoph Schwab. Sparse, adaptive Smolyak quadratures for Bayesian inverse problems. Inverse Problems, 29(6):065011, 2013.

[31] Claudia Schillings and Christoph Schwab. Scaling limits in computational Bayesian inversion. ESAIM: Mathematical Modelling and Numerical Analysis, 50(6):1825 1856, 2016.

[32] Ch. Schwab and A.M. Stuart. Sparse deterministic approximation of Bayesian inverse problems. Inverse Problems, 28(4):045003, 2012.

[33] A. Spantini, A. Solonen, T. Cui, J. Martin, L. Tenorio, and Y. Marzouk. Optimal low-rank approximations of Bayesian linear inverse problems. SIAM Journal on Scientiﬁc Computing, 37(6):A2451 A2487, 2015.

[34] Dilin Wang, Zhe Zeng, and Qiang Liu. Stein variational message passing for continuous graphical models. In International Conference on Machine Learning, pages 5206 5214, 2018.

[35] Yifei Wang and Wuchen Li. Information Newton s ﬂow: second-order optimization method in probability space. ar Xiv preprint ar Xiv:2001.04341, 2020.

[36] Olivier Zahm, Tiangang Cui, Kody Law, Alessio Spantini, and Youssef Marzouk. Certiﬁed dimension reduction in nonlinear Bayesian inverse problems. ar Xiv preprint ar Xiv:1807.03712, 2018.

[37] Jingwei Zhuo, Chang Liu, Jiaxin Shi, Jun Zhu, Ning Chen, and Bo Zhang. Message passing Stein variational gradient descent. In International Conference on Machine Learning, pages 6018 6027, 2018.