# pathfinder_parallel_quasinewton_variational_inference__455d8a49.pdf

Journal of Machine Learning Research 23 (2022) 1-49 Submitted 8/21; Revised 12/21; Published 10/22

Pathﬁnder: Parallel quasi-Newton variational inference

Lu Zhang LZHANG63@USC.EDU Division of Biostatistics Department of Population and Public Health Sciences University of Southern California Los Angeles, CA 90032, USA

Bob Carpenter BCARPENTER@FLATIRONINSTITUTE.ORG Center for Computational Mathematics Flatiron Institute 162 5th Ave, New York, NY 10010, USA

Andrew Gelman GELMAN@STAT.COLUMBIA.EDU Departments of Statistics and Political Science Columbia University New York, NY 10027, USA

Aki Vehtari AKI.VEHTARI@AALTO.FI Department of Computer Science Aalto University 00076 Aalto, Finland

Editor: Philipp Hennig

We propose Pathﬁnder, a variational method for approximately sampling from differentiable probability densities. Starting from a random initialization, Pathﬁnder locates normal approximations to the target density along a quasi-Newton optimization path, with local covariance estimated using the inverse Hessian estimates produced by the optimizer. Pathﬁnder returns draws from the approximation with the lowest estimated Kullback-Leibler (KL) divergence to the target distribution. We evaluate Pathﬁnder on a wide range of posterior distributions, demonstrating that its approximate draws are better than those from automatic differentiation variational inference (ADVI) and comparable to those produced by short chains of dynamic Hamiltonian Monte Carlo (HMC), as measured by 1-Wasserstein distance. Compared to ADVI and short dynamic HMC runs, Pathﬁnder requires one to two orders of magnitude fewer log density and gradient evaluations, with greater reductions for more challenging posteriors. Importance resampling over multiple runs of Pathﬁnder improves the diversity of approximate draws, reducing 1-Wasserstein distance further and providing a measure of robustness to optimization failures on plateaus, saddle points, or in minor modes. The Monte Carlo KL divergence estimates are embarrassingly parallelizable in the core Pathﬁnder algorithm, as are multiple runs in the resampling version, further increasing Pathﬁnder s speed advantage with multiple cores.

Keywords: Variational inference, Hamiltonian Monte Carlo, quasi-Newton optimization, Laplace approximation, importance resampling

c 2022 Lu Zhang, Bob Carpenter, Andrew Gelman, and Aki Vehtari.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v23/21-0889.html.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

1. Introduction

Obtaining efﬁcient, scalable, and robust posterior inference remains the primary challenge in advanced Bayesian computation. The difﬁculty of this challenge has led researchers to develop a wide range of posterior approximation algorithms. One of the most popular classes of approximate methods is variational inference (VI), which searches for a tractable approximate distribution that minimizes Kullback-Leibler (KL) divergence to the posterior. Although VI is typically faster than Monte Carlo sampling (Blei et al., 2017), popular approaches such as black-box variational inference (Ranganath et al., 2014) and automatic differentiation variational inference (Kucukelbir et al., 2017) can fail to converge due to the high variance of nested gradient estimates (Dhaka et al., 2021). In this paper, we develop Pathﬁnder, an algorithm that locates approximations to the target density along a quasi-Newton optimization path. Starting from a random initialization in the tail of the posterior distribution, the quasi-Newton optimization trajectory can quickly move from the tail, through the body of the distribution, to a mode or pole. By evaluating the ELBO in parallel for the normal approximations along the optimization path generated by L-BFGS, a popular quasi-Newton method, Pathﬁnder can quickly ﬁnd a region of high probability mass from which to draw approximate samples. Novel contributions of this paper include (1) new VI algorithms that use curvature information of the target distribution collected by optimization trajectories to propose approximate distributions; (2) an efﬁcient sampling algorithm for the normal approximations estimated from quasi-Newton inverse Hessian approximations; (3) the design of Pathﬁnder, which allows evaluating the evidence lower bound (ELBO) in parallel for each normal approximation. Hence, Pathﬁnder can be greatly accelerated by parallel computing, which is not possible with existing VI algorithms that directly minimize KL divergence, all of which are sequential. Figure 1 illustrates the evolution of approximate posteriors along the optimization path for a density with a single mode. In cases with posterior modes, Pathﬁnder provides a Laplace-like approximation of the posterior density using the quasi-Newton optimizer s efﬁcient inverse Hessian estimate for covariance. Figure 2 shows how Pathﬁnder behaves for unbounded target densities like the funnel, where it balances the competing goals of high entropy and containment within the target density to stop before heading off to a pole. In both cases, the use of approximate inverse Hessian information allows the quasi-Newton optimizer to move quickly and stably into and through the high probability region of the posterior. Multimodal distributions can be approximated by running several instances of Pathﬁnder in parallel from different initialization points, followed by ﬁltering with importance resampling. In Section 2.2, we describe this multi-path version of Pathﬁnder. We adapt importance sampling using Pareto smoothed importance weights (Yao et al., 2018; Vehtari et al., 2019) to importance resampling to obtain more stable approximate draws. We evaluate the performance of Pathﬁnder experimentally in Section 3. We compare Pathﬁnder to automatic differentiation variational inference (ADVI), a state-of-the-art variational inference algorithm (Kucukelbir et al., 2017).1 We also compare to the approximate draws generated by running many short MCMC chains with dynamic Hamiltonian Monte Carlo in the form of the no-U-turn sampler (Hoffman and Gelman, 2014) as reﬁned by Betancourt (2017).2 The short MCMC

1. We also evaluated a robust version of ADVI from Dhaka et al. (2020), which provided similar results in its mean-ﬁeld form, but the implementation failed to converge in its dense form. 2. We use Stan s implementation of ADVI and dynamic HMC (Stan Development Team, 2021a).

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

0.6 0.5 0.4 0.3 0.2 iteration 3 estimated ELBO: 340.5

0.6 0.5 0.4 0.3 0.2 iteration 4 estimated ELBO: 332.2

0.6 0.5 0.4 0.3 0.2 iteration 5 estimated ELBO: 329.7

0.6 0.5 0.4 0.3 0.2 iteration 6 estimated ELBO: 329.6

0.6 0.5 0.4 0.3 0.2 iteration 7 estimated ELBO: 329.6

0.6 0.5 0.4 0.3 0.2 iteration 8 estimated ELBO: 329.7

Figure 1: A series of normal approximations (black ellipse for 95% central region) along a quasi Newton optimization path (black polyline) for the posterior of a logistic regression model (green to blue contours), whose high probability mass region is indicated by the yellow to orange contours. As the optimization path reaches the mode, the central 95% region of the normal approximation closely matches the high probability mass region of the target density.

chains can be viewed as the ﬁrst stage of warmup for MCMC sampling or as a variational inference algorithm in its own right, following Hoffman and Ma (2020). We evaluate approximations to the target density based on the discrete form of 1-Wasserstein distance, with the target density deﬁned by long runs of dynamic HMC thinned to roughly independent draws. Wasserstein distance measures how much one distribution would need to be distorted to match the other. Unlike the asymmetric KL divergence measure, Wasserstein distance is a proper distance metric obeying symmetry and the triangle inequality. Over a diverse set of 20 models from the posteriordb evaluation set (Magnusson et al., 2021), we found Pathﬁnder s approximations ranged from slightly worse to much better than those of ADVI using diagonal covariance (mean ﬁeld), ADVI with dense covariance (full rank), and dynamic HMC using short chains (75 iterations). Pathﬁnder required one to two orders of magnitude fewer log density and gradient evaluations than these systems without parallelization. We further explore Pathﬁnder s features and limitations based on case studies of high-dimensional models in Section 3. Although we frame Pathﬁnder as a form of variational inference, its development was motivated by the question of how to efﬁciently generate a handful of approximate draws with which to initialize asymptotically exact Markov chain Monte Carlo (MCMC) methods such as dynamic HMC. The best initialization for MCMC is a draw from the posterior, as that leads to a stationary Markov chain.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

15 10 5 0 5 10 15 iteration 3 estimated ELBO: 4.3

15 10 5 0 5 10 15 iteration 4 estimated ELBO: 0.4

15 10 5 0 5 10 15 iteration 5 estimated ELBO: 132.1

15 10 5 0 5 10 15 iteration 6 estimated ELBO: 1.4

15 10 5 0 5 10 15 iteration 9 estimated ELBO: 579.9

15 10 5 0 5 10 15 iteration 13 estimated ELBO: 5.7

Figure 2: A series of normal approximations (black ellipse for 95% central region) along a quasi Newton optimization path (black polyline) for a funnel-like posterior density with no mode (green to blue contours), whose high probability mass region is indicated by the yellow to orange contours. The normal approximations with the highest ELBO value (lower left) occurs at a point on the optimization trajectory before it heads off toward the pole at negative inﬁnity on the vertical axis. The ELBO values rise and fall along the optimization path.

Initializing MCMC with an approximate draw from the posterior allows us to skip this ﬁrst stage of MCMC adaptation (sometimes called burn-in ). In Section 4, we demonstrate the beneﬁts of using Pathﬁnder to initialize MCMC through an analysis of a Gaussian process model.

2. Pathﬁnder

This section describes the Pathﬁnder algorithm for generating approximate draws from a differentiable target density known only up to a normalizing constant. We follow the presentation of the basic Pathﬁnder algorithm in Section 2.1 with a multi-path version in Section 2.2, which runs multiple optimization paths and uses importance resampling to select draws. Resampling from multiple normal approximations better matches non-normal target densities and also mitigates the problem of L-BFGS getting stuck at local optima or in saddle points on plateaus. To discriminate the Pathﬁnder algorithms in Section 2.1 and Section 2.2, we refer to the former one as single-path Pathﬁnder and the latter one as multi-path Pathﬁnder. The remaining sections provide details of the algorithms used in the inner loop of Pathﬁnder. Section 2.3 provides relevant details of L-BFGS optimization, Sections 2.4 and 2.5 present the algorithm for evaluating and sampling from the normal approximations in Pathﬁnder, Section 2.6 explains the Monte Carlo evaluations of the evidence lower bound, and Section 2.7

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

describes the importance resampling algorithm implemented in the multi-path Pathﬁnder algorithm. We review connections to related methods in Section 2.8.

2.1 Pathﬁnder algorithm

The Pathﬁnder algorithm begins by drawing from an initialization distribution π0, from which it follows a quasi-Newton optimization trajectory. We use L-BFGS to generate an optimization trajectory θ(0:L) = (θ(0),. . .,θ(L)) towards a local maximum (or pole) of the log density log p(θ), where θ(0) denotes the initial point and the superscript indicates the iteration. In applications to Bayesian posterior sampling, the target density p(θ) is the posterior p(θ | y), where θ RN

represents the N-dimensional parameter vector and y denotes the observations. The exact posterior is often intractable, and practitioners have to resort to iterative algorithms like MCMC and VI to obtain posterior samples or inferences based on a log probability function, which is the log density of the posterior up to an additive constant. For the ease of explanation, we use log p(θ) to refer to the tractable log probability function. Based on the exploration of the optimization trajectory, Pathﬁnder generates local normal approximations of the target density using the gradient and curvature information collected along the optimization trajectory. To obtain the approximations, we develop a function α-RECOVER that uses the optimization trajectory and the gradient along the optimization path log p(θ(0:L)) to compute a diagonal estimate of the covariance matrix of the approximation for each iteration. The α-RECOVER routine returns the diagonal elements of the covariance estimation for all iterations α(1:L), and provides indicators ξ(1:L) of whether a pair of the updates of the position and gradient along the optimization path should be used in further covariance estimation or not. It also returns the updates of locations s(1:L) and gradients of the negative log joint density z(1:L) along the optimization path, where s(l) = θ(l) θ(l 1) and z(l) = log p(θ(l 1)) log p(θ(l)). We describe the algorithm α-RECOVER and provide pseudocode in Section 2.4. Then for each iteration, we generate a local approximation based on the second-order Taylor series expansion and the quasi-Newton inverse Hessian estimate of covariance. We further develop an evaluation algorithm that generates samples and estimates the evidence lower bound (ELBO) in parallel for all local approximations. In particular, we propose an efﬁcient sampling algorithm BFGS-SAMPLE in Section 2.5, which returns K samples φ(l,1:K) and their log-densities under the approximation log q(φ(l,1:K)) for the local approximation at iteration l. Here we use double superscripts to distinguish samples from different approximations, the ﬁrst superscript indexes the approximation and the second superscript indexes the draws. We calculate a Monte Carlo estimate of the ELBO λ(l) for approximation l with Algorithm 6 in Appendix F.The last step in Pathﬁnder selects the approximation that maximizes the evidence lower bound (equivalently, minimizes Kullback-Leibler divergence to the target density), and then generates M draws from the best normal approximation. Algorithm 1 presents pseudocode for Pathﬁnder. We later show in Section 3 that Pathﬁnder is not particularly sensitive to its tuning parameters, which include an initial distribution, maximum number of L-BFGS iterations (Lmax), L-BFGS convergence tolerance (τrel), size of the history used to approximate the inverse Hessian (J), and the number of Monte Carlo draws used to evaluate the ELBO (K).

The computational cost of single-path Pathﬁnder is dominated by the (1) L-BFGS optimization, (2) sampling from the approximate normal distributions, and (3) evaluating the evidence lower bounds. The cost of L-BFGS optimization is dominated by log density and gradient evaluations, the number of which will be determined by the tuning parameters of L-BFGS, including τrel, Lmax, and

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Algorithm 1 Single-path Pathﬁnder

Input: log p: differentiable log density function of dimension N π0: initial distribution Lmax: maximum number of L-BFGS iterations τrel: relative tolerance for convergence of L-BFGS J: size of the history used to approximate the inverse Hessian K: number of Monte Carlo draws to evaluate ELBO M: number of approximate posterior draws to return Output: ψ(1),. . .,ψ(M): draws from ELBO-maximizing normal approximation log q(ψ(1)),. . .,log q(ψ(M)): log density of draws in ELBO-maximizing normal approximation

1: procedure PATHFINDER(log p, π0, L,τrel, J,K, M)

2: sample θ(0) π0 3: let θ(0:L), log p(θ(0:L)) = L-BFGS log p,θ(0), J,τrel, Lmax

4: let (α(1:L),ξ(1:L), s(1:L), z(1:L)) = α-RECOVER θ(0:L), log p(θ(0:L)), J

5: for l 1 : L in parallel do

6: let φ(l,1:K),log q(φ(l,1:K)) = BFGS-SAMPLE(s(1:l), z(1:l),θ(l), log p(θ(l)),α(l),ξ(1:l),K)

7: for k 1 : K do

8: evaluate and store log p(φ(l,k))

9: let λ(l) = ELBO log p(φ(l,1:K)),log q(φ(l,1:K))

10: let l = arg maxl λ(l)

11: let ψ(1:M),log q(ψ(1:M)) = BFGS-SAMPLE(s(1:l ), z(1:l ),θ(l ), log p(θ(l )),α(l ),ξ(1:l ), M)

J. The cost of sampling from the normal approximations is modest because we efﬁciently rescale and rotate standard normal draws using the factored L-BFGS covariance approximation. Estimating the evidence lower bound requires a number of log density evaluations equal to the number of Monte Carlo draws (K). The accuracy and reliability of the Monte Carlo estimates can be diagnosed without regard to a reference distribution using a diagnostic based on the Pareto k statistic (Vehtari et al., 2019; Dhaka et al., 2021). Sampling from the normal approximation and estimating the evidence lower bound may be done in parallel across the points on the optimization trajectory. This makes L-BFGS the serialization bottleneck for this algorithm. L-BFGS is economical in its calls to gradient and log density functions because of its ability to leverage quasi-Newton estimates of local curvature.

In some cases, the optimization path terminates at the initialization point and in others it can fail to generate a positive deﬁnite inverse Hessian estimate. In both of these settings, Pathﬁnder essentially fails. Rather than worry about coding exceptions or failure return codes, Pathﬁnder returns the last iteration of the optimization path as a single approximating draw with for the approximate normal log density of the draw. This ensures that failed ﬁts get zero importance weights in the multi-path Pathﬁnder algorithm, which we describe in the next section.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

Algorithm 2 Multi-path Pathﬁnder

Input: log p: differentiable log density function of dimension N π0: initial distribution Lmax: maximum number of L-BFGS iterations τrel: relative tolerance for convergence of L-BFGS J: size of the history used to approximate the inverse Hessian K: number of Monte Carlo draws to evaluate ELBO I: number of independent Pathﬁnder runs M: number of draws returned by each Pathﬁnder run R: number of draws returned by importance resampling (R IM) Output: ψ(1),. . .,ψ(R): approximate draws from target density p

1: procedure MULTIPATHPATHFINDER(log p,π0, Lmax,τrel, J,K, I, M, R)

2: for i 1 : I in parallel do

3: let φ(i,1:M),log q(φ(i,1:M)) = Pathﬁnder(log p,π0, Lmax,τrel, J,K, M)

4: compute and store target log densities log p(φ(i,1)),. . .,log p(φ(i,M))

5: let (ψ(1),i(1)),. . .,(ψ(R),i(R)) = PS-IR(φ(1:I,1:M),log q(φ(1:I,1:M)),log p(φ(1:I,1:M)), R),

6: where log q(φ(i,m)) = log 1

I multi-normal(φ(i,m) | µ(i),Σ(i))) = log q(φ(i,m)) log(I)

7: and log p(φ(i,m)) = log 1

I p(φ(i,m)) = log p(φ(i,m)) log(I).

2.2 Multi-path Pathﬁnder algorithm

The multi-path Pathﬁnder algorithm is given in Algorithm 2. It runs Pathﬁnder I times in parallel. For run i of Pathﬁnder, it saves the approximate samples φ(i,1:M). Then it generates R approximate draws based on all of the approximate draws φ(1:I,1:M) by importance resampling. The resulting approximation generalizes the normal distribution of Pathﬁnder to a mixture of I normal distributions, which improves approximations for distributions that are far from normal. Importance resampling from a mixture of normals also reduces the variability arising from single runs using random initial values and stochastic ELBO estimates to choose the best normal approximation along the trajectory. In particular, for more than one run of Pathﬁnder, we augment the parameter space to include the discrete index i 1:I indicating the mixture component that generated the parameters. With the assumption that all runs of Pathﬁnder return the same number of draws and that the mixture components are equally weighted, the resulting proposal distribution of draws and mixture indicator is q(φ,i) = 1

I multi-normal(φ | µ(i),Σ(i)),

where multi-normal(µ(i),Σ(i)) is the normal approximation selected by run i of Pathﬁnder. The corresponding joint target distribution can be extended in the same way, to

With the augmented parameter space, we can importance resample across different runs of Pathﬁnder without recomputing the marginal proposal distribution of φ for all draws. The importance resampling procedure based on several proposal densities from multiple runs of Pathﬁnder is a form of multiple

ZHANG, CARPENTER, GELMAN, AND VEHTARI

importance sampling. More speciﬁcally, we use the scheme labeled N1 by Elvira et al. (2019), which has the least computational burden among the proper alternatives and can be computed in parallel in each individual run of Pathﬁnder. Furthermore, optimization paths stuck in minor modes, at saddle points, or on plateaus can be eliminated through the natural weighting of importance resampling. The second step of the multi-path Pathﬁnder algorithm in Algorithm 2 uses importance resampling, based on the joint (parameter and mixture indicator) log density of the proposal density function q(φ,i) and target density function p(φ,i). The pseudocode for Pareto-smoothed importance resampling (PS-IR) is provided in Algorithm 5. Multi-path Pathﬁnder performs I completely independent runs of Pathﬁnder, so that its expected number of operations is I times as many operations as are expected from Pathﬁnder. Because these independent runs can be executed asynchronously and each takes roughly the same amount of work, wall time for multi-path Pathﬁnder should be only slightly higher than that of Pathﬁnder. The importance resampling step is fast, but it requires all runs of Pathﬁnder to complete before it is executed, making the expected time to run multi-path Pathﬁnder a bit longer than that of the slowest of the independent Pathﬁnder chains. There is a bit of additional parallelizable work to evaluate log densities in the approximation and in the target density. After this evaluation, resampling only requires normalization, random number generation, and selection, all of which are fast.

2.3 L-BFGS optimization

For minimizing the objective function log p(θ), Newton steps move in the direction of the inverse Hessian of log p(θ) times the gradient, with all derivatives being respect to the parameter vector θ,

δ = ( 2 log p(θ)) 1 log p(θ).

Quasi-Newton methods are so called because they use an approximation of the inverse Hessian. The BFGS optimization algorithm is a quasi-Newton method that approximates the inverse Hessian through updates of positions and gradients of the objective function at positions along the optimization path (Broyden, 1970; Fletcher, 1987; Goldfarb, 1970; Shanno, 1970). The limitedmemory BFGS (L-BFGS) algorithm limits the size of the history of ﬁnite differences for greater scalability and to allow the local inverse Hessian estimates to adapt to varying curvature along the optimization path (Nocedal, 1980). There are several prominent implementations of L-BFGS that vary in their details; we use the version introduced by Byrd et al. (1995) and detailed by Zhu et al. (1997).3 The pseudocode for Zhu et al. s version of L-BFGS (without bounds) is listed in Algorithm 7 in Appendix F. The standard L-BFGS algorithm approximates inverse Hessians using the previous J updates of positions and gradients of the objective function along the optimization path. For iteration l, let S = S1 SJ and Z = Z1 ZJ be N J matrices that store the previous J updates of positions and gradients, (i.e., Sj = θ(l J+j) θ(l J+j 1), Zj = log p(θ(l J+j)) log p(θ(l J+j 1)) for j = 1,. . ., J). Let α be an N-vector that stores the diagonal elements of an initial diagonal inverse Hessian estimate. Following Byrd et al. (1994, eq. 2.6), the estimated inverse Hessian at iteration l based on the previous J positions can be formulated as

Σ(l) = diag(α) + β γ β , (1)

3. We use the L-BFGS-B implementation in the R function stats::optim() (R Core Team, 2021).

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

β = diag(α) Z S , γ = 0 E 1

E E diag(η) + Z diag(α) Z E 1

( S i Zj if i j 0 otherwise ,η = S 1 Z1 S J ZJ .

Although L-BFGS never explicitly constructs Σ(l), in Section 2.5 we show how its factored form can be used to derive an efﬁcient algorithm to sample from the normal distributions with covariance Σ(l), which is the key step in the Monte Carlo estimator of the ELBO.

2.4 Local density approximations along the optimization path

The proposed normal approximations in Pathﬁnder are located along the optimization path. The second-order Taylor series expansion of the target log density log p(θ | y) at a point θ(l) on the optimization path is

log p(θ | y)

log p(θ(l) | y) + log p(θ(l) | y) (θ θ(l)) 1

2(θ θ(l)) H(θ(l)) (θ θ(l))

= log ˆp(θ | y),

where H(θ) = 2 log p(θ | y) = 2 log p(θ) is the Hessian function mapping points to the matrix of second derivatives of log p(θ) with respect to θ at that point. The approximate distribution ˆp(θ | y) is a second-order Taylor series expansion around θ(l), which produces a multivariate normal approximation with mean

µ(l) = θ(l) + H 1(θ(l)) log p(θ(l) | y) = θ(l) + H 1(θ(l)) log p(θ(l))

and covariance H 1(θ(l)). At a mode, the ﬁrst-order term drops out and we are left with a standard Laplace approximation. Pathﬁnder reconstructs the factors of the inverse Hessian approximations as needed using the optimization trajectory, as shown in Algorithm 3. We construct covariance estimates rather than caching the estimates from the optimization algorithm for three reasons. First, it gives us the ﬂexibility to use a different inverse Hessian approximation than that used in the optimization algorithm. In our implementation, we use standard L-BFGS to generate an optimization path whose diagonal inverse Hessian estimation diag(α) is a scaled identity matrix, while we use Gilbert and Lemaréchal (1989, eq. 4.9) in the recovery of the diagonal inverse Hessian estimation to allow the elements in α to vary. Second, the separation allows us to set up a more restricted condition to ﬁlter out sharp updates for the inverse Hessian estimation. In Algorithm 3 line 5, the condition of selecting the correct pairs (s(l), z(l)) ﬁlters out candidates whose gradient changes 1012 times more than the position on the direction of the update of gradient. Third, the outputs of the α-RECOVER procedure enable us to recover the factors in the covariance estimation (1) for each iteration in parallel without saving matrices β and γ in (2), which reduces total memory overhead and reduces communication costs.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Algorithm 3 Diagonal inverse Hessian estimation.

Input: θ(0),. . .,θ(L) RN: optimization path log p(θ(0)),. . ., log p(θ(L)) RN: gradients of log density of θ(0:L)

J: size of the history used to approximate the inverse Hessian Output: (α(1),. . .,α(L)): The diagonal elements of the initial inverse Hessian approximation (ξ(1),. . .,ξ(L)): The indicator of whether the updates of position and gradient are included in the inverse-Hessian approximation or not. (s(1),. . ., s(L)): The updates of position of the optimization path (z(1),. . ., z(L)): The updates of the gradient ( log p) of the optimization path

1: procedure α-RECOVER(θ(0:L), log p(θ(0:L)), J)

2: let α(0) = 1N where 1N denotes the vector of 1 s in RN O(N)

3: for l 1 : L do O(LJN)

4: let s(l) = θ(l) θ(l 1) and let z(l) = log p(θ(l 1)) log p(θ(l))

5: if s(l) z(l) < ϵ z(l) 2 with ϵ = 10 12 then

6: let ξ(l) = 1

7: let a = z(l) diag(α(l 1)) z(l); b = z(l) s(l); c = s(l) diag(α(l 1)) 1 s(l) O(N)

8: for n 1 : N do O(N)

9: let α(l) n =

a b α(l 1) n + z(l)2 n

b a s(l)2 n b c α(l 1) n 2

11: let α(l) = α(l 1) and ξ(l) = 0

We store a sequence of indicators to pick out pairs of positions and gradients to use in covariance estimates. Pathﬁnder evaluates the evidence lower bound for all local approximations to select the best normal approximation. By considering all normal approximations from the tail to the mode or pole of the posterior distribution, Pathﬁnder can quickly ﬁnd approximations that generate draws in the high probability region of the target density.

2.5 Sampling from the approximation and evaluating the log density of a draw

To implement Pathﬁnder, we need to be able to sample from the approximating distributions in order to evaluate the evidence lower bound. Furthermore, we need to be able to evaluate the log density of these sampled points for importance resampling. While these operations could be implemented by factoring our inverse Hessian approximations Σ(l), this would require O(N3) operations in N dimensions. Fortunately, the outer product representation may be used along with a thin QR factorization and Cholesky factorization in order to produce draws and their log density in O(NJ2 + J3) operations using only O(NJ + J2) memory, where J is the history size of L-BFGS and N is the number of dimensions. That is, the algorithm is linear in dimensionality, with a constant factor determined by the history size for L-BFGS Hessian approximations.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

To achieve this efﬁciency, the Hessian can be factored as

Σ(l) = diag(α 1 2 ) I + diag(α 1

2 ) β γ β diag(α 1

2 ) diag(α 1 2 ), (3)

where we have simpliﬁed notation by dropping the superscripts on α(l). Next, let

Q R = diag(α 1

be the thin QR-factorization of diag(α 1

2 ) β, so that Q is an N 2J matrix with orthonormal columns and R is a 2J 2J upper triangular matrix. Let P be the N (N 2J) matrix with orthonormal columns that makes Q P an N N orthogonal matrix. We can then factor the inverse Hessian estimate as Σ(l) = T T , (4)

where T = diag(α 1 2 ) Q L P ,

and L is deﬁned through Cholesky decomposition to satisfy L L = I + R γ R . We provide a more detailed derivation of (4) in Appendix A. If we draw a standard normal N-vector v multi-normal(0,I), then we can translate, scale, and rotate it so that it is a draw from the approximate distribution,

µ(l) + T v multi-normal(µ(l),Σ(l)),

where the location vector is computed via

µ(l) = θ(l) + Σ(l) log p(θ(l) | y) = θ(l) + diag(α) log p(θ) + β γ β log p(θ),

which only involves matrix vector multiplication and requires order O(JN + J2) operations and memory. Next, consider generating u multi-normal(0,I) and setting

It follows that v multi-normal(0,I), with the orthogonality between columns of P and Q allowing us to produce draws µ(l) + T v multi-normal(θ(l),Σ(l)) deﬁned by

µ(l) + T v = µ(l) + diag(α 1 2 )(Q L Q u + P P u)

= µ(l) + diag(α 1 2 )(Q L Q u + u Q Q u) . (5)

Using the factorization in (4), we can compute the log determinant required for evaluating the approximate log density as

log Σ(l) = log |diag(α)| + 2 log L .

This provides a means to efﬁciently calculate the log density of the sampled point in the approximating distribution as

log normal(µ(l) + T v | µ(l),Σ(l))

log Σ(l) + (µ(l) + Tv µ(l)) (Σ(l)) 1(µ(l) + Tv µ(l)) + N log(2π)

log Σ(l) + u u + N log(2π) .

ZHANG, CARPENTER, GELMAN, AND VEHTARI

In summary, the sampling and log density evaluation algorithm relies on efﬁciently decomposing the factored form instead of directly Cholesky decomposing Σ(l) to generate draws and compute log densities of draws. Algorithm 4 provides the complete pseudocode for the sampling and density evaluation algorithm.

Algorithm 4 Sample from local approximations

Input: s(1:l): the updates of position upto iteration l z(1:l): the updates of gradient ( log p) upto iteration l θ(l): the position of optimization path at iteration l log p(θ(l)): the gradient of log density at θ(l)

α(l): The diagonal elements of the initial inverse Hessian approximation ξ(1:l): The indicator of whether the update of the position and gradient are included in the inverse-Hessian approximation or not. J: size of the history used to approximate the inverse Hessian M: number of draws to return Output: φ(1),. . ., φ(M): draws from approximate distribution (M N matrix) log q(φ(1)),. . .,log q(φ(M)): log densities of draws in the approximate normal distribution (M-vector)

1: procedure BFGS-SAMPLE(s(1:l), z(1:l),θ(l), log p(θ(l)),α(l),ξ(1:l), M)

2: ﬁnd the indexes χ of the last (at most) J non-zero indicators in ξ(1:l) and record the last J updates of positions S = s(χ) = S1 SJ and gradients Z = z(χ) = Z1 ZJ . The latest update is on the last column

3: generate the upper-triangular matrix E by O(J2N)

( S i Zj if i j 0 otherwise , and save the diagonal elements of E as η

4: generate β,γ by O(J2N + J3)

β = diag(α(l)) Z S , γ = 0 E 1

E E diag(η) + Z diag(α(l)) Z E 1

5: compute the thin QR-factorization Q and R for diag(α 1

2 ) β O(J2N)

6: calculate the Cholesky decomposition L of I + R γ R O(J3)

7: let log |Σ| = log |diag(α)| + 2 log L O(N)

8: let µ = θ + diag(α) log p(θ) + β γ β log p(θ) O(JN + J2)

9: for m 1 : M do O(JNM + J2M)

10: sample u(m) multi-normal(0,I) O(N)

11: let φ(m) = µ + diag(α 1 2 ){Q ( L I) (Q u(m)) + u(m)} O(JN + J2)

12: let log q(φ(m)) = 1

2 log |Σ| + u(m) u(m) + N log(2π) O(N)

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

2.6 Estimating KL divergence from the approximate densities

From the sequence multi-normal(µ(1),Σ(1)),. . .,multi-normal(µ(L),Σ(L)) of normal approximations along the optimization path, Pathﬁnder then selects the approximation from step l that minimizes Kullback-Leibler divergence to the target density,

l = arg minl KL multi-normal(θ | µ(l),Σ(l)) p(θ | y) .

As usual in variational inference, it is more convenient to deﬁne l equivalently as the point that maximizes the evidence lower bound (ELBO) (Wainwright and Jordan, 2008). With draws from the approximating distribution, φ(1),. . ., φ(K) multi-normal(µ(l),Σ(l)), the ELBO is straightforward to evaluate with Monte Carlo,

ELBO multi-normal(µ(l),Σ(l)) p(θ | y)

k=1 log p(φ(k)) log(multi-normal(φ(k) | µ(l),Σ(l))). (7)

The pseudocode for the ELBO estimation algorithm is provided in Algorithm 6 in Appendix F. Given the factorization of the L-BFGS covariance described in the previous section, the ELBO can be approximated using K draws from the approximating distribution in O(NJ2+J3+JNK+J2K) operations using only O(NJ + J2) memory, where J is the history size of L-BFGS, N is the number of dimensions, and K is the number of Monte Carlo draws used to evaluate the ELBO. Both the history size (J) and number of Monte Carlo evaluations (K) will be small and ﬁxed, rendering the overall complexity linear in the dimensionality of the target distribution (N). Using a larger number K of Monte Carlo draws will reduce the variance of the ELBO estimate at the cost of more computation. The ELBO tends to be more stable than other divergence measures when using a ﬁnite sample size K (Dhaka et al., 2021). We have chosen K = 5 in our experiments in Section 3. Furthermore, the K samples can be drawn and evaluated for log density in parallel with no synchronization required until they are averaged to produce a ﬁnal estimate. We evaluate the sensitivity of our results to the choice of K in Section 3.

2.7 Pareto-smoothed importance resampling

In the ﬁnal step of multi-path Pathﬁnder, we employ a Pareto-smoothed importance resampling algorithm to reﬁne the approximation draws based on approximations from I independent runs of Pathﬁnder. Importance resampling is a method for reﬁning a set of draws from an approximating distribution to better approximate draws from a target distribution (Rubin, 1987). Importance resampling works by resampling from the original sample with replacement with probabilities proportional to the importance weights. Importance sampling estimators weight draws based on their importance ratios but can have high or even inﬁnite variance. Ionides (2008) showed that truncating the importance weights improves the efﬁciency of the resulting Monte Carlo estimator by reducing its variance. Vehtari et al. (2019) introduced a continuous generalization of truncation that ﬁts the importance weights to a generalized Pareto distribution, whose cumulative distribution function is then used to provide an evenly spaced set of importance weights. Rather than directly using the smoothed weights to calculate expectations, we instead use them as importance resampling weights, leading to Pareto smoothed importance resampling (PS-IR), as listed in Algorithm 5. As far as we know, Pareto smoothing has not been previously applied to importance resampling, but

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Algorithm 5 Pareto-smoothed importance resampling (PS-IR)

Input: φ(1),. . ., φ(J): draws from proposal distribution q log q(φ(1)),. . .,log q(φ(J)): proposal log densities log p(φ(1)),. . .,log p(φ(J)): target log densities R: number of draws resampled (R J) Output: ψ(1),. . .,ψ(R): importance resampled draws

1: procedure PS-IR(φ(1:J),log q(φ(1:J)),log p(φ(1:J)), R)

2: let w1,. . .,w J = PSIS(log q(φ(1:J)),log p(φ(1:J))) be the Pareto-smoothed importance sampling weights O(J)

3: sample ψ(1),. . .,ψ(R) from φ(1),. . ., φ(J) with replacement, with probabilities proportional to wj O(RJ)

only to importance sampling. We use resampling in order to make it easy to use as an initialization algorithm for MCMC and to simplify expectation and quantile estimation by returning draws rather than weighted draws.

2.8 Related methods

Automatic differentiation variational inference (ADVI) is a method for black-box variational inference with differentiable densities (Kucukelbir et al., 2017). ADVI s variational objective is identical to Pathﬁnder s, namely Kullback-Leibler (KL) divergence from the approximating distribution to the target distribution. The difference is that ADVI directly optimizes the variational objective using stochastic gradient descent, whereas Pathﬁnder optimizes the target density using quasi-Newton optimization and then chooses the point along the optimization path based on the variational objective. Like Pathﬁnder, ADVI uses a multivariate normal approximating distribution on an unconstrained parameter space. Any constrained variables such as scales or covariance matrices or simplexes are transformed to an unconstrained representation in RN, with appropriate change of variables adjustments. ADVI s covariance matrix may be taken to be dense or it may be constrained to be diagonal; Kucukelbir et al. (2017) call the former full rank and the latter mean ﬁeld, though both are technically required to have rank N. The Kullback-Leibler (KL) divergence from the normal approximation to the target distribution is evaluated using Monte Carlo methods by taking an average of the log density of draws from the approximating distribution. This results in a stochastic gradient algorithm, with gradients calculated using automatic differentiation (Mohamed et al., 2019; Carpenter et al., 2015). Dhaka et al. (2021) show that Monte Carlo estimates of this KL divergence and its gradient are in general stable, although they may have high variance. Compared to Pathﬁnder s direct quasi-Newton optimization of the log density, ADVI is restricted to small step sizes because of the stochastic nature of the gradient calculation and the lack of curvature information. In evaluations below, we show that ADVI requires one to two orders of magnitude more function evaluations than Pathﬁnder to ﬁnd the high probability mass region. In addition, we show that Pathﬁnder produces approximations that range from slightly worse to much better than ADVI (with limited computation time) in complex problems as measured by the 1-Wasserstein

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

metric. Finally, ADVI is intrinsically serial in its evaluation of the KL divergence at each iteration of optimization, whereas Pathﬁnder is embarrassingly parallel after the relatively fast L-BFGS optimization.

Early stopping optimization divides the data in two parts, with one part used for optimization and other part is used to compute out-of-sample performance criterion and the optimization is stopped when the out-of-sample performance starts to decrease (Vehtari et al., 2000). The downside of the approach is that it requires factorizing the likelihood and additional data manipulation to make the data divisions. Pathﬁnder works also for non-factorized likelihoods as the stopping is decided by the ELBO estimate (in our implementation the optimization is not stopped early but run to the termination and then ELBO is estimated for each optimization trajectory point, potentially in parallel). Furthermore Pathﬁnder returns normal approximations instead of just points along the trajectory.

Early stopping variational inference generates approximate posterior draws by taking random draws from an initialization distribution, then following an optimization path for a ﬁxed number of steps and taking the result as an approximate posterior draw (Duvenaud et al., 2016). The number of steps is selected to minimize the KL divergence from the approximating distribution to the target distribution. Early stopping variational inference can be viewed as a normalizing ﬂow (Rezende and Mohamed, 2015), which generates an approximate draw from a target distribution by generating a draw from a simple distribution, such as uniform or standard normal, then transforming it. Pathﬁnder is similar, but it stops at a normal approximation from which we draw a sample.

Early stopping VI differs from Pathﬁnder in several substantive ways. Most importantly, early stopping VI determines a number of optimization steps as the variational parameter, generating a range of values based on the random initialization. Pathﬁnder, in contrast, evaluates a variational approximation centered at each point on the optimization path as a variational approximation. Secondarily, early stopping VI chooses a point on the optimization path, whereas Pathﬁnder generates a point from a normal approximation located at a point on the optimization path. Computationally, early stopping variational inference requires an optimization run for each approximate posterior draw, whereas Pathﬁnder uses a single optimization run (though multi-path Pathﬁnder importance resamples among several such paths).

Invertible ﬂow non-equilibrium sampling (In Fi NE) is another method based on selecting points on a deterministic optimization path with importance resampling (Thin et al., 2021). Unlike Pathﬁnder and the other systems mentioned so far, In Fi NE is asymptotically exact. It was motivated by the need for efﬁcient estimators for normalizing constants of intractable densities known only up to a constant factor, such as most Bayesian posteriors. The other motivation mentioned in the paper is in accurately evaluating KL divergence using the evidence lower bound (ELBO). To achieve these goals, In Fi NE uses an iterative importance resampling scheme (Andrieu et al., 2010).

As with normalizing ﬂows, In Fi NE keeps the Jacobian tractable over the optimization path by using a Hamiltonian ﬂow with a friction term that will cause the ﬂow to come to rest at a local mode (a well in potential energy, which is negative log density) or just keep falling.

Short parallel MCMC chains. Hoffman and Ma (2020) demonstrate that HMC can quickly reach the high probability region, which they exploit by generating many short MCMC chains in parallel and using only the small part from the end. In experiments not reported here, we tested starting many parallel chains from the L-BFGS trajectory points and testing for a trend in the log density. A lack of trend in the log density is consistent with the chains having been initialized in the high probability region. Although we have veriﬁed this approach works by evaluating short-chain dynamic HMC, the

ZHANG, CARPENTER, GELMAN, AND VEHTARI

number of log density evaluations required is several orders of magnitude larger than needed by the ELBO estimate used in Pathﬁnder.

3. Experiments

This section provides experimental evaluations of Pathﬁnder. In Section 3.1, we compare Pathﬁnder to two popular posterior approximation algorithms, ADVI and an ensemble of short adaptive HMC chains. ADVI (Kucukelbir et al., 2017) is the industry standard black-box variational inference algorithm, implemented in Stan (Stan Development Team, 2021a), Py MC3 (Salvatier et al., 2016), Pyro (Bingham et al., 2019), Tensor Flow Probability (Dillon et al., 2017), JAX (Bradbury et al., 2018), Turing.jl (Ge et al., 2018), and other differentiable programming languages. Like Pathﬁnder, ADVI provides normal approximations of posteriors. ADVI can be conﬁgured to use a dense covariance or restricted to a diagonal covariance matrix. We evaluate both alternatives in this section. We treat Stan s no-U-turn sampler (Hoffman and Gelman, 2014; Betancourt, 2017) as a nonparametric posterior approximation algorithm, and, following the conclusions in Hoffman and Ma (2020), we ran many parallel MCMC chains and took samples from the last iteration as approximate draws. We evaluate Pathﬁnder s sensitivity to tuning parameters in Section 3.2. In Section 3.3, we investigate the behavior of Pathﬁnder for difﬁcult posteriors. We evaluate the results of using Pathﬁnder versus short chains of adaptive HMC for initializing a Gaussian process model in Section 4. The code for simulations is available at https://github.com/Lu Zhangstat/Pathfinder. We use 1-Wasserstein distance (Craig, 2016; Villani, 2009; Mc Cann, 1995) between the empirical distribution of the approximate samples and the target posterior distribution to evaluate how well we are taking independent draws from the posterior. We provide an introduction to 1-Wasserstein distance and its computation in our simulation studies in Appendix B.In all of the evaluations presented in this section, the model parameters are transformed to the unconstrained scale, with corresponding change of variables adjustments. The resulting support on all of RN matches the support of the multivariate normal approximations used by ADVI and Pathﬁnder and allows the algorithms to avoid dealing with boundaries. Moreover, we assume that the posterior distribution is closer to normal in the unconstrained space. Hence, for both ADVI and Pathﬁnder, the Gaussian approximation is made in the unconstrained space. And in all of our experiments, we compare the approximation performance through samples in the unconstrained space. In practice, these samples will be (inverse) transformed back to the constrained space. Details of the transformations, inverse transforms, and their log absolute Jacobian determinants are provided by Stan Development Team (2021a, Chapter 10).

3.1 Evaluating Pathﬁnder as variational inference

In this section, we compare Pathﬁnder with ADVI and Stan s default phase I warmup (dynamic HMC in the form of the no-U-turn sampler) through experiments. We evaluate Pathﬁnder using the 20 models and data sets from posteriordb (Magnusson et al., 2021), each of which is supplied with reference posteriors in the form of 10,000 roughly independent draws. The set of models evaluated includes

generalized linear models: nes, earnings, dogs, diamonds sblrc,

hierarchical meta-analysis models: eight_schools (centered and non-centered),

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

Gaussian processes: gp_pois,

mixtures: low_dim_gauss_mix,

differential equation dynamics models: hudson_lynx_hare, one_comp_MM_elim,

hidden Markov models: bball_drive, and

time-series models: arma, ar K, and garch.

For each model in posteriordb, we run single-path Pathﬁnder with 100 different random initializations, using our proposed default settings:

maximum L-BFGS iterations (Lmax = 1000),

relative tolerance for L-BFGS convergence (τrel = 10 13),

size of L-BFGS history to approximate inverse Hessian (J = 6),

number of Monte Carlo draws to evaluate ELBO (K = 5), and

number of draws per run (M = 100).

For multi-path Pathﬁnder, we again take 100 approximate draws, but use a larger number of intermediate runs,

number of single-path Pathﬁnder runs (I = 20),

number of draws returned by each single-path Pathﬁnder run (M = 100), and

number of draws per run (R = 100).

For both single-path and multi-path Pathﬁnder, we repeat the entire process 100 times. We use Wasserstein distance from the approximate draws in each run to the reference posterior draws to determine how well Pathﬁnder achieves its goal of producing approximate posterior samples. In addition to single-path and multi-path Pathﬁnder, we also evaluate 100 runs by

Stan phase I adaptation: adaptive Hamiltonian Monte Carlo with Stan s no-U-turn sampler (unit metric, step size adaptation, and a maximum tree depth of 10, keeping the last of 75 iterations),

dense ADVI: automatic differentiation variational inference with a dense covariance matrix (Stan default settings, return 100 approximate draws), and

mean-ﬁeld ADVI: automatic differentiation variational inference with a diagonal covariance matrix (Stan default settings, return 100 approximate draws).

Each of these procedures uses random initialization values, generated from a uniform( 2,2) distribution, which is the default for Stan. The expressive power of Pathﬁnder s low-rank plus diagonal covariance approximation is between that of ADVI s diagonal and dense choices for covariance. In summary, we generated 100 runs of 100 draws for each of our candidate approximate posterior distribution algorithms, including Pathﬁnder, multi-path Pathﬁnder, mean-ﬁeld ADVI and dense ADVI, and we generated 100 approximate samples using Stan s phase I sampler.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/32 1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 256 >512 scaled 1 Wasserstein distance

Stan Phase I pathfinder mean field ADVI dense ADVI

Figure 3: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from single-path Pathﬁnder and ADVI for the 20 models in posteriordb. Each box plot displays 1-Wasserstein distances of 100 independent runs of (single-path) Pathﬁnder, mean-ﬁeld ADVI, and dense ADVI. We calculate the 1-Wasserstein distance with 100 approximate draws from the last iteration of 100 runs of Stan s phase I warmup (adaptive HMC). Distances for each model are scaled by the median of the 1-Wasserstein distances for single-path Pathﬁnder.

Assessing the quality of approximations. We provide a comparison of single-path Pathﬁnder, ADVI and Stan s phase I sampler through 1-Wasserstein distances for all 20 models from posteriordb in Figure 3. A comparison of multi-path Pathﬁnder, single-path Pathﬁnder and Stan s phase I sampler is in Figure 4. To adjust for the varying scale of the 1-Wasserstein distances across target densities, we scaled results relative to the median of the 100 1-Wasserstein distances for single-path Pathﬁnder for each model. This allows us to compare ratios of the 1-Wasserstein distance between Pathﬁnder s draws and the target posterior and the 1-Wasserstein distance between another system s draws and the target posterior.

It is clear that single-path and multi-path Pathﬁnder outperform the ADVI variants for most of the models in posteriordb. The bar and whisker plots show that over 100 independent runs, multi-path Pathﬁnder is the most stable, followed by single-path Pathﬁnder, and then mean-ﬁeld

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/32 1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 scaled 1 Wasserstein distance

Stan Phase I multi path pathfinder pathfinder

Figure 4: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from single-path Pathﬁnder and multi-path Pathﬁnder for the 20 models in posteriordb. Each box plot displays 1-Wasserstein distances of 100 independent runs of singlepath or multi-path Pathﬁnder. We calculate 1-Wasserstein distance with 100 approximate draws from the last iteration of 100 runs of Stan s phase I warmup (NUTS). Distances for each model are scaled by the median of the 1-Wasserstein distances for single-path Pathﬁnder.

ADVI, with dense ADVI providing the most variability in 1-Wasserstein distance to the true posterior. The median 1-Wasserstein distance for mean-ﬁeld ADVI is more than double that of single-path Pathﬁnder for 8 (of 20) test models. Dense ADVI is worse, with 9 (of 20) test models having double the 1-Wasserstein distance of single-path Pathﬁnder.

There is only one model where the 1-Wasserstein distance is much smaller for mean-ﬁeld ADVI, the hidden Markov model bball_drive_event_0-hmm_drive_0, with a median 1Wasserstein distance that less than one tenth of that for single-path Pathﬁnder. This particular model has multiple meaningful posterior modes. The noise inherent in the stochastic gradient descent approach used by ADVI allows it to escape minor modes than can trap the L-BFGS optimizer used by Pathﬁnder. It might be possible to resolve this problem for single-path Pathﬁnder with a more robust optimization algorithm. Until we ﬁnd such an algorithm, we note that multi-path Pathﬁnder

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Multiple of Pathﬁnder s evaluations Stan phase I mean-ﬁeld ADVI dense ADVI log density 7.9 24 28 gradient 34 48 54

Table 1: The values in the table indicate how much more work is required for Stan s phase I warmup and ADVI compared to Pathﬁnder averaged over all of the test models in posteriordb. The values are ratios of evaluations, so that, for example, mean-ﬁeld ADVI required 48 times as many gradient evaluations as Pathﬁnder. Dense ADVI does additional work beyond gradient evaluations in log density and simulation which are not included. The actual implementation of candidate algorithms can be aborted due to various errors. In this experiment, we count all the log density and its gradient evaluation in the failed trials until success.

eliminates this problem for the bball_drive_event_0-hmm_drive_0 model, resulting in 1-Wasserstein distances that are comparable to those for mean-ﬁeld ADVI. In addition to working better on most posteriors, single-path and multi-path Pathﬁnder are more stable than Stan s HMC-based phase I adaptation for more challenging posteriors. For 7 (of 20) test models, Stan s phase I warmup produced 1-Wasserstein distances more than double the median distance of single-path Pathﬁnder. Except for the bball_drive_event_0-hmm_drive_0 example, the 1-Wasserstein distances for single-path Pathﬁnder are at most double that of Stan s phase I warmup.

Assessing the computational cost. Pathﬁnder, Stan s phase I sampler, and automatic differentiation variational inference are all dominated computationally by log density and gradient calculations. Using the number of these operations as a measure of computation conveys the further advantage of being implementation agnostic. We summarize the cost for 100 repetitions of (single-path) Pathﬁnder, Stan phase I warmup, and ADVI for each model through box plots in Figure 5 (the repetitions are not part of the algorithm, but merely to provide a sense of execution cost variability from run to run). Table 1 summarizes average costs in terms of log density and gradient operations. Although we summarize both log density and gradient costs for completeness, gradients typically consume closer to 80% of overall compute cost for Stan phase I and ADVI when calculated with automatic differentiation (Carpenter et al., 2015). Because gradients are more expensive with automatic differentiation than log density evaluations, the results indicate that other methods require around 30 to 50 times more operations than Pathﬁnder. These results only consider serial execution. Multi-path Pathﬁnder requires 20 (the default number used in the experiments) times as many evaluations as single-path Pathﬁnder plus the importance resampling step. Importance resampling is fast, but it does require evaluating a few log densities of each candidate in both the approximating and target density. Thus the wall time for running multi-path Pathﬁnder could be nearly as fast as the slowest of the runs of single-path Pathﬁnder. Compared to short chains of adaptive HMC, there is less variability across runs for Pathﬁnder, as can be gleaned from the plots in Figure 5. These are timings for single runs (averaged over 100 runs). In practice, we typically initialize multiple Markov chains, so the gap in number of evaluations becomes even wider, though these can be parallelized for all systems.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

bball_drive_event_0 hmm_drive_0 eight_schools_centered eight_schools_noncentered garch garch11 mcycle_gp accel_gp dogs dogs_log gp_pois_regr gp_pois_regr hmm_example ar K ar K nes2000 nes one_comp_mm_elim_abs dogs dogs gp_pois_regr gp_regr low_dim_gauss_mix hudson_lynx_hare lotka_volterra bball_drive_event_1 hmm_drive_1 diamonds diamonds earnings logearn_interaction sblrc blr arma arma11

100 1000 10,000 50,000 count of log density evaluation

Pathfinder mean field ADVI dense ADVI Stan Phase I

(a) Log density evaluations

bball_drive_event_0 hmm_drive_0 eight_schools_centered eight_schools_noncentered garch garch11 mcycle_gp accel_gp dogs dogs_log gp_pois_regr gp_pois_regr hmm_example ar K ar K nes2000 nes one_comp_mm_elim_abs dogs dogs gp_pois_regr gp_regr low_dim_gauss_mix hudson_lynx_hare lotka_volterra bball_drive_event_1 hmm_drive_1 diamonds diamonds earnings logearn_interaction sblrc blr arma arma11

10 100 1000 10,000 50,000 count of log density gradient evaluation

Pathfinder mean field ADVI dense ADVI Stan Phase I

(b) Gradient evaluations

Figure 5: Box plots of the number of (a) log density evaluations and (b) gradient evaluations required by the candidate algorithms. For each algorithm, we performed 100 independent runs and summarize the results with box plots. Candidate algorithms can abort due to various errors. We count the log density and gradient evaluations in failed runs. We do not plot multi-path Pathﬁnder because its number of evaluations is just a multiple of the single-path results plus a small number of extra approximate and log density evaluations for importance resampling.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 scaled 1 Wasserstein distance

Lmax = 100, tol = 1e 3, K = 5, J = 6 Lmax = 1000, tol = 1e 13, K = 5, J = 6

Figure 6: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from Pathﬁnder for examples in posteriordb. Each box plot summarizes 1-Wasserstein distances for 100 independent runs of Pathﬁnder. The two results are for Pathﬁnder with default settings (orange), and with the number of iterations reduced to Lmax = 100 and the convergence threshold increased to 10 3 (black). We scaled the 1-Wasserstein distances for the same model by the median distances using the default settings.

3.2 Sensitivity to tuning parameters

In this section, we provide an analysis of Pathﬁnder s sensitivity to tuning parameters to evaluate whether we just got lucky with our suggested default settings for Pathﬁnder. For both adaptive Hamiltonian Monte Carlo and automatic differentiation variational inference, we use the default settings in Stan, which have proven successful for a broad range of applications (Stan Development Team, 2021a).

Optimization tolerance and maximum number of iterations. To test the performance of Pathﬁnder under different relative tolerances for convergence τrel and maximum iteration for L-BFGS Lmax, we reproduced 100 runs of Pathﬁnder with Lmax reduced to 100 and τrel increased to 10 3 for each model in posteriordb. The lower cap on number of iterations and more relaxed tolerances should lead to less computation and perhaps less stability in Pathﬁnder. Following the simulation

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 scaled 1 Wasserstein distance

Lmax = 1000, tol = 1e 13, K = 30, J = 6 Lmax = 1000, tol = 1e 13, K = 5, J = 6

Figure 7: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from Pathﬁnder for examples in posteriordb when using K = 5 (default, in orange) and K = 30 (black) Monte Carlo draws to evaluate the evidence lower bound. The results are scaled by the median of 1-Wasserstein distances for Pathﬁnder in the default setting.

design in Section 3.1, we estimate the 1-Wasserstein distance for 100 approximate draws from each run of Pathﬁnder. Figure 6 illustrates reduced ﬁdelity and increase in uncertainty with lower Lmax

and higher τrel for most of the tested models. For model earnings-logearn_interation, diamonds-diamond and bball_drive_event_1-hmm_drive_1 1-Wasserstein distances increased more than a factor of 8. None of the models show improved performance under these alternative settings. We thus prefer to keep a larger L and a smaller τrel, as they do not add much computation for simpler models, for which optimization terminates before the maximum iteration threshold.

The number of Monte Carlo draws to estimate ELBO. Figure 7 reports our sensitivity test for the number of Monte Carlo draws used to evaluate the evidence lower bound in Pathﬁnder (tuning parameter K). In particular, it compares 1-Wasserstein distances using the default K = 5 draws for evaluating the ELBO with the result of K = 30 draws. This increases the number of log density evaluations and the number of random numbers generated, but not the number of gradient

ZHANG, CARPENTER, GELMAN, AND VEHTARI

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 scaled 1 Wasserstein distance

Lmax = 1000, tol = 1e 13, K = 5, J = 60 Lmax = 1000, tol = 1e 13, K = 5, J = 6

Figure 8: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from Pathﬁnder for examples in posteriordb with the default of J = 6 (orange) gradients used to approximate the inverse Hessian and a much larger value of J = 60 (black) history elements. The distances are scaled by the median of the default behavior (J = 6).

evaluations, and reduces the scale of Monte Carlo error by around (1

30), or 60%. Increasing K consistently improves the performance of Pathﬁnder, but not by much. The median 1-Wasserstein distances for all models were reduced by only about 3.2% on average, with a maximum reduction of only 15.1%. The cost is about 4.9 times as many log-density evaluations on average. Based on the tradeoff between the quality of approximation and cost, a smaller K is a better choice when using Pathﬁnder, especially multi-path Pathﬁnder, to quickly ﬁnd a handful of draws close to high probability mass region. On the other hand, when using Pathﬁnder to generate an approximate posterior, a larger K may be warranted. Using K = 30 also reduces the variability in 1-Wasserstein distance by a little bit, as can be seen in the narrow 50% intervals and less extreme tail behavior.

History size of L-BFGS. Figure 8 provides a plot evaluating the difference between the default history size of J = 6 for L-BFGS to estimate an inverse Hessian with the much longer history size of J = 60. As before, we report on a comparison of 100 different runs of each system. Sensitivity to L-BFGS history size (J) varies across models. For the majority of models, Pathﬁnder works

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

better with longer histories. There is substantial improvement for diamonds-diamonds, whereas performance for sblrc-blr declines with larger J. This is because in some cases we need more history to get a higher-rank estimate of covariance, whereas in other cases, it can hurt local adaptation to keep a longer history. The example diamonds-diamonds has 27 parameters and the posterior distribution exhibits high correlations among 25 parameters. Hence we observed a great improvement of Pathﬁnder with a larger J. We encourage a larger J for approximation when the target distribution is expected to have high dependencies among a large number of parameters. Meanwhile, a smaller J requires less memory and computational cost, which is more efﬁcient for ﬁnding draws from the high probability mass region. For the other 19 models in posteriordb, median 1-Wasserstein distances were reduced an average of 0.7% when moving from J = 6 to J = 60. Among these 19 models, the maximum reduction was 16.3% and the maximum increase was 9.8%. In summary, a good choice of J depends on the speciﬁc problem. We found that, in general, J = 6 works well for the models in posteriordb. Our default history length is in line with defaults used for J in the range of 5 10 used as the default for most software distributions of L-BFGS. For example, R s stats::optim() function defaults to 5, whereas Sci Py uses 10 as the default for its scipy.optimize.fmin_l_bfgs_b function (Virtanen et al., 2020).

Number of parallel runs. To evaluate how sensitive multi-path Pathﬁnder is to the number of single-path Pathﬁnder runs used (I), we reproduce 100 runs of multi-path Pathﬁnder with I {5,20,40}, generating R = 100 approximate draws from each run (20 is our suggested default setting). We left all other tuning parameters at their proposed default values. We estimate the 1-Wasserstein distance for each run of multi-path Pathﬁnder to compare the performance of multipath Pathﬁnder under different values of I. Figure 9 shows that the change of I does not have much impact on the approximate performance of multi-path Pathﬁnder. When decreasing I from 20 to 5, the median 1-Wasserstein distances only increase 5.4% on average, with a minimum increase of 0.1% and maximum increase of 12.5%. When increasing I from 20 to 40, the median 1-Wasserstein distances are reduced by only 1.1% on average, with a maximum reduction of 5.5% and maximum increase of 6.4%. Increasing I from 5 to over 20 eliminates the extreme outcomes for the bball_drive_event_0-hmm_drive_0 example, which we further investigate in the next section.

3.3 Pathﬁnder for posteriors with challenging geometry

In this section, we consider four problems that challenge optimizers and MCMC samplers and thus might be expected to challenge Pathﬁnder.

Neal s funnel. Neal (2003) presents a model with funnel-like posterior geometry, deﬁned by

p(τ, β1,. . ., βN) = normal(τ | 0,3)

n=1 normal(βn | 0,exp(τ/2)). (8)

For values of τ > 0, the βn are relatively free (the mouth of the funnel), whereas for τ < 0, they are constrained to be near 0 (the neck of the funnel). This density is problematic for at least two reasons. First, the density grows without bound (i.e., has no maximum) as τ and βn 0. Second, the condition number of the inverse Hessian (ratio of largest to smallest eigenvalue) grows quickly as τ moves away from 0, which bounds the efﬁcacy of gradient-based updates. Furthermore, there is no way to globally precondition this distribution, as the curvature changes direction as τ

ZHANG, CARPENTER, GELMAN, AND VEHTARI

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/4 1/2 1 2 4 8 16 scaled 1 Wasserstein distance

I = 40 I = 20 I = 5

Figure 9: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from multi-path Pathﬁnder for examples in posteriordb when using I = 5 (green), 20 (orange, default), and 40 (black) independent Pathﬁnder runs. The results are scaled by the median of 1-Wasserstein distances for multi-path Pathﬁnder with I = 20.

moves away from 0 to the positive and negative side. The poor conditioning is endemic to hierarchical models, even with data (Betancourt and Girolami, 2015). The posteriordb package includes the eight schools model of Rubin (1981), in both a centered and non-centered parameterization; see Papaspiliopoulos et al. (2003); Stan Development Team (2021b); Betancourt and Girolami (2015) for more information on these parameterizations.

In Figure 10, we plot the behavior of the centered parameterization of the eight schools model as an example of funnel-like behavior, where the optimization paths and approximate draws by multi-path Pathﬁnder when initials of Pathﬁnder are randomly generated from uniform( 2,2) (our default) or uniform( 15,15) distributions in the unconstrained parameter space. We observe that even though the optimization paths correctly follow an optimization trajectory down the neck of the funnel, Pathﬁnder successfully identiﬁes points in the high probability mass (not high density) region, which is evenly split above and below τ = 0. The ﬁgure also shows that Pathﬁnder is sensitive to the choice of initial distribution π0. When the initial distribution is concentrated in a region of high target

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

-10 0 10 20

(a) optimization paths, uniform( 2,2) inits

-10 0 10 20

(b) approximate draws from multi-path Pathﬁnder

-10 0 10 20

(c) optimization paths, uniform( 15,15) inits

-10 0 10 20

(d) approximate draws from multi-path Pathﬁnder

Figure 10: Results of multi-path Pathﬁnder with varying initial distribution for the eight-schools model with a centered parameterization. Optimization paths (left column) are displayed next to the points points selected by pathﬁnder (right column), with uniform( 2,2) initialization (top row) and uniform( 15,15) initialization (bottom row). Multi-path Pathﬁnder used all 20 optimization paths. The density plots use lighter color for higher density. The optimization paths are plotted in the left panels using white for initial points and orange for later points in the optimization trajectories; later points dive into the neck of the funnel where density increases without bound but volume and hence probability mass is low.

density, the optimization paths may not pass through all regions of high probability mass, resulting in approximate draws clustered in a relatively small region as shown in the right-hand side plot in Figure 10. Meanwhile, ADVI tends to be less sensitive to the initials in this example, outperforming Pathﬁnder in both 1-Wasserstein distance and ELBO, as shown in Appendix D. We will next consider a medium-dimensional instance of Neal s funnel. Letting N in (8) be 99, we ﬁt the 100-dimensional Neal s funnel model with multi-path Pathﬁnder using our proposed default settings. As in the eight-schools example, Pathﬁnder successfully locates the high probability mass region along the optimization paths as illustrated in Figure 11. With Pathﬁnder s default initialization, the approximate draws are underdispersed when Pathﬁnder is started at the region of high posterior density, and the approximation can be greatly improved with a more diffuse initialization distribution.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

-10 -5 0 5 10

(a) optimization paths with uniform( 2,2) initialization

-10 -5 0 5 10 (b) approximate draws from multi-path Pathﬁnder

-10 -5 0 5 10

(c) optimization paths with uniform( 10,10) initialization

-10 -5 0 5 10 (d) approximate draws from multi-path Pathﬁnder

Figure 11: Results of multi-path Pathﬁnder with varying initial distribution for the 100-dimensional Neal s funnel model. The optimization paths (left column) are displayed next to the points points selected by pathﬁnder (right column), with uniform( 2,2) initialization (top row) and uniform( 10,10) initialization (bottom row). Multi-path Pathﬁnder used all 20 optimization paths. The density plots illustrate region with higher probability with lighter color. The optimization paths are plotted in the left panels using white for initial points and orange for later points in the optimization trajectories. The reference samples for density plots are generated by 4 MCMC chains ﬁtted with cmdstanr, with an adaptation period of 100,000 iterations, 850,000 saved iterations, and a thinning rate of 300.

Non-normal posteriors. Given that Pathﬁnder and ADVI rely on normal approximations to generate approximate draws, we are interested in the behavior of Pathﬁnder for posteriors that are far from normal. We focus here on the Gaussian process Poisson regression model gp_pois_regr-gp_pois_regr to explore the performance of Pathﬁnder in approximating non-Gaussian posteriors. As shown in Figure 12, the high probability mass region for this model projected down to two selected dimensions has the shape of a waning moon. The approximate draws generated by Pathﬁnder concentrate in a relatively small region within the high probability mass region. This phenomenon is expected, because Pathﬁnder selects an approximate normal distribution based on minimizing KL divergence to the target density, which favors more concentrated approximations that fall within the bulk of

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

0.0 0.5 1.0 1.5 2.0

(a) optimization paths

0.0 0.5 1.0 1.5 2.0

(b) draws from multi-path Pathﬁnder

Figure 12: (a) 20 optimization paths and (b) the 100 approximate draws of multi-path Pathﬁnder, overlaid on the density plot of two selected parameters for the gp_pois_regr-gp_pois_regr example. Multi-path Pathﬁnder used all 20 optimization paths. Lighter color in the density plot indicates higher density. The optimization paths are shaded from white (initial) to orange (ﬁnal).

the target probability mass. Therefore, Pathﬁnder tends to make a conservative guess on the high probability region when the posterior cannot be well approximated by a normal distribution.

Multimodality. Multimodal posteriors have more than one local optimum. In some cases, such as high-dimensional mixture models or neural networks, the multimodality can be so extreme that it defeats Monte Carlo methods. In other cases, the posterior has one, or maybe a few major modes, with other modes having negligible probability mass. Off-the-shelf MCMC sampling can work in these cases if it s possible to move among the modes, but this is usually too difﬁcult, and specialized samplers for multimodal problems need to be employed, such as bridge sampling (Meng and Wong, 1996). If there is only one major mode, MCMC will succeed if it s initialized near that mode or if chains initialized near minor modes can escape. We observed severely biased MCMC sampling due to the existence of minor modes for 4 of the 49 test models in posteriordb. In Figure 13 we illustrate 100 approximate draws generated by Stan phase I sampler, optimization paths for 20 runs of Pathﬁnder, and 100 approximate draws by multipath Pathﬁnder using all 20 runs of Pathﬁnder for the bball_drive_event_0-hmm_drive_0 example in posteriordb. This posterior has multiple modes as shown in Figure 13b. Figure 13a reveals how Stan s phase I adaptation can get stuck near this minor mode. The importance resampling step of multi-path Pathﬁnder is able to ﬁlter out points around minor modes as shown in Figure 13c. Of course, when all optimization paths are trapped in minor modes, importance resampling cannot recover. As a result, multi-path Pathﬁnder is more robust with more single-path Pathﬁnder runs from which to importance resample. We now turn to ovarian-logistic_regression_rhs, a challenging high-dimensional example from posteriordb. The example works on ovarian cancer data containing gene expression measurements of tissues. The example ﬁts a hierarchical logistic regression for the purpose of discriminating between tissues from different classes (e.g., tumor and normal samples). The model has 3075 parameters and a horseshoe prior that performs soft variable selection (Piironen and Vehtari, 2017). The prior induces multiple major modes, one of which corresponds to all unconstrained coefﬁcients being very close to zero. Piironen and Vehtari (2017, Section 4.2) provide a detailed

ZHANG, CARPENTER, GELMAN, AND VEHTARI

(a) 100 draws from Stan phase I warmup

(b) 20 optimization paths

(c) 100 approximate draws from multi-path Pathﬁnder

Figure 13: (a) 100 initializations generated by Stan phase I warmup, (b) 20 optimization paths, and, and (c) the 100 approximate draws of multi-path Pathﬁnder, overlaid on the density plot of two selected parameters for the bball_drive_event_0-hmm_drive_0 example. The density plot illustrates the region with higher probability with lighter color. In (b), the optimization paths are plotted from white (initialization) to orange (later iterations).

Figure 14: Illustration of multimodality for the ovarian-logistic_regression_rhs example. The 3-d scatterplot illustrates the reference posterior samples for three selected parameters. We can see at least four major modes in this 3-d marginal posterior. Overall, we estimate the joint posterior to have hundreds of modes with non-negligible masses.

discussion about the challenging features in this example. This interesting example is not included in the tests in Sections 3.1 and 3.2 because the reference samples are not available in posteriordb. We produce a reference posterior sample by running 4 adaptive HMC chains in cmdstanr with 10,000 warmup iterations, 25,000 saved iterations, and a thinning rate of 10. Figure 14 provides a three dimensional scatterplot illustrating the multimodality of the reference draws.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

-25 0 25 50

(a) uniform( 2,2) initialization of multi-path Pathﬁnder

-25 0 25 50

(b) random initialization from reference draws of multi-path Pathﬁnder

0.00 0.25 0.50 0.75 1.00 Probability (reference)

Probability (Pathﬁnder)

(c) predictive performance with uniform( 2,2) initialization

0.00 0.25 0.50 0.75 1.00 Probability (reference)

Probability (Pathﬁnder)

(d) predictive performance with reference initialization

Figure 15: Results of multi-path Pathﬁnder with varying initial distribution for the ovarian-logistic_regression_rhs example. With uniform( 2,2) initialization (top left), multi-path Pathﬁnder does not explore both modes and predictive performance is poor (bottom left). When initialized with draws from the reference sample (top right), multi-path Pathﬁnder explores both modes and performance improves (bottom right), but is still poor compared to the reference draws. The density plots illustrate region with higher probability with lighter color. The optimization paths are plotted from white (initialization) to orange (later iterations). The approximate draws of Pathﬁnder are indicated by red dots. The density plots show two major modes for the two selected parameters (top row, background). Predictive performance is the estimated probability of a tumor, with the x-axis deﬁned by the reference draws (bottom row).

We ﬁt multi-path Pathﬁnder with varying initial distributions and illustrate the results in Figure 15. The starting points of Pathﬁnder determine the optimization paths and, therefore, the performance of Pathﬁnder. Meanwhile, we observe a small number of distinct samples returned by multi-path Pathﬁnder even when the optimization paths succeed in ﬁnding the high probability mass region. The small number of distinct samples shows that importance resampling fails to select appropriately diffuse samples, which provides evidence that the approximation found by Pathﬁnder is not good enough.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Due to the high dimensionality and the shape of the posterior, the optimization from random initial values tends to ﬁnd the major mode around the origin, which is consistent with all coefﬁcients being close to zero. This concentration around the origin leads to poor predictive performance as shown in Figure 15c. On the other hand, if Pathﬁnder could somehow be initialized within or near the minor modes, optimization will ﬁnd them as shown in Figure 15b. Thus, even though Pathﬁnder is ﬁnding a single region of high probability, in high dimensions it can fail to ﬁnd other relevant high probability modes.

Weak identiﬁability. Unlike multimodal priors, which have multiple modes, non-identiﬁable posteriors do not have any modes. For example, a uniform(0,1) distribution is ﬂat and does not have a mode. Flat regions in densities must be compact, because ﬂatness in unbounded spaces leads to unnormalizable distributions. In Stan and this paper, we work with unconstrained distributions, where a variable α with a uniform(0,1) distribution is transformed to logit(α) = log α 1 α, which has a logistic(0,1) distribution. Although we cannot have properly ﬂat unbounded posteriors, they can be nearly ﬂat over large regions, which can cause the same computational problems. For example, we see such behavior with collinear predictors in regressions, which produces high posterior variability and high correlation between their regression coefﬁcients. Such ﬂat regions correspond to plateaus in the density and cause difﬁculties with convergence for adaptive HMC, ADVI, and Pathﬁnder. We observe pathological ﬂatness in the posterior of model mcycle_gp-accel_gp from posteriordb, which is one of the most expensive models in posteriordb according to the cost evaluation presented in Figure 5. This example models measurements of head acceleration in a simulated motorcycle accident. Speciﬁcally, the observation y is modeled with a normal distribution having Gaussian process prior on mean function f and log standard deviation function g,

y normal( f,exp(g)), f GP(µf,K1), g GP(µg,K2),

where K1 and K2 denote covariance functions, and µf and µg model the mean of the priors of f and g. Both Gaussian processes f and g are modeled with Hilbert-space approximate basis functions (Solin and Särkkä, 2020; Riutort-Mayol et al., 2020). There are in total 66 parameters. In Figure 16, we illustrate 100 draws from multi-path Pathﬁnder, which are based on resampling 20 optimization paths, for two selected dimensions of mcycle_gp-accel_gp. Figures 16a and 16b show that the optimization paths are almost parallel to each other when passing the high probability region. The gradients of the log density along the x-axis fail to guide optimization paths toward the high probability region of the posterior. Figure 16c illustrates that with specialized initials, the positions of the maxima found by L-BFGS span from 40 to 20 on the x-axis, a strong sign of weak posterior identiﬁability. Since Pathﬁnder relies on estimating curvature along optimization paths to estimate the covariance of normal approximations, Pathﬁnder fails to provide good covariance estimates for normal approximations centered near the high probability mass region in this example. Apart from weakly identiﬁed parameters, we ﬁnd that several parameters exhibit correlation in the posterior. Therefore, L-BFGS with a small history size (J), which uses a low-rank plus diagonal matrix to approximate local Hessian, cannot precisely capture local curvature information. Figure 16e shows how a large history size can improve the performance of Pathﬁnder in this example. Pathﬁnder successfully identiﬁes the high probability mass region among the optimization paths, though the approximate draws are concentrated in a small region due to the weak identiﬁability problem. Figure 16f compares the expectation of f for 133 observations estimated by reference samples and approximate samples from multi-path Pathﬁnder with J = 100. We conclude that

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

-40 -20 0 20

(a) default behavior of multi-path Pathﬁnder

-4 -3 -2 -1 0 1 2

(b) central region of plot (a)

-40 -20 0 20

(c) result of multi-path Pathﬁnder with specialized initials

-40 -20 0 20

(d) Stan phase I sampler draws

-4 -3 -2 -1 0 1 2

(e) reproduce (b) with J = 100

-100 -50 0 Expected mean (reference)

Expected mean (Pathﬁnder)

(f) compare expectation of f of 133 observations

Figure 16: Illustration of weakly identiﬁed posteriors for two parameters in the mcycle_gp-accel_gp example. In (a), we see the result of Pathﬁnder with default initializations in uniform( 2,2); in (b), we zoom in the x-axis of plot (a). The optimization paths are plotted from initialization in white to later iterations in orange. In (c), we initialize parameters on the x axis between 40 and 20, resulting in L-BFGS declaring convergence at various points along the x axis as a result of the weak identiﬁcation. Due to the poor approximations from Pathﬁnder, the importance resampling step in multi-path Pathﬁnder only picks 4 distinct draws (recall that sampling is with replacement). Short runs of adaptive HMC perform better than Pathﬁnder (d), but are expensive in terms of density and gradient evaluations. Figure (e) shows that a large history size (J = 100 as opposed to our default J = 5) improves the local Hessian approximations in optimization. Figure (f) compares posterior expectations of parameters µf , the mean of the Gaussian process for f , in the reference draws (horizontal axis) and draws from multi-path Pathﬁnder with history length of J = 100 (vertical axis).

approximate inference from Pathﬁnder is reasonable in this case because expectations are close between Pathﬁnder and the reference draws as indicated by the 45-degree line where y = x.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

(a) 20 optimization paths

(b) 100 approximate draws from multi-path Pathﬁnder

Figure 17: Illustration of non-normality in two additional dimensions of the mcycle_gp-accel_gp example. The 20 optimization paths are shown in (a) with later iterations in orange. The 100 approximate draws selected by multi-path Pathﬁnder are shown in (b) as red dots overlaid on the target density. The target density is clearly not normal in these two dimensions. Due to the poor approximations from Pathﬁnder, importance resampling with replacement in multi-path Pathﬁnder only generates 4 distinct draws among the 100 approximate draws it chooses because of the high variance of the importance weights in this example.

In addition to weakly identiﬁed and correlated parameters, mcycle_gp-accel_gp has other features that frustrate Pathﬁnder. In Figure 17, we see that the high probability region in the other two dimensions is shaped like a boomerang. Because the posterior distribution is far from normal, none of the parametric VI algorithms ﬁnd good approximations. The approximate draws by multi-path Pathﬁnder on the two selected dimensions are shown in Figure 17b. Although Pathﬁnder fails to provide a good normal approximation, the draws from Pathﬁnder are close to the target region for most of the parameters in this example. For difﬁcult high-dimensional examples such as the Gaussian process mcycle_gp-accel_gp and the hidden Markov model bball_drive_event_0-hmm_drive_0, the approximate draws are concentrated in a small region, but they can be useful starting points for more elaborate and more costly algorithms.

4. Using Pathﬁnder to initialize MCMC

Markov chain Monte Carlo methods, including adaptive Hamiltonian Monte Carlo, are typically initialized randomly and then allowed to evolve until multiple chains have converged. This informal notion of convergence can mean several things. In the strictest sense, it requires running the chains long enough to fully forget their initial positions, as measured, for example, by coupling (Jacob et al., 2020). Such forgetting is necessary to avoid estimation bias due to the initialization. A less strict criterion only requires adequate mixing of the chains, which may be measured through withinand between-chain variances (Gelman and Rubin, 1992; Vehtari et al., 2021). The most generous notion of convergence provides the modest goal of using Pathﬁnder for initialization, namely ﬁnding a point, or several points, that look as if they might have been drawn from the posterior. This is the point at

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

which the transient bias due to not initializing with a draw from the target is mostly eliminated and estimation error starts to become dominated by sampling error (Angelino et al., 2016, Section 2.2.4). Initialization to remove most of the transient bias of MCMC can succeed by producing draws within the high probability volume of the posterior without covering that posterior it only requires draws to look reasonable. Besides, even if initial points are underdispersed, if they are close enough to the high probability mass region, the total variance would still likely increase faster than the within chain variance. As there will be additional warmup after Pathﬁnder, the probability of having high between-chain variance, would be negligible. Therefore, the convergence diagnostic based on withinand between-chain variances is still reliable with underdispersed initials in high probability mass region. We can diagnose problems with Pathﬁnder s distributional approximation by using the Pareto-k statistic (Yao et al., 2018). In such cases, Pathﬁnder can still be valuable as a starting distribution for the adaptation phase of MCMC, improving current practice whereby adaptation can be slow when starting points are not chosen well.

The birthday problem Gaussian process. We consider the challenging problem of modeling the time series of the number of births by day, where the default no-U-turn Hamiltonian Monte Carlo sampler in Stan tends to get stuck in local minor modes. This time series shows periodic trends at weekly and annual scales, as well as longer-term trends and dateand holiday-based behavior. This is an attractive example for Bayesian inference and computation because national-level data are publicly available, sample sizes are large enough that ﬁne-grained patterns can be detected with careful analysis, the underlying scientiﬁc questions are accessible and of general interest, and the repetition at different time scales is not exact (for example, there is a general pattern of more births in the summer than in the winter, but that contrast changes over time and varies by country). Together, these features motivate the use of Gaussian processes, a class of models that support ﬂexible patterns of ad hoc, trend-based, and periodic patterns at different time scales. We follow Gelman et al. (2013) in analyzing the number of births per day in the United States from 1969 through 1988 with Gaussian processes (GP). Applying GPs at this scale is challenging because of the need to work with an N N covariance matrix, where N is the number of days in the series. The generic form of the likelihood is

yn normal( f (xn),σ), (9)

where y is the time series of logarithm of number of births by day. (The counts are large enough that we do not need to worry about their discreteness.) The function f is assigned a Gaussian process prior, which takes into account a global trend f1, yearly seasonal trend f2, the effect of weekdays βday of week, and the day of year effect βday of year,

f = α + f1 + f2 + βday of week + βday of year, (10)

α normal(0,1), f1 GP(0,K1), f2 GP(0,K2),

βday of week = 0 if day of week is Monday,

βday of week normal(0,1) if day of week is not Monday, and

βday of year normal(0,0.1),

where the ﬁrst GP uses the exponentiated quadratic covariance function K1 and the second a periodic covariance function K2. Most years have 365 calendar days and every four years (during the data range) there are 366 day, and thus we simplify and use period of 365.25 for the periodic component.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

-1 0 1 2 parameter 1

parameter 2

(a) draws from Stan phase I (1-W = 9.65)

-1 0 1 2 parameter 1

parameter 2

(b) draws from multi-path Pathﬁnder (1-W = 5.45)

-1 0 1 2 parameter 1

parameter 2

(c) draws from Pathﬁnder + Stan phase I (1-W = 5.32)

20 40 60 iteration

log-density

(d) Stan phase I traceplots with random inits from uniform( 2,2)

20 40 60 iteration

log-density

(e) Stan phase I traceplots with Pathﬁnder inits

6000 8000 10000 12000 number of births (reference)

number of births (NUTS)

(f) expected births, reference vs. ﬁnal Stan phase I

6000 8000 10000 12000 number of births (reference)

number of births (Pathﬁnder)

(g) expected births, reference vs. multi-path Pathﬁnder

6000 8000 10000 12000 number of births (reference)

number of births (pathﬁnder + NUTS)

(h) expected births, reference vs. Pathﬁnder + Stan phase I

Figure 18: Comparison of 75 iterations of adaptive HMC (Stan Phase I adaptation), multi-path Pathﬁnder, and HMC initialized by multi-path Pathﬁnder for the birthday Gaussian process. Top row: draws in orange against the target density of two selected parameters for (a) draws from 20 adaptive HMC chains initialized with uniform( 2,2), (b) 20 approximate draws from multi-path Pathﬁnder, and (c) draws from 20 adaptive HMC chains initialized by multi-path Pathﬁnder. Draws from multi-path Pathﬁnder are closer to the target than 75 iterations of adaptive HMC. Middle row: traceplots of 75 adaptive HMC iterations initialized randomly (d) and with Pathﬁnder (e). Bottom row: (f) expected number of births by day estimated by reference samples and draws from adaptive HMC, (g) multi-path Pathﬁnder, and (h) adaptive HMC initialized by multi-path Pathﬁnder. Pathﬁnder-initialized HMC provides inference closest to the reference.

The Gaussian process priors for f1 and f2 are modeled by the Hilbert space approximate basis function approximation of Gaussian processes (Solin and Särkkä, 2020; Riutort-Mayol et al., 2020). This model includes a total of 429 parameters.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

To evaluate the performance of Pathﬁnder, we compare the approximate draws obtained by adaptive Hamiltonian Monte Caro (Stan s phase I sampler), multi-path Pathﬁnder, and adaptive HMC initialized with draws from multi-path Pathﬁnder. For multi-path Pathﬁnder, we generate 20 approximate draws using 20 optimization paths and the default setting of tuning parameters. For Stan s phase I sampler, we run 20 adaptive HMC chains with 75 iterations for each, and take the samples at the last iteration as the approximate draws. The trace plots of log density plotted in Figure 18 indicate multiple modes in the posterior distribution. Initializing an MCMC algorithm with Pathﬁnder avoids wasting computation time sampling within minor modes with negligible posterior mass. To compute a reference for comparison, we ran 4 MCMC chains with an adaptation period of 50,000 iterations, 300,000 saved iterations, and a thinning rate of 100 using cmdstanr. In Figure 18, we illustrate reference posterior draws and approximate draws of three competitors on two selected dimensions. The reference posterior has the shape of an hourglass, which is found by all draws from Pathﬁnder-initialized adaptive HMC. The 1-Wasserstein distance between the reference posterior draws and draws after 75 iterations of adaptive HMC as well as the comparison of the estimated number of births highlight the beneﬁts of initializing MCMC with draws from Pathﬁnder.

Figure 18a show that 5 of the 20 randomly initialized adaptive HMC chains failed to reach the high probability region, while the remaining chains provided reasonable approximate draws. Figure 18d shows the 5 non-converging MCMC chains were trapped in minor modes. Convergence diagnostic results are consistent with multimodality. Therefore, initializing MCMC chains with Pathﬁnder helps to avoid the local minor modes in the subsequent MCMC sampling. On average, the number of gradient evaluations of an adaptive HMC chain in this example is around 10 times that of single-path Pathﬁnder. Pathﬁnder further requires around 5 times more log-density evaluations than gradient evaluations to estimate ELBOs and generate approximate draws. With our experimental Pathﬁnder implementation in cmdstanr, the average running time of a single-path Pathﬁnder is around one-third of that of a single 75 iteration HMC chain (3s vs 9s). In summary, initializing MCMC with Pathﬁnder can improve the overall sampling efﬁciency.

5. Discussion

The Pathﬁnder algorithm we present in this paper is similar to automatic differentiation variational inference (Kucukelbir et al., 2017), with two key differences. First, instead of performing stochastic gradient descent directly on the evidence lower bound, we use quasi-Newton methods to directly optimize the objective function. Second, rather than a diagonal or dense covariance approximation, Pathﬁnder uses a low-rank plus diagonal factorization that is more expressive than a simple diagonal covariance matrix, but nearly as cheap to employ. Direct optimization is much more stable because we can use cheap L-BFGS estimates of inverse Hessians to locally condition our optimization steps (Nocedal, 1980). Pathﬁnder can evaluate the ELBO of normal approximations built upon L-BFGS estimates of inverse Hessians along this path (in parallel), and we know from the intermediate value theorem that if starting from the tail of the distribution, the optimization trajectory will move from the tail, through the body of the distribution, to the mode or pole. In our examples, Pathﬁnder requires one to two orders of magnitude fewer log density and gradient evaluations than using automatic differentiation variational inference or using dynamic Hamiltonian Monte Carlo to warm up. Pathﬁnder provides approximations slightly inferior to that of adaptive step-size Hamiltonian Monte Carlo, but better than mean-ﬁeld or dense ADVI as measured by 1-Wasserstein distance.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Multi-path Pathﬁnder improves on the robustness of single-path Pathﬁnder by running multiple single-path instances, then importance resampling. We have shown that this can both ﬁlter out minor modes where the algorithm might otherwise get stuck, and improve ﬁdelity in representing irregular posteriors. In this usage for black-box variational inference, Pathﬁnder is similar to using multiple short MCMC paths for inference (Hoffman and Ma, 2020). In most cases, the approximation provided by multi-path Pathﬁnder is comparable to or better than that of short chains of Hamiltonian Monte Carlo. Although multi-path Pathﬁnder requires many log density and gradient evaluations, these can be parallelized, unlike the necessarily serial evaluation of Markov chain Monte Carlo.

For posteriors with high computational cost, it might be useful to use Bayesian optimization (Shahriari et al., 2015) to ﬁnd the ELBO maximizing point along the L-BFGS optimization trajectory. Such optimization could also consider parameter values between L-BFGS iterations, as L-BFGS might sometimes take a big step over the optimal point. Multi-path Pathﬁnder is more robust than single-path Pathﬁnder to the variability due to the small number of Monte Carlo draws used to evaluate the ELBO, because importance resampling will favor the best approximate distributions from multiple single-path runs.

The performance of Pathﬁnder can be sensitive to the initial values if they happen to be underdispersed and close to the mode. To make Pathﬁnder more robust to initial values, it would be possible to create the ﬁrst approximation with any initial values and then use that approximation to generate additional initial values from a normal distribution with much larger scale than the scale of the ﬁrst approximation (e.g., 10 times larger scale). For example, in the ovarian cancer example in Section 3.2, this approach would be likely to ﬁnd additional modes beyond the mode at the origin.

Pathﬁnder can be part of a more effective computational workﬂow, starting with fast multivariate optimization, moving to Pathﬁnder s distributional approximation, and then if necessary moving to fully stochastic MCMC algorithms. Even biased approximate inference can be useful if it can produce one reasonable draw quickly or several in parallel. If such draws are unreasonable, there is a good reason to believe the model is misspeciﬁed or has an error in its code. This allows us to fail fast during model development, a perspective from software engineering (Shore, 2004) that we recommend applying to statistical workﬂow (Gelman et al., 2020; Gabry et al., 2019).

Perhaps the most obvious extension to consider is using optimizers other than L-BFGS. For example, it might be possible to improve scalability by subsampling long data sets and using stochastic gradient descent (Robbins and Monro, 1951) or one of its modern adaptive variants such as Adam (Kingma and Ba, 2014) on the objective (not on the evidence lower bound). The overall algorithm should be tolerant to data subsampling because we do not need high tolerance for mode ﬁnding. Subsampling may also be more robust in the face of minor modes (i.e., local optima).

One of our goals in developing Pathﬁnder was to ﬁnd a drop-in replacement to initialize the warmup phase of MCMC sampling. For Gibbs samplers, that would only involve ﬁnding a reasonable starting point. In contrast, adaptive Metropolis and Hamiltonian Monte Carlo samplers require us to generate an approximate covariance matrix for either a proposal distribution or to pre-condition Hamiltonians. Thus ﬁnding initial values is only the ﬁrst phase of warmup and adaptation. In Stan, the second phase of warmup involves exploring the posterior to estimate the posterior covariance in either diagonal or dense form. Following Bales et al. (2019), we have veriﬁed that the estimated inverse Hessian from L-BFGS is a reasonable initalization for such adaptation, and may in fact be accurate enough to bypass Stan s phase II adaptation, at least for low-dimensional models where it is easier to estimate covariance. This would remove the remaining serial processing bottleneck for

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

efﬁcient parallel MCMC. In more difﬁcult problems, it should be possible to use the information provided by Pathﬁnder to perform warmup adaptation more efﬁciently. With a few dozen test models of fairly low dimension and a handful of higher-dimensional examples, we have only scratched the surface of potential applications. We believe the use of local curvature information in L-BFGS should help with ﬁtting in both higher dimensions and in situations with poor conditioning locally due to varying curvature, compared to either optimizing the ELBO directly using stochastic gradient methods or subsampling short chains of Hamiltonian Monte Carlo.

Acknowledgments

We thank Dan Simpson, Ben Bales, Colin Carroll, and Philip Greengard for helpful comments and Matt Hoffman for motivation, discussion, and clariﬁcation. Andrew Gelman and Lu Zhang thank the U.S. National Science Foundation (grant 2055251), National Institutes of Health (grant 1R01AG06714901), Ofﬁce of Naval Research (grant N000141912204), Institute for Education Sciences (grant R305D190048), Alfred P. Sloan Foundation (grant G201912491), and Schmidt Futures for partial support of this work. Aki Vehtari thanks the Academy of Finland Flagship programme: Finnish Center for Artiﬁcial Intelligence, FCAI, for partial support of this work.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Appendix A. Derivation of (4)

From (3), we have

Σ(l) = diag(α 1 2 )

I + diag(α 1

2 ) β | {z }

γ β diag(α 1

diag(α 1 2 )

= diag(α 1 2 ) I + Q R γ R Q diag(α 1 2 )

= diag(α 1 2 ) I + Q P R γ R 0 0 0

diag(α 1 2 )

= diag(α 1 2 )

Q P I 0 0 I

+ Q P R γ R 0 0 0

diag(α 1 2 )

= diag(α 1 2 ) Q P I + R γ R 0 0 I

diag(α 1 2 ) [because I + R γ R = L L ]

= diag(α 1 2 ) Q P L L 0 0 I

diag(α 1 2 )

= diag(α 1 2 ) Q P L 0 0 I

diag(α 1 2 )

= diag(α 1 2 ) Q L P

diag(α 1 2 ).

Appendix B. 1-Wasserstein distance

The 1-Wasserstein distance between two probability measures is based on the amount of effort it would take to rearrange one probability measure to look like the other, where effort is measured according to the distance mass is moved (Craig, 2016; Villani, 2009; Mc Cann, 1995). If each distribution is viewed as a unit volume of earth piled according to the density, then the 1-Wasserstein distance is the minimum cost of turning one pile into the other, with cost deﬁned as the amount of earth that needs to be moved times the mean distance it has to be moved. Unlike KL divergence, Wasserstein distance is a proper distance metric in that it is symmetric, all distances are non-negative, it obeys the triangle inequality because mass can be moved in two steps, and distance between a distribution and itself is zero, because no probability mass needs to be adjusted. Both multi-path Pathﬁnder and Stan s current no-U-turn sampler generate nonparametric approximations. In simulation studies in Section 3, we use the empirical distribution of 100 approximate draws to measure the Wasserstein distance for approximations generated by multi-path Pathﬁnder and Stan s no-U-turn sampler. We use the empirical distribution based on the 10,000 reference draws from posteriordb to represent the target posterior distribution. In order to compare multi-path Pathﬁnder and Stan s no-U-turn sampler to single-path Pathﬁnder and ADVIs, a discrete form of Wasserstein distance is applied to 100 samples drawn from the approximate distributions produced

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

-0.45 -0.40 -0.35 -0.30 -0.25

(a) 1-W distance = 0.034

-0.45 -0.40 -0.35 -0.30 -0.25

(b) 1-W distance = 0.017

-0.45 -0.40 -0.35 -0.30 -0.25

(c) 1-W distance = 0.009

Figure 19: Scatterplot of the 100 approximate draws for the dogs-dogs_log example from posteriordb. Higher target densities are shaded lighter. The subtitles provide 1-Wasserstein (1-W) distances to the target density.

by single-path Pathﬁnder and ADVIs. In summary, we estimate distances between reference draws from the target posterior and the approximate draws from Pathﬁnder (single and multi-path), ADVI (with diagonal and dense covariance), and Stan s current no-U-turn sampler. We provide an example in Figure 19 to demonstrate the 1-Wasserstein distances between reference samples and approximate draws with different patterns.4

Appendix C. Importance sampling

Importance sampling is a method for adjusting expectation estimates based on an approximate proposal distribution so that they more closely resemble a target distribution. Rather than weighting the draws equally, importance sampling reweights them according to the ratio of the target density to the proposal density. If the proposal distribution is q(θ) and the target p(θ), the importance weight of a draw θ q(θ) is w(θ) = p(θ)/q(θ). The importance sampling estimate of an expectation with respect to importance weighted expectation estimate over a target density p given a sequence of proposal draws θ(1),. . .,θ(M) q(θ) is

Ep(θ)[ f (θ)] 1

m=1 w(θ(m)) f (θ(m)).

Appendix D. ELBO comparison for simulation studies in Section 3.1

In the body of the paper, we used Wasserstein-1 distance to evaluate the quality of approximate draws from a target distribution. In Figure 3, we show the corresponding values of the ELBO derived by single-path Pathﬁnder and ADVI in both its diagonal ( mean-ﬁeld ) and dense ( full rank ) versions for the collection of posteriordb models we considered in Section 3.1. In all cases, we use 100 approximate draws to compute a Monte Carlo estimate of ELBO by (7). The ELBO comparison is

4. We use the function wasserstein() from the R package transport (Schuhmacher et al., 2020) to calculate the 1-Wasserstein distance between two sets of draws.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

< 10000 1000 100 10 0 10 >100 shifted ELBO

dense ADVI mean field ADVI pathfinder

Figure 20: Box plots of estimated ELBO values (higher is better) for single-path Pathﬁnder and ADVI for the 20 examples in posteriordb. Each box plot displays estimated ELBOs of 100 independent runs of (single-path) Pathﬁnder, mean-ﬁeld ADVI, and dense ADVI. We estimate the ELBO with 100 approximate draws from each run. All values are shifted by the median of the estimated ELBO values for single-path Pathﬁnder.

consistent with the 1-Wasserstein distance comparison in Section 3.1. Pathﬁnder results in better (higher) ELBO values than ADVI for most of the examples from posteriordb. Meanwhile, for model mcycle_gp-accel_gp, eight_schools-eight_schools_noncentered and eight_schools-eight_schools_centered, the advantages of ADVI over Pathﬁnder are more prominent in ELBO comparison than in Wasserstein distance comparison. Based on the case studies in Section 3.3, the ELBO more heavily penalizes underdispersed approximations than Wasserstein distance when the approximate samples are close to the high probability mass region.

Appendix E. Laplace approximation for simulation studies in Section 3.1

In this section, we present an experiment to compare Pathﬁnder with standard Laplace approximation. The comparison is based on the same examples in the simulation study in Section 3.1. We deﬁne the standard Laplace approximation as the Laplace approximation at the mode found by optimization.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

To generate standard Laplace approximations, we ﬁrst use L-BFGS to generate an optimization trajectory until it converges. Then we compute the Hessian of the negative log density at the end of the optimization path by function jacobian() from the R package num Deriv. The L-BFGS for ﬁnding mode and the L-BFGS in Pathﬁnder share the same tuning parameters except the maximum iteration. The Laplace is the Normal approximation with mean at the mode and covariance matrix equal to the inverse of the computed Hessian. We generate 100 approximate draws from the Laplace approximation and compute the 1-Wasserstein distance between the empirical distribution of the approximate draws and the target distribution. In Figure 21 we compare multi-path Pathﬁnder with Laplace approximation based on 1-Wasserstein distances. The distances are scaled by the median of the 1-Wasserstein distance for multi-path Pathﬁnder for each example. Normal approximation at the mode (aka Laplace) and Pathﬁnder are both based on optimization. If the mode exists, Pathﬁnder s path will also end at the mode. In case of Laplace, we trust that the mode is a good center of the normal approximation. In case of Pathﬁnder we use ELBO estimates to check if some other place along the path would be better. If the ELBO could be estimated exactly, in case of low-dimensional normal posterior, Pathﬁnder s approximation would match the Laplace approximation. As ELBO is estimated using a small number of Monte Carlo draws, there is additional variability and Pathﬁnder s approximation can be little off compared to the Laplace. In Figure 21 for nearly normal posteriors like example nes2000-nes, due to the error of the ELBO estimates, Pathﬁnder might pick approximations slightly inferior to the Laplace approximation. In Section 5 we discuss ways to improve the accuracy of ELBO estimation near the optimal ELBO. In the experiments we intentionally used low J for the low-rank part of Pathﬁnder s approximation. Thus for close to normal posteriors with posterior dependencies, Laplace with dense covariance matrix beats the low-rank plus diagonal approximation even if that approximation would also be at the mode. The effect of this is clearly seen in case of diamonds-diamonds posterior that has close to normal posterior with strong dependencies in 25 dimensions. If such strong dependencies would be expected, Pathﬁnder s accuracy can be trivially improved by increasing J which naturally then increases also memory and computation costs. As shown in Figure 8, the Wasserstein distances for single-path Pathﬁnder can be reduced by half with a larger J. On the other hand, the implementation of standard Laplace quickly become infeasible as N grows. Standard Laplace not only requires O(N3) ﬂops and O(N2) memory for obtaining the Cholesky decomposition of the inverse Hessian, the cost of autodiff for Hessian computation is also expensive, especially for high-dimensional problems. If the mode doesn t exist or the second derivatives don t exist at the mode (e.g., for eight_schools_centered and mcycle_gp-accel_go ), then the Laplace approximation fails but Pathﬁnder is still likely to be able to provide an approximation. Even if the mode and second derivatives at the mode exist, Pathﬁnder is safer for posteriors that are far from normal or multimodal. Taking sblrc-blr for example, the posterior is skewed due to a small sample size, and we observe that Pathﬁnder works better than Laplace. We are thus trading some accuracy (due to the stochastic ELBO estimates) in case of close to normal posteriors to faster computation (low rank part) and better accuracy in case of no-normal distributions (use of ELBO).

ZHANG, CARPENTER, GELMAN, AND VEHTARI

bball_drive_event_0 hmm_drive_0

eight_schools_centered

eight_schools_noncentered

garch garch11

mcycle_gp accel_gp

dogs dogs_log

gp_pois_regr gp_pois_regr

hmm_example

nes2000 nes

one_comp_mm_elim_abs

gp_pois_regr gp_regr

low_dim_gauss_mix

hudson_lynx_hare lotka_volterra

bball_drive_event_1 hmm_drive_1

diamonds diamonds

earnings logearn_interaction

arma arma11

1/4 1/2 1 2 4 8 16 scaled 1 Wasserstein distance

Laplace multi path pathfinder

Figure 21: Box plots of 1-Wasserstein distances between the reference posterior samples and approximate draws from multi-path Pathﬁnder and Laplace approximation for the 20 models in posteriordb. Each box plot displays 1-Wasserstein distances of 100 independent runs of (multipath) Pathﬁnder and Laplace approximation. Distances for each model are scaled by the median of the 1-Wasserstein distances for multi-path Pathﬁnder. The Laplace approximation is not available for examples eight_schools-eight_schools_centered and mcycle_gp-accel_gp, since the Hessian at their modes are singular.

Appendix F. Pseudocode for ELBO estimation and L-BFGS

Algorithm 6 ELBO estimation

Input: log p(φ(1)),. . .,log p(φ(K)): target log densities log q(φ(1)),. . .,log q(φ(K)): log densities of approximation distribution Output: ELBO: Monte Carlo estimate of the evidence lower bound

1: procedure ELBO(log p(φ(1:K)),log q(φ(1:K)))

2: let ELBO = 1

K ÍK k=1 log p(φ(k)) log q(φ(k)) O(K)

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

Algorithm 7 L-BFGS.

Input: log p: differentiable log density function of dimension N θinit: initial value in support with log p(θinit) ﬁnite ( uniform( 2,2)N) J: size of the history used to approximate the inverse Hessian (6) τrel: relative tolerance of log density change for convergence (10 13) L: maximum number of iterations (1000) c: pair of bounds for Wolfe condition on line search, with 0 < c1 c2 < 1 (10 4,0.9) ϵ: positivity threshold for updating covariance estimate (2.2 10 16) Output: θ(1),. . .,θ(L ): optimization path with L L and log p(θ(l)) < log p(θ(l+1)) log p(θ(1)),. . ., log p(θ(L )): gradient log density of points on optimization path

1: procedure L-BFGS(log p,θinit, J,τrel, L,c,ϵ)

2: let θ(0) = θinit, S = , Z = , and α = 1N, where 1N denotes the N-vector of 1s. O(N)

3: for l 0 : L 1 do O(LJ2N)

4: generate β and γ by (2) O(J2N)

5: let δ = (diag(α) + β γ β ) log p(θ(l)) be the search direction O(JN)

6: for λ 1, 1

7: let θ(l+1) = θ(l) + λ δ O(N)

8: break if the Wolfe conditions are satisﬁed, O(N)

log p(θ(l+1)) log p(θ(l)) + c1 log p(θ(l)) (λ δ)

log p(θ(l+1)) δ c2 log p(θ(l)) δ

9: return if log p(θ(l+1)) log p(θ(l)) log p(θ(l)) < τrel

10: let Sl+1 = θ(l+1) θ(l) and Zl+1 = log p(θ(l)) log p(θ(l+1)),

11: if S l+1Zl+1 > ϵ Zl+1 2 (Condition 3.9 of Byrd et al. (1995)) then

12: if more than J 1 updates are stored, delete the ﬁrst columns of S and Z

13: let S = S Sl+1 and Z = Z Zl+1

14: let α = Zl+1 2

S l+1 Zl+1 1N O(N)

Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 72(3):269 342, 2010.

Elaine Angelino, Matthew James Johnson, and Ryan P. Adams. Patterns of scalable Bayesian inference. Foundations and Trends in Machine Learning, 9(2-3):119 247, 2016.

Ben Bales, Arya Pourzanjani, Aki Vehtari, and Linda Petzold. Selecting the metric in Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1905.11916, 2019.

Michael Betancourt. A conceptual introduction to Hamiltonian Monte Carlo. ar Xiv preprint ar Xiv:1701.02434, 2017.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Michael Betancourt and Mark Girolami. Hamiltonian Monte Carlo for hierarchical models. Current Trends in Bayesian Methodology with Applications, pages 79 101, 2015.

Eli Bingham, Jonathan P Chen, Martin Jankowiak, Fritz Obermeyer, Neeraj Pradhan, Theofanis Karaletsos, Rohit Singh, Paul Szerlip, Paul Horsfall, and Noah D Goodman. Pyro: Deep universal probabilistic programming. The Journal of Machine Learning Research, 20(1):973 978, 2019.

David M. Blei, Alp Kucukelbir, and Jon D. Mc Auliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+Num Py programs, 2018. URL http://github.com/google/jax.

Charles George Broyden. The convergence of a class of double-rank minimization algorithms 1. General considerations. IMA Journal of Applied Mathematics, 6(1):76 90, 1970.

Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi-Newton matrices and their use in limited memory methods. Mathematical Programming, 63(1-3):129 156, 1994.

Richard H. Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientiﬁc Computing, 16(5):1190 1208, 1995.

Bob Carpenter, Matthew D. Hoffman, Marcus Brubaker, Daniel Lee, Peter Li, and Michael Betancourt. The Stan math library: Reverse-mode automatic differentiation in C++. ar Xiv preprint ar Xiv:1509.07164, 2015.

Katy Craig. The exponential formula for the Wasserstein metric. ESAIM: Control, Optimisation and Calculus of Variations, 22(1):169 187, 2016.

Akash Kumar Dhaka, Alejandro Catalina, Michael Riis Andersen, Måns Magnusson, Jonathan H. Huggins, and Aki Vehtari. Robust, accurate stochastic optimization for variational inference. ar Xiv preprint ar Xiv:2009.00666, 2020.

Akash Kumar Dhaka, Alejandro Catalina, Manushi Welandawe, Michael Riis Andersen, Jonathan Huggins, and Aki Vehtari. Challenges and opportunities in high-dimensional variational inference. ar Xiv preprint ar Xiv:2103.01085, 2021.

Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, and Rif A. Saurous. Tensorﬂow distributions. ar Xiv preprint ar Xiv:1711.10604, 2017.

David Duvenaud, Dougal Maclaurin, and Ryan Adams. Early stopping nonparametric variational inference. Proceedings of Machine Learning Research, 51:1070 1077, 2016.

Víctor Elvira, Luca Martino, David Luengo, and Mónica F. Bugallo. Generalized multiple importance sampling. Statistical Science, 34(1):129 155, 2019.

Roger Fletcher. Practical Methods of Optimization (Second Edition). Wiley, 1987.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

Jonah Gabry, Daniel Simpson, Aki Vehtari, Michael Betancourt, and Andrew Gelman. Visualization in Bayesian workﬂow (with discussion). Journal of the Royal Statistical Society: Series A, 182(2): 389 402, 2019.

Hong Ge, Kai Xu, and Zoubin Ghahramani. Turing: A language for ﬂexible probabilistic inference. In International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2018, 9-11 April 2018, Playa Blanca, Lanzarote, Canary Islands, Spain, pages 1682 1690, 2018.

Andrew Gelman and Donald B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457 472, 1992.

Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis (Third Edition). CRC Press, 2013.

Andrew Gelman, Aki Vehtari, Daniel Simpson, Charles C. Margossian, Bob Carpenter, Yuling Yao, Lauren Kennedy, Jonah Gabry, Paul-Christian Bürkner, and Martin Modrák. Bayesian workﬂow. ar Xiv preprint ar Xiv:2011.01808, 2020.

Jean Charles Gilbert and Claude Lemaréchal. Some numerical experiments with variable-storage quasi-Newton algorithms. Mathematical Programming, 45:407 435, 1989.

Donald Goldfarb. A family of variable-metric methods derived by variational means. Mathematics of Computation, 24(109):23 26, 1970.

Matthew Hoffman and Yian Ma. Black-box variational inference as a parametric approximation to Langevin dynamics. Proceedings of Machine Learning Research, 119:4324 4341, 2020.

Matthew D. Hoffman and Andrew Gelman. The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014.

Edward L. Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295 311, 2008.

Pierre E. Jacob, John O Leary, and Yves F. Atchadé. Unbiased Markov chain Monte Carlo methods with couplings. Journal of the Royal Statistical Society: Series B, 82(3):543 600, 2020.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980, 2014.

Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic differentiation variational inference. Journal of Machine Learning Research, 18(1):430 474, 2017.

Måns Magnusson, Paul Bürkner, and Aki Vehtari. posteriordb: A database of Bayesian posterior inference, 2021. URL https://github.com/stan-dev/posteriordb.

Robert J. Mc Cann. Existence and uniqueness of monotone measure-preserving maps. Duke Mathematical Journal, 80(2):309 323, 1995.

Xiao-Li Meng and Wing Hung Wong. Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, pages 831 860, 1996.

ZHANG, CARPENTER, GELMAN, AND VEHTARI

Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Andriy Mnih. Monte Carlo gradient estimation in machine learning. ar Xiv preprint ar Xiv:1906.10652, 2019.

Radford M. Neal. Slice sampling. Annals of Statistics, 31:705 741, 2003.

Jorge Nocedal. Updating quasi-Newton matrices with limited storage. Mathematics of Computation, 35(151):773 782, 1980.

Omiros Papaspiliopoulos, Gareth O. Roberts, and Martin Sköld. Non-centered parameterisations for hierarchical models and data augmentation. Bayesian Statistics, 7:307 326, 2003.

Juho Piironen and Aki Vehtari. Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2):5018 5051, 2017.

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2021. URL https://www.R-project.org/.

Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Artiﬁcial Intelligence and Statistics, pages 814 822. PMLR, 2014.

Danilo Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. Proceedings of Machine Learning Research, 37:1530 1538, 2015.

Gabriel Riutort-Mayol, Paul-Christian Bürkner, Michael R. Andersen, Arno Solin, and Aki Vehtari. Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. ar Xiv preprint ar Xiv:2004.11408, 2020.

Herbert Robbins and Sutton Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400 407, 1951.

Donald B. Rubin. Estimation in parallel randomized experiments. Journal of Educational Statistics, 6(4):377 401, 1981.

Donald B. Rubin. A noniterative sampling/importance resampling alternative to the data augmentation algorithm for creating a few imputations when fractions of missing information are modest: The SIR algorithm. Journal of the American Statistical Association, 82(398):543 546, 1987.

John Salvatier, Thomas V. Wiecki, and Christopher Fonnesbeck. Probabilistic programming in Python using Py MC3. Peer J Computer Science, 2:e55, 2016.

Dominic Schuhmacher, Björn Bähre, Carsten Gottschlich, Valentin Hartmann, Florian Heinemann, and Bernhard Schmitzer. transport: Computation of Optimal Transport Plans and Wasserstein Distances, 2020. URL https://cran.r-project.org/package=transport. R package version 0.12-2.

Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando De Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148 175, 2015.

David F. Shanno. Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24(111):647 656, 1970.

PATHFINDER: QUASI-NEWTON POSTERIOR APPROXIMATION

Jim Shore. Fail fast [software debugging]. IEEE Software, 21(5):21 25, 2004.

Arno Solin and Simo Särkkä. Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, 30(2):419 446, 2020.

Stan Development Team. Stan Reference Manual, 2021a. URL https://mc-stan.org/docs/ 2_26/reference-manual/index.html.

Stan Development Team. Stan User s Guide, 2021b. URL https://mc-stan.org/docs/2_ 26/stan-users-guide/index.html.

Achille Thin, Yazid Janati, Sylvain Le Corff, Charles Ollion, Arnaud Doucet, Alain Durmus, Eric Moulines, and Christian Robert. Invertible ﬂow non equilibrium sampling. ar Xiv preprint ar Xiv:2103.10943, 2021.

Aki Vehtari, Simo Sarkka, and Jouko Lampinen. On MCMC sampling in Bayesian MLP neural networks. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, volume 1, pages 317 322. IEEE, 2000.

Aki Vehtari, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. Pareto smoothed importance sampling. ar Xiv preprint ar Xiv:1507.02646, 2019.

Aki Vehtari, Andrew Gelman, Daniel Simpson, Bob Carpenter, and Paul-Christian Bürkner. Ranknormalization, folding, and localization: An improved b R for assessing convergence of MCMC. Bayesian Analysis, 2021.

Cédric Villani. Optimal Transport. Springer, 2009.

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, et al. Sci Py 1.0: Fundamental algorithms for scientiﬁc computing in Python. Nature Methods, 17(3):261 272, 2020.

Martin J. Wainwright and Michael I. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., 2008.

Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gelman. Yes, but did it work?: Evaluating variational inference. Proceedings of Machine Learning Research, 80:5581 5590, 2018.

Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550 560, 1997.