# bayesian_inference_via_sparse_hamiltonian_flows__a47c5eb1.pdf

Bayesian Inference via Sparse Hamiltonian Flows

Naitong Chen Zuheng Xu Trevor Campbell Department of Statistics University of British Columbia [naitong.chen | zuheng.xu | trevor]@stat.ubc.ca

A Bayesian coreset is a small, weighted subset of data that replaces the full dataset during Bayesian inference, with the goal of reducing computational cost. Although past work has shown empirically that there often exists a coreset with low inferential error, efficiently constructing such a coreset remains a challenge. Current methods tend to be slow, require a secondary inference step after coreset construction, and do not provide bounds on the data marginal evidence. In this work, we introduce a new method sparse Hamiltonian flows that addresses all three of these challenges. The method involves first subsampling the data uniformly, and then optimizing a Hamiltonian flow parametrized by coreset weights and including periodic momentum quasi-refreshment steps. Theoretical results show that the method enables an exponential compression of the dataset in a representative model, and that the quasi-refreshment steps reduce the KL divergence to the target. Real and synthetic experiments demonstrate that sparse Hamiltonian flows provide accurate posterior approximations with significantly reduced runtime compared with competing dynamical-system-based inference methods.

1 Introduction

Bayesian inference provides a coherent approach to learning from data and uncertainty assessment in a wide variety of complex statistical models. Two standard methodologies for performing Bayesian inference in practice are Markov chain Monte Carlo (MCMC) [1; 2; 3, Ch. 11,12] and variational inference (VI) [4, 5]. MCMC simulates a Markov chain that targets the posterior distribution. In the increasingly common setting of large-scale data, most exact MCMC methods are intractable. This is essentially because simulating each MCMC step requires an (expensive) computation involving each data point, and many steps are required to obtain inferential results of a reasonable quality. To reduce cost, a typical approach is to perform the computation for a random subsample of the data, rather than the full dataset, at each step [6 10] (see [11] for a recent survey). However, recent work shows that the speed benefits are outweighed by the drawbacks; uniformly subsampling at each step causes MCMC to either mix slowly or provide poor inferential approximation quality [11 15]. VI, on the other hand, posits a family of approximations to the posterior and uses optimization to find the closest member, enabling the use of scalable stochastic optimization algorithms [16, 17]. While past work involved simple parametric families, recent work has developed flow families based on Markov chains [18, 19] and in particular, those based on Langevin and Hamiltonian dynamics [20 25]. However, because these Markov chains are typically designed to target the posterior distribution, each step again requires a computation involving all the data, making KL minimization and sampling slow. Repeated subsampling to reduce cost has the same issues that it does in MCMC.

Although repeated subsampling in each step of a Markov chain (for MCMC or VI) is not generally helpful, recent work on Bayesian coresets [26] has provided empirical evidence that there often exists a fixed small, weighted subset of the data a coreset that one can use to replace the full dataset in a standard MCMC or VI inference method [27]. In order for the Bayesian coreset approach to be

36th Conference on Neural Information Processing Systems (Neur IPS 2022).

practically useful, one must (1) find a suitable coreset that provides a good posterior approximation; and (2) do so quickly enough that the speed-up of inference is worth the time it takes to find the coreset. There is currently no option that satisfies these two desiderata. Importance weighting methods [26] are fast, but do not provide adequate approximations in practice. Sparse linear regression methods [28 30] are fast and sometimes provide high-quality approximations, but are very difficult to tune well. And sparse variational methods [27, 31] find very high quality coreset approximations without undue tuning effort, but are too slow to be practical.

This work introduces three key insights. First, we can uniformly subsample the dataset once to pick the points in the coreset (the weights still need to be optimized). This selection is not only significantly simpler than past algorithms; we show that it enables constructing an exact coreset with KL divergence 0 to the posterior of size O(log2(N)) for N data points in a representative model (Proposition 3.1). Second, we can then construct a normalizing flow family based on Hamiltonian dynamics [21, 22, 32] that targets the coreset posterior (parametrized by coreset weights) rather than the expensive full posterior. This method address all of the current challenges with coresets: it enables tractable i.i.d. sampling, provides a known density and normalization constant, and is tuned using straightforward KL minimization with stochastic gradients. It also addresses the inefficiency of Markov-chain-based VI families, as the Markov chain steps are computed using the inexpensive coreset posterior density rather than the full posterior density. The final insight is that past momentum tempering methods [21] do not provide sufficient flexibility for arbitrary approximation to the posterior, even in a simple setting (Proposition 3.2). Thus, we introduce novel periodic momentum quasi-refreshment steps that provably reduce the KL objective (Propositions 3.3 and A.2). The paper concludes with real and synthetic experiments, demonstrating that sparse Hamiltonian flows compare favourably to both current coreset compression methods and variational flow-based families. Proofs of all theoretical results may be found in the appendix.

It is worth noting that Hamiltonian flow posterior approximations based on a weighted data subsample were also developed in concurrent work in the context of variational annealed importance sampling [33], and subsampling prior to weight optimization was developed in concurrent work on MCMC [34]. In this work, we focus on incorporating Bayesian coresets into Hamiltonian-based normalizing flows to obtain fast and accurate posterior approximations.

2 Background

2.1 Bayesian coresets

We are given a target probability density π(θ) for variables θ Rd that takes the following form:

In a Bayesian inference problem with i.i.d. data, π0 is the prior density, the fn are the log-likelihood terms for N data points, and the normalization constant is in general not known. The goal is to take samples from the distribution corresponding to density π(θ).

In order to avoid the Θ(N) cost of evaluating log π(θ) or log π(θ) (at least one of which must be conducted numerous times in most standard inference algorithms), Bayesian coresets [26] involve replacing the target with a surrogate density of the form

πw(θ) = 1 Z(w) exp

n=1 wnfn(θ)

where w RN, w 0 are a set of weights. If w has at most M N nonzeros, the O(M) cost of evaluating log πw(θ) or log πw(θ) is a significant improvement upon the original Θ(N) cost.

The baseline method to construct a coreset is to draw a uniformly random subsample of M data points, and give each a weight of N/M; although this method is fast in practice, it typically generates poor posterior approximations. More advanced techniques generally involve significant user tuning effort [26, 28 30]. The current state-of-the-art black box approach formulates the problem as variational inference [27, 31] and provides a stochastic gradient scheme using samples from πw, w = arg min w RN + DKL (πw||π) s.t. w 0 M.

Empirically, this method tends to produce very high-quality coresets [27]. However, to estimate the gradient at each iteration of the optimization, we require MCMC samples from the weighted coreset posterior at that iteration. While generating MCMC samples from a sparse coreset posterior is not expensive, it is difficult to tune the algorithm to ensure the quality of these MCMC samples across iterations (as the weights, and thus the coreset posterior, change each iteration). The amount of tuning effort required makes the application of this method too slow to be practical. Once the coreset is constructed, all of the aforementioned methods require a secondary inference algorithm to take draws from πw. Further, since Z(w) is not known in general, it is not tractable to use these methods to bound the marginal evidence Z.

2.2 Hamiltonian dynamics

In this section we provide a very brief overview of some important aspects of a special case of Hamiltonian dynamics and its use in statistics; see [35] for a more comprehensive overview. The differential equation below in Eq. (1) describes how a (deterministic) Hamiltonian system with position θt Rd, momentum ρt Rd, differentiable negative potential energy log π(θt), and kinetic energy 1

2ρT t ρt evolves over time t R:

dt = log π(θt) dθt

dt = ρt. (1)

For t R, define the mappings Ht : R2d R2d that take (θs, ρs) 7 (θs+t, ρs+t) under the dynamics in Eq. (1). These mappings have two key properties that make Hamiltonian dynamics useful in statistics. First, they are invertible, and preserve volume in the sense that |det Ht| = 1. In other words, they provide tractable density transformations: for any density q on R2d and pushforward qt on R2d under the mapping Ht, we have that qt( , ) = q H 1 t ( , ) . Second, the augmented target density π(θ, ρ) on R2d corresponding to independent draws from π and N(0, I),

π(θ, ρ) π(θ) exp 1

is invariant under the mappings Ht, i.e., π(Ht( , )) = π( , ). Given these properties, Hamiltonian Monte Carlo [35, 36] constructs a Gibbs sampler for π that interleaves Hamiltonian dynamics with periodic stochastic momentum refreshments ρ N(0, I). Upon completion, the ρ component of the samples can be dropped to obtain samples from the desired target π.

In practice, one approximately simulates the dynamics in Eq. (1) using the leapfrog method, which involves interleaving three discrete transformations with step size ϵ > 0,

ˆρk+1 = ρk + ϵ

2 log π(θk) θk+1 = θk + ϵˆρk+1 ρk+1 = ˆρk+1 + ϵ

2 log π(θk+1). (2)

Denote the map constructed by applying these three steps in sequence Tϵ : R2d R2d. As the transformations in Eq. (2) are all shear, Tϵ is also volume-preserving, and for small enough step size ϵ it nearly maintains the target invariance. Note also that evaluating a single application of Tϵ is of O(Nd) complexity, which is generally expensive in the large-data (large-N) regime.

2.3 VI via Hamiltonian dynamics

Since the mapping Tϵ is invertible and volume-preserving, it is possible to tractably compute the density of the pushforward of a reference distribution q( , ) under repeated applications of it. In addition, this repated application of Tϵ resembles the steps of Hamiltonian Monte Carlo (HMC) [35], which we know converges in distribution to the target posterior distribution. [21, 22] use these facts to construct a normalizing flow [32] VI family. However, there are two issues with this methodology. First, the O(Nd) complexity of evaluating each step Tϵ makes training and simulating from this flow computationally expensive. Second, Hamiltonian dynamics on its own creates a flow with insufficient flexibility to match a target π of interest. In particular, given a density q( , ) and pushforward qt( , ) under Ht, we have

t R, DKL (qt|| π) = DKL (q|| π) . (3)

In other words, Hamiltonian dynamics itself cannot reduce the KL divergence to π; it simply interchanges potential and kinetic energy. [21] address this issue by instead deriving their flow from tempered Hamiltonian dynamics: for an integrable tempering function γ : R R,

dt = log π(θt) γ(t)ρt dθt

dt = ρt. (4)

The discretized version of the dynamics in Eq. (4) corresponds to multiplying the momentum by a tempering value αk > 0 after the kth application of Tϵ. By scaling the momentum, one provides the normalizing flow with the flexibility to change the kinetic energy at each step. However, we show later in Proposition 3.2 that just tempering the momentum does not provide the required flow flexibility, even for a simple representative Gaussian target.

A related line of work uses the mapping Tϵ for variational annealed importance sampling [23 25]. The major difference between these methods and the normalizing flow-based methods is that the auxiliary variable is (partially) stochastically refreshed via ρ N(0, I) after applications of Tϵ. One is then forced to minimize the KL divergence between the joint distribution of θ and all of the auxiliary momentum variables under the variational and augmented target distributions.

3 Sparse Hamiltonian flows

In this section we present sparse Hamiltonian flows, a new method to construct and draw samples from Bayesian coreset posterior approximations. We first present a method and supporting theory for selecting the data points to be included in the coreset, then discuss building a sparse flow with these points, and finally introduce quasi-refreshment steps to give the flow family enough flexibility to match the target distribution. Sparse Hamiltonian flows enables tractable i.i.d. sampling, provides a tractable density and normalization constant, and is constructed by minimizing the KL divergence to the posterior with simple stochastic gradient estimates.

3.1 Selection via subsampling

The first step in our algorithm is to choose a uniformly random subsample of M points from the full dataset; these will be the data points that comprise the coreset. Without loss of generality, we assume these are the first M points. The key insight in this work is that while subsampling with importance weighting does not typically provide good coreset approximations [26], a uniformly random subset of the N log-likelihood potential functions {f1, . . . , f M} still provides a good basis for approximation with high probability. Proposition 3.1 provides the precise statement of this result for a representative example model Eq. (5). In particular, Proposition 3.1 asserts that as long as we set our coreset size M to be proportional to d log2 N, the optimal coreset posterior approximation will be exact, i.e., have 0 KL divergence to the true posterior, with probability at least 1 N d

2 (log2 N) d 2 . Thus we achieve an exponential compression of the dataset, N log2 N, without losing any fidelity. Note that we will still need a method to choose the weights w1, . . . , w M for the M points, but the use of uniform selection rather than a one-at-a-time approach [27 29] substantially simplifies the construction. In Proposition 3.1, C is the universal constant from [37, Corollary 1.2], which provides an upper bound on the number of spherical balls of some fixed radius needed to cover a d-dimensional unit sphere.

Proposition 3.1. Consider a Bayesian Gaussian location model:

θ N(0, I) and n [N], Xn i.i.d. N(θ, I), (5)

where θ, Xn Rd for d N. Suppose the true data generating parameter θ = 0, and set M = log2(Ad N d(log N) d/2) + C where Ad = e d 2 d 3 2 log(1 + d). Then the optimal coreset πw for the model Eq. (5) built using a uniform subsample of data of size M satisfies

P (DKL(πw ||π) = 0)

2 (log N) d 2 1.

3.2 Sparse flows

Upon taking a uniform subsample of M data points from the full dataset, we consider the sparsified Hamiltonian dynamics initialized at θ0, ρ0 q( , ) for reference density1 q( , ),

dt = log πw(θt) dθt

dt = ρt. (6)

Much like the original Hamiltonian dynamics for the full target density, the sparsified Hamiltonian dynamics Eq. (6) targets the augmented coreset posterior with density πw(θ, ρ) on R2d,

πw(θ, ρ) πw(θ) exp 1

Discretizing these dynamics yields a leapfrog method similar to Eq. (2) with three interleaved steps,

ˆρk+1 = ρk + ϵ

2 log πw(θk) θk+1 = θk + ϵˆρk+1 ρk+1 = ˆρk+1 + ϵ

2 log πw(θk+1). (7)

Denote the map constructed by applying these three steps in sequence Tw,ϵ : R2d R2d. Like the original leapfrog method, these transformations are both invertible and shear, and thus preserve volume; and for small enough step size ϵ, they approximately maintain the invariance of πw(θ, ρ). However, since w only has the first M entries nonzero,

log πw(θk) =

m=1 wm log fm(θk),

and thus a coreset leapfrog step can be taken in O(Md) time, as opposed to O(Nd) time in the original approach. Given that Proposition 3.1 recommends setting M d log2(N), we have achieved an exponential reduction in computational cost of running the flow.

However, as before, the weighted sparse leapfrog flow is not sufficient on its own to provide a flexible variational family. In particular, we know that Tw,ϵ nearly maintains the distribution πw as invariant. We therefore need a way to modify the distribution of the momentum variable ρ. One option is to include a tempering of the form Eq. (4) into the sparse flow. However, Proposition 3.2 shows that even optimal tempering does not provide the flexibility to match a simple Gaussian target π.

Proposition 3.2. Let θt, ρt R follow the tempered Hamiltonian dynamics Eq. (4) targeting π = N(0, σ2), σ > 0, with initial distribution θ0 N(µ, 1), ρ0 N(0, β2) for initial center µ R and momentum scale β > 0. Let qt be the distribution of (θt, ρt). Then

inf t>0,β>0,γ:R+ R DKL (qt|| π) log 1 + µ2

Note that if γ(t) = 0 identically, then t 0, DKL(qt|| π) = DKL(q0|| π).

The intuition behind Proposition 3.2 is that while adding a tempering γ(t) enables one to change the total energy by scaling the momentum, it does not allow one fine enough control on the distribution of the momentum. For example, if E[ρ] = 0 under the current flow approximation, we cannot scale the momentum to force E[ρ] = 0 as it should be under the augmented target; intuitively, we also need the ability to shift or recenter the momentum as well.

1The reference q can also have its own variational parameters to optimize, but in this paper we leave it fixed.

Algorithm 1 Sparse Ham Flow

Require: θ0, ρ0, w, ϵ, λ, L, R

J 0, and (θ, ρ) (θ0, ρ0) for r = 1, . . . , R do

for ℓ= 1, . . . , L do

Sparse flow leapfrog: θ, ρ Tw,ϵ(θ, ρ) end for Accumulate log Jacobian determinant: J J + log |det Rλr

ρ (ρ, θ)| Quasi-refreshment: ρ Rλr(ρ, θ) end for return θ, ρ, J

Algorithm 2 Estimate_ELBO

Require: q, π0, w, ϵ, λ, L, R, S

(θ0, ρ0) q( , ) Forward pass: θ, ρ, J Sparse Ham Flow(θ0, ρ0, w, ϵ, λ, L, R) Obtain unbiased ELBO estimate: (n1, . . . , n S) i.i.d. Unif({1, 2, . . . , N}) log p log π0(θ) + N

S PS s=1 fns(θ)+ log N(ρ | 0, I) log q q(θ0, ρ0) J return log p log q

3.3 Quasi-refreshment

Rather than resampling the momentum variable from its target marginal which removes the ability to evaluate the density of θt, ρt in this work we introduce deterministic quasi-refreshment moves that enable the flow to strategically update the momentum without losing the ability to compute the density and normalization constant of θt, ρt (i.e., we construct a normalizing flow [32]). Here we introduce the notion of marginal quasi-refreshment, which tries to make the marginal distribution of ρt match the corresponding marginal distribution of the augmented target πw. Proposition 3.3 shows that marginal quasi-refreshment is guaranteed to reduce the KL divergence. Proposition 3.3. Consider the state θt, ρt Rd of the flow at step t, and the augmented target distribution θ, ρ π. Suppose that we have a bijection R : Rd Rd such that R(ρt) d= ρ. Then

DKL (θt, R(ρt)||θ, ρ) = DKL (θt, ρt||θ, ρ) DKL (ρt||ρ) .

Figure 1: ELBO across leapfrog steps.

See Appendix A for the proof of Proposition 3.3 and a general treatment of quasi-refreshment; for simplicity, we focus on the type of quasirefreshment that we use in the experiments. In particular, if we are willing to make an assumption about the marginal distribution of ρt at step t of the flow, we can introduce a tunable family of functions Rλ with parameters λ that is flexible enough to set Rλ(ρt) d= ρ for some λ, and include optimization of λ along with the coreset weights. It is important to note that this assumption on the distribution of ρt is not related to the posterior π. As an example, in this work we assume that ρt N(µ, Λ 1) for some unknown mean µ and diagonal precision Λ, which enables us to simply set

Rλ(x) = Λ (x µ) . (8)

We then include λ = (µ, Λ) as parameters to be optimized along with the coreset weights w (each quasi-refreshment step will have its own set of parameters µ, Λ). Even when this assumption does not hold exactly, the resulting form of Eq. (8) enables the refreshment step to both shift and scale (i.e., standardize) the momentum as desired, and is natural to implement as part of a single optimization routine.

Fig. 1 provides an example of the effect of quasi-refreshment in a synthetic Gaussian location model (see Section 4.1 for details). In particular, it shows the evidence lower bound (ELBO) as a function of leapfrog step number in a trained sparse Hamiltonian flow with the quasi-refreshment scheme in

Eq. (8). While the estimated ELBO values stay relatively stable across leapfrog steps in between quasi-refreshments (as expected by Eq. (3)), the quasi-refreshment steps (colored red) cause the ELBO to increase drastically. We note that the ELBO does not stay exactly constant because the Hamiltonian dynamics targets the coreset posterior instead of the true posterior, and is simulated approximately using leapfrog steps. As the series of transformations brings the approximated density closer to the target, the quasi-refreshment steps no longer change the ELBO much, signalling the convergence of the flow s approximation of the target. It is thus clear that the marginal quasi-refreshments indeed decrease the KL, as shown in Proposition 3.3.

3.4 Algorithm

In this section, we describe the procedure for training and generating samples from a sparse Hamiltonian flow. As a normalizing flow, a sparse Hamiltonian flow can be trained by maximizing the augmented ELBO using usual stochastic gradient methods (e.g. as in [32]), where the transformations follow Eq. (7) with a periodic quasi-refreshment. Here and in the experiments we focus on the shift-and-scale quasi-refreshment in Eq. (8).

We begin by selecting a subset of M data points chosen uniformly randomly from the full data. Next we select a total number R of quasi-refreshment steps, and a number L of leapfrog steps between each quasi-refreshment. The flow parameters to be optimized consist of the quasi-refreshment parameters λ = (λr)R r=1, the M coreset weights w = (wm)M m=1, and the leapfrog step sizes ϵ = (ϵi)d i=1; note that we use a separate step size ϵi per latent variable dimension i in Eq. (7) [35, Sec. 4.2]. This modification enables the flow to fit nonisotropic target distributions.

We initialize the weights to N/M (i.e., a uniform coreset), and select an initial step size for all dimensions. We use a warm start to initialize the parameters λr = (µr, Λr) of the quasi-refreshments. Specifically, using the initial leapfrog step sizes and coreset weights, we pass a batch of samples from the reference density q( , ) through the flow up to the first quasi-refreshment step. We initialize µ1, Λ1 to the empirical mean and diagonal precision of the samples at that point. We then apply the initialized first quasi-refreshment to the momentum, proceed with the second sequence of leapfrog steps, and repeat until we have initialized all quasi-refreshments r = 1, . . . , R.

Once the parameters are initialized, we log-transform the step sizes, weights, and quasi-refreshment diagonal scaling matrices to make them unconstrained during optimization. We obtain an unbiased estimate of the augmented ELBO gradient by applying automatic differentiation [38, 39] to the ELBO estimation function Algorithm 2, and optimize all parameters jointly using a gradient-based stochastic optimization technique such as SGD [40, 41] and ADAM [42]. Once trained, we can obtain samples from the flow via Algorithm 1.

4 Experiments

In this section, we compare our method against other Hamiltonian-based VI methods and Bayesian coreset construction methods. Specifically, we compare the quality of posterior approximation, as well as the training and sampling times of sparse Hamiltonian flows (SHF), Hamiltonian importance sampling (HIS) [21], and unadjusted Hamiltonian annealing (UHA) [23] using real and synthetic datasets. We compare with two variants of HIS and UHA: -Full, in which we train using in-flow minibatching as suggested by [21, 23], but compute evaluation metrics using the full-data flow; and -Coreset, in which we base the flow on a uniformly subsampled coreset. We also include sampling times of adaptive HMC and NUTS [43, Alg. 5 and 6] using the full dataset. Finally, we compare the quality of coresets constructed by SHF to those obtained using uniform subsampling (UNI) and Hilbert coresets with orthogonal matching pursuit (Hilbert-OMP) [28, 44]. All experiments are performed on a machine with an Intel Core i7-12700H processor and 32GB memory. Code is available at https://github.com/Naitong Chen/Sparse-Hamiltonian-Flows. Details of the experiments are in Appendix B.

4.1 Synthetic Gaussian

We first demonstrate the performance of SHF on a synthetic Gaussian-location model,

θ N(0, I) and n [N], Xn i.i.d. N(θ, c I),

Figure 2: ELBO (Fig. 2a), KL divergence (Fig. 2b), relative 2-norm mean error (Fig. 2c), relative Frobenius norm covariance error (Fig. 2d), energy distance (Fig. 2e), and IMQ KSD [45] (Fig. 2f) for synthetic Gaussian. The lines indicate the median, and error regions indicate 25th to 75th percentile from 5 runs.

Figure 3: Density evaluation (Fig. 3a) and sample generation time (Fig. 3b) (100 samples), and ELBO versus time during training (Fig. 3c) for synthetic Gaussian. The lines indicate the median, and error regions indicate 25th to 75th percentile from 5 runs.

where θ, Xn Rd. We set c = 100, d = 10, N = 10, 000. This model has a closed from posterior distribution π = N PN n=1 Xn

c+N , c c+N I . More details may be found in Appendix B.1.

Fig. 2a compares the ELBO values of SHF, HIS, and UHA across all optimization iterations. We can see that SHF and UHA-Full result in the highest ELBO, and hence tightest bound on the log normalization constant of the target. In this problem, since we have access to the exact posterior distribution in closed form, we can also estimate the θ-marginal KL divergence directly, as shown in Fig. 2b. Here we see the posterior approximation produced by SHF provides a significantly lower KL than the other competing methods. Figs. 2c to 2f show, through a number of other metrics, that SHF provides a higher quality posterior approximation than others. It is worth noting that while the relative covariance error for SHF takes long to converge, we observe a monotonic downward trend in both the relative mean error and KL divergence of SHF. This means that for this particular problem, our method finds the centre of the target before fine tuning the covariance. We also note that a number of metrics go up for UHA-full because it operates on the augmented space based on a sequence of distributions that bridge some simple initial distribution and the target distribution. Therefore, it is not guaranteed that all steps of optimization improve the quality of approximation on the marginal space of the latent variables of interest, which is what the plots in Fig. 2 show.

Figs. 3a and 3b show the time required for each method to evaluate the density of the joint distribution of θ, ρ and to generate samples. It is clear that the use of a coreset improves the density evaluation

Figure 4: Estimated KL divergence (Fig. 4a), relative 2-norm mean error (Fig. 4b), relative Frobenius norm covariance error (Fig. 4c), energy distance (Fig. 4d), and IMQ KSD (Fig. 4e) versus coreset size. The lines indicate the median, and error regions indicate 25th to 75th percentile from 5 runs.

and sample generation time by more than an order of magnitude. Fig. 3c compares the training times of SHF, HIS-Coreset, and UHA-Coreset (recall that due to the use of subsampled minibatch flow dynamics, HIS-Full and UHA-Full share the same training time as their -Coreset versions). The relative training speeds generally match those of sample generation from the target posterior.

Finally, Fig. 4 compares the quality of coresets constructed via SHF, uniform subsampling (UNI), and Hilbert coresets with orthogonal matching pursuit (Hilbert-OMP). Note that in this problem, the Laplace approximation is exact (the true posterior is Gaussian), and hence Hilbert-OMP constructs a coreset using samples from the true posterior. Despite this, SHF provides coresets of comparable quality, in addition to enabling tractable i.i.d. sampling, density evaluation, normalization constant bounds, and straightforward construction via stochastic optimization.

4.2 Bayesian linear regression

In the setting of Bayesian linear regression, we are given a set of data points (xn, yn)N n=1, each consisting of features xn Rp and response yn R, and a model of the form β log σ2 T N(0, I), n [N], yn | xn, β, σ2 indep N 1 x T n β, σ2 ,

where β Rp+1 is a vector of regression coefficients and σ2 R+ is the noise variance. The dataset2 that we use consists of N = 100, 000 flights, each containing p = 10 features (e.g., distance of the flight, weather conditions, departure time, etc), and the response variable is the difference, in minutes, between the scheduled and actual departure times. More details can be found in Appendix B.2.

Since we no longer have the posterior distribution in closed form, we estimate the mean and covariance using 5000 samples from Stan [46] and treat them as the true posterior mean and covariance. Figs. 5a to 5c show the marginal KL, relative mean error, and relative covariance error of SHF, HIS, and UHA, where the marginal KL is estimated using the Gaussian approximation of the posterior with the estimated mean and covariance. Here we also include the posterior approximation obtained using the Laplace approximation as a baseline. We see that SHF provides the highest quality posterior approximation. Furthermore, Fig. 5d shows that SHF provides a significant improvement in the marginal KL compared with competing coreset constructions UNI and Hilbert-OMP. This is due to the true posterior no longer being Gaussian; the Laplace approximation required by Hilbert-OMP fails to capture the shape of the posterior. Additional plots comparing the quality of posterior approximations using various other metrics can be found in Appendix B.2.

2This dataset consists of airport data from https://www.transtats.bts.gov/DL_Select Fields.asp? gnoyr_VQ=FGJ and weather data from https://wunderground.com.

Figure 5: Linear (top) and logistic (bottom) regression results: Gaussian approximated KL divergence versus training time (Figs. 5a and 5e), relative 2-norm mean error (Figs. 5b and 5f), relative Frobenius norm covariance error (Figs. 5c and 5g), and Gaussian approximated KL divergence versus coreset size (Figs. 5d and 5h). The lines indicate the median, and error regions indicate 25th to 75th percentile from 5 runs.

4.3 Bayesian logistic regression

In the setting of Bayesian logistic regression, we are given a set of data points (xn, yn)N n=1, each consisting of features xn Rp and label yn {0, 1}, and a model of the form

i [p + 1], βi i.i.d. Cauchy(0, 1), n [N], yn indep Bern 1 + exp 1 x T n β 1 ,

where β Rp+1. The same airline dataset is used with the labels indicating whether a flight is cancelled. Of the flights included, 1.384% were cancelled. More details can be found in Appendix B.3.

The same procedures as in the Bayesian linear regression example are followed to generate the results in Figs. 5e to 5h. To account for the class imbalance problem present in the dataset, we construct all subsampled coresets with half the data having label 1 and the rest with label 0. The results in Figs. 5e to 5h are similar to those from the Bayesian linear regression example; SHF provides high quality variational approximations to the posterior. Additional plots comparing the quality of posterior approximations using various other metrics can be found in Appendix B.3.

5 Conclusion

This paper introduced sparse Hamiltonian flows, a novel coreset-based variational family that enables tractable i.i.d. sampling, and evalution of density and normalization constant. The method randomly subsamples a small set of data, and uses the subsample to construct a flow from sparse Hamiltonian dynamics. Novel quasi-refreshment steps provide the flow with the flexibility to match target posteriors without introducing additional auxiliary variables. Theoretical results show that, in a representative model, the method can recover the exact posterior using a subsampled dataset of the size that is a logarithm of its original size, and that quasi-refreshments are guaranteed to reduce the KL divergence to the target. Experiments demonstrate that the method provides high quality coreset posterior approximations. One main limitation of our methodology is that the data must be compressible" in the sense that log-likelihood functions of a subset can be used to well represent the full log-likelihood. If the data do not live on some underlying low-dimensional manifold, this may not be the case. Additionally, while our quasi-refreshment is simple and works well in practice, more work is required to develop a wider variety of general-purpose quasi-refreshment moves. We leave this for future work.

Acknowledgments and Disclosure of Funding

All authors were supported by a National Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant, NSERC Discovery Launch Supplement, and a gift from Google LLC.

[1] Christian Robert and George Casella. Monte Carlo Statistical Methods. Springer, 2nd edition, 2004.

[2] Christian Robert and George Casella. A short history of Markov Chain Monte Carlo: subjective recollections from incomplete data. Statistical Science, 26(1):102 115, 2011.

[3] Andrew Gelman, John Carlin, Hal Stern, David Dunson, Aki Vehtari, and Donald Rubin. Bayesian data analysis. CRC Press, 3rd edition, 2013.

[4] Michael Jordan, Zoubin Ghahramani, Tommi Jaakkola, and Lawrence Saul. An introduction to variational methods for graphical models. Machine Learning, 37:183 233, 1999.

[5] Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008.

[6] Rémi Bardenet, Arnaud Doucet, and Chris Holmes. On Markov chain Monte Carlo methods for tall data. Journal of Machine Learning Research, 18:1 43, 2017.

[7] Anoop Korattikara, Yutian Chen, and Max Welling. Austerity in MCMC land: cutting the Metropolis Hastings budget. In International Conference on Machine Learning, 2014.

[8] Dougal Maclaurin and Ryan Adams. Firefly Monte Carlo: exact MCMC with subsets of data. In Conference on Uncertainty in Artificial Intelligence, 2014.

[9] Max Welling and Yee Whye Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, 2011.

[10] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In International Conference on Machine Learning, 2012.

[11] Matias Quiroz, Robert Kohn, and Khue-Dung Dang. Subsampling MCMC an introduction for the survey statistician. Sankhya: The Indian Journal of Statistics, 80-A:S33 S69, 2018.

[12] James Johndrow, Natesh Pillai, and Aaron Smith. No free lunch for approximate MCMC. ar Xiv:2010.12514, 2020.

[13] Tigran Nagapetyan, Andrew Duncan, Leonard Hasenclever, Sebastian Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. ar Xiv:1706.02692, 2017.

[14] Michael Betancourt. The fundamental incompatibility of Hamiltonian Monte Carlo and data subsampling. In International Conference on Machine Learning, 2015.

[15] Matias Quiroz, Robert Kohn, Mattias Villani, and Minh-Ngoc Tran. Speeding up MCMC by efficient data subsampling. Journal of the American Statistical Association, 114(526):831 843, 2019.

[16] Matthew Hoffmann, David Blei, Chong Wang, and John Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14:1303 1347, 2013.

[17] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In International Conference on Artificial Intelligence and Statistics, 2014.

[18] Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: bridging the gap. In International Conference on Machine Learning, 2015.

[19] Raza Habib and David Barber. Auxiliary variational MCMC. In International Conference on Learning Representations, 2018.

[20] Christopher Wolf, Maximilian Karl, and Patrick van der Smagt. Variational inference with Hamiltonian Monte Carlo. ar Xiv:1609.08203, 2016.

[21] Anthony Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, 2018.

[22] Radford Neal. Hamiltonian importance sampling. Banff International Research Station (BIRS) Workshop on Mathematical Issues in Molecular Dynamics, 2005.

[23] Tomas Geffner and Justin Domke. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.

[24] Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, and Roger Grosse. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021.

[25] Achille Thin, Nikita Kotelevskii, Arnaud Doucet, Alain Durmus, Eric Moulines, and Maxim Panov. Monte Carlo variational auto-encoders. In International Conference on Machine Learning, 2021.

[26] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian logistic regression. In Advances in Neural Information Processing Systems, 2016.

[27] Trevor Campbell and Boyan Beronov. Sparse variational inference: Bayesian coresets from scratch. In Advances in Neural Information Processing Systems, 2019.

[28] Trevor Campbell and Tamara Broderick. Automated scalable Bayesian inference via Hilbert coresets. Journal of Machine Learning Research, 20(15), 2019.

[29] Trevor Campbell and Tamara Broderick. Bayesian coreset construction via greedy iterative geodesic ascent. In International Conference on Machine Learning, 2018.

[30] Jacky Zhang, Rajiv Khanna, Anastasios Kyrillidis, and Oluwasanmi Koyejo. Bayesian coresets: revisiting the nonconvex optimization perspective. In Artificial Intelligence in Statistics, 2021.

[31] Dionysis Manousakas, Zuheng Xu, Cecilia Mascolo, and Trevor Campbell. Bayesian pseudocoresets. In Advances in Neural Information Processing Systems, 2020.

[32] Danilo Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning, 2015.

[33] Martin Jankowiak and Du Phan. Surrogate likelihoods for variational annealed importance sampling. ar Xiv:2112.12194, 2021.

[34] Cian Naik, Judith Rousseau, and Trevor Campbell. Fast Bayesian coresets via subsampling and quasi Newton refinement. ar Xiv:2203.09675, 2022.

[35] Radford Neal. MCMC using Hamiltonian dynamics. In Steve Brooks, Andrew Gelman, Galin Jones, and Xiao-Li Meng, editors, Handbook of Markov chain Monte Carlo, chapter 5. CRC Press, 2011.

[36] Radford Neal. Bayesian Learning for Neural Networks. Lecture Notes in Statistics, No. 118. Springer Verlag, 1996.

[37] Károly Böröczky and Gergely Wintsche. Covering the sphere by equal spherical balls. In Boris Aronov, Saugata Basu, János Pach, and Micha Sharir, editors, Discrete and Computational Geometry, volume 25 of Algorithms and Combinatorics, pages 235 251. Springer, 2003.

[38] Aılım Güne s Baydin, Barak Pearlmutter, Alexey Radul, and Jeffrey Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:1 43, 2018.

[39] Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David Blei. Automatic Differentiation Variational Inference. Journal of Machine Learning Research, 18(14), 2017.

[40] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400 407, 1951.

[41] Léon Bottou. Stochastic Learning. In Olivier Bousquet, Ulrike von Luxburg, and Gunnar Rätsch, editors, Advanced Lectures on Machine Learning: ML Summer Schools 2003, pages 146 168. Springer Berlin Heidelberg, 2004.

[42] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2014.

[43] Matthew Hoffman and Andrew Gelman. The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1):1593 1623, 2014.

[44] Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of 27th Asilomar Conference on Signals, Systems and Computers, pages 40 44. IEEE, 1993.

[45] Jackson Gorham and Lester Mackey. Measuring sample quality with kernels. In International Conference on Machine Learning, pages 1292 1301. PMLR, 2017.

[46] Bob Carpenter, Andrew Gelman, Matthew Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.

[47] Luke Tierney and Joseph Kadane. Accurate approximations for posterior moments and marginal densities. Journal of the American Statistical Association, 81(393):82 86, 1986.

1. For all authors...

(a) Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? [Yes] (b) Did you describe the limitations of your work? [Yes] In Section 3.3, we assume that the momentum distribution is Gaussian. In Section 5, we mention that a direction of future work is to develop more general quasi-refreshment moves that do not require such assumptions. (c) Did you discuss any potential negative societal impacts of your work? [No] This paper presents an algorithm for sampling from a Bayesian posterior distribution in the large-scale data regime. There are possible negative societal impacts of downstream applications of this method (e.g. inference for a particular model and dataset), but we prefer not to speculate about these here. (d) Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

2. If you are including theoretical results...

(a) Did you state the full set of assumptions of all theoretical results? [Yes] (b) Did you include complete proofs of all theoretical results? [Yes] All proofs are provided in the appendix

3. If you ran experiments...

(a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We have included anonymized code in the supplement. (b) Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

[Yes] All details are provided in the main paper and in Appendix B. (c) Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] All results come with error bars. (d) Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] Hardware is listed at the beginning of the experiments.

4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets...

(a) If your work uses existing assets, did you cite the creators? [Yes] Footnotes to the source URL and bibliography citations are provided. (b) Did you mention the license of the assets? [No] None of the real datasets we use specify a particular license that the data were released under. (c) Did you include any new assets either in the supplemental material or as a URL? [N/A] We do not produce any new assets. (d) Did you discuss whether and how consent was obtained from people whose data you re using/curating? [No] Our data does not pertain to individuals. (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [No] Our data does not pertain to individuals.

5. If you used crowdsourcing or conducted research with human subjects...

(a) Did you include the full text of instructions given to participants and screenshots, if applicable?

[N/A] (b) Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A] (c) Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]