# mixflows_principled_variational_inference_via_mixed_flows__6dcc7a6a.pdf

Mix Flows: principled variational inference via mixed ﬂows

Zuheng Xu 1 Naitong Chen 1 Trevor Campbell 1

Abstract This work presents mixed variational ﬂows (Mix Flows), a new variational family that consists of a mixture of repeated applications of a map to an initial reference distribution. First, we provide efﬁcient algorithms for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. We then show that Mix Flows have MCMClike convergence guarantees when the ﬂow map is ergodic and measure-preserving, and provide bounds on the accumulation of error for practical implementations where the ﬂow map is approximated. Finally, we develop an implementation of Mix Flows based on uncorrected discretized Hamiltonian dynamics combined with deterministic momentum refreshment. Simulated and real data experiments show that Mix Flows can provide more reliable posterior approximations than several black-box normalizing ﬂows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.

1. Introduction

Bayesian statistical modelling and inference provides a principled approach to learning from data. However, for all but the simplest models, exact inference is not possible and computational approximations are required. A standard methodology for Bayesian inference is Markov chain Monte Carlo (MCMC) [Robert & Casella, 2004; Robert & Casella, 2011; Gelman et al., 2013, Ch. 11,12], which involves simulating a Markov chain whose stationary distribution is the Bayesian posterior distribution, and then treating the sequence of states as draws from the posterior. MCMC methods are supported by theory that guarantees that if one simulates the chain for long enough, Monte Carlo averages based on the sequence of states will converge to the exact posterior expectation of interest (e.g., (Roberts & Rosenthal, 2004)).

1University of British Columbia, Department of Statistics, Vancouver, Canada. Correspondence to: Trevor Campbell <trevor@stat.ubc.ca>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

This exactness property is quite compelling: regardless of how well one is able to tune the Markov chain, the method is guaranteed to eventually produce an accurate result given enough computation time. Nevertheless, it remains a challenge to assess and optimize the performance of MCMC in practice with a ﬁnite computational budget. One option is to use a Stein discrepancy (Gorham & Mackey, 2015; Liu et al., 2016; Chwialkowski et al., 2016; Gorham & Mackey, 2017; Anastasiou et al., 2021), which quantiﬁes how well a set of MCMC samples approximates the posterior distribution. But standard Stein discrepancies are not reliable in the presence of multimodality (Wenliang & Kanagawa, 2020), are computationally expensive to estimate, and suffer from the curse of dimensionality, although recent work addresses the latter two issues to an extent (Huggins & Mackey, 2018; Gong et al., 2021). The other option is to use a traditional diagnostic, e.g., Gelman Rubin (Gelman & Rubin, 1992; Brooks & Gelman, 1998), effective sample size (Gelman et al., 2013, p. 286), Geweke (1992), or others (Cowles & Carlin, 1996). These diagnostics detect mixing issues, but do not comprehensively quantify how well the MCMC samples approximate the posterior.

Variational inference (VI) (Jordan et al., 1999; Wainwright & Jordan, 2008; Blei et al., 2017) is an alternative to MCMC that does provide a straightforward quantiﬁcation of posterior approximation error. In particular, VI involves approximating the posterior with a probability distribution typically selected from a parametric family that enables both i.i.d. sampling and density evaluation (Wainwright & Jordan, 2008; Rezende & Mohamed, 2015a; Ranganath et al., 2016; Papamakarios et al., 2021). Because one can both obtain i.i.d. draws and evaluate the density, one can estimate the ELBO (Blei et al., 2017), i.e., the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) to the posterior up to a constant. The ability to estimate this quantity, in turn, enables scalable tuning via straightforward stochastic gradient descent algorithms (Hoffman et al., 2013; Ranganath et al., 2014), optimally mixed approximations (Jaakkola & Jordan, 1998; Gershman et al., 2012; Zobay, 2014; Guo et al., 2016; Wang, 2016; Miller et al., 2017; Locatello et al., 2018b;a; Campbell & Li, 2019), model selection (Corduneau & Bishop, 2001; Masa-aki, 2001; Constantinopoulos et al., 2006; Ormerod et al., 2017; Ch erief-Abdellatif & Alquier, 2018; Tao et al., 2018), and more. However,

Mix Flows: principled variational inference via mixed ﬂows

VI typically does not possess the same exactness regardless of tuning that MCMC does. The optimal variational distribution is not usually equal to the posterior due to the use of a limited parametric variational family; and even if it were, one could not reliably ﬁnd it due to nonconvexity of the KL objective (Xu & Campbell, 2022). Recent work addresses this problem by constructing variational families from parametrized Markov chains targeting the posterior. Many are related to annealed importance sampling (Salimans et al., 2015; Wolf et al., 2016; Geffner & Domke, 2021; Zhang et al., 2021; Thin et al., 2021b; Jankowiak & Phan, 2021); these methods introduce numerous auxiliary variables and only have convergence guarantees in the limit of increasing dimension of the joint distribution. Those based on ﬂows (Neal, 2005; Caterini et al., 2018; Chen et al., 2022) avoid the increase in dimension with ﬂow length, but typically do not have guaranteed convergence to the target. Methods involving the ﬁnal-state marginal of ﬁnite simulations of standard MCMC methods (e.g., Zhang & Hern andez-Lobato, 2020) do not enable density evaluation.

The ﬁrst contribution of this work is a new family of mixed variational ﬂows (Mix Flows), constructed via averaging over repeated applications of a pushforward map to an initial reference distribution. We develop efﬁcient methods for i.i.d. sampling, density evaluation, and unbiased ELBO estimation. The second contribution is a theoretical analysis of Mix Flows. We show (Theorems 4.1 and 4.2) that when the map family is ergodic and measure-preserving, Mix Flows converge to the target distribution for any value of the variational parameter, and hence have guarantees regardless of tuning as in MCMC. We then extend these results (Theorem 4.3 and Corollary 4.4) to Mix Flows based on approximated maps which are typically necessary in practice with bounds on error with increasing ﬂow length. The third contribution of the work is an implementation of Mix Flows via uncorrected discretized Hamiltonian dynamics. Simulated and real data experiments compare performance to the No-U-Turn sampler (NUTS) (Hoffman & Gelman, 2014), standard Hamiltonian Monte Carlo (HMC) (Neal, 2011), and several black-box normalizing ﬂows (Rezende & Mohamed, 2015b; Dinh et al., 2017). Our results demonstrate a comparable sample quality to NUTS, similar computational efﬁciency to HMC, and more reliable posterior approximations than standard normalizing ﬂows.

Related work. Averages of Markov chain state distributions in general were studied in early work on shift-coupling (Aldous & Thorisson, 1993), with convergence guarantees established by Roberts & Rosenthal (1997). However, these guarantees involve minorization and drift conditions that are designed for stochastic Markov chains, and are not applicable to Mix Flows. Averages of deterministic pushforwards speciﬁcally to enable density evaluation have also appeared

in more recent work. Rotskoff & Vanden-Eijnden (2019); Thin et al. (2021a) use an average of pushforwards generated by simulating nonequilibrium dynamics as an importance sampling proposal. The proposal distribution itself does not come with any convergence guarantees due to the use of non-measure-preserving, non-ergodic damped Hamiltonian dynamics or tuning guidance. Our work provides comprehensive convergence theory and establishes a convenient means of optimizing hyperparameters.

Mix Flows are also related to past work on deterministic MCMC (Murray & Elliott, 2012; Neal, 2012; ver Steeg & Galstyan, 2021; Neklyudov et al., 2021). Murray & Elliott (2012) developed a Markov chain Monte Carlo procedure based on an arbitrarily dependent random stream via augmentation and measure-preserving bijections. ver Steeg & Galstyan (2021) designed a specialized momentum distribution that generates valid Monte Carlo samples solely through the simulation of deterministic Hamiltonian dynamics. Neklyudov et al. (2021) proposed a general form of measure-preserving dynamics that can be utilized to construct deterministic Gibbs samplers. These works all involve only deterministic updates, but do not construct variational approximations, provide total variation convergence guarantees, or provide guidance on hyperparameter tuning. Finally, some of these works involve discretization of dynamical systems, but do not characterize the resultant error (ver Steeg & Galstyan, 2021; Neklyudov et al., 2021). In contrast, our work provides a comprehensive convergence theory, with error bounds for when approximate maps are used.

2. Background

2.1. Variational inference with ﬂows

Consider a set X Rd and a target probability distribution π on X whose density with respect to the Lebesgue measure we denote π(x) for x X. In the setting of Bayesian inference, π is the posterior distribution that we aim to approximate, and we are only able to evaluate a function p(x) such that p(x) = Z π(x) for some unknown normalization constant Z > 0. Throughout, we will assume all distributions have densities with respect to the Lebesgue measure on X, and will use the same symbol to denote a distribution and its density; it will be clear from context what is meant.

Variational inference involves approximating the target distribution π by minimizing the Kullback-Leibler (KL) divergence from π to members of a parametric family {qλ : λ Λ}, Λ Rp, i.e.,

λ = arg min λ Λ DKL (qλ||π)

= arg min λ Λ

Z qλ(x) log qλ(x)

p(x) dx. (1)

The two objective functions in Equation (1) differ only by

Mix Flows: principled variational inference via mixed ﬂows

the constant log Z. In order to be able to optimize λ using standard techniques, the variational family qλ, λ Λ must enable both i.i.d. sampling and density evaluation. A common approach to constructing such a family is to pass draws from a simple reference distribution q0 through a measurable function Tλ : X X; Tλ is often referred to as a ﬂow when comprised of repeated composed functions (Tabak & Turner, 2013; Rezende & Mohamed, 2015a; Kobyzev et al., 2021). If Tλ is a diffeomorphism, i.e., differentiable and has a differentiable inverse, then we can express the density of X = Tλ(Y ), Y q0 as

x X, qλ(x)= q0(T 1 λ (x)) Jλ(T 1 λ (x)), Jλ(x)=|det x Tλ(x)| .

In this case the optimization in Equation (1) can be rewritten using a transformation of variables as

λ = arg min λ Λ

Z q0(x) log q0(x) Jλ(x)p(Tλ(x))dx. (2)

One can solve the optimization problem Equation (2) using unbiased stochastic estimates of the gradient1 with respect to λ based on draws from q0 (Salimans & Knowles, 2013; Kingma & Welling, 2014),

λDKL (qλ||π) λlog q0(X) Jλ(X)p(Tλ(X)), X q0.

2.2. Measure-preserving and ergodic maps

Variational ﬂows are often constructed from a ﬂexible, general-purpose parametrized family {Tλ : λ Λ} that is not specialized for any particular target distribution (Papamakarios et al., 2021); it is the job of the KL divergence minimization Equation (2) to adapt the parameter λ such that qλ becomes a good approximation of the target π. However, there are certain functions in particular, those that are both measure-preserving and ergodic for π that naturally provide a means to approximate expectations of interest under π without the need for tuning. Intuitively, a measurepreserving map T will not change the distribution of draws from π: if X π, then T(X) π. And an ergodic map T, when applied repeatedly, will not get stuck in a subset of X unless it has probability either 0 or 1 under π. The precise deﬁnitions are given in Deﬁnitions 2.1 and 2.2.

Deﬁnition 2.1 (Measure-preserving map (Eisner et al., 2015, pp. 73, 105)). T : X X is measure-preserving for π if Tπ = π, where Tπ is the pushforward measure given by π(T 1(A)) for each measurable set A X.

Deﬁnition 2.2 (Ergodic map (Eisner et al., 2015, pp. 73, 105)). T : X X is ergodic for π if for all measurable sets A X, T(A) = A implies that π(A) {0, 1}.

1We assume throughout that differentiation and integration can be swapped wherever necessary.

If a map T satisﬁes both Deﬁnitions 2.1 and 2.2, then longrun averages resulting from repeated applications of T will converge to expectations under π, as shown by Theorem 2.3. When X is compact, this result shows that the discrete measure 1

N PN 1 n=0 δT nx converges weakly to π (Dajani & Dirksin, 2008, Theorem 6.1.7). Theorem 2.3 (Ergodic Theorem [Birkhoff, 1931; Eisner et al., 2015, p. 212]). Suppose T : X X is measurepreserving and ergodic for π, and f L1(π). Then

n=0 f(T nx) = Z fdπ π-a.e. x X.

Based on this result, one might reasonably consider building a measure-preserving variational ﬂow, i.e., X = T(X0) where X0 q0. However, it is straightforward to show that measure-preserving bijections T do not decrease the KL divergence (or any other f-divergence, e.g., total variation and Hellinger (Qiao & Minematsu, 2010, Theorem 1)):

DKL (Tq||π) = DKL (Tq||Tπ) = DKL (q||π) .

3. Mixed variational ﬂows (Mix Flows)

In this section, we develop a general family of mixed variational ﬂows (Mix Flows) as well as algorithms for tractable i.i.d. sampling, density evaluation, and ELBO estimation. Mix Flows consist of a mixture of normalizing ﬂows obtained via repeated application of a pushforward map. This section introduces general Mix Flows; later in Sections 4 and 5 we will provide convergence guarantees and examples based on Hamiltonian dynamics.

3.1. Variational family

Deﬁne a reference distribution q0 on X for which i.i.d. sampling and density evaluation is tractable, and a collection of measurable functions Tλ : X X parametrized by λ Λ. Then the Mix Flow family generated by q0 and Tλ is

n=0 T n λ q0 for λ Λ, N N,

where T n λ q0 denotes the pushforward of the distribution q0 under n repeated applications of Tλ.

3.2. Density evaluation and sampling

If Tλ : X X is a diffeomorphism with Jacobian Jλ : X R, we can express the density of qλ,N by using a transformation of variables formula on each component in the mixture:

qλ,N(x) = 1

q0(T n λ x) Qn j=1 Jλ(T j λ x) .

Mix Flows: principled variational inference via mixed ﬂows

This density can be computed efﬁciently using N 1 evaluations of T 1 λ and Jλ each (Algorithm 2). For sampling, we can obtain an independent draw X qλ,N by treating qλ,N as a mixture of N distributions:

K Unif{0, 1, . . . , N 1} X0 q0 X = T K λ (X0).

The procedure is given in Algorithm 1. On average, this computation requires E[K] = N 1

2 applications of Tλ, and at most it requires N 1 applications. However, one often takes samples from qλ,N to estimate the expectation of some test function f : X R. In this case, one can use all intermediate states over a single pass of a trajectory rather than individual i.i.d. draws. In particular, we obtain an unbiased estimate of R f(x)qλ,N(dx) via

n=0 f(T n λ X0). (3)

We refer to this estimate as the trajectory-averaged estimate of f. The trajectory-averaged estimate of f is preferred over the na ıve estimate based on a single draw f(X), X qλ,N, as its cost is of the same order (N 1 applications of Tλ) and its variance is bounded above by that of the na ıve estimate f(X), as shown in Proposition 3.1. See Appendix E.2, and Figure 6 in particular, for empirical veriﬁcation of Proposition 3.1.

Proposition 3.1.

n=0 f(T n X0)

Var [f(X)] .

3.3. ELBO estimation

We can minimize the KL divergence from qλ,N to π by maximizing the ELBO (Blei et al., 2017), given by

ELBO (λ, N) = Z qλ,N(x) log p(x) qλ,N(x)dx

n=0 log p(T n λ x) qλ,N(T n λ x)dx

The trajectory-averaged ELBO estimate is thus

X0 q0, \ ELBO(λ, N)= 1

n=0 log p(T n λ X0) qλ,N(T n λ X0). (4)

The na ıve method to compute this estimate sampling X0 and then computing the log density ratio for each term requires O(N 2) computation because each evaluation of qλ,N(x) is O(N). Algorithm 3 provides an efﬁcient way of computing \ ELBO(λ, N) in O(N) operations, which is on par with taking a single draw from

Algorithm 1 Sample(qλ,N): Take a draw from qλ,N

Input: reference distribution q0, ﬂow map Tλ, number of steps N K Sample(Unif{0, 1, . . . , N 1}) x0 Sample(q0) Return: T K λ (x0)

Algorithm 2 log qλ,N(x): Evaluate the log-density of qλ,N

Input: location x, reference distribution q0, ﬂow map Tλ, Jacobian Jλ, number of steps N L 0 w0 log q0(x) for n = 1, . . . , N 1 do

x T 1 λ (x) L L + log Jλ(x) wn log q0(x) L end for Return: Log Sum Exp(w0, . . . , w N 1) log N

qλ,N or evaluating qλ,N(x) once. The key insight in Algorithm 3 is that we can evaluate the collection of values {qλ,N(X0), qλ,N(TλX0), . . . , qλ,N(T N 1 λ X0)} incrementally, starting from qλ,N(X0) and iteratively computing each qλ,N(T n λ X0) for increasing n in constant time. Speciﬁcally, in practice one computes qλ,N(x) with a complexity of O(N) and iteratively updates qλ,N(T k λ x) and the Jacobian for k = 1, 2, . . . N 1. Each update requires only constant cost if one precomputes and stores T N+1 λ x, . . . , T N 1 λ x and Jλ(T N+1 λ x), . . . , Jλ(T N 1 λ x). This then implies that Algorithm 3 requires O(N) memory; Algorithm 5 in Appendix B provides a slightly more complicated O(N) time, O(1) memory implementation of Equation (4).

4. Guarantees for Mix Flows

In this section, we show that when Tλ is measure-preserving and ergodic for π or approximately so Mix Flows come with convergence guarantees and bounds on their approximation error as a function of ﬂow length. Proofs for all results may be found in Appendix A.

4.1. Ergodic Mix Flows

Ergodic Mix Flow families are those where λ Λ, Tλ is measure-preserving and ergodic for π. In this setting, Theorems 4.1 and 4.2 show that Mix Flows converge to the target weakly and in total variation as N regardless of the choice of λ (i.e., regardless of parameter tuning effort). Thus, ergodic Mix Flow families provide the same compelling convergence result as MCMC, but with the added beneﬁt of unbiased ELBO estimates. We ﬁrst demonstrate setwise and weak convergence. Recall that a sequence of

Mix Flows: principled variational inference via mixed ﬂows

Algorithm 3 Est ELBO(λ, N): Obtain an unbiased estimate of the ELBO for qλ,N.

Input: reference q0, unnormalized target p, ﬂow map Tλ, Jacobian Jλ, number of ﬂow steps N x0 Sample(q0), J0 Jλ(x0) for n = 1, . . . , N 1 do

xn Tλ(xn 1), Jn Jλ(xn) x n T 1 λ (x n+1), J n Jλ(x n) end for z0 qλ,N(x0) (via Algorithm 2) J QN 1 j=1 J j for n = 1, . . . , N 1 do

N q0(x N+n)/J zn (zn 1 qn) /Jn 1 + 1

N q0(xn) if n < N 1 then

J J Jn 1/J N+n end if end for

\ ELBO(λ, N) 1

N PN 1 n=0 (log p(xn) log zn) Return: \ ELBO(λ, N)

distributions qn converges weakly to π if for all bounded, continuous f : X R, limn R f(x)qn(dx) = R f(x)π(dx), and converges setwise to π if for all measurable A X, limn qn(A) = π(A).

Theorem 4.1. Suppose q0 π and Tλ is measurepreserving and ergodic for π. Then qλ,N converges both setwise and weakly to π as N .

Using Theorem 4.1 as a stepping stone, we can obtain convergence in total variation. Recall that a sequence of distributions qn converges in total variation to π if DTV(qn, π) = sup A |qn(A) π(A)| 0 as n . Note that similar nonasymptotic results exist for the ergodic average law of Markov chains (Roberts & Rosenthal, 1997), but as previously mentioned, these results do not apply to deterministic Markov kernels.

Theorem 4.2. Suppose q0 π and Tλ is measurepreserving and ergodic for π. Then qλ,N converges in total variation to π as N .

4.2. Approximated Mix Flows

In practice, it is rare to be able to evaluate a measurepreserving map exactly; but approximations are commonly available. For example, in Section 5 we will create a measure-preserving map using Hamiltonian dynamics, and then approximate that map with a discretized leapfrog integrator. We therefore need to extend the result in Section 4.1 to apply to approximated maps.

Suppose we have two maps, T and ˆT, with corresponding Mix Flow families qn and ˆqn (suppressing λ for brevity).

Theorem 4.3 shows that the error of the Mix Flow family ˆq N reﬂects an accumulation of the difference between the pushforward of each qn under ˆT and T.

Theorem 4.3. Suppose ˆT is a bijection. Then

DTV (ˆq N, π) DTV (q N, π) +

n N DTV Tqn, ˆTqn .

Suppose T is measure-preserving and ergodic for π and q0 N. Then Theorem 4.2 implies that DTV(q N, π) 0, and for the second term, we (very loosely) expect that DTV(Tqn, ˆTqn) DTV(π, ˆTπ) for large n. Therefore, the second term should behave like O(N DTV(π, ˆTπ)), i.e., increase linearly in N proportional to how non-measurepreserving ˆT is. This presents a trade-off between ﬂow length and approximation quality: better approximations enable longer ﬂows and lower minimal error bounds. Our empirical ﬁndings in Section 6 generally conﬁrm this behavior. Corollary 4.4 further provides a more explicit characterization of this trade-off when ˆT and its log-Jacobian log ˆJ are uniformly close to their exact counterparts T and log J, and log qn and log J are uniformly (over n N) Lipschitz continuous. The latter assumption is reasonable in practical settings where we observe that the log-density of qn closely approximates π (see the experimental results in Section 6).

Corollary 4.4. Suppose that for all x X,

max n ˆT 1x T 1x , | log ˆJ(x) log J(x)| o ϵ,

for all n [N 1], log qn and log J are ℓ-Lipschitz continuous, and that T, ˆT are diffeomorphisms. Then

DTV (ˆq N, π) DTV (q N, π) + Nϵ(ℓ+ 1)e(2ℓ+1)ϵ.

This result states that the overall map-approximation error is O(Nϵ) when ϵ is small. Theorem 1 of Butzer & Westphal (1971) also suggests that DTV(q N, π) = O(1/N) in many cases, which hints that the bound should decrease roughly until N = O(1/ ϵ ), with error bound O( ϵ ). We leave a more careful investigation of this trade-off for future work.

4.3. Discussion

Our main convergence results (Theorems 4.1 and 4.2) require that π dominates the reference q0, and that T is both measure-preserving and ergodic. It is often straightforward to design a dominated reference q0. Designing a measurepreserving map T with an implementable approximation ˆT T is also often feasible. However, verifying the ergodicity of T conclusively is typically a very challenging task. As a consequence, most past work involving measurepreserving transformations just asserts the ergodic hypothesis without proof; see the discussions in ver Steeg & Galstyan (2021, p. 4) and Tupper (2005, p. 2-3).

Mix Flows: principled variational inference via mixed ﬂows

But because Mix Flows provide the ability to estimate the Kullback-Leibler divergence up to a constant, it is not critical to prove convergence a priori (as it is in the case of MCMC, for example). Instead, we suggest using the results of Theorems 4.1 to 4.3 and Corollary 4.4 as a guiding recipe for constructing principled variational families. First, start by designing a family of measure-preserving maps Tλ, λ Λ. Next, approximate Tλ with some tractable Tλ,ϵ including a tunable ﬁdelity parameter ϵ 0, such as that ϵ 0, Tλ,ϵ becomes closer to Tλ. Finally, build a Mix Flow from Tλ,ϵ, and tune both λ and ϵ by maximizing the ELBO. We follow this recipe in Section 5 and verify it empirically in Section 6.

5. Uncorrected Hamiltonian Mix Flows

In this section, we provide an example of how to design a Mix Flow by starting from an exactly measure-preserving map and then creating a tunable approximation to it. The construction is inspired by Hamiltonian Monte Carlo (HMC) (Neal, 2011; 1996), in which each iteration involves simulating Hamiltonian dynamics followed by a stochastic momentum refreshment; our method replaces the stochastic refreshment with a deterministic transformation. In particular, consider the augmented target density on X Rd [0, 1],

π(x, ρ, u)=π(x)m(ρ)1[0 u 1], m(ρ)=

with auxiliary variables ρ Rd, u [0, 1], and some almost-everywhere differentiable univariate probability density r : R R+. The x-marginal distribution of π is the original target distribution π

5.1. Measure-preserving map via Hamiltonian dynamics

We construct Tλ by composing the following three steps, which are all measure-preserving for π:

(1) Hamiltonian dynamics We ﬁrst apply

(x , ρ ) HL(x, ρ),

where HL : X Rd X Rd is the map of Hamiltonian dynamics with position x and momentum ρ simulated for a time interval of length L R+,

dt = log π(x) dx

dt = log m(ρ). (5)

One can show that this preserves density (and hence is measure preserving) and is also unit Jacobian (Neal, 2011).

(2) Pseudotime shift Second, we apply a constant shift to the pseudotime variable u,

u u + ξ mod 1,

where ξ R is a ﬁxed irrational number (say, ξ = π/16). As this is a constant shift, it is unit Jacobian and densitypreserving (and hence measure-preserving). The u component will act as a notion of time of the ﬂow, and ensures that the refreshment of ρ in step (3) below will take a different form even if x visits the same location again.

(3) Momentum refreshment Finally, we refresh each of the momentum variables via

i = 1, . . . , d, ρ i R 1(R(ρ i) + z(x i, u ) mod 1),

where R is the cumulative distribution function (CDF) of density r, and z : X [0, 1] R is any differentiable function; this generalizes the map from Neal (2012); Murray & Elliott (2012) to enable the shift to depend on the state x and pseudotime u. This step (3) is an attempt to replicate the independent resampling of ρ m from HMC using only deterministic maps. This map is measure-preserving as it involves mapping the momentum to a Unif[0, 1] random variable via the CDF, shifting by an amount that depends only on x, u, and then mapping back using the inverse CDF. The Jacobian is the momentum density ratio m(ρ )/m(ρ ).

5.2. Approximation via uncorrected leapfrog integration

In practice, we cannot simulate the dynamics in step (1) perfectly. Instead, we approximate the dynamics in Equation (5) by running L steps of the leapfrog method, where each leapfrog map ˆHϵ : R2d R2d involves interleaving three discrete transformations with step size ϵ > 0,

ˆρk+1 = ρk + ϵ

2 log π(xk)

xk+1 = xk ϵ log m(ˆρk+1)

ρk+1 = ˆρk+1 + ϵ

2 log π(xk+1)

Denote the map Tλ,ϵ to be the composition of the three steps with Hamiltonian dynamics replaced by the leapfrog integrator; Algorithm 6 in Appendix C provides the pseudocode. The ﬁnal variational tuning parameters for the Mix Flow are the step size ϵ which controls how close Tλ,ϵ is to being measure-preserving for π the Hamiltonian simulation length L N, and the ﬂow length N. In our experiments we tune the number of leapfrog steps L and the step size ϵ by maximizing the ELBO. We also tune the number of refreshments N to achieve a desirable computation-quality tradeoff by visually inspecting the convergence of the ELBO.

5.3. Numerical stability

Density evaluation (Algorithm 2) and ELBO estimation (Algorithm 3) both involve repeated applications of Tλ and T 1 λ . This poses no issue in theory, but in a computer with

Mix Flows: principled variational inference via mixed ﬂows

(a) Gaussian

(b) Gaussian mixture

Figure 1. Marginal samples (ﬁrst row) and pairs of exact and approximate joint log density (second row) for Gaussian (1a), mixture (1b), and Cauchy (1c) targets. Marginal samples (third row), pairs of sliced exact and approximate joint log density (fourth row) for banana (1d), funnel (1e), cross (1f), and warped Gaussian (1g).

ﬂoating point computations one should code the map Tλ and its inverse T 1 λ in a numerically precise way such that (T K λ ) (T K λ ) is the identity map for large K N. Figure 8 in Appendix E.2 displays the severity of the numerical error when Tλ and T 1 λ are not carefully implemented. In practice, we check the stability limits of our ﬂow by taking draws from q0 and evaluating T K λ followed by T K λ (and vice versa) for increasing K until we cannot reliably invert the ﬂow. See Figure 7 in the appendix for an example usage of this diagnostic. Note that for sample generation speciﬁcally (Algorithm 1), numerical stability is not a concern as it only requires forward evaluation of the map Tλ.

6. Experiments

In this section, we demonstrate the performance of our method (Mix Flow) on 7 synthetic targets and 7 real data targets. See Appendix E for the details of each target. Both our synthetic and real data examples are designed to cover a range of challenging features such as heavy tails, high dimensions, multimodality, and weak identiﬁability. We compare posterior approximation quality to several black-box normalizing ﬂow methods (NF): Planar Flow, Radial Flow, and Real NVP with various architectures (Papamakarios et al., 2021). To make the methods comparable via the ELBO, we train all NFs on the same joint space as Mix Flow. We also compare the marginal sample quality of Mix Flow against 5,000 samples from NUTS and NFs. Finally, we compare sampling time with all competitors, and effective sample size (ESS) per second with HMC. For all experiments, we use the standard Laplace distribution as the momentum distribution due to its nu-

merical stability (see Figure 7 in Appendix E). Additional comparisons to variational inference based on uncorrected Hamiltonian annealing (UHA) (Geffner & Domke, 2021) and nonequilibrium orbit sampling (NEO) (Thin et al., 2021a, Algorithm 2) may be found in Appendix E. All experiments were conducted on a machine with an AMD Ryzen 9 3900X and 32GB of RAM. Code is available at https:

//github.com/zuhengxu/Ergodic-variational-flow-code.

6.1. Qualitative assessment

We begin with a qualitative examination of the i.i.d. samples and the approximated targets produced by Mix Flow initialized at q0 = N(0, 1) for three one-dimensional synthetic distributions: a Gaussian, a mixture of Gaussians, and the standard Cauchy. We excluded the pseudotime variable u here in order to visualize the full joint density of (x, ρ) in 2 dimensions. More details can be found in Appendix E.1. Figures 1a to 1c show histograms of 10,000 i.i.d. x-marginal samples generated by Mix Flow for each of the three targets, which nearly perfectly match the true target marginals. Figures 1a to 1c also show that log qλ,N is generally a good approximation of the log target density.

We then present similar visualizations on four more challenging synthetic target distributions: the banana (Haario et al., 2001), Neal s funnel (Neal, 2003), a cross-shaped Gaussian mixture, and a warped Gaussian. All four examples have a 2-dimensional state x R2, and hence (x, ρ, u) R5. In each example we set the initial distribution q0 to be the mean-ﬁeld Gaussian approximation. More details can be found in Appendix E.2. Figures 1d to 1g shows the scatter plots consisting of 1,000 i.i.d. xmarginal samples drawn from Mix Flow, as well as the approximated Mix Flow log density and exact log density sliced as a function of x R2 for a single value of (ρ, u) chosen randomly via (ρ, u) Lap(0, I) Unif[0, 1] (which is required for visualization, as (x, ρ, u) R5). We see that, qualitatively, both the samples and approximated densities from Mix Flow closely match the target. Finally, Figure 9 in Appendix E.2 provides a more comprehensive set of sample histograms (showing the x-, ρ-, and u-marginals). These visualizations support our earlier theoretical analysis.

6.2. Posterior approximation quality

Next, we provide a quantitative comparison of Mix Flow, NFs, NUTS on 7 real data experiments outlined in Appendix E.5. We tune each NF method under various settings (Tables 1 and 2 and Appendix E.5.5), and present the best one for each example. ELBOs of Mix Flow are estimated with Algorithm 3, averaging over 1,000 independent trajectories. ELBOs of NFs are based on 2,000 samples. To obtain an assessment for the target marginal distribution itself (not the augmented target), we also compare methods using the

Mix Flows: principled variational inference via mixed ﬂows

(a) linear regression

(b) linear regression (heavy)

(c) logistic regression

(d) Poisson regression

(e) student t regression

(f) sparse regression

(g) sparse regression (high dim)

Figure 2. ELBO and KSD comparison for real data examples. Figure 2a displays the effect of step size for the linear regression problem: ϵ1 = 0.0001, ϵ2 = 0.001 and tuned ϵ = 0.0005; see Figure 17 for step sizes for all other experiments. Lines indicate the median, and error regions indicate 25th to 75th percentile from 5 runs. Figure 2b does not include ELBOs of Planar Flow as values are signiﬁcantly worse than all other methods and are hard to visualize (see its ELBOs in Figure 17b).

kernel Stein discrepancy (KSD) with an inverse quadratic (IMQ) kernel (Gorham & Mackey, 2017). NUTS and NFs use 5,000 samples for KSD estimation, while Mix Flow is based on 2,000 i.i.d. draws (Algorithm 1). For KSD comparisons, all variational methods are tuned by maximizing the ELBO (Figures 2 and 17).

Augmented target distribution Figure 2 displays the ELBO comparison. First, Figure 2a shows how different step sizes ϵ and number of refreshments N affects approximation quality. Overly large step sizes typically result in errors due to the use of discretized Hamiltonian dynamics, while overly small step sizes result in a ﬂow that makes slow progress towards the target. Mix Flow with a tuned step size generally shows a comparable joint ELBO value to the best NF method, yielding a competitive target approximation. Similar comparisons and assessment of the effect of step size for the synthetic examples are presented in Figures 5, 10 and 16. Note that in three examples (Figures 2a, 2d and 2g), the tuned Real NVP ELBO exceeds Mix Flow by a small amount; but this required expensive architecture search and parameter tuning, and as we will describe next, Mix Flow is actually more reliable in terms of target marginal sample quality and density estimation.

Original target distribution The second row of Figure 2 displays a comparison of KSD for the target distribution itself (instead of the augmented target). In particular, Mix Flow produces comparable sample quality to that from NUTS an exact MCMC method and clearly outperforms all of the NF competitors. The scatter plots of samples in Figure 20 conﬁrm the improvement in sample quality of Mix Flow over variational competitors. Further, Figure 3 shows the (sliced) densities on two difﬁcult real data examples: Bayesian student-t regression (with a heavy-tailed posterior), and a high-dimensional sparse regression (parameter dimension is 84). This result demonstrates that the densities provided by Mix Flow more closely match those of the original target distribution than those of the best

NF. Notice that Mix Flow accurately captures the skew and heavy tails of the exact target, while the NF density fails to do so and contains many spurious modes.

6.3. Ease of tuning

In order to tune Mix Flow, we simply run a 1-dimensional parameter sweep for the step size ϵ, and use a visual inspection of the ELBO to set an appropriate number of ﬂow steps N. Tuning an NF requires optimizing its architecture, number of layers, and its (typically many) parameters. Not only is this time consuming in our experiments, tuning took 10 minutes to roughly 1 hour (Figure 4) but the optimization can also behave in unintuitive ways. For example, performance can be heavily dependent on the number of ﬂow layers, and adding more layers does not necessarily improve quality. Figure 16, Tables 1 and 2, and Appendix E.5.5 show that using more layers does not necessarily help, and slows tuning considerably. In the case of Real NVP speciﬁcally, tuning can be unstable, especially for more complex models. The optimizer often returns Na N values for ﬂow parameters during training (see Table 1). This instability has been noted in earlier work (Dinh et al., 2017, Sec. 3.7).

6.4. Time efﬁciency

Finally, Figure 4 presents timing results for two of the real data experiments (additional comparisons in Figures 14 and 19). In this ﬁgure, we use Mix Flow iid to refer to i.i.d. sampling from Mix Flow, and Mix Flow single to refer to collecting all intermediate states on a single trajectory. This result shows that the per sample time of Mix Flow single is similar to NUTS and HMC as one would expect. Mix Flow iid is the slowest because each sample is generated by passing through the entire ﬂow. The NF generates the fastest draws, but recall that this comes at the cost of signiﬁcant initial tuning time; in the time it takes NF to generate its ﬁrst sample, Mix Flow single has generated millions of samples in Figure 4. See Ap-

Mix Flows: principled variational inference via mixed ﬂows

(a) student t regression

(b) sparse regression (high dim)

Figure 3. Sliced log conditional densities on student-t regression (Figure 3a) and high-dimensional sparse regression (Figure 3b). We visualize the log conditional density of the ﬁrst coordinate by ﬁxing other coordinates to value 0. The NF methods are chosen to be the best performing ones from Figures 2e and 2g. Since we only know the log posterior density up to an unknown constant, we shift all lines to have maximal value 0 for visualization.

(a) sparse regression (high dim)

(b) student t regression

Figure 4. Timing results (100 trials), showing sampling time (ﬁrst row) and ESS per second (second row).

pendix E.4 for a detailed discussion of this trade-off.

Figures 4 and 19 further show the computational efﬁciency in terms of ESS per second on real data examples, which reﬂects the autocorrelation between drawn samples. Results show that Mix Flow produce comparable ESS per second to HMC. Mix Flow single behaves similarly to HMC as expected since the pseudo-momentum refreshment we proposed (steps (2-3) of Section 5) resembles the momentum resample step of HMC. The ESS efﬁciency of Mix Flow iid depends on the trade-off between a slower sampling time and i.i.d. nature of drawn samples. In these real data examples, Mix Flow iid typically produces a high ESS per second; but in the synthetic examples (Figure 14b), Mix Flow iid is similar to the others.

7. Conclusion

This work presented Mix Flows, a new variational family constructed from mixtures of pushforward maps. Experi-

ments demonstrate a comparable sample quality to NUTS and more reliable posterior approximations than standard normalizing ﬂows. A main limitation of our methodology is numerical stability; reversing the ﬂow for long trajectories can be unstable in practice. Future work includes developing more stable momentum refreshment schemes and extensions via involutive MCMC (Neklyudov et al., 2020; Spanbauer et al., 2020; Neklyudov & Welling, 2022).

Acknowledgements

The authors gratefully acknowledge the support of a National Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant and a UBC four year doctoral fellowship.

Aldous, D. and Thorisson, H. Shift-coupling. Stochastic Processes and their Applications, 44:1 14, 1993.

Anastasiou, A., Barp, A., Briol, F.-X., Ebner, B., Gaunt, R., Ghaderinezhad, F., Gorham, J., Gretton, A., Ley, C., Liu, Q., Mackey, L., Oates, C., Reinert, G., and Swan, Y. Stein s method meets statistics: a review of some recent developments. ar Xiv:2105.03481, 2021.

Birkhoff, G. Proof of the ergodic theorem. Proceedings of the National Academy of Sciences, 17(12):656 660, 1931.

Blei, D., Kucukelbir, A., and Mc Auliffe, J. Variational inference: a review for statisticians. Journal of the American Statistical Association, 112(518):859 877, 2017.

Brooks, S. and Gelman, A. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434 455, 1998.

Butzer, P. and Westphal, U. The mean ergodic theorem and saturation. Indiana University Mathematics Journal, 20 (12):1163 1174, 1971.

Campbell, T. and Li, X. Universal boosting variational inference. In Advances in Neural Information Processing Systems, 2019.

Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., and Riddell, A. Stan: A probabilistic programming language. Journal of Statistical Software, 76(1), 2017.

Caterini, A., Doucet, A., and Sejdinovic, D. Hamiltonian variational auto-encoder. In Advances in Neural Information Processing Systems, 2018.

Mix Flows: principled variational inference via mixed ﬂows

Chen, N., Xu, Z., and Campbell, T. Bayesian inference via sparse Hamiltonian ﬂows. In Advances in Neural Information Processing Systems, 2022.

Ch erief-Abdellatif, B.-E. and Alquier, P. Consistency of variational Bayes inference for estimation and model selection in mixtures. Electronic Journal of Statistics, 12 (2):2995 3035, 2018.

Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of ﬁt. In International Conference on Machine Learning, 2016.

Constantinopoulos, C., Titsias, M., and Likas, A. Bayesian feature and model selection for Gaussian mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6):1013 1018, 2006.

Corduneau, A. and Bishop, C. Variational Bayesian model selection for mixture distributions. In Artiﬁcial Intelligence and Statistics, 2001.

Cowles, M. K. and Carlin, B. Markov chain Monte Carlo convergence diagnostics: a comparative review. Journal of the American Statistical Association, 91(434):883 904, 1996.

Dajani, K. and Dirksin, S. A simple introduction to ergodic theory, 2008. URL: https://webspace.science.uu.nl/ kraai101/lecturenotes2009.pdf.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using Real NVP. In International Conference on Learning Representations, 2017.

Eisner, T., Farkas, B., Haase, M., and Nagel, R. Operator Theoretic Aspects of Ergodic Theory. Graduate Texts in Mathematics. Springer, 2015.

Fjelde, T. E., Xu, K., Tarek, M., Yalburgi, S., and Ge, H. Bijectors.jl: Flexible transformations for probability distributions. In Symposium on Advances in Approximate Bayesian Inference, pp. 1 17, 2020.

Geffner, T. and Domke, J. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021.

Gelman, A. and Rubin, D. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4): 457 472, 1992.

Gelman, A., Carlin, J., Stern, H., Dunson, D., Vehtari, A., and Rubin, D. Bayesian data analysis. CRC Press, 3rd edition, 2013.

Gershman, S., Hoffman, M., and Blei, D. Nonparametric variational inference. In International Conference on Machine Learning, 2012.

Geweke, J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In Bernardo, J. M., Berger, J. O., and Dawid, A. P. (eds.), Bayesian Statistics, volume 4, pp. 169 193, 1992.

Gong, W., Li, Y., and Hern andez-Lobato, J. M. Sliced kernelized Stein discrepancy. In International Conference on Learning Representations, 2021.

Gorham, J. and Mackey, L. Measuring sample quality with Stein s method. In Advances in Neural Information Processing Systems, 2015.

Gorham, J. and Mackey, L. Measuring sample quality with kernels. In International Conference on Machine Learning, 2017.

Guo, F., Wang, X., Fan, K., Broderick, T., and Dunson, D. Boosting variational inference. In Advances in Neural Information Processing Systems, 2016.

Haario, H., Saksman, E., and Tamminen, J. An adaptive Metropolis algorithm. Bernoulli, pp. 223 242, 2001.

Hamidieh, K. A data-driven statistical model for predicting the critical temperature of a superconductor. Computational Materials Science, 154:346 354, 2018.

Harrison Jr., D. and Rubinfeld, D. Hedonic housing prices and the demand for clean air. Journal of Environmental Economics and Management, 5(1):81 102, 1978.

Hoffman, M. and Gelman, A. The No-U-Turn Sampler: adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1): 1593 1623, 2014.

Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. Stochastic variational inference. Journal of Machine Learning Research, 14:1303 1347, 2013.

Huggins, J. and Mackey, L. Random feature Stein discrepancies. In Advances in Neural Information Processing Systems, 2018.

Jaakkola, T. and Jordan, M. Improving the mean ﬁeld approximation via the use of mixture distributions. In Learning in graphical models, pp. 163 173. Springer, 1998.

Jankowiak, M. and Phan, D. Surrogate likelihoods for variational annealed importance sampling. ar Xiv:2112.12194, 2021.

Jordan, M., Ghahramani, Z., Jaakkola, T., and Saul, L. An introduction to variational methods for graphical models. Machine Learning, 37:183 233, 1999.

Mix Flows: principled variational inference via mixed ﬂows

Kakutani, S. Iteration of linear operations in complex Banach spaces. Proceedings of the Imperial Academy, 14 (8):295 300, 1938.

Kingma, D. and Welling, M. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.

Kobyzev, I., Prince, S., and Brubaker, M. Normalizing ﬂows: an introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964 3979, 2021.

Kullback, S. and Leibler, R. On information and sufﬁciency. The Annals of Mathematical Statistics, 22(1): 79 86, 1951.

Liu, C. and Rubin, D. ML estimation of the t distribution using EM and its extensions, ECM and ECME. Statistica Sinica, pp. 19 39, 1995.

Liu, Q., Lee, J., and Jordan, M. A kernelized Stein discrepancy for goodness-of-ﬁt tests and model evaluation. In International Conference on Machine Learning, 2016.

Locatello, F., Dresdner, G., Khanna, R., Valera, I., and R atsch, G. Boosting black box variational inference. In Advances in Neural Information Processing Systems, 2018a.

Locatello, F., Khanna, R., Ghosh, J., and R atsch, G. Boosting variational inference: an optimization perspective. In International Conference on Artiﬁcial Intelligence and Statistics, 2018b.

Masa-aki, S. Online model selection based on the variational Bayes. Neural Computation, 13(7):1649 1681, 2001.

Miller, A., Foti, N., and Adams, R. Variational boosting: iteratively reﬁning posterior approximations. In International Conference on Machine Learning, 2017.

Moro, S., Cortez, P., and Rita, P. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22 31, 2014.

Murray, I. and Elliott, L. Driving Markov chain Monte Carlo with a dependent random stream. ar Xiv:1204.3187, 2012.

Neal, R. Bayesian Learning for Neural Networks. Lecture Notes in Statistics, No. 118. Springer-Verlag, 1996.

Neal, R. Slice sampling. The Annals of Statistics, 31(3): 705 767, 2003.

Neal, R. Hamiltonian importance sampling. Banff International Research Station (BIRS) Workshop on Mathematical Issues in Molecular Dynamics, 2005.

Neal, R. MCMC using Hamiltonian dynamics. In Brooks, S., Gelman, A., Jones, G., and Meng, X.-L. (eds.), Handbook of Markov chain Monte Carlo, chapter 5. CRC Press, 2011.

Neal, R. How to view an MCMC simulation as permutation, with applications to parallel simulation and improved importance sampling. ar Xiv:1205.0070, 2012.

Neklyudov, K. and Welling, M. Orbital MCMC. In Artiﬁcial Intelligence and Statistics, 2022.

Neklyudov, K., Welling, M., Egorov, E., and Vetrov, D. Involutive MCMC: a unifying framework. In International Conference on Machine Learning, 2020.

Neklyudov, K., Bondesan, R., and Welling, M. Deterministic gibbs sampling via ordinary differential equations. ar Xiv:2106.10188, 2021.

Ormerod, J., You, C., and M uller, S. A variational Bayes approach to variable selection. Electronic Journal of Statistics, 11(2):3549 3594, 2017.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing ﬂows for probabilistic modeling and inference. Journal of Machine Learning Research, 22:1 64, 2021.

Qiao, Y. and Minematsu, N. A study on invariance of f-divergence and its application to speech recognition. IEEE Transactions on Signal Processing, 58(7):3884 3890, 2010.

Ranganath, R., Gerrish, S., and Blei, D. Black box variational inference. In International Conference on Artiﬁcial Intelligence and Statistics, 2014.

Ranganath, R., Tran, D., and Blei, D. Hierarchical variational models. In International Conference on Machine Learning, 2016.

Rezende, D. and Mohamed, S. Variational inference with normalizing ﬂows. In International Conference on Machine Learning, 2015a.

Rezende, D. and Mohamed, S. Variational inference with normalizing ﬂows. In International Conference on Machine Learning, pp. 1530 1538. PMLR, 2015b.

Riesz, F. Some mean ergodic theorems. Journal of the London Mathematical Society, 13(4):274 278, 1938.

Robert, C. and Casella, G. Monte Carlo Statistical Methods. Springer, 2nd edition, 2004.

Robert, C. and Casella, G. A short history of Markov Chain Monte Carlo: subjective recollections from incomplete data. Statistical Science, 26(1):102 115, 2011.

Mix Flows: principled variational inference via mixed ﬂows

Roberts, G. and Rosenthal, J. Shift-coupling and convergence rates of ergodic averages. Stochastic Models, 13 (1):147 165, 1997.

Roberts, G. and Rosenthal, J. General state space Markov chains and MCMC algorithms. Probability Surveys, 1: 20 71, 2004.

Rotskoff, G. and Vanden-Eijnden, E. Dynamical computation of the density of states and Bayes factors using nonequilibrium importance sampling. Physical Review Letters, 122(15):150602, 2019.

Salimans, T. and Knowles, D. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837 882, 2013.

Salimans, T., Kingma, D., and Welling, M. Markov chain Monte Carlo and variational inference: bridging the gap. In International Conference on Machine Learning, 2015.

Spanbauer, S., Freer, C., and Mansinghka, V. Deep involutive generative models for neural MCMC. ar Xiv:2006.15167, 2020.

Tabak, E. and Turner, C. A family of non-parametric density estimation algorithms. Communications on Pure and Applied Mathematics, 66(2):145 164, 2013.

Tao, C., Chen, L., Zhang, R., Henao, R., and Carin, L. Variational inference and model selection with generalized evidence bounds. In International Conference on Machine Learning, 2018.

Thin, A., Janati, Y., Le Corff, S., Ollion, C., Doucet, A., Durmus, A., Moulines, E., and Robert, C. NEO: non equilibrium sampling on the orbit of a deterministic transform. Advances in Neural Information Processing Systems, 2021a.

Thin, A., Kotelevskii, N., Durmus, A., Panov, M., Moulines, E., and Doucet, A. Monte Carlo variational autoencoders. In International Conference on Machine Learning, 2021b.

Tupper, P. Ergodicity and the numerical simulation of Hamiltonian systems. SIAM Journal on Applied Dynamical Systems, 4(3):563 587, 2005.

U.S. Department of Justice Federal Bureau of Investigation. Crime in the United States, 1995. URL: https://ucr.fbi.gov/crime-in-the-u.s/1995.

ver Steeg, G. and Galstyan, A. Hamiltonian dynamics with non-Newtonian momentum for rapid sampling. In Advances in Neural Information Processing Systems, 2021.

Wainwright, M. and Jordan, M. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1 2):1 305, 2008.

Wang, X. Boosting variational inference: theory and examples. Master s thesis, Duke University, 2016.

Wenliang, L. and Kanagawa, H. Blindness of score-based methods to isolated components and mixing proportions. ar Xiv:2008.10087, 2020.

Wolf, C., Karl, M., and van der Smagt, P. Variational inference with Hamiltonian Monte Carlo. ar Xiv:1609.08203, 2016.

Xu, K., Ge, H., Tebbutt, W., Tarek, M., Trapp, M., and Ghahramani, Z. Advanced HMC.jl: A robust, modular and efﬁcient implementation of advanced HMC algorithms. In Symposium on Advances in Approximate Bayesian Inference, 2020.

Xu, Z. and Campbell, T. The computational asymptotics of variational inference and the Laplace approximation. Statistics and Computing, 32(4):1 37, 2022.

Yosida, K. Mean ergodic theorem in Banach spaces. Proceedings of the Imperial Academy, 14(8):292 294, 1938.

Zhang, G., Hsu, K., Li, J., Finn, C., and Grosse, R. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021.

Zhang, Y. and Hern andez-Lobato, J. M. Ergodic inference: accelerate convergence by optimisation. In ar Xiv:1805.10377, 2020.

Zobay, O. Variational Bayesian inference with Gaussianmixture approximations. Electronic Journal of Statistics, 8:355 389, 2014.

Mix Flows: principled variational inference via mixed ﬂows

Proof of Proposition 3.1. Because both estimates are unbiased, it sufﬁces to show that E[f 2(X)]

E 1 N PN 1 n=0 f(T n X0) 2 , which itself follows by Jensen s inequality:

n=0 f(T n X0)

n=0 f 2(T n X0)

= E[f 2(X)].

Proof of Theorem 4.1. Since setwise convergence implies weak convergence, we will focus on proving setwise convergence. We have that qλ,N converges setwise to π if and only if for all measurable bounded f : X R,

Ef(XN) Ef(X), XN qλ,N, X π.

The proof proceeds by directly analyzing Ef(XN):

Ef(XN) = Z f(x)qλ,N(dx)

Z f(x)(T n λ q0)(dx)

Z f(T n λ x)q0(dx)

n=0 f(T n λ x)q0(dx).

Since q0 π, by the Radon-Nikodym theorem, there exists a density of q0 with respect to π, so

Ef(XN) = Z 1 N

n=0 f(T n λ x)dq0

dπ (x)π(dx).

By the pointwise ergodic theorem (Theorem 2.3), f N(x) = 1

N PN 1 n=0 f(T n λ x) converges pointwise π-a.e. to R fdπ; and because f is bounded, f N is uniformly bounded for all N N. Hence by the Lebesgue dominated convergence theorem,

lim N Ef(XN) = lim N

n=0 f(T n λ x)dq0

dπ (x)π(dx)

= Z Z fdπ dq0

dπ (x)π(dx)

= Z fdπ 1 = Ef(X).

Theorem A.1 (Mean ergodic theorem in Banach spaces [Yosida, 1938; Kakutani, 1938; Riesz, 1938; Eisner et al., 2015, Theorem 8.5]). Let T be a bounded linear operator on a Banach space E, deﬁne the operator

ﬁx(T) = {v E : Tv = v} = ker(I T).

Mix Flows: principled variational inference via mixed ﬂows

Suppose that sup N N AN < and that 1

N T Nv 0 for all v E. Then the subspace

V = n v E : lim N ANv exists o

is closed, T-invariant, and decomposes into a direct sum of closed subspaces

V = ﬁx(T) range(I T).

The operator T|V on V is mean ergodic. Furthermore, the operator

A : V ﬁx(T) Av = lim N ANv

is a bounded projection with kernel ker(A) = range(I T) and AT = A = TA.

Proof of Theorem 4.2. We suppress λ subscripts for brevity. Consider the Banach space M(π) of signed ﬁnite measures dominated by π endowed with the total variation norm m = sup A X m(A) inf A X m(A). Then the pushforward Tm of m M(π) under T is dominated by π since

π(A) = 0 = π(T 1(A)) = 0 = m(T 1(A)) = 0.

Note the slight abuse of notation involving the same symbol for T : X X and the associated operator. Hence T : M(π) M(π) is a linear operator on M(π). Further,

Tm = sup A X m(T 1(A)) inf A X m(T 1(A)) sup A X m(A) inf A X m(A) = m ,

and so T is bounded with T 1. Hence 1

N T Nm 0 for all m M(π), and if we deﬁne the operator

N N, AN = 1

we have that sup N N AN 1. Therefore by the mean ergodic theorem in Banach spaces [Yosida, 1938; Kakutani, 1938; Riesz, 1938; Eisner et al., 2015, Theorem 8.5], we have that

A : V ﬁx(T) Am = lim N ANm

where V = {m M(π) : lim N ANm exists}. Eisner et al. (2015, Theorem 8.20) guarantees that V = M(π) as long as the weak limit of ANm exists for each m M(π). Note that since M(π) and L1(π) are isometric (via the map m 7 dm

dπ ), and π is σ-ﬁnite, the dual of M(π) is the set of linear functionals

m 7 Z φ(x)m(dx), φ L (π).

So therefore we have that V = M(π) if for each m M(π), there exists a g M(π) such that

φ L (π), Z φ(x)(ANm)(dx) Z φ(x)g(dx).

The same technique using a transformation of variables and the pointwise ergodic theorem from the proof of Theorem 4.1 provides the desired weak convergence. Therefore we have that ANm converges in total variation to ﬁx(T) for all m M(π), and hence ANq0 converges in total variation to ﬁx(T). Furthermore Theorem 4.1 guarantees weak convergence of ANq0 to π, and π ﬁx(T), so thus DTV(qλ,N, π) 0.

Proof of Theorem 4.3. We suppress λ subscripts for brevity. By the triangle inequality,

DTV (ˆq N, π) DTV (ˆq N, q N) + DTV (q N, π) .

Mix Flows: principled variational inference via mixed ﬂows

Focusing on the error term, we have that

DTV (ˆq N, q N) = DTV

n=1 ˆT nq0, 1

n=1 ˆT nq0, 1 N 1

n=1 ˆT n 1q0, T 1 N 1

n=1 T n 1q0

n=0 ˆT nq0, T 1 N 1

N DTV ˆT ˆq N 1, Tq N 1

N DTV ˆq N 1, ˆT 1Tq N 1 ,

where the last equality is due to the fact that ˆT is a bijection. The triangle inequality yields

DTV (ˆq N, q N) N 1

DTV (ˆq N 1, q N 1) + DTV q N 1, ˆT 1Tq N 1

DTV (ˆq N 1, q N 1) + DTV ˆTq N 1, Tq N 1 .

Then iterating that technique yields

DTV (ˆq N, q N)

n N DTV ˆTqn, Tqn , (6)

which completes the proof.

Proof of Corollary 4.4. Examining the total variation in its L1 distance formulation yields that for n = 1, . . . , N 1,

DTV ˆTqn, Tqn = Z

ˆTqn(x) Tqn(x) 1

= Z exp n log qn( ˆT 1x) log qn(T 1x) + log J(T 1x) log ˆJ( ˆT 1x) o 1 Tqn(dx).

By assumption, log qn( ˆT 1x) log qn(T 1x) ℓ ˆT 1x T 1x ℓϵ,

and log J(T 1x) log ˆJ( ˆT 1x) log J(T 1x) log J( ˆT 1x) + log J( ˆT 1x) log ˆJ( ˆT 1x)

ℓ ˆT 1x T 1x + ϵ

Combining these two bounds yields

DTV ˆTqn, Tqn exp((2ℓ+ 1)ϵ) 1 (2ℓ+ 1)ϵ exp((2ℓ+ 1)ϵ).

Therefore, by Equation (6), we obtain that

DTV (ˆq N, q N)

n N DTV ˆTqn, Tqn N 1

2 e(2ℓ+1)ϵ (2ℓ+ 1)ϵ Nϵ(ℓ+ 1)e(2ℓ+1)ϵ.

Finally, combining the above with Theorem 4.3 yields the desired result.

Mix Flows: principled variational inference via mixed ﬂows

Algorithm 4 Density Triple(x): Evaluate T N+1 λ (x), qλ,N(x), and QN 1 j=1 Jλ(T j λ x) with O(N)-time, O(1)-memory

Input: location x, reference distribution q0, ﬂow map Tλ, Jacobian Jλ, number of steps N L 0 w q0(x) for n = 1, . . . , N 1 do

x T 1 λ (x) L L + log Jλ(x) w Log Sum Exp(w, log q0(x) L) end for w w log N Return: x, exp(w), exp(L)

Algorithm 5 Est ELBO(λ, N): Estimate the ELBO for qλ,N in O(N)-time, O(1)-memory.

Input: reference q0, unnormalized target p, ﬂow map Tλ, Jacobian Jλ, number of ﬂow steps N x Sample(q0) x , z, J Density Triple(x) f log p(x) g log z for n = 1, . . . , N 1 do

N q0(x )/J Jn 1 Jλ(x) z (z q) /Jn 1 x Tλ(x) z z + 1

N q0(x) f f + log p(x) g g + log z if n < N 1 then

J J Jn 1/Jλ(x ) end if x Tλ(x ) end for Return: \ ELBO(λ, N) 1

N (f g) \ ELBO(λ, N)

B. Memory efﬁcient ELBO estimation

C. Hamiltonian ﬂow pseudocode

D. Extensions

Tunable reference So far we have assumed that the reference distribution q0 for the ﬂow is ﬁxed. Given that q0 is often quite far from the target π, this forces the variational ﬂow to spend some of its steps just moving the bulk of the mass to π. But this can be accomplished much easier with, say, a simple linear transformation that efﬁciently allows large global moves in mass. For example, if q0 = N(0, I), we can include a map Mθ

n=0 T n λ Mθq0,

where Mθ(x) = θ1x + θ2, where θ1 Rd d and θ2 Rd. Note that it is possible to optimize the reference and ﬂow jointly, or to optimize the reference distribution by itself ﬁrst and then use that ﬁxed reference in the ﬂow optimization.

Mix Flows: principled variational inference via mixed ﬂows

Algorithm 6 Compute Tλ,ϵ and its Jacobian Jλ,ϵ for Hamiltonian ﬂow with leapfrog integrator

Input: initial state x X, ρ Rd, u [0, 1], step size ϵ, shift ξ R, pseudorandom shift z( , ) x0, ρ0 x, ρ for ℓ= 1, . . . , L do

xℓ, ρℓ ˆHϵ(xℓ 1, ρℓ 1) end for x , ρ x L, ρL u u + ξ mod 1 for i = 1, . . . , d do

ρ i R 1(R(ρ i) + z(x i, u ) mod 1) end for J m(ρ )/m(ρ ) Return: (x , ρ , u ), J

Automated burn-in A common practice in MCMC is to throw away a ﬁrst fraction of the states in the sequence to ensure that the starting sample is in a high probability region of the target distribution, thus reducing the bias from initialization ( burn-in ). Usually one needs a diagnostic to check when burn-in is completed. In the case of Mix Flow, we can monitor the burn-in phase in a principled way by evaluating the ELBO. Once the ﬂow is trained, the variational distribution with M burn-in samples is simply

qλ,M,N = 1 N M

n=M T n λ q0,

We can easily optimize this by estimating the ELBOs for M = 1, . . . , N.

Mixtures of Mix Flows One can build multiple Mix Flows starting from multiple different initial reference distributions; when the posterior is multimodal, it may be the case that some of these Mix Flows converge to different modes but do not mix across modes. In this scenario, it can be helpful to average several Mix Flows, i.e., build an approximation of the form

k=1 wk(qλ,N)k,

k=1 wk = 1.

Because each component ﬂow provides access to i.i.d. samples and density evaluation, Mix Flow provides the ability to optimize the weights by maximizing the ELBO (i.e., minimizing the KL divergence).

E. Additional experimental details

In this section, we provide details for each experiment presented in the main text, as well as additional results regarding numerical stability and a high-dimensional synthetic experiment. Aside from the univariate synthetic example, all examples include a pseudotime shift step with ξ = π/16, and a momentum refreshment with z(x, u) = 0.5 sin(2x + u) + 0.5. For the kernel Stein discrepancy, we use a IMQ kernel k(x, y) = (c2 + x y 2 2)β with β = 0.5, c = 1, the same setting as in (Gorham & Mackey, 2017).

For all experiments, unless otherwise stated, NUTS uses 20,000 steps for adaptation, targeting at an average acceptance ratio 0.7, and generates 5,000 samples for KSD estimation. The KSD for Mix Flow is estimated using 2,000 samples. The KSD estimation for NEO is based on 5,000 samples generated by a tuned NEO after 20,000 burn-in steps. We adopted two tuning strategies for NEO: (1) choosing among the combinations of several ﬁxed settings of discretization step (ϵ = 0.2, 0.5, 1.0), friction parameter γ of nonequilibrium Hamiltonian dynamics (γ = 0.2, 0.5, 1.0), integration steps (K = 10, 20), and mass matrix of momentum distribution (M = I); (2) ﬁxing the integration steps (K = 10, 20) and friction parameter (γ = 0.2, 0.5, 1.0), and using windowed adaptation (Carpenter et al., 2017) to adapt the mass matrix and integration step size, targeting at an average acceptance ratio at 0.7. The optimal setting of NEO is considered to be the one that produces lowest average marginal KSD over 3 runs with no Na N values encountered. Performance of NEO across various settings for each example is summarized in Table 4. In each of NEO MCMC transition, we run 10 deterministic orbits and computes

Mix Flows: principled variational inference via mixed ﬂows

corresponding normalizing constant estimates in parallel. Aside from NEO, all other methods are deployed using a single processor.

As for NUTS, we use the Julia package Advanced HMC.jl (Xu et al., 2020) with all default settings. NEO adaptation is also conducted using Advanced HMC.jl with the number of simulation steps set to the number of integration steps of NEO. The ESS is computed using the R package mcmcse, which is based on batch mean estimation. We implement all NFs using the Julia package Bijectors.jl (Fjelde et al., 2020). The ﬂow layers of Planar Flow and Radial Flow, as well as the coupling layers of Real NVP are implemented in Bijectors.jl. The afﬁne coupling functions of Real NVP a scaling function and a shifting function are both parameterized using fully connected neural networks of three layers, of which the activation function is Leaky Re LU and number of hidden units is by default the same dimension as the target distribution, unless otherwise stated.

E.1. Univariate synthetic examples

The three target distributions tested in this experiment were

normal: N(2, 22),

synthetic Gaussian mixture: 0.5N( 3, 1.52) + 0.3N(0, 0.82) + 0.2N(3, 0.82), and

Cauchy: Cauchy(0, 1).

For all three examples, we use a momentum refreshment without introducing the pseudotime variable; this enables us to plot the full joint density of (x, ρ) R2:

ρ R 1 Lap(RLap(ρ ) + (sin(2x ) + 1)/2 mod 1).

For the three examples, we used the leapfrog stepsize ϵ = 0.05 and run L = 50 leapfrogs between each refreshment. For both the Gaussian and Gaussian mixture targets, we use 100 refreshments. In the case of the Cauchy, we used 1,000 refreshments.

E.2. Multivariate synthetic examples

The three target distributions tested in this experiment were

the banana distribution (Haario et al., 2001):

N 0, 100 0 0 1

, x = y1 y2 + by2 1 100b

Neals funnel (Neal, 2003):

x1 N 0, σ2 , x2 | x1 N 0, exp x1

a cross-shaped distribution: in particular, a Gaussian mixture of the form

, 0.152 0 0 1

, 1 0 0 0.152

, 1 0 0 0.152

, 0.152 0 0 1

and a warped Gaussian distribution

N 0, 1 0 0 0.122

, x = y 2 cos atan2 (y2, y1) 1

y 2 sin atan2 (y2, y1) 1

where atan2(y, x) is the angle, in radians, between the positive x axis and the ray to the point (x, y).

Mix Flows: principled variational inference via mixed ﬂows

We used ﬂows with 500 and 2000 refreshments for the banana distribution, Neal s Funnel respectively, and a ﬂow with 1000 refreshments for both cross distribution and warped Gaussian. Between each refreshment we used 200 leapfrog steps for the banana distribution, 60 for the cross distribution, and 80 for the funnel and warped Gaussian. Note that in all four examples, we individually tuned the step size ϵ by maximizing the estimated ELBO, as shown in Figure 5a. Figure 5b also demonstrates how the ELBO varies versus the number of refreshments N. For small step sizes ϵ, the discretized Hamiltonian dynamics approximates the continuous dynamics, and the ELBO generally increases with N indicating convergence to the target. For larger step sizes ϵ, the ELBO increases to a peak and then decreases, indicating that the discretized dynamics do not exactly target the desired distribution.

Figure 6 presents a comparison of the uncertainty involved in estimating the expectations of a test function f for both Mix Flow iid and Mix Flow single. Speciﬁcally, it examines the streaming estimation of E[f(X)], X q N, where f(x) = x 1, based on samples generated from Mix Flow iid and Mix Flow single under 50,000 ﬂow map evaluations over 10 independent trials. Note that we assess computational cost via the number of ﬂow map evaluations, not by the number of draws, because the cost per draw is random in Mix Flow iid (due to K Unif{0, . . . , N 1}), while the cost per draw is ﬁxed to N evaluations for the trajectory estimate in Mix Flow single. The results indicate that, given equivalent computational resources, the trajectory-averaged estimates generally exhibit lower variances between trials than the na ıve i.i.d. Monte Carlo estimate. This observation validates Proposition 3.1.

Figure 7 shows why we opted to use a Laplace momentum distribution as opposed to a Gaussian momentum. In particular, the numerical error of composing T K λ T K λ and T K λ T K λ (denoted as Bwd and Fwd in the legend, respectively) indicate that the ﬂow using the Laplace momentum is more reliably invertible for larger numbers of refreshments than the ﬂow with a Gaussian momentum. Figure 8 uses a high-precision (256 bit) ﬂoating point representation to further illustrate the rapid escalation of numerical error when evaluating forward and backward trajectories on Gaussian Hamiltonian Mix Flows. For all four synthetic examples, after approximately 100 ﬂow transformations, Mix Flows with Gaussian momentum exhibits an error on the scale of the target distribution itself (per the contour plots (d)-(g) in Figure 1). This may be due to the fact that the normal has very light tails; for large momentum values in the right tail, the CDF is 1, which gets rounded to exactly 1 in ﬂoating point representation. Our implementation of the CDF and inverse CDF was fairly na ıve, so it is possible that stability could be improved with more care taken to prevent rounding error. We leave a more careful examination of this error to future work; for this work, using a Laplace distribution momentum sufﬁced to maintain a low numerical error.

Finally, Figure 9 provides a more comprehensive set of sample histograms for these experiments, showing the x-, ρ-, and u-marginals. It is clear that Mix Flow generates samples for each variable reliably from the target marginal.

E.3. Higher-dimensional synthetic experiment

We also tested two higher dimensional Neal s funnel target distributions of x Rd where d = 5, 20. In particular, we used the target distribution

x1 N 0, σ2 , and i {2, , d}, xi | x1 i.i.d. N 0, exp x1

with σ2 = 36. We use 80 leapfrogs between refreshments and ϵ = 0.0009 when d = 5, and 100 leapfrogs between refreshments and ϵ = 0.001 when d = 20 (Figures 10a and 10c shows the ELBO comparison used to tune the step size ϵ). Figures 10b and 10d conﬁrm via the KSD that the method performs as well as NUTS in higher-dimensional cases.

E.4. Additional experiments for synthetic examples

In this section, we provide additional comparisons of Mix Flow against NUTS, HMC, NEO and a generic normalizing ﬂow method planar ﬂow (NF) (Rezende & Mohamed, 2015b) on all four synthetic examples in Appendix E.2. For this set of experiments, all of the settings for Mix Flow are the same as outlined in Appendix E.2. HMC uses the same leapfrog step size and number of leapfrogs steps (between refreshments) as Mix Flow. For NF, 5 Planar layers (Rezende & Mohamed, 2015b) that contain 45 parameters to be optimized (9 parameters for each Planar layer of a 4-dimensional Planar ﬂow) are used unless otherwise stated. We train NF using ADAM until convergence (100, 000 iterations except where otherwise noted) with the initial step size set to 0.001. And at each step, 10 samples are used for gradient estimation. The initial distribution for Mix Flow, NF and NUTS is set to be the mean-ﬁeld Gaussian approximation, and HMC and NUTS are initialized using the learned mean of the mean-ﬁeld Gaussian approximation. All parameters in NF are initialized using random samples from the standard Gaussian distribution.

Mix Flows: principled variational inference via mixed ﬂows

(a): ELBO versus step size ϵ.

(b): ELBO of Mix Flow versus number of refreshments N over different step sizes, and comparision to tuned Planar Flow

Figure 5. Mix Flow tuning for the four multivariate synthetic examples

Figure 6. Comparison of Monte Carlo estimates of E[f(X)], X q N, f(x) = x 1 based on individual i.i.d. draws (blue) and trajectoryaveraged estimates in Equation (3) (orange) on four synthetic examples. The vertical axis indicates the estimate of E[f], and the horizontal axis indicates the total number of ﬂow transformations evaluated, i.e., total computational cost. Note that in each example, N is ﬁxed; the number of refreshments on the horizontal axis increases because we average over increasingly many draws X i.i.d. q N (blue) or increasingly many trajectory averages (orange). We run 10 trials to assess the quality of each estimate: lines indicate the mean, and error regions indicate standard deviation.

Figure 7. Stability of composing T K λ T K λ (Fwd) and T K λ T K λ (Bwd) for the four multivariate experiments with ﬂows constructed using Gaussian (Gauss) and Laplace (Lap) momentum distributions. The vertical axis shows the 2-norm error of reconstructing (x, ρ, u) sampled from q0; the horizontal axis shows increasing numbers of refreshments K. The lines indicate the median, and error regions indicate 25th to 75th percentile for 100 independent samples.

Figure 11 compares the sliced log density of tuned Mix Flow and the importance sampling proposal of tuned NEO, visualized in a similar fashion as Figure 1 in Section 6.1. It is clear that NEO does not provide a high-quality density approximation, while the log density of Mix Flow visually matches the target. Indeed, unlike our method, due to the use of a nonequilibrium dynamic, NEO does not provide a similar convergence guarantee of its proposal distribution as simulation length increases.

Figure 12 presents comparisons of the marginal sample qualities of Mix Flow to NUTS and NEO both are exact MCMC methods. Results show that Mix Flow produces comparable marginal sample quality, if not slightly better, than both MCMC

Mix Flows: principled variational inference via mixed ﬂows

(a): Banana

(b): Neal s Funnel

(d): Warped Gaussian

Figure 8. Large numerical errors exhibited by Hamiltonian Mix Flow with Gaussian momentum on synthetic examples. Figure shows forward (fwd) error T kx ˆT kx and backward (bwd) error T kx ˆBkx comparing k transformations of the forward approximate/exact maps ˆT T and backward approximate/exact maps ˆB T 1. For the exact maps we use a 256-bit Big Float representation, and for the numerical approximations we use 64-bit Float representation. The lines indicate the median, and error regions indicate 25th to 75th percentile over 100 initialization draws from the reference distribution q0.

methods. Note that it involves a nontrivial tuning process to achieve the displayed performance of NEO. Concrete tuning strategies are explained in the beginning of Appendix E. In fact, ﬁnding a good combination of discretization step, friction parameter, simulation length, and mass matrix of momentum distribution is necessary for NEO to behave reasonably; Table 4 summarizes the performance of NEO under various settings for both synthetic and real data experiments. The standard deviation of marginal sample KSD can change drastically across different settings, meaning that the marginal sample quality can be sensitive to the choice of hyperparameters. Moreover, due to the usage of an unstable Hamiltonian dynamics (with friction), one must choose hyperparameters carefully to avoid Na N values. The last column of Table 4 shows the number of hyperparameter combinations that lead to Na N values during sampling.

Figure 5b compares the joint ELBO values across various leapfrog step sizes for Mix Flow against that of NF. We see that in all four examples, NF produces a smaller ELBO than Mix Flow with a reasonable step size, which implies a lower quality target approximation. Indeed, Figure 13 shows that the trained NFs fail to capture the shape of the target distributions. Although one may expect the performance of NF to improve if it were given more layers, we will show that this is not the case in a later paragraph.

Figure 14a compares the sampling efﬁciency of Mix Flow against NUTS, HMC, and NF. We see that Mix Flow iid is the slowest, because each sample is generated by passing through the entire ﬂow. However, we see that by taking all intermediate samples as in Mix Flow single, we can generate samples just as fast as NUTS and HMC. On the other hand, while NF is fastest for sampling, it requires roughly 2 minutes for training, which alone allows Mix Flow single, NUTS, and HMC to generate over 1 million samples in these examples. A more detailed discussion about this trade-off for NF is presented later.

Figure 14b further shows the computational efﬁciency in terms of effective samples size (ESS) per second. The smaller per second ESS of Mix Flow iid is due to its slower sampling time. However, we emphasize that these samples are i.i.d.. NUTS overall achieves a higher per second ESS. NUTS is performant because of the much longer trajectories it produces (it only terminates once it detects a U-turn ). This is actually an illustration of a limitation of the ESS per unit time as a measurement of performance. Because NUTS generates longer trajectories, it has a lower sample autocorrelation and a higher ESS; but Figure 15 shows that the actual estimation performance of NUTS is comparable to the other methods. Note that it is also possible to incorporate the techniques used in NUTS to our method, which we leave for future work.

As mentioned above, ESS mainly serves as a practical index for the relative efﬁciency of using a sequence of dependent samples, as opposed to independent samples, to estimate certain summary statistics of the target distribution. In this case, Mix Flow single can be very useful. Figure 15 demonstrate the performance of Mix Flow single, NUTS, HMC, and NF when estimating the coordinate-wise means and standard deviations (SD) of target distributions. We see that Mix Flow single, NUTS, and HMC generally show similar performance in terms of convergence speed and estimation precision. While NF converges very quickly due to i.i.d. sampling, it does seem to struggle more at identifying the correct statistics, particularly the standard deviation, given its limited approximation quality. It is worth noting that, unlike Mix Flow and general MCMC methods, sample estimates of target summaries obtained from NF are typically not asymptotically exact, as the sample quality is fundamentally limited by the choice of variational family and how well the ﬂow is trained.

Finally, we provide an additional set of results for NF (Figure 16), examining its performance as we increase the number of planar layers (5, 10, 20, 50). All settings for NF are identical to the above, except that we increase the optimization iteration to 500, 000 to ensure the convergence of ﬂows with increased numbers of layers. As demonstrated in the second

Mix Flows: principled variational inference via mixed ﬂows

(a) Banana distribution

(b) Neal s funnel

(c) Cross distribution

(d) Warped Gaussian

Figure 9. Scatter plots and histograms for the x-, ρ-, and u-marginals in the four synthetic experiments.

and third column of Figure 16, both training time and sampling time scale with the number of layers roughly linearly. Although a trained NF is still generally faster in sample generation (see also Figure 14a), for these synthetic examples with 4-dimensional joint target distributions, training time can take up to 30 minutes. More importantly, the corresponding target approximations of NF are still not as good as those of Mix Flow in all four examples, even when we increase the number of layers to 50, which corresponds to optimizing 450 parameters from the ﬂow. One may also notice from Figure 16 that a more complex normalizing ﬂow does not necessarily lead to better performance. This is essentially because the (usually non-convex) KL optimization problem of standard NF becomes more complex to solve as the ﬂow becomes more ﬂexible. As a result, even though theoretically, NF becomes more expressive with more layers, there is no guarantee on how well it approximates the target distribution. In contrast, Mix Flow is optimization-free and is asymptotically exact with a proper choice of hyper-parameter, more computation typically leads to better performance (Figure 5b). With the 30 minutes training time of NF, Mix Flow iid can generate 18, 000 i.i.d. samples and Mix Flow single can generate over 10

Mix Flows: principled variational inference via mixed ﬂows

Figure 10. ELBO versus step size (10a,10c) and KSD comparison with NUTS (10b,10d) for the 5and 20-dimensional Neal s funnel examples.

(a) Banana. From left to right: True, Mix Flow, and NEO

(b) Neal s Funnel. From left to right: True, Mix Flow, and NEO

(c) Cross. From left to right: True, Mix Flow, and NEO

(d) Warped Gauss. From left to right: True, Mix Flow, and NEO

Figure 11. Visualization of sliced exact (left) and Mix Flow-approximated (middle) joint log density (middle), and sliced joint log density of tuned NEO importance sampling proposal (right) for banana (11a), funnel (11b), cross (11c), and warped Gaussian (11d).

Figure 12. Marginal KSD of tuned Mix Flow versus number of refreshments N, and comparison to NUTS and NEO

Figure 13. Scatter plot of 5000 i.i.d. samples generated from trained NF.

million samples, both of which are more than sufﬁcient for most estimations under these target distributions.

Mix Flows: principled variational inference via mixed ﬂows

(a): Sampling time comparison to NF,NUTS, and HMC (100 trials).

(b): Per second ESS comparison to NUTS and HMC (100 trials).

Figure 14. Time results for four multivariate synthetic examples: from left to right, each column corresponds to Banana, Neal s funnel, Cross, and warped Gaussian respectively.

Figure 15. Streaming mean and standard deviation (SD) estimation (50, 000 samples) for four 2-dimensional synthetic distributions. For each distribution, the two plots on the top row correspond to the two marginal means, and the two plots on the bottom row correspond to the two marginal SDs. Each plot shows the evolution of coordinate-wise mean/SD estimates, using samples from Mix Flow single (green), NUTS (blue), HMC (red), and NF (purple); black dashed line indicates the true target mean/SD, which is estimated using 50, 000 samples from the actual synthetic target distribution. The lines indicate the median, and error regions indicate 25th to 75th percentile from 10 runs.

E.5. Real data examples

All settings for Mix Flow, NUTS, NEO and HMC are identical to those in synthetic examples. All NFs are trained using ADAM for 200,000 iterations with initial step size set to be 0.001. We remark that comparison NEO is not included for high dimensional sparse regression example the most challenging example due to its high consumption of RAM, which hits the ceiling of our computation resources. Overall we observe similar phenomenons as for synthetic examples Appendix E.2. Additionally, we include comparison to UHA on real data examples. The tuning procedure of UHA involves architectural search (i.e., leapfrog steps between refreshments, and number of refreshments), and hyperparameter optimization (i.e., leapfrog step size, and tempering path). We choose among the combination of several ﬁxed architectural settings of leapfrog steps between refreshments (L = 10, 20, 50), and number of refreshment (N = 5, 10), and optimize hyperparameters using ADAM for 20,000 iterations (each gradient estimate is based on 5 independent samples); each setting is repeated for 5 times.

Mix Flows: principled variational inference via mixed ﬂows

(a) Banana distribution

(b) Neal s funnel

(c) Cross distribution

(d) Warped Gaussian

Figure 16. Additional NF results with number of planar layers (5, 10, 20, 50). First column shows the ELBO value. Each ELBO is estimated using 2000 i.i.d. samples from trained ﬂow. The second and third columns correspond to the training time and per-sample time (100 trials) of NF given increasing planar layers.

Table 5 presents median, IQR of the resulting ELBOs of UHA under different architectural settings. We pick the best settings for UHA based on the ELBO performance and compare UHA under the selected settings with our method in terms of target approximation quality and marginal sample quality. Both ELBO and KSD of UHA are estimated using 5,000 samples.

Mix Flows: principled variational inference via mixed ﬂows

Figure 17 offers a comparison of the ELBO performance between UHA and Mix Flow. While Mix Flow demonstrates similar or superior ELBO performance in the linear regression, logistic regression, student-t regression, and sparse regression problems, it is outperformed by UHA in the remaining three real data examples. The higher ELBO of UHA in these examples might be due to its incorporation of annealing, which facilitates exploration of complex target distributions. We leave studying the use of annealing in Mix Flows to future work.

However, it is essential to note that for variational methods that augment the original target space, a higher ELBO does not necessarily equate to improved marginal sample quality. As demonstrated in Figure 18, the marginal sample quality of UHA, as evaluated by marginal KSD, is worse than that of Mix Flow and the two Monte Carlo methods (NUTS and NEO), although it surpasses that of parametric ﬂows (e.g. Real NVP).

It is also worth noting that tuning augmentation methods like UHA can be very expensive. For instance, in the Student-t regression problem, UHA required 20 minutes ( 1223.8 seconds) to run 20,000 optimization steps, with each gradient estimated using 5 Monte Carlo samples. This is approximately 40 times slower than Real NVP, which required around 10 minutes for 200,000 optimization steps with stochastic gradients based on 10 Monte Carlo samples. Conversely, evaluating one ELBO curve (averaged over 1,000 independent trajectories) for Mix Flow across six different ﬂow lengths (i.e., ﬂow length = 100, 200, 500, 1000, 1500, 2000) only took 8 minutes, while adapting NUTS with 20,000 iterations was completed within a few seconds.

E.5.1. BAYESIAN LINEAR REGRESSION

We consider two Bayesian linear regression problems, with both a standard normal prior and a heavy tail prior using two sets of real data. The two statistical models take the following form:

log σ2 i.i.d. N(0, 1), yj | β, σ2 indep N x T j β, σ2

Normal:β i.i.d. N(0, 1) Cauchy:β i.i.d. Cauchy(0, 1)

where yj is the response and xj Rp is the feature vector for data point j. For linear regression problem with a normal prior, we use the Boston housing prices dataset (Harrison Jr. & Rubinfeld, 1978). Dataset available in the MLDatasets Julia package at https://github.com/Julia ML/MLDatasets.jl. containing J = 506 suburbs/towns in the Boston area; the goal is to use suburb information to predict the median house price. We standardize all features and responses. For linear regression with a heavy-tail prior, we use the communities and crime dataset (U.S. Department of Justice Federal Bureau of Investigation, 1995), available at http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. The original dataset contains 122 attributes that potentially connect to crime; the goal is to predict per-capita violent crimes using the information of the community, such as the median family income, per capita number of poluce ofﬁcers, and etc. For the data preprocessing, we drop observations with missing values, and using Principle component analysis for feature dimension reduction; we selected 50 principal components with leading eigenvalues. The posterior dimensions of the two linear regression inference problems are 15 and 52, repsectively.

E.5.2. BAYESIAN GENERALIZED LINEAR REGRESSION

We then consider two Bayesian generalized linear regression problems a hierachical logistic regression and a poisson regression:

Logis. Reg.: α Gam(1, 0.01), β | α N(0, α 1I),

yj | β indep Bern 1

1 + e x T j β

Poiss. Reg.: β N(0, I),

yj | β indep Poiss log 1 + e x T j β ,

For logistic regression, we use a bank marketing dataset (Moro et al., 2014) downsampled to J = 400 data points. Original dataset is available at https://archive.ics.uci.edu/ml/datasets/bank+marketing. the goal is to use client information to predict whether they subscribe to a term deposit. We include 8 features from the bank marketing dataset (Moro et al., 2014): client age, marital status, balance, housing loan status, duration of last contact, number of contacts during campaign, number of days since last contact, and number of contacts before the current campaign. For each of the binary variables (marital

Mix Flows: principled variational inference via mixed ﬂows

status and housing loan status), all unknown entries are removed. All features of the dataset are also standardized. Hence the posterior dimension of the logistic regression problem is 9 and the overall (x, ρ, u) state dimension of the logistic regression inference problems are 19.

For Poisson regression problem, we use an airport delays dataset with 15 features and J = 500 data points (subsampled), resulting a 16-dimensional posterior distribution. The airport delays dataset was constructed using ﬂight delay data from

http://stat-computing.org/dataexpo/2009/the-data.html and historical weather information from https://www.wunderground.

com/history/. relating daily weather information to the number of ﬂights leaving an airport with a delay of more than 15 minutes. All features are standardized as well.

E.5.3. BAYESIAN STUDENT-T REGRESSION

We also consider a Bayesian Student-t regression problem, of which the posterior distribution is heavy-tail. The Student-t regression model is as follows:

yi | Xi, β T5(XT i β, 1), β i.i.d. Cauchy(0, 1).

In this example, we use the creatinine dataset (Liu & Rubin, 1995), containing a clinical trial on 34 male patients with 3 covariates. Original dataset is available in https://github.com/faosorios/heavy/blob/master/data/creatinine.rda. The 3 covariates consist of body weight in kg(WT), serum creatininte concentration (SC), and age in years. The goal is to predict the endogenous cretinine clearance (CR) using these covariates. We apply the data transformation recommended by Liu & Rubin (1995) by transferring response into log(CR), and transferring covariats into log(WT), log(SC), log(140 age).

E.5.4. BAYESIAN SPARSE REGRESSION

Finally, we compare the methods on the Bayesian sparse regression problem applied to two datasets: a prostate cancer dataset containing 9 covariates and 97 observations, and a superconductivity dataset (Hamidieh, 2018), containing 83 features and 100 observations (subsampled). The prostate cancer dataset is available at https://hastie.su.

domains/Elem Stat Learn/datasets/prostate.data. The superconductivity dataset is available at https://archive.ics.uci.edu/

ml/datasets/superconductivty+data. The model is as follows:

log σ2 i.i.d. N(0, 1), βi i.i.d. 1

2N(0, τ 2 1 ) + 1

2N(0, τ 2 2 ),

yj | β, σ2 indep N x T j β, σ2

For both two datasets, we set τ1 = 0.1, τ2 = 10. The resulting posterior dimension for both datasets are 10 and 84 respectively. When data information is weak, the posterior distribution in this model typically contains multiple modes (Xu & Campbell, 2022). We standardize the covariates during the preprocessing procedure for both datasets.

E.5.5. ADDITIONAL EXPERIMENT FOR REAL DATA EXAMPLES

Mix Flows: principled variational inference via mixed ﬂows

Table 1. Comparison of ELBO between Mix Flow and Real NVP with different number of layers on real data example. Each setting of Real NVP is run in 5 trials. ELBOs of Mix Flow are estimated using 1000 independent trajectories and ELBOs of Real NVP are estimated using 2000 independent samples.

STEPSIZE #LEAPFROG #REFRESHMENT ELBO #SAMPLE

Lin Reg 0.0005 30 2000 -429.98 1000 Lin Reg Heavy 0.000025 50 2500 117.1 1000 Log Reg 0.002 50 1500 -250.5 1000 Poiss Reg 0.0001 50 2000 82576.9 1000 Student t Reg 0.0008 80 2000 -145.41 1000 Sparse Reg 0.001 30 600 -125.9 1000 Sparse Reg High Dim 0.0012 50 2000 -538.36 2000 Real NVP

#LAYER #HIDDEN MEDIAN IQR #NAN

Lin Reg 5 15 -429.43 (-429.56, -429.42) 0 Lin Reg 10 15 -429.41 (-429.43, -429.36) 1 Lin Reg Heavy 5 52 116.47 (116.34, 116.49) 0 Lin Reg Heavy 8 52 116.65 (116.56, 116.73) 3 Lin Reg Heavy 10 52 116.26 (116.11,116.41) 3 Log Reg 5 9 -250.76 (-250.76, -250.75) 3 Log Reg 8 9 -250.65 (-250.75, -250.64) 2 Real NVP

#LAYER #HIDDEN MEDIAN IQR #NAN

Poiss Reg 3 16 82576.87 (82575.57, 82577.53) 1 Poiss Reg 5 16 82580.77 (82580.13, 82583.72) 2 Poiss Reg 8 16 N/A N/A 5 Student t Reg 5 4 -145.52 (-145.53, -145.51) 0 Student t Reg 10 4 -145.52 (-145.52, -145.49) 0 Real NVP

#LAYER #HIDDEN MEDIAN IQR #NAN

Sparse Reg 5 10 -126.33 (-126.36, -126.32) 0 Sparse Reg 8 10 -126.35 (-126.36, -126.33) 1 Sparse Reg High Dim 5 20 -531.75 (-533.41, -522.60) 0 Sparse Reg High Dim 8 20 N/A N/A 5

Mix Flows: principled variational inference via mixed ﬂows

Table 2. Comparison of ELBO between Mix Flow and Planar Flow with different number of layers on real data example. Each setting of Planar Flow is run in 5 trials. ELBOs of Mix Flow are estimated using 1000 independent trajectories and ELBOs of Planar Flow are estimated using 2000 independent samples.

STEPSIZE #LEAPFROG #REFRESHMENT ELBO #SAMPLE

Lin Reg 0.0005 30 2000 -429.98 1000 Lin Reg Heavy 2.50E-05 50 2500 117.1 1000 Log Reg 0.002 50 1500 -250.5 1000 Poiss Reg 0.0001 50 2000 82576.9 1000 Student t Reg 0.0008 80 2000 -145.41 1000 Sparse Reg 0.001 30 600 -125.9 1000 Sparse Reg High Dim 0.0012 50 2000 -538.36 2000 Planar Flow

#LAYER MEDIAN IQR #NAN

Lin Reg 5 -434.73 (-435.01, -434.65) 0 Lin Reg 10 -438.79 (-439.25, -436.77) 0 Lin Reg 20 -442.4384675 (-444.64, -440.93) 0 Lin Reg Heavy 5 79.65 (-75.92, 99.10) 0 Lin Reg Heavy 10 23.13 ( 21.29, 100.10) 0 Lin Reg Heavy 20 -879.1909019 (-1641.19, -392.80) 0 Log Reg 5 -251.71 (-251.85, -251.41) 0 Log Reg 10 -251.8 (-252.03, -251.80) 0 Log Reg 20 -252.0937812 (-252.20, -251.93) 0 Planar Flow

#LAYER MEDIAN IQR #NAN

Poiss Reg 5 82569.58 (82569.33, 82569.68) 0 Poiss Reg 10 82568.49 (82568.12, 82569.57) 0 Poiss Reg 20 82566.40441 (82552.57, 82568.05) 2 Student t Reg 5 -145.69 (-145.71, -145.64) 0 Student t Reg 10 -145.72 (-145.73, -145.69) 0 Student t Reg 20 -145.7579931 (-145.77, -145.72) 0 Planar Flow

#LAYER MEDIAN IQR #NAN

Sparse Reg 5 -127.58 (-127.60, -127.42) 0 Sparse Reg 10 -127.75 (-127.87, -127.68) 0 Sparse Reg 20 -127.6677415 (-128.48, -127.65) 0 Sparse Reg High Dim 5 -571.97 (-572.09, -565.10) 0 Sparse Reg High Dim 10 -571.7 (-573.22, -565.55) 0 Sparse Reg High Dim 20 -577.1338885 (-580.30, -569.39) 0

Mix Flows: principled variational inference via mixed ﬂows

Table 3. Comparison of ELBO between Mix Flow and Radial Flow with different number of layers on real data example. Each setting of Radial Flow is run in 5 trials. ELBOs of Mix Flow are estimated using 1000 independent trajectories and ELBOs of Radial Flow are estimated using 2000 independent samples.

STEPSIZE #LEAPFROG #REFRESHMENT ELBO #SAMPLE

Lin Reg 0.0005 30 2000 -429.98 1000 Lin Reg Heavy 2.50E-05 50 2500 117.1 1000 Log Reg 0.002 50 1500 -250.5 1000 Poiss Reg 0.0001 50 2000 82576.9 1000 Student t Reg 0.0008 80 2000 -145.41 1000 Sparse Reg 0.001 30 600 -125.9 1000 Sparse Reg High Dim 0.0012 50 2000 -538.36 2000 Radial Flow

#LAYER MEDIAN IQR #NAN

Lin Reg 5 -434.14 (-434.36, -434.12) 0 Lin Reg 10 -434.12 (-434.19, -434.12) 0 Lin Reg 20 -433.82 (-434.21, -433.73) 0 Lin Reg Heavy 5 111.99 (111.91, 112.01) 0 Lin Reg Heavy 10 111.99 (111.89 , 112.00) 0 Lin Reg Heavy 20 111.7 (111.61, 111.71) 0 Log Reg 5 -251.33 (-251.49, -251.31) 0 Log Reg 10 -251.17 (-251.21 , -251.06) 0 Log Reg 20 -250.96 (-250.97, -250.95) 0 Radial Flow

#LAYER MEDIAN IQR #NAN

Poiss Reg 5 82570.5 (82570.19, 82570.57) 0 Poiss Reg 10 82571.02 (82570.76 , 82571.05) 0 Poiss Reg 20 82571.08 (82570.90, 82571.12) 0 Student t Reg 5 -145.61 (-145.62, -145.61) 0 Student t Reg 10 -145.59 (-145.60, -145.59) 0 Student t Reg 20 -145.58 (-145.60, -145.57) 0 Radial Flow

#LAYER MEDIAN IQR #NAN

Sparse Reg 5 -127.28 (-127.37, -127.27) 0 Sparse Reg 10 -127 ( -127.03, -126.88) 0 Sparse Reg 20 -126.68 ( -126.69, -126.61) 0 Sparse Reg High Dim 5 -548.05 (-548.54, -547.98) 0 Sparse Reg High Dim 10 -547.2 (-547.74, -547.16) 0 Sparse Reg High Dim 20 -547.03 (-547.78, -546.76) 0

Mix Flows: principled variational inference via mixed ﬂows

(a) Linear regression

(b) Linear regression (heavy tail)

(c) Logistic regression

(d) Poisson regression

(e) Student-t regression

(f) Sparse regression

(g) Sparse regression (high dim)

Figure 17. ELBO comparison against NFs for real examples (except for linear regression, which is displayed in Figure 2a. Each NF method is tuned under various settings, and only the best one is present for each example. Each ﬁgure also shows the effect of step sizes on Mix Flow; overly large or small step sizes inﬂuence the performance of Mix Flow negatively. Lines indicate the median, and error regions indicate 25th to 75th percentile from 5 runs.

(a) linear regression

(b) linear regression (heavy)

(c) logistic regression

(d) poisson regression

(e) student t regression

(f) sparse regression

(g) sparse regression (high dim)

Figure 18. KSD comparison for real data examples.

Mix Flows: principled variational inference via mixed ﬂows

(a) Linear regression

(b) Linear regression (heavy tail)

(c) Logistic regression

(d) Poisson regression

(e) Sparse regression

Figure 19. Time results (sampling time comparison to NF, NUTS, NEO, and HMC, and per second ESS comparison to HMC; each repeated 100 trials) for two linear regression problems (Figures 19a and 19b), two generalized linear regression problems (Figures 19c and 19d), and one sparse regression problem (Figure 19e).

Mix Flows: principled variational inference via mixed ﬂows

(a) Linear reg.: ﬁrst 5 dimensions

(b) Linear reg. (Cauchy): ﬁrst 10 dimensions

(c) Poisson reg.: all 8 dimensions

(d) Sparse reg. (high dim): ﬁrst 10 dimensions

Figure 20. Sample quality visualization of 2,000 i.i.d. samples draw from each of Mix Flow (green scatters) and NF (orange scatters) on 4 real data examples: linear regression, linear regression with heavy tail prior, poisson regression, and high-dimensional sparse regression. Pairwise kernel density estimation (KDE) is based on 20,000 NUTS samples, which is initialized with the Gaussian mean of the mean-ﬁeld Gaussian approximation to posterior distribution, uses 20,000 steps for adaptation (targeting at an average acceptance rate 0.7). NF is chosen to be the same one as compared in Figures 4 and 19.

Mix Flows: principled variational inference via mixed ﬂows

Table 4. Minimum, average, and standard deviation of KSD obtained from NEO for each of the examples across all tuning settings. #Fail indicates the number of tuning settings that resulted in Na N outputs. The ﬁnal KSDs obtained from Mix Flow are also included for comparion.

MIXFLOW KSD MIN. KSD AVG. KSD SD. KSD #FAIL

banana 0.06 0.06 1.3 1.69 7/24 cross 0.13 0.11 2.78 3.09 12/24 Neal s funnel 0.04 0.04 0.76 1.84 6/24 warped Gaussian 0.15 0.23 0.55 0.25 0/14 linear regression 1.64 9.37 143.56 84.16 18/24 linear regression (heavy tail) 51.43 396.97 1589.71 599.47 18/24 logistic regression 0.34 0.72 19.28 11.7 18/24 Poisson regression 10.57 639.08 1642.81 670.8 18/24 t regression 0.11 0.15 4.79 5.03 12/24 sparse regression 0.44 1.26 43.83 25.58 18/24

Mix Flows: principled variational inference via mixed ﬂows

Table 5. ELBO results of UHA with different number of leapfrogs and number of refreshments on real data examples. Each setting of UHA is run in 5 trials. ELBOs of UHA are estimated using 5000 independent samples.

#LFRG #REF MEDIAN IQR #SAMPLE

Lin Reg 10 5 -430.438 (-430.488, -430.411) 5000 Lin Reg 20 5 -430.589 (-430.596, -430.541) 5000 Lin Reg 50 5 -432.215 (-432.268, -432.146) 5000 Lin Reg 10 10 -430.468 (-430.511, -430.457) 5000 Lin Reg 20 10 -431.404 (-431.441, -431.202) 5000 Lin Reg 50 10 -432.462 (-432.56, -432.327) 5000 Lin Reg Heavy 10 5 120.188 (120.08, 120.401) 5000 Lin Reg Heavy 20 5 119.126 (118.806, 119.796) 5000 Lin Reg Heavy 50 5 119.134 (118.994, 119.247) 5000 Lin Reg Heavy 10 10 118.222 (118.125, 118.245) 5000 Lin Reg Heavy 20 10 118.191 (118.151, 118.202) 5000 Lin Reg Heavy 50 10 118.059 (117.718, 118.175) 5000 Log Reg 10 5 -250.588 (-250.607, -250.553) 5000 Log Reg 20 5 -250.555 (-250.565, -250.551) 5000 Log Reg 50 5 -250.601 (-250.64, -250.567) 5000 Log Reg 10 10 -250.643 (-250.662, -250.584) 5000 Log Reg 20 10 -250.578 (-250.615, -250.563) 5000 Log Reg 50 10 -250.913 (-251.006, -250.839) 5000 Poiss Reg 10 5 82578.439 (82578.43, 82578.459) 5000 Poiss Reg 20 5 82578.453 (82578.372, 82578.522) 5000 Poiss Reg 50 5 82578.421 (82578.342, 82578.465) 5000 Poiss Reg 10 10 82578.45 (82578.435, 82578.488) 5000 Poiss Reg 20 10 82578.449 (82578.368, 82578.458) 5000 Poiss Reg 50 10 82578.369 (82578.322, 82578.414) 5000 Student t Reg 10 5 -145.46 (-145.461, -145.438) 5000 Student t Reg 20 5 -145.424 (-145.451, -145.423) 5000 Student t Reg 50 5 -145.427 (-145.43, -145.421) 5000 Student t Reg 10 10 -145.441 (-145.454, -145.404) 5000 Student t Reg 20 10 -145.439 (-145.466, -145.411) 5000 Student t Reg 50 10 -145.41 (-145.417, -145.407) 5000 Sparse Reg 10 5 -125.936 (-125.953, -125.931) 5000 Sparse Reg 20 5 -125.954 (-125.981, -125.944) 5000 Sparse Reg 50 5 -126.279 (-126.28, -126.226) 5000 Sparse Reg 10 10 -125.978 (-125.988, -125.931) 5000 Sparse Reg 20 10 -126.032 (-126.057, -125.961) 5000 Sparse Reg 50 10 -126.557 (-126.559, -126.525) 5000 Sparse Reg High Dim 10 5 -501.201 (-501.201, -500.778) 5000 Sparse Reg High Dim 20 5 -499.467 (-499.471, -499.387) 5000 Sparse Reg High Dim 50 5 -498.968 (-499.001, -498.656) 5000 Sparse Reg High Dim 10 10 -488.7 (-488.799, -488.561) 5000 Sparse Reg High Dim 20 10 -487.216 (-487.662, -487.042) 5000 Sparse Reg High Dim 50 10 -486.423 (-486.477, -485.83) 5000