# on_sampling_with_approximate_transport_maps__0faeb77b.pdf

On Sampling with Approximate Transport Maps

Louis Grenioux 1 Alain Oliviero Durmus 1 Eric Moulines 1 Marylou Gabri e 1

Transport maps can ease the sampling of distributions with non-trivial geometries by transforming them into distributions that are easier to handle. The potential of this approach has risen with the development of Normalizing Flows (NF) which are maps parameterized with deep neural networks trained to push a reference distribution towards a target. NF-enhanced samplers recently proposed blend (Markov chain) Monte Carlo methods with either (i) proposal draws from the flow or (ii) a flow-based reparametrization. In both cases, the quality of the learned transport conditions performance. The present work clarifies for the first time the relative strengths and weaknesses of these two approaches. Our study concludes that multimodal targets can be reliably handled with flow-based proposals up to moderately high dimensions. In contrast, methods relying on reparametrization struggle with multimodality but are more robust otherwise in high-dimensional settings and under poor training. To further illustrate the influence of targetproposal adequacy, we also derive a new quantitative bound for the mixing time of the Independent Metropolis-Hastings sampler.

1. Introduction

Creating a transport map between an intractable distribution of interest and a tractable reference distribution can be a powerful strategy for facilitating inference. Namely, if a bijective map T : Rd Rd transports a tractable distribution ρ on Rd to a target π on the same space, then the expectation of any test function f : Rd R under the target distribution can also be written as the expectation value

1 Ecole Polytechnique. Correspondence to: Louis Grenioux <louis.grenioux@polytechnique.edu>, Alain Oliviero Durmus <alain.durmus@polytechnique.edu>, Eric Moulines <eric.moulines@polytechnique.edu>, Marylou Gabri e <marylou.gabrie@polytechnique.edu>.

Proceedings of the 40 th International Conference on Machine Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright 2023 by the author(s).

under the reference distribution

Rd f(x)dπ(x) = Z

Rd f(T(x))|JT (x)|dρ(x) ,

where JT is the Jacobian determinant of T. However, the intractability of the target is traded here with the difficult task of finding the map T.

According to Brenier s theorem (Brenier, 1991), if ρ is absolutely continous then an exact mapping T between ρ and π always exists. Such a map may be known analytically, as in some field theories in physics (L uscher, 2009). Otherwise, it can be approximated by optimizing a parameterized version of the map. While learned maps typically suffer from approximation and estimation errors, a sufficiently accurate approximated map is still valuable when combined with a reweighting scheme such as a Markov chain Monte Carlo (MCMC) or Importance Sampling (IS).

The choice of the parametrization must ensure that the mapping is invertible and that the Jacobian determinant remains easy to compute. Among the first works combining approximate transport and Monte Carlo methods, (Parno & Marzouk, 2018) proposed using triangular maps. Over the years, the term Normalising Flow, introduced initially to refer to a Gaussianizing map (Tabak & Vanden Eijnden, 2010), has become a common name for highly flexible transport maps, usually parameterized with neural networks, that allow efficient computations of inverses and Jacobians (Papamakarios et al., 2021; Kobyzev et al., 2021). NFs were developed in particular for generative modelling and are now also a central tool for Monte Carlo algorithms based on transport maps.

While the issue of estimating the map is of great interest in the context of sampling, it is not the focus of this paper; see e.g. (Rezende & Mohamed, 2015; Parno & Marzouk, 2018; M uller et al., 2019; No e et al., 2019). Instead, we focus on comparing the performance of algorithmic trends among NF-enhanced samplers developed simultaneously. On the one hand, flows have been used as reparametrization maps that improve the geometry of the target before running local traditional samplers such as Hamiltonian Monte Carlo (HMC) (Parno & Marzouk, 2018; Hoffman et al., 2019; No e et al., 2019; Cabezas & Nemeth, 2022). We refer to these strategies as neutra-MCMC methods. On the other hand, the push-forward of the NF base distribution through

On Sampling with Approximate Transport Maps

the map has also been used as an independent proposal in IS (M uller et al., 2019; No e et al., 2019), an approach coined neural-IS, and in MCMC updates (Albergo et al., 2019; Gabri e et al., 2022; Samsonov et al., 2022) among others. We refer to the latter as flow-MCMC methods.

Despite the growing number of applications of NFenhanced samplers, such as in Bayesian inference (Karamanis et al., 2022; Wong et al., 2022), Energy Based Model learning (Nijkamp et al., 2021), statistical mechanics, (Mc Naughton et al., 2020), lattice QCD (Abbott et al., 2022b) or chemistry (Mahmoud et al., 2022), a comparative study of the methods of neutra-MCMC, flow-MCMC and neural IS is lacking. The present work fills this gap:

We systematically compare the robustness of algorithms with respect to key performance factors: imperfect flow learning, poor conditioning and complex geometries of the target distribution, multimodality and high-dimensions (Section 3). We show that flow-MCMC and neural-IS can handle mutlimodal distributions up to moderately highdimensions while neutra-MCMC is hindered in mixing between modes by the approximate nature of learned flows. For unimodal targets, we find that neutra-MCMC is more reliable than flow-MCMC and neural-IS given low-quality flows. We provide a new theoretical result on the mixing time of the independent Metropolis-Hastings (IMH) sampler by leveraging for the first time, to the best of our knowledge, a local approximation condition on the importance weights (Section 4). Intuitions formed on synthetic controlled cases are confirmed in real-world applications (Section 6).

The code to reproduce the experiments is available at https://github.com/h2o64/flow mcmc.

2. Background

2.1. Normalizing flows

Normalizing flows are a class of probabilistic models combining a C1-diffeomorphism T : Rd Rd and a fully tractable probability distribution ρ on Rd. The pushforward of ρ by T is defined as the distribution of X = T(Z) for Z ρ and has a density given by the change of variable as

λρ T (x) = ρ(T 1(x))|JT 1(x)| , (1)

where |JT | denotes the Jacobian determinant of T. Similarly, given a probability density π on Rd, the pushbackward of π through T is defined as the distribution of Z = T 1(X) for X π and has a density given by λπ T 1.

A parameterized family of C1-diffeomorphisms {Tα}α A then defines a family of distributions {λρ Tα}α A. This construction has recently been popularised by the use of neural networks (Kobyzev et al., 2021; Papamakarios et al., 2021) for generative modelling and sampling applications.

In the context of sampling, the flow is usually trained so that the push-forward distribution approximates the target distribution i.e., λρ Tα π. As will be discussed in more detail below, the flow can then be used as a reparametrization map or for drawing proposals. Note that we are interested in situations where samples from π are not available a priori when training flows for sampling applications. In that case, the reverse Kullback-Leiber divergence (KL)

KL(λρ Tα||π) = Z log(λρ Tα(x)/π(x))λρ Tα(x)dx (2)

can serve as a training target, since it can be efficiently estimated with i.i.d. samples from ρ. Minimizing the reverse KL amounts to variational inference with a NF candidate model (Rezende & Mohamed, 2015). This objective is also referred to in the literature as self-training or training by energy; it is notoriously prone to mode collapse (No e et al., 2019; Jerfel et al., 2021; Hackett et al., 2021). On the other hand, the forward KL

KL(π||λρ Tα) = Z log(π(x)/λρ Tα(x))π(x)dx (3)

is more manageable but more difficult to estimate because it is an expectation over π. Remedies include importance reweighting (M uller et al., 2019), adaptive MCMC training (Parno & Marzouk, 2018; Gabri e et al., 2022), and sequential approaches to the target distribution (Mc Naughton et al., 2020; Arbel et al., 2021; Karamanis et al., 2022; Midgley et al., 2022). Regardless of which training strategy is used, the learned model λρ Tα always suffers from approximation and estimation errors with respect to the target π. However, the approximate transport map Tα can be used to produce high quality Monte Carlo estimators using the strategies described in the next Section.

2.2. Sampling with transport maps

Since NFs can be efficiently sampled from, they can easily be integrated in Monte Carlo methods relying on proposal distributions, such as IS and certain MCMCs.

neural-IS Importance Sampling uses a tractable proposal distribution, here denoted by λ, to calculate expected values with respect to π (Tokdar & Kass, 2010). Assuming that the support of π is included in the support of λ, we denote w(x) = π(x)/λ(x) (4)

the importance weight function, and define the selfnormalized importance sampling estimator (SNIS) (Robert

On Sampling with Approximate Transport Maps

& Casella, 2005) of the expectation of f under π as

i=1 wi Nf(Xi)

where X1:N N are i.i.d. samples from λ and

wi N = w(Xi)

j=1 w(Xj) (5)

are the self-normalized importance weights. For IS to be effective, the proposal λ must be close enough to π in χsquare distance (see (Agapiou et al., 2017, Theorem 1)), which makes IS also notably affected by the curse of dimensionality (e.g., (Agapiou et al., 2017, Section 2.4.1)).

Adaptive IS proposed to consider parametric proposals adjusted to match the target as closely as possible. NFs are suited to achieve this goal: they define a manageable pushforward density, can be easily sampled, and are very expressive. IS using flows as proposals is known as Neural-IS (M uller et al., 2019) or Boltzmann Generator (No e et al., 2019). NFs were also used in methods building on IS to specifically estimate normalization constants (Jia & Seljak, 2019; Ding & Zhang, 2021; Wirnsberger et al., 2020).

Flow-based independent proposal MCMCs Another way to leverage a tractable λρ Tα π is to use it as a proposal in an MCMC with invariant distribution π. In particular, the flow can be used as a proposal for IMH.

Metropolis-Hastings is a two-step iterative algorithm relying on a proposal Markov kernel, here denoted by Q(x(k), dx) = q(x(k), x)dx. At iteration k + 1 a candidate x is sampled from Q conditionally to the previous state x(k)

and the next state is set according to the rule

x(k+1) = x w. prob. acc(x(k), x) x(k) w. prob. 1 acc(x(k), x) , (6)

where, given a target π, the acceptance probability is

acc(x(k), x) = min 1, q(x, x(k))π(x) q(x, x(k))π(x(k))

To avoid vanishing acceptance probabilities, the Markov kernel is usually chosen to propose local updates, as in Metropolis-adjusted Langevin (MALA) (Roberts & Tweedie, 1996) or Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Neal et al., 2011), which exploit the local geometry of the target. These local MCMCs exhibit a tradeoff between update size and acceptability leading in many cases to a long decorrelation time for the resulting chain. Conversely, NFs can serve as non-local proposals in the independent Metropolis Hastings setting q(x, x ) = λρ Tα(x ).

Since modern computer hardware allows a high degree of parallelism, it may also be advantageous to consider Markov chains with multiple proposals at each iteration, such as multiple-try Metropolis (Liu et al., 2000; Craiu & Lemieux, 2007) or iterative sampling-importance resampling (i-SIR) (Tjelmeland, 2004; Andrieu et al., 2010). The latter has been combined with NFs in (Samsonov et al., 2022). Setting the number of parallel trials to N > 1 in i-SIR, N 1 propositions are drawn at iteration (k + 1):

xl λρ Tα for l = 2 N

and x1 is set equal to the previous state x(k). The next state x(k+1) is drawn from the set {xl}N l=1 according to the selfnormalized weights wl N computed as in (5). If x(k+1) is not equal to x(k), the MCMC state has been fully refreshed.

In what follows, we refer to MCMC methods that use NFs as independent proposals, in IMH or in i-SIR, as flow MCMC. These methods suffer from similar pitfalls as IS. If the proposal is not close enough to the target, the importance function w(x) = π(x)/λρ Tα(x) fluctuates widely, leading to long rejection streaks that become all the more difficult to avoid as the dimension increases.

Flow-reparametrized local-MCMCs A third strategy of NF-enhanced samplers is to use the reverse map T 1 α defined by the flow to transport π into a distribution which, it is hoped, will be easier to capture with standard local samplers such as MALA or HMC. This strategy was discussed by (Parno & Marzouk, 2018) in the context of triangular flows and by (No e et al., 2019) and (Hoffman et al., 2019) with modern normalizing-flows, we keep the denomination of neutra-MCMC from the latter. neutra-MCMC amounts to sampling the push-backward of the target λπ T 1 α with local MCMCs before mapping the samples back through Tα. It can be viewed as a reparametrization of the space or a spatially-dependent preconditioning step that has similarities to Riemannian MCMC (Girolami & Calderhead, 2011; Hoffman et al., 2019). Indeed, local MCMCs notoriously suffer from poor conditioning. For example, one can show that MALA has a mixing time1 scaling as O(κ

d) - where κ is the conditioning number of the target distribution - provided the target distribution is log-concave (Wu et al., 2022; Chewi et al., 2021; Dwivedi et al., 2018).

Nevertheless, neutra-MCMC does not benefit from the fast decorrelation of flow-MCMC methods, since updates remain local. Still, local-updates might be precisely the ingredient making neutra-MCMC escape the curse of dimensionality in some scenarios. Indeed, if the distribution targeted is log-concave, the mixing time of MALA mentioned above depends only sub-linearly on the dimension. The question becomes, when do scaling abilities of neutra-

1See Section 4 for a definition of the mixing time.

On Sampling with Approximate Transport Maps

MCMCs provided by locality allow to beat neural-IS and flow-MCMCs?

In a recent work, (Cabezas & Nemeth, 2022) also used a flow reparametrization for the Elliptical Slice Sampler (ESS) (Murray et al., 2010) which is gradient free and parameter free2. Notably, ESS is able to cross energy barriers to tackle multimodal targets more efficiently than MALA or HMC (Natarovskii et al., 2021).

In our experiments we include different versions of neutra MCMCs and indicate in parentheses the sampler used on the push-backward: either MALA, HMC or ESS.

3. Synthetic case studies

neural-IS, neutra-MCMC and flow-MCMC build on wellstudied Monte Carlo scheme with known strengths and weaknesses (see (Rubinstein & Kroese, 2017) for a textbook). Most of their limitations would be lifted if an exact transport between the base distribution and the target were available, as sampling from the flow would directly amounts to sampling from the target distribution. However, learned maps are imperfect, which leaves open a number of questions about the expected performance of NF-enhanced samplers: Which of the methods is most sensitive to the quality of the transport approximation? How do they work on challenging multimodal destinations? And how do they scale with dimension? In this Section, we present systematic synthetic case studies answering the questions above.

In all of our experiments, we selected the samplers hyperparameters by optimizing case-specific performance metrics. The length of chains was chosen to be twice the number of steps required to satisfy the ˆR diagnosis (Gelman & Rubin, 1992) at the 1.1-threshold for the fastest converging algorithm. We used MALA as local sampler as it is suited for the log-concave distributions considered, faster and easier to tune than HMC. Evaluation metrics are fully descibed in App. D.1.

3.1. neutra-MCMCs are robust to imperfect training

Provided an exact transport is available, drawing independent samples from the base and pushing them through the maps generates i.i.d. samples from the target. This is equivalent to running neural-IS and finding that importance weights (4) are uniformly equal. With an approximate transport map, it is not clear which sampling strategy to prefer.

In practice, the quality of the learned flow depends on many

2(Cabezas & Nemeth, 2022) uses ESS with a fixed covariance parameter Σ = Id. This choice is justified by the fact that using the neutra-MCMC trick amounts to sampling something close to the base of the flow which is N(0, Id)

parameters: expressiveness, training objective, optimization procedure etc. In the first experiments that we now present, we design a framework in which the quality of the flow can be adjusted manually.

Our first target distribution is a multivariate Gaussian (d = 128) with an ill-conditioned covariance. We define an analytical flow with a scalar quality parameter t [0, 1] leading to a perfect transport at t = 0.5 and an over-concentrated (respectively under-concentrated) pushforward at t = 0 (resp. t = 1)3, as shown on Fig. 1. All the experimental details are reported in App D.2.

For the perfect flow, neural-IS yields the most accurate samples as expected, closely followed by flow-MCMCs. More interestingly, neutra-MCMC (MALA) quickly outperforms flow-MCMC as the flow shifts away from the exact transport, both towards over-spreading and overconcentrating the mass (Fig 1). A low-quality flow leads rapidly to zero-acceptance of NF proposals or very poor participation ratios4 for IS which translates into neural IS and flow-MCMC being even less efficient that MALA (see Fig. 12 in App. D.4). Conversely, neutra-MCMCs are found to be robust as imperfect pre-conditioning is still an improvement over a simple MALA. These findings are confirmed by repeating the experiment on Neal s Funnel distribution in App. D.3 (Fig. 11), for which the conditioning number of the target distribution highly fluctuates over its support.

Finally note that flow-MCMC methods using a multipletry scheme, here i-SIR, remain more efficient than neutra MCMC for a larger set of flow imperfection compared to the single-try IMH scheme. This advantage is understandable: an acceptable proposal is more likely to be available among multiple tries (see Fig. 12 in App. D.2). If multipletry schemes are more costly per iteration, wall-clock times may be comparable thanks to parallelization (see Fig. 17 in App. F.1).

3.2. neutra-MCMC may not mix between modes

MCMCs with global updates or IS can effectively capture multiple modes, provided the proposal distribution is well matched to the target. MCMCs with local updates, on the other hand, usually cannot properly sample multimodal targets due to a prohibitive mixing time5. Therefore,

3This mimics the behavior when fitting the closest Gaussian of type N(0, σId) with the forward KL if t = 0 and with the backward KL if t = 1. 4The participation ratio of a sample of N IS proposal 1/PN i=1 wi N 2 [1, N] tracks the number of samples contributing in practice to the computation of an SNIS estimator. 5Indeed, the Eyring-Kramers law shows that the exit time of a bassin of attraction of the Langevin diffusion has expectation which scales exponentially in the depth of the mode; see e.g.

On Sampling with Approximate Transport Maps

Latent space

t = 0.5 t = 1.0

Push-backward Push-forward Base Target

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

Sliced Total Variation

MALA ﬂow-(IMH) ﬂow-(i-SIR) neural-IS

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

neutra-(ESS) neutra-(MALA)

Figure 1. (Left) Push-backwards λρ Tt and push-forwards λπ T 1 t as a function of the flow imperfection parameter t. (Right) Sliced TV distances of different samplers depending on the quality of the flow t, using 256 chains of length 1400 initialized with draws from NF with Tt. neural-IS was evaluated with 14000 samples. Results were qualitatively unchanged for d = 16, 32, 64, 256.

the performance in multimodal environments of neutra MCMC methods coupled with local samplers depends on the ability of the flow to lower energy barriers while keeping a simple geometry in the push-backward of the target.

We first examined the push-backward of common flow architectures trained by likelihood maximization on 2d target distributions. Both for a mixture of 4 isotropic Gaussians (Fig. 2 left) and for the Two-moons distribution (Fig. 8 in App. C), the approximate transport map is unable to erase energy barriers and creates an intricate pushbackward landscape. Visualizing chains in latent and direct spaces makes it is clear that: neutra-MCMC (MALA) mixing is hindered by the complex geometry, the gradient free neutra-MCMC (ESS) mixes more successfully, while flow MCMC (i-SIR) is even more efficient.

We systematically extended the experiment on 4-Gaussians by training a Real NVP (Dinh et al., 2016) with maximum likelihood for a scale of increasing dimensions (see App. D.5 for all experiment details). We tested the relative ability of the neural-IS, neutra-MCMC and flow-MCMC algorithms to represent the relative weights of each Gaussian component by building histograms of the visited modes within a chain and comparing them with the perfect uniform histogram using a median squared error (Fig. 2 middle). As dimension increases and the quality of the transport map presumably decreases, the ability of neutra MCMC (MALA) to change bassin worsens. Using neutra MCMC with ESS enables an approximate mixing between modes but only flow-MCMCs recover the exact histograms robustly up to d = 256. In other words, dimension only heightens the performance gap between NF-enhanced samplers observed in small dimension. Note however that the acceptance of independent proposals drops with dimension such that flow-MCMC is also eventually limited by dimen-

(Bovier et al., 2005) and the reference therein.

sion (Fig. 13 in App. D.5). Not included in the plots for readability, neural-IS behaves similarly to flow-MCMC.

In fact, it is expected that exact flows mapping a unmiodal base distribution to a multimodal target distribution are difficult to learn (Cornish et al., 2020). More precisely, Theorem 2.1 of the former reference shows that the bi-Lipschitz constant6 of a transport map between two distributions with different supports is infinite. Here we provide a complementary result on the illustrative uni-dimensional case of a Gaussian mixture target π and standard normal base ρ: Proposition 3.1. Let π = N( a, σ2)/2+N(a, σ2)/2 with a > 0, σ > 0 and ρ = N(0, 1). The unique increasing flow mapping π to ρ denoted Tπ,ρ verifies that

Bi Lip(Tπ,ρ) d T 1 π,ρ dz (0) = σ exp a2

The proof of Proposition 3.1, showing the exponential scaling of the bi-Lipschitz constant in the distance between modes, is postponed to Appendix A.

Overall, these results show that neural-IS and flow MCMC are typically more effective for multimodal targets than neutra-MCMC. Note further that it has also been proposed to use mixture base distributions (Izmailov et al., 2020) or mixture of NFs (No e et al., 2019; Gabri e et al., 2022; Hackett et al., 2021) to accommodate for multimodal targets provided that prior knowledge of the modes structure is available. These mixture models can be easily plugged in neural-IS and flow-MCMC, however it is unclear how to combine them with neutra-MCMC. Finally, mixing neutra-MCMC and flow-MCMC schemes by alternating between global updates (crossing energy barriers) and flow-preconditioned local-steps (robust withing a

6Bi Lip(f) is the bi-Lipschitz constant of f is defined as the infimum over M [1, ] such that M 1 z z f(z) f(z ) M z z for all z and z different.

On Sampling with Approximate Transport Maps

mode) seems promising, in particular when properties of the target distributions are not known a priori. It will be referred below as neutra-flow-MCMC. A proposition along these lines was also made in (Grumitt et al., 2022).

3.3. flow-MCMCs are the most impacted by dimension

To investigate the effect of dimension, we ran a systematic experiment on the Banana distribution, a unimodal distribution with complicated geometry (details in App. D.6). We compared NF-enhanced samplers powered by Real NVPs trained to optimize the backward KL and found that a crossover occurs in moderate dimensions : neural-IS and flow-MCMC algorithms are more efficient in small dimension but are more affected by the increase in dimensions compared to neutra-MCMC algorithms (Fig. 2 Right).

4. New mixing rates for IMH

As illustrated by the previous experiment, learning and sampling are expected to be more challenging when dimension increases. To better understand the interplay between flow-quality and dimension, we examined the case of the IMH sampler applied to a strongly log-concave target π and proposal N(0, σ2Id)7, σ > 0, with density denoted by qσ.

To this end, we consider the following assumption on the importance weight function wσ(x) = π(x)/qσ(x):

Assumption 4.1. For any R 0, there exists CR 0 such that for any x, y B(0, R) = {z : z < R}:

|log wσ(x) log wσ(y)| CR |x y| . (9)

The constant CR appearing in Assumption 4.1, for R 0, represents the quality of the proposal with respect to the target π locally on B(0, R). Indeed, if qσ = π on B(0, R), this constant is zero. In particular, qσ π with qσ and π smooth and log wσ(x) 0 on B(0, R) would result in Assumption 4.1 holding with a small constant CR. In contrast to existing analyses of IMH which assume wσ to be uniformly bounded to obtain explicit convergence rates, here we only assume a smooth local condition on log(wσ), namely that it is locally Lipschitz. Note this latter condition is milder than assuming the former. To the best of our knowledge, it is the first time that such a condition is considered for IMH; a thorough comparison of our contribution with the literature is postponed to Section 5.

While we relax existing conditions on wσ, we restrict our study to the following particular class of targets:

Assumption 4.2. The target π is positive and log π is m-strongly convex on Rd and attains its minimum at 0.

7Nevertheless, we develop a theory for a generic proposal in Section E trough Theorem E.4

Denote by Pσ, the IMH Markov kernel with target π and proposal N(0, σ2Id). In our next result, we analyze the mixing time of Pσ, defined for an accuracy ϵ > 0 and an initial distribution µ as

τmix(µ, ϵ) = inf{n N : µP n σ π TV ϵ} . (10)

τmix quantifies the number of MCMC steps needed to bring the total variation distance 8 between the Markov chain and its invariant distribution bellow ϵ.

Theorem 4.3 (Explicit mixing time bounds for IMH). Assume Assumption-4.1-4.2 hold. Let 0 < ϵ < 1 and µ a β-warm distribution with respect to π 9. Suppose in addition that CR log 2 m/32 with

d max σ, 1 m

(1+|logα(ϵ/β)| /dα/2) , (11)

for some explicit numerical constant C 0 and exponent α > 0. Then the mixing time of IMH is bounded as

τmix(µ, ϵ) 128 log 2β

max 1, 1282C2 R log(2)2m

The proof of Theorem 4.3 is postponed to App. E. It shows that if CR is bounded by a constant independent of the dimension for R of order at least

d, then the mixing time is also independent of the dimension, which recovers easy consequences of existing analyses (Roberts & Rosenthal, 2011; Wang, 2022). In contrast to these works, Theorem 4.3 can be applied to the illustrative case where π = N(0, Id) and σ = 1 + λ considering the error term λ either positive or negative (for which wσ is not uniformly bounded). In that case, Theorem 4.3 shows that reaching a precision ϵ with a fixed number of MCMC steps n requires λ to decrease as O(1/d) (the detailed derivation is postponed to App E). Finally, note that we do not assume that π is L-smooth, i.e., log π is Lipschitz in Theorem 4.3. This condition is generally considered in existing results on MALA and HMC for strongly log-concave target distributions; see (Dwivedi et al., 2018; Chen et al., 2020).

5. Related Works

Comparison of NF-enhanced samplers Several papers have investigated the difficulty of flow-MCMC algorithms in scaling with dimension (Del Debbio et al., 2021; Abbott et al., 2022a). Hurdles arising from multimodality were also discussed in (Hackett et al., 2021; Nicoli et al., 2022) in the context of flow-MCMC methods. Meanwhile, the authors of (Hoffman et al., 2019) argued that the success of

8For two distribution µ, ν on Rd, µ ν TV = sup A B(Rd) |µ(A) ν(A)|. 9For any Borel set E, µ(E) βπ(E)

On Sampling with Approximate Transport Maps

16 32 64 128 256 Dimension (log-scale)

Median squared error

4 Gaussians

One mode ﬂow-(IMH) neutra-(ESS) neutra-(MALA) neutra-ﬂow (i-SIR + MALA)

32 128 512 Dimension (log-scale)

0.0 0.2 0.4 0.6 0.8 1.0 1.2 Sliced Total Variation

ﬂow-(i-SIR) neural-IS neutra-(ESS) neutra-(MALA)

Figure 2. (Left) Example chains of NF-enhanced walkers with a 2d target mixture of 4 Gaussians. The 128-step MCMC chain is colored according to the closest mode in the data space (bottom row) with corresponding location in the latent space (top row). The complex geometry of the push-backward λπ Tα hinders the mixing of local-update algorithms. MALA s step-size was chosen to reach 75% of acceptance. (Middle) Median squared error of the histograms of visited modes of 4 Gaussians per chain against the perfect uniform histogram as a function of dimension. 512 chains of 1000-steps on average were used. (Right) Sliced total variation in sampling the Banana distribution in increasing dimension using a Real NVP. 128 chains of 1024-steps were used.

their neutra-MCMC was bound to the quality of the flow but did not provide experiments in this direction. To the best of our knowledge, no thorough comparative study of the different NF-enhanced samplers was performed prior to this work.

As previously mentioned, (Grumitt et al., 2022) proposed to mix local NF-preconditioned steps with NF Metropolis Hastings steps, i.e., to combine neutra-MCMC and flow MCMC methods. However, the focus of these authors was on the aspect of performing deterministic local updates using an instantaneous estimate of the density of walkers provided by the flow. More related to the present work, they present a rapid ablation study in their Appendix D.

Enhancing Sequential Monte Carlo (Del Moral et al., 2006) with NFs has also been investigated by (Arbel et al., 2021; Karamanis et al., 2022). These methods are more involved and require the definition of a collection of target distributions approaching the final distribution of interest. They could not be directly compared to neural-IS, neutra MCMC and flow-MCMC. We also note that (Invernizzi et al., 2022) recently proposed another promising method to assist sampling with flows, in the context of replica exchange MCMCs.

IMH analysis Most analyses establishing quantitative convergence bounds rely on the condition that the ratio π/q be uniformly bounded (Yang & Liu, 2021; Brown & Jones, 2021; Wang, 2022). In these works, it is shown that IMH is uniformly geometric in total variation or Wasserstein distances. Our contribution relaxes the uniform boundedness condition on π/q by restricting our study to the class of strongly log-concave targets.

The analysis of local MCMC samplers, such as MALA or

HMC for sampling from a strongly log-concave target is now well developed; see e.g., (Dwivedi et al., 2018; Chen et al., 2020; Chewi et al., 2021; Wu et al., 2022). These works rely on the notion of s-conductance for a reversible Markov kernel and on the results developed in (Lov asz & Simonovits, 1993) connecting this notion to the kernel s mixing time. This strategy has been successively applied to numerous MCMC algorithm since then; e.g., (Lov asz, 1999; Vempala, 2005; Lov asz & Vempala, 2007; Chandrasekaran et al.; Mou et al., 2019; Cousins & Vempala; Laddha & Vempala, 2020; Narayanan & Srivastava, 2022). We follow the same path in the proof of Theorem 4.3.

Finally, while (Roberts & Rosenthal, 2011) establish a general convergence for IMH under mild assumption, exploiting this result turns out to be difficult. In particular, we believe it cannot be made quantitative if π/q is unbounded since their main convergence result involves an intractable expectation with respect to the target distribution.

6. Benchmarks on real tasks

In this Section we compare NF-enhanced samplers beyond the previously discussed synthetic examples. Our main findings hold for real world use-cases. An extra experiment on high dimension with image dataset is available in App. G.

6.1. Molecular system

Our first experiment is the alanine dipeptide molecular system, which consists of 22 atoms in an implicit solvent. Our goal is to capture the Boltzmann distribution at temperature T = 300K of the atomic 3D coordinates, which is known to be multimodal. We have used the flow trained

On Sampling with Approximate Transport Maps

Figure 3. Sampled configurations of alanine-dipeptide projected from 66 Cartesian coordinates to 2 dihedral angles ϕ and ψ (see App. F.3). (Top) Samples from the flow (left) and samples from a single MCMC chain of the different NF-samplers are shown as bright-colored points on colored background displaying the log histogram of exact samples at T = 300K obtained by a Replica Exchange Molecular Dynamics simulation of (Stimper et al., 2022). (Bottom) Log-histograms of samples from the flow (left) and from 256 MCMC chains started at the same location.

Table 1. Predictive posterior distribution for Bayesian sparse logistic regression on the German credit dataset.

Sampler Average predictive log-posterior distribution

neutra-MCMC (HMC) -191.1 0.1 neutra-flow-MCMC (i-SIR + HMC) -194.1 1.6 flow-MCMC (i-SIR) -208.5 2.1 HMC -209.7 1.0

in (Midgley et al., 2022) to drive the samplers and generated 2d projections of the outputs in Figure 3. neutra MCMC methods are not perfectly mixing between modes, while flow-MCMC properly explores the weaker modes. For more details, see App. F.3.

6.2. Sparse logistic regression

Our second experiment is a sparse Bayesian hierarchical logistic regression on the German credit dataset (Dua & Graff, 2017), which has been used as a benchmark in recent papers (Hoffman et al., 2019; Grumitt et al., 2022; Cabezas & Nemeth, 2022). We trained an Inverse Autoregressive Flow (IAF) (Papamakarios et al., 2017) using the procedure described in (Hoffman et al., 2019). More details about the sampled distribution and the construction of the flow are given in App. F.2. We sampled the posterior predictive distribution on a test dataset and reported the logposterior predictive density values for these samples in Table 1. neutra-MCMC methods achieve higher posterior predictive values compared to flow-MCMC methods, which differ little from HMC. Note that neutra-flow-MCMC, al-

neutra-MCMC

neutra-MCMC

64 128 256 Dimension (log-scale)

Sliced Total Variation

Figure 4. (Left) Sampled ϕ4 configurations in dimension 128. (Right) Within-mode Sliced TV as a function of dimension.

ternating between flow-MCMC and neutra-MCMC, does not improve upon neutra-MCMC.

6.3. Field system

In our last experiment we investigate the 1-d ϕ4 model used as a benchmark in (Gabri e et al., 2022). This field system has two well-separated modes at the chosen temperature. Defined at the continuous level, the field can be discretized with different grid sizes, leading to practical implementations in different dimensions. We trained a Real NVP in 64, 128, and 256 dimensions by minimizing an approximated forward KL (more details on this procedure in App. F.4). Consistent with the results of Section 3.2, neutra-MCMC (MALA) chains remain in the modes in which they were initialized, neutra-MCMC (ESS) crosses over to the other mode rarely while flow-MCMC is able to mix properly (see Fig. 4 left). To further examine performance as a function of dimension, we considered the distribution restricted to the initial mode only and calculated the sliced total variation of samplers chain compared to exact samples (Fig. 4 right). neutra-MCMC methods appear to be less accurate here than flow-MCMC. Even within a mode, the global updates appear to allow for more effective exploration. Both approaches suffer as dimensions grow.

6.4. Run time considerations

In all experiments, algorithms were compared with a fixed sample-size, yet wall-clock time and computational costs per iteration vary between samplers: neural-IS and singletry flow-MCMC require two passes through the flow per iteration, neutra-MCMCs require typically more. Multipletry flow-MCMC computational cost scales linearly with the number of trials yet can be parallelized. In App. F.1 we report the run-time per iteration for the experiments of this section. Results show that neutra-MCMCs are usually significantly slower per iteration than other methods. Nevertheless, expensive target evaluation such as in Molecular

On Sampling with Approximate Transport Maps

Table 2. Informal summary of findings. The symbol indicates where algorithms can be expected to produce reliable samples, the symbol is starred when faster than another suited algorithm for the task. The symbol is used when the algorithm reaches its limits and may fail. A cross indicates failure to sample the target.

Unimodal target Multimodal target

low-dim mid-dim high-dim low-dim mid-dim high-dim good map fair map poor map good map fair map poor map

flow-MCMC neural-IS

neutra-MCMC

Dynamics impact in particular multiple-try flow-MCMC.

7. Conclusion

As a conclusion, we gather our findings in an informal summary table (2) of heuristics depending on the type of target and with respect to the quality of the map and dimension of the problem which typically go together (the lower the dimension, the better the learned map and vice-versa). In synthetic and real experiments, we show that NF provide significant advantage in sampling provided the method is chosen accordingly to the properties of the target. However, high-dimensional multimodal targets remain a challenge for NF-assisted samplers.

Acknowledgements

We thank the anonymous reviewers for their valuable feedback on the paper. L.G. and M.G. acknowledge funding from Hi! Paris. The work was partly supported by ANR-19-CHIA-0002-01 SCAI . Part of this research has been carried out under the auspice of the Lagrange Center for Mathematics and Computing. A.O.D. would like to thank the Isaac Newton Institute for Mathematical Sciences for support and hospitality during the programme The mathematical and statistical foundation of future datadriven engineering when work on this paper was undertaken. This work was supported by: EPSRC grant number EP/R014604/1

On Sampling with Approximate Transport Maps

Abbott, R., Albergo, M. S., Botev, A., Boyda, D., Cranmer, K., Hackett, D. C., Matthews, A. G. D. G., Racani ere, S., Razavi, A., Rezende, D. J., Romero-L opez, F., Shanahan, P. E., and Urban, J. M. Aspects of scaling and scalability for flow-based sampling of lattice QCD, November 2022a. URL http://arxiv.org/abs/2211. 07541. ar Xiv:2211.07541 [cond-mat, physics:hep-lat].

Abbott, R., Albergo, M. S., Boyda, D., Cranmer, K., Hackett, D. C., Kanwar, G., Racani ere, S., Rezende, D. J., Romero-L opez, F., Shanahan, P. E., Tian, B., and Urban, J. M. Gauge-equivariant flow models for sampling in lattice field theories with pseudofermions. Physical Review D, 106(7):074506, October 2022b. ISSN 2470-0010, 2470-0029. doi: 10.1103/Phys Rev D.106. 074506. URL https://link.aps.org/doi/10. 1103/Phys Rev D.106.074506.

Agapiou, S., Papaspiliopoulos, O., Sanz-Alonso, D., and Stuart, A. M. Importance sampling: Intrinsic dimension and computational cost. Statistical Science, 32(3):405 431, 2017. ISSN 08834237, 21688745. URL http: //www.jstor.org/stable/26408299.

Albergo, M. S., Kanwar, G., and Shanahan, P. E. Flow-based generative models for Markov chain Monte Carlo in lattice field theory. Physical Review D, 100(3):034515, aug 2019. ISSN 24700010. doi: 10.1103/Phys Rev D.100.034515. URL http://arxiv.org/abs/2106.05934https: //link.aps.org/doi/10.1103/Phys Rev D. 100.034515.

Andrieu, C., Doucet, A., and Holenstein, R. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 72(3):269 342, 2010.

Arbel, M., Matthews, A., and Doucet, A. Annealed Flow Transport Monte Carlo. In Proceedings of the 38th International Conference on Machine Learning, pp. 318 330. PMLR, July 2021. URL https://proceedings. mlr.press/v139/arbel21a.html. ISSN: 26403498.

Bingham, E., Chen, J. P., Jankowiak, M., Obermeyer, F., Pradhan, N., Karaletsos, T., Singh, R., Szerlip, P. A., Horsfall, P., and Goodman, N. D. Pyro: Deep universal probabilistic programming. J. Mach. Learn. Res., 20:28:1 28:6, 2019. URL http://jmlr.org/ papers/v20/18-403.html.

Bobkov, S. G. Isoperimetric and Analytic Inequalities for Log-Concave Probability Measures. The Annals of Probability, 27(4):1903 1921, 1999. doi: 10.1214/aop/

1022874820. URL https://doi.org/10.1214/ aop/1022874820.

Bogachev, V. I., Kolesnikov, A. V., and Medvedev, K. V. Triangular transformations of measures. Sbornik: Mathematics, 196(3):309, apr 2005. doi: 10.1070/SM2005v196n03ABEH000882. URL https://dx.doi.org/10.1070/ SM2005v196n03ABEH000882.

Bovier, A., Klein, M., and Gayrard, V. Metastability in reversible diffusion processes ii. precise asymptotics for small eigenvalues. Journal of the European Mathematical Society, 7:69 99, 10 2005. doi: 10.4171/JEMS/22.

Brenier, Y. Polar factorization and monotone rearrangement of vector-valued functions. Communications on Pure and Applied Mathematics, 44(4):375 417, 1991. doi: https://doi.org/10.1002/cpa.3160440402. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/cpa.3160440402.

Brown, A. and Jones, G. L. Exact convergence analysis for metropolis-hastings independence samplers in wasserstein distances, 2021.

Cabezas, A. and Nemeth, C. Transport elliptical slice sampling, 2022. URL https://arxiv.org/abs/ 2210.10644.

Carvalho, C. M., Polson, N. G., and Scott, J. G. Handling sparsity via the horseshoe. In van Dyk, D. and Welling, M. (eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pp. 73 80, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, 16 18 Apr 2009. PMLR. URL https://proceedings.mlr. press/v5/carvalho09a.html.

Chandrasekaran, K., Dadush, D., and Vempala, S. Thin Partitions: Isoperimetric Inequalities and a Sampling Algorithm for Star Shaped Bodies, pp. 1630 1645. doi: 10.1137/1.9781611973075.133. URL https://epubs.siam.org/doi/abs/10. 1137/1.9781611973075.133.

Che, T., ZHANG, R., Sohl-Dickstein, J., Larochelle, H., Paull, L., Cao, Y., and Bengio, Y. Your gan is secretly an energy-based model and you should use discriminator driven latent sampling. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 12275 12287. Curran Associates, Inc., 2020. URL https://proceedings.neurips. cc/paper_files/paper/2020/file/ 90525e70b7842930586545c6f1c9310c-Paper. pdf.

On Sampling with Approximate Transport Maps

Chen, R. T. Q., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual flows for invertible generative modeling. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alch e-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/paper/2019/file/ 5d0d5594d24f0f955548f0fc0ff83d10-Paper. pdf.

Chen, Y., Dwivedi, R., Wainwright, M. J., and Yu, B. Fast mixing of metropolized hamiltonian monte carlo: Benefits of multi-step gradients. Journal of Machine Learning Research, 21(92):1 72, 2020. URL http: //jmlr.org/papers/v21/19-441.html.

Chewi, S., Lu, C., Ahn, K., Cheng, X., Gouic, T. L., and Rigollet, P. Optimal dimension dependence of the metropolis-adjusted langevin algorithm. In Belkin, M. and Kpotufe, S. (eds.), Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp. 1260 1300. PMLR, 15 19 Aug 2021. URL https://proceedings.mlr. press/v134/chewi21a.html.

Cornish, R., Caterini, A., Deligiannidis, G., and Doucet, A. Relaxing bijectivity constraints with continuously indexed normalising flows. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2133 2143. PMLR, 13 18 Jul 2020. URL https://proceedings. mlr.press/v119/cornish20a.html.

Cousins, B. and Vempala, S. A Cubic Algorithm for Computing Gaussian Volume, pp. 1215 1228. doi: 10.1137/1.9781611973402.90. URL https://epubs.siam.org/doi/abs/10. 1137/1.9781611973402.90.

Craiu, R. V. and Lemieux, C. Acceleration of the multiple-try metropolis algorithm using antithetic and stratified sampling. Statistics and Computing, 17(2): 109 120, Jun 2007. ISSN 1573-1375. doi: 10.1007/ s11222-006-9009-4. URL https://doi.org/10. 1007/s11222-006-9009-4.

Del Debbio, L., Rossney, J. M., and Wilson, M. Efficient Modelling of Trivializing Maps for Lattice $\phiˆ4$ Theory Using Normalizing Flows: A First Look at Scalability. Physical Review D, 104(9):094507, November 2021. ISSN 2470-0010, 2470-0029. doi: 10.1103/ Phys Rev D.104.094507. URL http://arxiv.org/ abs/2105.12481. ar Xiv:2105.12481 [hep-lat].

Del Moral, P., Doucet, A., and Jasra, A. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, June 2006. ISSN 1369-7412, 1467-9868. doi: 10.1111/j.1467-9868.2006.00553. x. URL https://onlinelibrary.wiley.com/ doi/10.1111/j.1467-9868.2006.00553.x.

Ding, X. and Zhang, B. Deep BAR: A Fast and Exact Method for Binding Free Energy Computation. The Journal of Physical Chemistry Letters, 12(10):2509 2515, March 2021. ISSN 19487185, 1948-7185. doi: 10.1021/acs.jpclett.1c00189. URL https://pubs.acs.org/doi/10.1021/ acs.jpclett.1c00189.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp, 2016. URL https://arxiv. org/abs/1605.08803.

Dua, D. and Graff, C. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.

Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. Hybrid Monte Carlo. Physics Letters B, 195(2):216 222, September 1987. ISSN 0370-2693. doi: 10.1016/0370-2693(87)91197-X. URL https://www.sciencedirect.com/ science/article/pii/037026938791197X.

Dwivedi, R., Chen, Y., Wainwright, M. J., and Yu, B. Log-concave sampling: Metropolis-hastings algorithms are fast! In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp. 793 797. PMLR, 06 09 Jul 2018. URL https://proceedings.mlr. press/v75/dwivedi18a.html.

Gabri e, M., Rotskoff, G. M., and Vanden-Eijnden, E. Adaptive monte carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022. doi: 10.1073/ pnas.2109420119. URL https://www.pnas.org/ doi/abs/10.1073/pnas.2109420119.

Gelman, A. and Rubin, D. B. Inference from Iterative Simulation Using Multiple Sequences. Statistical Science, 7 (4):457 472, 1992. doi: 10.1214/ss/1177011136. URL https://doi.org/10.1214/ss/1177011136.

Girolami, M. and Calderhead, B. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 73(2):123 214, 2011. ISSN 13697412, 14679868.

On Sampling with Approximate Transport Maps

Grumitt, R. D. P., Dai, B., and Seljak, U. Deterministic langevin monte carlo with normalizing flows for bayesian inference, 2022. URL https://arxiv. org/abs/2205.14240.

Hackett, D. C., Hsieh, C.-C., Albergo, M. S., Boyda, D., Chen, J.-W., Chen, K.-F., Cranmer, K., Kanwar, G., and Shanahan, P. E. Flow-based sampling for multimodal distributions in lattice field theory, July 2021. URL http://arxiv.org/abs/2107. 00734. ar Xiv:2107.00734 [cond-mat, physics:hep-lat].

Hoffman, M. D., Sountsov, P., Dillon, J. V., Langmore, I., Tran, D., and Vasudevan, S. Neu Tra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport. In 1st Symposium on Advances in Approximate Bayesian Inference, 2018 1 5, 2019. URL http:// arxiv.org/abs/1903.03704.

Invernizzi, M., Kr amer, A., Clementi, C., and No e, F. Skipping the Replica Exchange Ladder with Normalizing Flows. The Journal of Physical Chemistry Letters, 13(50):11643 11649, December 2022. doi: 10.1021/ acs.jpclett.2c03327. URL https://doi.org/10. 1021/acs.jpclett.2c03327. Publisher: American Chemical Society.

Izmailov, P., Kirichenko, P., Finzi, M., and Wilson, A. G. Semi-Supervised Learning with Normalizing Flows. Proceedings of the 37 th International Conference on Machine Learning, PMLR 119, 2020.

Jerfel, G., Wang, S., Wong-Fannjiang, C., Heller, K. A., Ma, Y., and Jordan, M. I. Variational refinement for importance sampling using the forward kullback-leibler divergence. In Uncertainty in Artificial Intelligence, pp. 1819 1829. PMLR, 2021.

Jia, H. and Seljak, U. Normalizing Constant Estimation with Gaussianized Bridge Sampling, December 2019. URL http://arxiv.org/abs/1912. 06073. ar Xiv:1912.06073 [astro-ph, stat].

Karamanis, M., Beutler, F., Peacock, J. A., Nabergoj, D., and Seljak, U. Accelerating astronomical and cosmological inference with Preconditioned Monte Carlo. Monthly Notices of the Royal Astronomical Society, 516(2):1644 1653, September 2022. ISSN 0035-8711, 1365-2966. doi: 10.1093/mnras/ stac2272. URL http://arxiv.org/abs/2207. 05652. ar Xiv:2207.05652 [astro-ph, physics:physics].

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization, 2014. URL https://arxiv.org/ abs/1412.6980.

Kobyzev, I., Prince, S. J., and Brubaker, M. A. Normalizing flows: An introduction and review of current methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3964 3979, nov 2021. doi: 10.1109/tpami.2020.2992934. URL https://doi. org/10.1109%2Ftpami.2020.2992934.

Laddha, A. and Vempala, S. Convergence of gibbs sampling: Coordinate hit-and-run mixes fast, 2020. URL https://arxiv.org/abs/2009.11338.

Liu, J. S., Liang, F., and Wong, W. H. The multipletry method and local optimization in metropolis sampling. Journal of the American Statistical Association, 95(449):121 134, 2000. ISSN 01621459. URL http: //www.jstor.org/stable/2669532.

Lov asz, L. Hit-and-run mixes fast. Mathematical Programming, 86(3):443 461, Dec 1999. ISSN 1436-4646. doi: 10.1007/s101070050099. URL https://doi.org/ 10.1007/s101070050099.

Lov asz, L. and Simonovits, M. Random walks in a convex body and an improved volume algorithm. Random Structures & Algorithms, 4(4):359 412, 1993. doi: https://doi.org/10.1002/rsa.3240040402. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/rsa.3240040402.

Lov asz, L. and Vempala, S. The geometry of logconcave functions and sampling algorithms. Random Structures & Algorithms, 30(3):307 358, 2007. doi: https://doi.org/10.1002/rsa.20135. URL https://onlinelibrary.wiley.com/doi/ abs/10.1002/rsa.20135.

L uscher, M. Trivializing Maps, the Wilson Flow and the HMC Algorithm. Communications in Mathematical Physics, 293(3):899, November 2009. ISSN 14320916. doi: 10.1007/s00220-009-0953-7. URL https: //doi.org/10.1007/s00220-009-0953-7.

Mahmoud, A. H., Masters, M., Lee, S. J., and Lill, M. A. Accurate Sampling of Macromolecular Conformations Using Adaptive Deep Learning and Coarse Grained Representation. Journal of Chemical Information and Modeling, 62(7):1602 1617, April 2022. ISSN 1549-9596, 1549-960X. doi: 10.1021/acs.jcim. 1c01438. URL https://pubs.acs.org/doi/ 10.1021/acs.jcim.1c01438.

Mc Naughton, B., Miloˇsevi c, M. V., Perali, A., and Pilati, S. Boosting Monte Carlo simulations of spin glasses using autoregressive neural networks. Physical Review E, 101(Mc):1 13, 2020. ISSN 24700053. doi: 10.1103/ Phys Rev E.101.053312. URL http://arxiv.org/ abs/2002.04292.

On Sampling with Approximate Transport Maps

Midgley, L. I., Stimper, V., Simm, G. N. C., Sch olkopf, B., and Hern andez-Lobato, J. M. Flow annealed importance sampling bootstrap, 2022. URL https:// arxiv.org/abs/2208.01893.

Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In International Conference on Learning Representations, 2018. URL https://openreview. net/forum?id=B1QRgzi T-.

Mou, W., Ho, N., Wainwright, M. J., Bartlett, P. L., and Jordan, M. I. Sampling for bayesian mixture models: Mcmc with polynomial-time mixing, 2019. URL https://arxiv.org/abs/1912.05153.

Murray, I., Adams, R., and Mac Kay, D. Elliptical slice sampling. In Teh, Y. W. and Titterington, M. (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp. 541 548, Chia Laguna Resort, Sardinia, Italy, 13 15 May 2010. PMLR. URL https://proceedings.mlr. press/v9/murray10a.html.

M uller, T., Mcwilliams, B., Rousselle, F., Gross, M., and Nov ak, J. Neural Importance Sampling. ACM Transactions on Graphics, 38(5):1 19, October 2019. ISSN 0730-0301, 1557-7368. doi: 10.1145/ 3341156. URL https://dl.acm.org/doi/10. 1145/3341156.

Narayanan, H. and Srivastava, P. On the mixing time of coordinate hit-and-run. Combinatorics, Probability and Computing, 31(2):320 332, 2022. doi: 10.1017/ S0963548321000328.

Natarovskii, V., Rudolf, D., and Sprungk, B. Geometric convergence of elliptical slice sampling. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 7969 7978. PMLR, 18 24 Jul 2021. URL https://proceedings.mlr. press/v139/natarovskii21a.html.

Neal, R. M. et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.

Nicoli, K. A., Anders, C., Funcke, L., Hartung, T., Jansen, K., Kessel, P., Nakajima, S., and Stornati, P. Machine Learning of Thermodynamic Observables in the Presence of Mode Collapse. In Proceedings of The 38th International Symposium on Lattice Field Theory Po S(LATTICE2021), pp. 338, May 2022. doi: 10.22323/1.396.0338. URL http://arxiv.org/ abs/2111.11303. ar Xiv:2111.11303 [hep-lat].

Nijkamp, E., Gao, R., Sountsov, P., Vasudevan, S., Pang, B., Zhu, S.-C., and Wu, Y. N. Mcmc should mix: Learning energy-based model with neural transport latent space mcmc. In International Conference on Learning Representations, 2021.

No e, F., Olsson, S., K ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of manybody systems with deep learning. Science, 365 (6457), 2019. ISSN 10959203. doi: 10.1126/science. aaw1147. URL https://www.science.org/ doi/10.1126/science.aaw1147.

Papamakarios, G., Pavlakou, T., and Murray, I. Masked autoregressive flow for density estimation. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings. neurips.cc/paper/2017/file/ 6c1da886822c67822bcf3679d04369fa-Paper. pdf.

Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed, S., and Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021. URL http://jmlr.org/papers/v22/ 19-1028.html.

Parno, M. D. and Marzouk, Y. M. Transport map accelerated markov chain monte carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645 682, jan 2018. doi: 10.1137/17m1134640. URL https:// doi.org/10.1137%2F17m1134640.

Rezende, D. and Mohamed, S. Variational Inference with Normalizing Flows. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1530 1538. PMLR, June 2015. URL https://proceedings. mlr.press/v37/rezende15.html. ISSN: 19387228.

Robert, C. P. and Casella, G. Monte Carlo Statistical Methods (Springer Texts in Statistics). Springer-Verlag, Berlin, Heidelberg, 2005. ISBN 0387212396.

Roberts, G. O. and Rosenthal, J. S. Quantitative nongeometric convergence bounds for independence samplers. Methodology and Computing in Applied Probability, 13(2):391 403, Jun 2011. ISSN 1573-7713. doi: 10.1007/s11009-009-9157-z. URL https://doi. org/10.1007/s11009-009-9157-z.

Roberts, G. O. and Tweedie, R. L. Exponential convergence of langevin distributions and their discrete approximations. Bernoulli, 2(4):341 363, 1996.

On Sampling with Approximate Transport Maps

ISSN 13507265. URL http://www.jstor.org/ stable/3318418.

Rubinstein, R. Y. and Kroese, D. P. Simulation and the Monte Carlo method. Wiley series in probability and statistics. John Wiley & Sons, Inc, Hoboken, New Jersey, third edition edition, 2017. ISBN 978-1-118-63220-8 978-1-118-63238-3.

Samsonov, S., Lagutin, E., Gabri e, M., Durmus, A., Naumov, A., and Moulines, E. Local-global mcmc kernels: the best of both worlds. In Advances in Neural Information Processing Systems, 2022.

Stimper, V., Midgley, L. I., Simm, G. N. C., Sch olkopf, B., and Hern andez-Lobato, J. M. Alanine dipeptide in an implicit solvent at 300k, August 2022. URL https: //doi.org/10.5281/zenodo.6993124.

Tabak, E. G. and Vanden-Eijnden, E. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217 233, 2010. ISSN 15396746. doi: 10.4310/CMS.2010.v8.n1.a11.

Tjelmeland, H. Using all metropolis hastings proposals to estimate mean values. Technical report, 2004.

Tokdar, S. T. and Kass, R. E. Importance sampling: a review. WIREs Computational Statistics, 2 (1):54 60, 2010. doi: https://doi.org/10.1002/wics. 56. URL https://wires.onlinelibrary. wiley.com/doi/abs/10.1002/wics.56.

Vempala, S. Geometric random walks: a survey. Combinatorial and computational geometry, 52(573-612):2, 2005.

Wang, G. Exact convergence analysis of the independent Metropolis-Hastings algorithms, 2022. URL https: //doi.org/10.3150/21-BEJ1409.

Wirnsberger, P., Ballard, A. J., Papamakarios, G., Abercrombie, S., Racani ere, S., Pritzel, A., Jimenez Rezende, D., and Blundell, C. Targeted free energy estimation via learned mappings. The Journal of Chemical Physics, 153(14):144112, October 2020. ISSN 0021-9606, 10897690. doi: 10.1063/5.0018903. URL https://aip. scitation.org/doi/10.1063/5.0018903.

Wong, K. W. K., Gabri e, M., and Foreman-Mackey, D. flow MC: Normalizing-flow enhanced sampling package for probabilistic inference in Jax, November 2022. URL http://arxiv.org/abs/2211. 06397. ar Xiv:2211.06397 [astro-ph].

Wu, K., Schmidler, S., and Chen, Y. Minimax mixing time of the metropolis-adjusted langevin algorithm for log-concave sampling. Journal of Machine Learning

Research, 23(270):1 63, 2022. URL http://jmlr. org/papers/v23/21-1184.html.

Yang, X. and Liu, J. S. Convergence rate of multiple-try metropolis independent sampler, 2021. URL https: //arxiv.org/abs/2111.15084.

On Sampling with Approximate Transport Maps

5.0 2.5 0.0 2.5 5.0 z

True map Spline approximation (degree = 3) Mode 1 (95% conﬁdence) Mode 2 (95% conﬁdence) Target mode (95% conﬁdence)

5.0 2.5 0.0 2.5 5.0 z

5.0 2.5 0.0 2.5 5.0 z

Tspline(Z µ)

Figure 5. Flow Tµ,ν in 1D - (Left) Map Tµ,ν from the latent space (on the x-axis) to the data space (on the y-axis). Each pair of dotted lines highlight a mode in the latent space while the horinzontal line show the mode in the data space. (Middle) Kernel density estimation of the push forward of µ through the flow Tµ,ν (Right) Kernel density estimation of the push forward of the base (N(0, 1)) through the flow a smooth cubic spline approximation of Tµ,ν

A. Perfect flows between multimodal distributions and unimodal distributions

In (Bogachev et al., 2005), the authors provide a recipe to build a triangular mapping 10 from one distribution to another. They do this by using the inverse CDF method for the first coordinate and then iterate it on the conditional distribution of the other coordinates. (Bogachev et al., 2005) shows that this bijection is the unique increasing one. We will illustrate this method by building a triangular map between a mixture of two Gaussians and a single Gaussian in 1D and 2D.

Unidimensional example Consider µ = N( a, σ2)/2 + N(a, σ2)/2 with a > 0 and σ > 0 and ν = N(0, σ2) with σ > 0. We can build a bijective mapping Tµ,ν between the two distribution by taking T = F 1 ν Fµ where Fν and Fµ are the cumulative distribution functions of ν and µ respectively. In our case, we have

4 2 + erf z+a σ

+ erf z a σ

2 1 + erf z σ

F 1 ν (y) = σ

2 erf 1(2y 1) Tµ,ν(z) = F 1 ν (Fµ(z))

Figure 5 shows Tµ,ν : it takes the mass from the left and push it to the middle and does the symmetric task on the other side. Note that at the edge of the modes, the curve is very sharp. This sharpness increases as the modes are getting further apart (i.e., a ) which makes smooth approximation (like the spline approximation shown here) more and more difficult. Those small errors on the edges of the mode lead to multimodality in the push-forward.

Bidimensional example We take a similar setup in 2 dimensions. We consider µ = N( a12, σ2I2)/2 + N(a12, σ2I2)/2 with a > 0 and σ > 0 and ν = N(0, σ2I2) with σ > 0. Following (Bogachev et al., 2005) protocol, we define T (1) µ,ν as our previous flow

T (1) µ,ν(z) = σ

2 + erf z + a

because it is the canonical mapping between µ1 and ν1 which are the projections of µ and ν on the first axis. Let µx and

10A function T : Rn Rn is triangular if Ti(x) only depends on x1, . . . , xi for x Rn and 1 i n

On Sampling with Approximate Transport Maps

Figure 6. Flow Tµ,ν in 2D - (Left) Samples from µ in the latent-space colored. The color of the samples is based on the closest mode. (Right) Samples from µ pushed through the flow Tµ,ν. The color of the samples correspond to their origin mode in the latent space.

νx be the projections of µ and ν on the second axis. We first compute ρx1 µ which is the density of µx1,

ρx1 µ (x2) = ρµ,ν(x1, x2) R

R ρµ,ν(x1, x2)dx2

= N(x1, a, σ)N(x2, a, σ)/2 + N(x1, a, σ)N(x2, a, σ)/2

(N(x1, a, σ) + N(x1, a, σ)) /2 = wx1,a,σN(x2, a, σ) + (1 wx1,a,σ)N(x2, a, σ) ,

where wx1,a,σ = N(x1, a, σ)/(N(x1, a, σ) + N(x1, a, σ)). So µx1 is a mixture between two Gaussians N( a, σ) and N(a, σ) with weight wx1,a,σ. On the other hand, νx1 is simply N(0, σ2) because it is isotropic. Following the same recipe as the unidimensional example, we can build T (2) µ,ν = Tµx1,νx1 to map µx1 on νx1. Finaly, we define the

bidimensional map Tµ,ν to be Tµ,ν(x, y) = (T (1) µ,ν(x), T (2) µ,ν(y)). Figure 6 show how the samples from µ are transported to ν. Again, we find a very sharp border between the two modes which would lead to multimodality if not well approximated.

Theoretical explanation In (Cornish et al., 2020, Theorem 2.1), the authors explain that if the support of the base and the target of a flow are different then in order to have a sequence Tαn of homeomorphisms such λρ Tαn

D π (with the notations of Sec. 2.1) then

lim n Bi Lip(Tαn) = ,

where Bi Lip(f) is the bi-Lipschitz constant of f is defined as the infemum over M [1, ] such that

M 1 z z f(z) f(z ) M z z .

Because classic deep normalizing flows use contraction mappings like Re Lu as activation functions, their bi-Lipschitz function is necessarily bounded thus forbidding a perfect approximation of π. As mentionned by (Cornish et al., 2020), this is especially true for Res Flows (Chen et al., 2019) which are built upon Lipschitz transformations.

The intuition of this theorem can be seen in the previous example, as the sharp edges at the border of the modes would require unbounded derivatives.

Proof of Proposition 3.1 Because d = 1, all real functions are triangular. Using (Bogachev et al., 2005, Lemma 2.1) we deduce the transport map Tµ,ν that we built in the previous paragraph is actually the unique increasing flow. We now

On Sampling with Approximate Transport Maps

3 2 1 0 1 2 3 x

Probability

L = 0.3 π(x) q(x)

T n(x, .) π TV

Number of MCMC steps to

bring the TV under ϵ

Figure 7. Toy experiment with a mixture of two uni-dimensional Gaussians - (Left) Target distribution π and proposal Q = N(0,

σ2 + a2). The distance L is the depth of the gap between the two modes. (Middle) Total variation metrics of 256 MCMC chains sampling π with different values of L after 8192 steps. (Right) Number of steps needed to bring the total variation distance of the chain being built under ϵ = 8 10 2 for different values of L. Note that curves eventually plateau due to the fact that we can t sample infinitely long MCMC chains.

compute its bi-Lipschitz constant. We have that

dz (z) = 1 2σ

d F 1 ν dy (y) = σ

2π exp([erf 1(2y 1)]2) , (14)

dz (z) = d F 1 ν dy (Fµ(z))d Fµ

dz (z) . (15)

Using equations (13)-(15) with z = 0, we have that

2 + erf a σ

Using this result, we can derive a lower bound on Bi Lip(Tµ,ν)

Bi Lip(Tµ,ν) Lip(T 1 µ,ν) = 1/Lip(Tµ,ν) d Tµ,ν

This shows that lima Bi Lip(Tµ,ν) = + . Proposition 3.1 can be recovered by taking σ = 1.

B. Local samplers can t cross energy barriers

We can illustrate that local samplers can t cross energy barriers by taking a simple uni-dimensional mixture of Gaussians as in Fig. 7. Figure 7 shows that, as the energy barrier increases, it gets more and more difficult for a local sampler to sample the mixture. Independent proposal methods are able to overcome this issue despite a poorly chosen proposal.

On Sampling with Approximate Transport Maps

Figure 8. Push-backward and push-forward of popular deep normalizing flows targeting the two moons distribution

C. Flows typically do not erase potential barriers in the latent space

The toy experiment from App. B explained the difficulty of sampling multi-modal distributions directly in the data space, but could the flow kill this multi-modality in the latent space like it killed the bad conditioning before ? We trained popular deep normalizing flows on the two moons multimodal target and observed the push-forward and push-backward space. Figure 8 shows that the flows are not erasing energy barriers and can even worsen the conditioning of the modes in the latent space compared to the data space. However, they successfully put all the mass under the support of the base of the flow which should improve the quality flow proposal based methods. The intuition behind this difficulty is hinted in App. A.

D. Additional details on synthetic examples

D.1. Computation of sliced distribution metrics

In section 3, we used two different distances : the Total Variation (TV) distance and the Kolmogorov-Smirnov (KS) distance. They are computed between distributions µ and ν as follows

DTV(µ, ν) = sup A F |µ(A) ν(A)| , (16)

DKS(µ, ν) = sup x |Fµ(x) Fν(x)| , (17)

where Fµ and Fν are the cumulative distribution functions of µ and ν respectively. If we consider sliced versions of those metrics in Sec. 3, it s because both of them cannot be computed in high dimension : (16) require the computation of an intractable integral and (17) require the multidimensional generalisation of the cumulative distribution function which is also intractable. Moreover, since µ will always be the distribution of the MCMC samples which is unknown it requires density estimation which is notoriously hard in high dimension.

To solve this problem, we project the samples in 1D before computing the metric. Let (Xi)N i=1 and (Yi)M i=1 be samples from µ and ν respectively, for every random normal projection P : Rd R (i.e., Pi(x) = pix where pi N(0, 1)), we compute Xi = P(Xi) and Yi = P(Yi) and then

Sliced Total Variation

we perform density estimation on the obtained samples leading to densities µ and ν;

On Sampling with Approximate Transport Maps

we compute DTV( µ, µ) by doing

DTV( µ, µ) = 1

Z | µ(x) ν(x)| dx .

Sliced Kolmogorov-Smirnov

we compute the empirical cumulative distributions of those projected samples ˆFµproj(x) = 1/n Pn i=1 1( ,x](xi) where x browse the union support of µproj and νproj (same goes for ˆFνproj(x));

we compute the supremum of ˆFµproj ˆFνproj .

To be accurate this procedure requires many random projections. In this work, we always use 128 random projections. Moreover, the density estimation task needed for the sliced total variation is performed using kernel density estimation (with a gaussian kernel) where the bandwitch is selected with Scott method for all sampling algorithms except IMH and IS which use the Sheather-Jones algorithm 11. This density estimation task (just like the estimation of the cumulative distribution function) require to have M N if ν represent the true distribution and µ is an approximate distribution. In pratice, we use M = 10N.

Finally, we have chosen the use of the Kolmogorov-Smirnov distance for the experiments involving Neal s Funnel as we wanted to highlight the sampling behavior in the tails of the distribution.

To circumvent the need of slicing dimensions, a reviewer suggested to use the Maximum Mean Discrepancy (MMD). Given a feature map ϕ : Rd F and two distributions P and Q, the MMD is defined as

MMD2(P, Q) = µP µQ 2 F ,

where µP = (EP [ϕ(X)1], . . . , EP [ϕ(X)m])T . Using the kernel trick, one can find a kernel k : Rd Rd R that satisfies

< µP , µQ >F= EP,Q[k(X, Y )] .

In the following, we ll empirically compute the MMD between two datasets X and Y (as feature matrix) with the following formula

MMD2(X, Y ) = 1 m(m 1)

j =i k(xi, xj) 2 1 m.m

j k(xi, yj) + 1 m(m 1)

j =i k(yi, yj)

with a Gaussian kernel k(x, y) = exp( x y 2 /(2σ2). In practice, we choose the bandwidth σ as the mean of the pairwise distances between each sample of X and Y .

D.2. Experimental details on the interpolated Gaussians

Design of the target π The target π is a unimodal centered Gaussian N(0, Σ) with a badly conditioned covariance matrix Σ. We take Σ = Rπ/4diag(σ2 1, . . . , σ2 d)RT π/4 where Rπ/4 is the rotation of angle π/4 in the plan (x1, xd) (first and last axis) and σi are logarithmically evenly distributed between 10 1 and 101.

Design of Tt The imperfect flows are built with a piecewise linear interpolation Tt using multiple canonical transformations between different multivariate Gaussians. If

A denotes the Cholesky factor of a matrix A (i.e.,

A T = A), then the flow indexed by t can be expressed as

σ2 1Id + 2t

Σ)z if t < 1/2

Σz if t = 1/2 ((2t 1) p

σ2 d Id + 2(1 t)

Σ)z if t > 1/2 ,

11This is because when the flow is very far from the target, those samples tend to stagnate breaking Scott or Silverman rules.

On Sampling with Approximate Transport Maps

0.0 0.2 0.4 0.6 0.8 1.0 t

Conditioning number

Conditioning number of the target Conditioning number of the push-backward

Figure 9. Conditioning of the push-backward of π through Tt

Table 3. Sampling hyper-parameters for the interpolated Gaussians experiment - n is the number of MCMC steps in the chain, nlocal is the number of interleaved local steps in global/local samplers and N is the number of particles used in importance sampling.

Dimension n N (flow-MCMC)

N (neutra-flow-MCMC)

nlocal (flow-MCMC)

nlocal (neutra-flow-MCMC)

16 1100 80 80 22000 50 50 32 1200 80 80 24000 50 50 64 1300 80 80 26000 50 50 128 1400 80 80 28000 50 50 256 1500 80 80 30000 50 50

T 1 t : x 7

σ2 1Id + 2t

Σ] 1x if t < 1/2

Σ 1x if t = 1/2 [(2t 1) p

σ2 d Id + 2(1 t)

Σ] 1x if t > 1/2

The flow Tt is a linear map and its jacobians are the determinant of the scaling factors. This flow is built so that the conditioning number of the push-backward is canceled when the flow is perfect and can be as high as the one from the target if t = 0.5 (Fig. 9).

Sampling details Sampling hyperparameters where chosen by doing a grid-search for each algorithm and maximizing the median (over different quality of flows) of the sliced TV. The number of MCMC steps n was also chosen so that n is twice the number of steps needed to bring the ˆR diagnostic of the faster algorithm (to converge) bellow 1.1. The number of particles used in importance sampling is a quarter of the total number of particles used by i-SIR. This division was chosen to avoid memory problems when using non-sequential importance sampling. All details are available in table 3. 256 chains were sampled in parallel to compute the metrics and were started by samples from the flow. 128 random projections are used to compute the sliced total variation (more details about this metric in App. D.1). The step size of the local steps was chosen to maintain the acceptance rate at 75%.

Maximum Mean Discrepancy distance As suggested by one of the reviewer, we computed the MMD distance - see end of App. D.1 - instead of the Sliced TV - see Fig. 1 (Left). Fig. 10 shows no apparent differences.

D.3. Experimental details on Neal s Funnel

Design of the target π Neal s funnel is a distribution which closely resembles the posterior of hierarchical Bayesian models. It depends on 2 positive parameters a, b and is defined as follows

( X1 N(0, a) Xi N(0, eb X1), i {2, . . . , d} .

On Sampling with Approximate Transport Maps

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

Maximum Mean Discrepancy

MALA ﬂow-(IMH) ﬂow-(i-SIR) neural-IS

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

neutra-(ESS) neutra-(MALA)

Figure 10. Replicate of Fig. 1 (Left) on the interpolated Gaussians with MMD distance.

Latent space

β = 1.0 β = 10.0

Push-backward Push-forward Base Target

10 1 100 101

Flow imperfection parameter

β (log-scale)

Kolmogorov-Smirnov distance

ﬂow-(IMH) ﬂow-(i-SIR) neural-IS

10 1 100 101

Flow imperfection parameter

β (log-scale)

MALA neutra-(ESS) neutra-(MALA)

Figure 11. (Left) Push-backwards λρ Tt and push-forwards λπ T 1 t as a function of the flow imperfection parameter t. Bottom row compares ill-conditioned Gaussian target (blue levels) and flow push-forward (black lines) projected on smallest and largest variance axes. Top row compares base distribution N(0, Id) (pink level) with target push-backward (black lines). (Right) Kolmogorov-Smirnov (KS) distances between target and empirical samples depending on the quality of the flow β using 512 chains of length 600 initialized with the flow Tβ. neural-IS was evaluated with 9750 samples.

Design of the flow Tβ We define an analytical parametric flow Ta,b,α as follows

Ta,b,α : z 7

1 α exp( b a

α z1 2 )zi ...

, T 1 a,b,α : x 7

a x1 ... α exp( b

log det(JTa,b,α(z)) = log( a) d

2 log(α) + d 1

log det(JT 1 a,b,α(x)) = log(a) + d

2 log(α) d 1

Ta,b,α=1 is the natural flow which transports N(0, Id) to π(a, b). The parameter α was introduce for two purposes : control the variance of the push-backward and calibrate the mode of the push-forward. Indeed, it s easy to see that

On Sampling with Approximate Transport Maps

T 1 a,b,α(Ta,b,1(z)) = T 1 a,b,α

az1 ... exp( b az1

a ( az1) ... α exp b

2 az1 exp( b az1

So if X π(a, b), then T 1 a,b,α(X) N(0, αId). Moreover, we can compute the mode of Ta,b,α(N(0, Id)) using the change of variable formula,

log λN(0,Id) Ta,b,α (x) = log(N(T 1 a,b,α(x); 0, Id)) + log det JT 1 a,b,α(x)

T 1 a,b,α(x) 2

2 log(2π) log(a) + d

2 log(α) d 1

a x2 1 + α exp( bx1)

where K is a constant. Canceling the gradient of log λN(0,Id) Ta,b,α leads to

log λ N (0,Id) Ta,b,α (x)

log λ N (0,Id) Ta,b,α (x)

xj (x) = 0 , (j = 1)

a x1 bα Pd i=2 x2 i exp( bx1) d 1

α exp( bx1)xj = 0 (j = 0)

2 b xj = 0 (j = 0)

2α(d 1) 0 ... 0

Let β > 0, a = 3, b = 1 then if a = βa , b = b and α = β the mode of the push-forward of N(0, Id) through Ta,b,α and the mode of π(a , b ) will always coincide for any β > 0. We use this principle to create the flow Tβ = Tβa ,b ,β which is a perfect mapping from N(0, Id) to π(a , b ) if β = 1.

Sampling details Sampling procedure is the same as in Sec. D.2 with the Kolmogorov-Smirnov distance as target metric, only some hyper-parameters change (see table 4).

D.4. High acceptance rates of pure flow-MCMC algorithms

Figure 12 shows how global flow-MCMC samplers - IMH and i-SIR - behave depending on the quality of the flow. The middle and right columns show that the acceptance of IMH/i-SIR and the participation ratio of IS are crashing as soon as π and ρ aren t aligned perfectly. Note the the crashing rate of i-SIR is slower than the one of IMH. Fig. 12 also highlights that flow-MCMC algorithms as well as neutra-MCMC with ESS produce samples with very different likelihoods compared to neutra-MCMC methods which are known to perform better (see Fig. 1 and Fig. 11 of the main paper).

On Sampling with Approximate Transport Maps

Table 4. Sampling hyper-parameters for Neal s Funnel experiment.

Dimension n N (flow-MCMC)

N (neutra-flow-MCMC)

nlocal (flow-MCMC)

nlocal (neutra-flow-MCMC)

16 500 60 60 7500 10 40 32 550 60 100 8250 10 40 64 600 60 100 9000 10 40 128 650 60 100 9750 10 40 256 700 60 120 10500 10 40

Table 5. Real NVP training hyperparameters for the Gaussian mixture experiment

Dimension # Iteration Patience of learning rate scheduler Learning rate Size of hidden layers # Real NVP blocks

16 2500 100 10 2 64 4 32 2640 106 7.36 10 3 76 4 64 2900 120 3.98 10 3 102 4 128 3440 146 1.17 10 3 153 5 256 11250 200 10 4 256 8

D.5. Experimental details for the high dimensional Gaussian mixture

Design of π π is a mixture of 4 isotropic Gaussians i.e., π = 1/4 P4 i=1 N(µi, Id) where the µi are defined as µ1 = a (1, 1, 1, . . . , 1, 1, 1), µ2 = a ( 1, 1, 1, . . . , 1, 1, 1), µ2 = µ2 and µ4 = µ1 and a = 0.5919. This specific value of a guarantees that if X N(µi, Id) then j = i, P( X µj < X µi ) 10 10 for any dimension d.

Training the flow The flows used here are Real NVPs. The base of the flow is ρ = N(0, σ2Id) where σ2 is the maximum variance of π along each dimension 12. The flow was trained using Adam (Kingma & Ba, 2014) optimizer at a progressive learning rate. All the coupling layers had 3 hidden layers initialized with very small weights ( 10 6). The batch size was 8192 and the decay rate was 0.99. The other hyper-parameters can be found in table 5. An indicator of the flows qualities can be found in Fig. 13.

More neutra-MCMC algorithms for the 2D mixture Figure 14 extends Fig. 2 with more neutra-MCMC methods. It shows that changing the reparametrized sampler could allow switching modes in the latent space. For instance, ESS is able to do so. Our conclusion is that using local samplers in this pathological latent space make mixing between modes much longer and less accurate compared to using flow-MCMC methods.

The case of mode collapse As recalled in the introduction, normalizing flows trained with the backward KL objective notoriously suffer from mode collapse. Based on this observation, we extended the conclusions of Fig. 2 Left using a flow which suffered from mode collapse. Fig. 15 shows that flow-MCMC methods sometimes reach the uncovered mode while neutra-MCMC never do. One can also think of the following thought experiment: consider a target as a two-component mixture of Gaussian and a Gaussian base distribution. In a mode-collapsed situation, we can imagine that an affine flow sends the base distribution perfectly to one of the modes. All types of NF-enhanced samplers will fail. On the one hand, flow-MCMC will never manage to propose in the other mode. On the other hand, the push-backward of the target sampled by neutra-MCMC will be an affine transformation of the original Gaussian mixture, which a local sampler cannot properly sample.

flow-MCMC methods exploit modes If figure 2 shows the exploration capability that flow-MCMC methods have and neutra-MCMC methods lack, it doesn t show how much the mass within the modes is exploited. By cutting the Markov chains depending on the closest mode and computing the mean and covariance of each part, we can compute the forward KL between each mode and the Gaussian approximated in each part. Averaging those metrics leads to Fig. 13 which shows

12Using the total variation formula i {1, . . . , d}, σ2 = σ2 i = 1 + P4 j=1 1 4(µj)2 i

On Sampling with Approximate Transport Maps

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

Log-likelihood

Mean log-likelihood in a chain

MALA ﬂow-(IMH) ﬂow-(i-SIR) neutra-(ESS) neutra-(MALA)

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

Global acceptance

Global acceptance

0.0 0.2 0.4 0.6 0.8 1.0 Flow imperfection parameter t

1 / (1 + V[w])

Inverse variance of importance weights

10 1 100 101

Flow imperfection parameter

β (log-scale)

Log-likelihood

Mean log-likelihood in a chain

MALA ﬂow-(IMH) ﬂow-(i-SIR) neutra-(ESS) neutra-(MALA)

10 1 100 101

Flow imperfection parameter

β (log-scale)

Global acceptance

Global acceptance

10 1 100 101

Flow imperfection parameter

β (log-scale)

1 / (1 + V[w])

Inverse variance of importance weights

Figure 12. (Left) Likelihoods levels (Middle) / Global acceptance of pure flow-MCMC methods / (Right) Participation ratio of importance sampling, depending on the quality of the flow - (Top) Interpolated gaussians flow (Bottom) Neal s Funnel flow

that in moderate dimensions the flow is good enough so purely global methods can exploit the modes. Here, we can see the dimension dependence of all methods.

Sampling details Sampling procedure is the same as in Sec. D.2 and targets the mean score between the average KL and the mode mixing metric. Note that unlike Sec. D.2, 1024 chains were used. Only some hyper-parameters change and can be found in table 6.

D.6. Experimental details on the banana distribution

Design of π The banana distribution depends on two positive parameters a and b and is expressed as follows

X π(a, b) i {0, . . . , d/2},

( X2i = a Z2i X2i+1 = Z2i+1 + ba2Z2 2i a2b ,

where Z N(0, Id) and d is even. This leads to a natural bijection Ta,b between N(0, Id) and π(a, b)

... az2i z2i+1 + ba2z2 2i a2b ...

, T 1 a,b : x 7

... x2i/a x2i+1 bx2 2i + a2b ...

log | det(JTa,b(z))| = d

2 log(a) , log | det(JT 1 a,b (x))| = d

On Sampling with Approximate Transport Maps

16 32 64 128 256 Dimension (log-scale)

Global acceptance

ﬂow-MCMC (IMH) ﬂow-MCMC (i-SIR) neutra-ﬂow-MCMC (i-SIR + MALA)

16 32 64 128 256 Dimension (log-scale)

Average estimated KL

between mixtures

ﬂow-MCMC (i-SIR) neural-IS neutra-MCMC (ESS) neutra-ﬂow-MCMC (i-SIR + MALA)

Figure 13. (Left) Global acceptance of the Real NVP flow obtained after training on the mixture of Gaussians in increasing dimension (Right) Averaged forward Kullback-Leiber for the Gaussian mixture.

Figure 14. Extended version of Fig. 2

In the following, we take a = 10 and b = 0.02 (see Fig. 16).

Flows, training and sampling The normalizing flows used are Real NVPs (Dinh et al., 2016) with 3 layers deep neural networks initialized with small weights ( 10 6). The specific architecture of the neural networks and the training hyperparameters depending on the dimension of the problem can be found in table 7 (the learning rate scaling is 0.98). The sampling algorithms with i-SIR using N = 60 particles, nlocal = 10 and the target acceptance of local steps is 75%. 128 random projections where used to compute the sliced total variation. There are 128 MCMC chains of length 1024 sampled in parallel.

E. Mixing time for IMH for log concave distribution

E.1. General Theorem

Here we consider Independent Metropolis-Hastings (IMH) with a target π on Rd and a proposal Q with density denoted by q with respect to the Lebesgue measure. We denote by P the resulting Markov kernel given for any x Rd and Borel set A by

P(x, A) = Z

A Q(dy)acc(x, y) + 1 Z Q(dy)acc(x, y) δx(A) . (18)

We define the mixing time of P with respect to an initial distribution µ and a precision target ϵ > 0 as

τmix(µ, ϵ) = inf{n N : µP n π TV ϵ} , (19)

On Sampling with Approximate Transport Maps

Figure 15. An illustration of sampling with a mode-collapsed flow. (Left) Real NVP trained with backward KL torwards a 2d target mixture of 4 Gaussians. (Middle) neutra-MCMC and (Right) flow-MCMC Samplers using the flow on the left. The 128-step MCMC chain is colored according to the closest mode in the data space (bottom row) with corresponding location in the latent space (top row). MALA s step-size was chosen to reach 75% of acceptance.

Table 6. Sampling hyper-parameters for the Gaussian mixture experiment.

Dimension n N (flow-MCMC)

N (neutra-flow-MCMC)

nlocal (flow-MCMC)

nlocal (neutra-flow-MCMC)

16 700 120 120 21000 5 5 32 900 140 140 31500 5 5 64 1100 160 160 44000 5 5 128 1300 180 180 58500 5 5 256 1500 200 200 75000 5 5

where the total variation distance between two distributions P1 and P2 is defined as

P1 P2 TV = sup A B(Rd) |P1(A) P2(1)| . (20)

The quantity τmix(µ, ϵ) provides the minimum number of MCMC steps needed to bring the total variation distance between the chain and the target π below a precision ϵ when the chain has initial distribution µ. Now we introduce the s-conductance which will be later use to bound the mixing time. Let s (0, 1/2], we define the s-conductance as

ϕs = inf S,π(S) (s,1/2]

S P(x, Sc)π(dx)

π(S) s , (21)

where Sc denotes the complement of S. The s-conductance quantifies the probability of transitioning from a set S charged with a probability s to its complementary. This quantity is of interest since it can be used for bounding the mixing time very useful because of Theorem E.1 and its corollary taken from (Lov asz & Simonovits, 1993).

τmix(µ, ϵ) = inf{n N : P n(µ, .) π TV ϵ} , where ϵ > 0 . (22)

τmix(µ, ϵ) provides the minimum number of MCMC steps needed to bring the total variation distance between the chain and the target π bellow an accuracy ϵ when the chain started with distribution µ.

Theorem E.1 (Corollary 1.5 (Lov asz & Simonovits, 1993)). A Markov chain with π as unique invariant distribution and µ a β-warm start with respect to π (i.e., for any Borel set E, µ(E) βπ(E)) verifies that for all n N,

P n(µ, .) π TV βs + β 1 ϕ2 s 2

n βs + β exp n

2 ϕ2 s . (23)

In particular, for 0 < ϵ < 1, then

τmix(µ, ϵ) 2

Here we follow now a common strategy to derive quantitative mixing times for Metropolis-Hastings algorithms applied to log-concave target by applying Theorem E.1 (Dwivedi et al., 2018; Mou et al., 2019; Chen et al., 2020; Chewi et al., 2021; Wu et al., 2022; Narayanan & Srivastava, 2022).

On Sampling with Approximate Transport Maps

Figure 16. Banana distribution with different parameters

Table 7. Real NVP hyper-parameters for the Banana experiment.

Dimension # Iteration Patience of learning rate scheduler Learning rate Size of hidden layers # Real NVP blocks

16 1000 75 10 2 64 3 32 1193 82 8 10 3 74 3 64 1580 96 5 10 3 96 3 128 2354 125 2 10 3 139 5 256 3903 183 3 10 4 226 7 512 7000 300 1 10 5 400 12

We denote by w the weight function w = π/q and w = log w the log-weight function. We establish first conditions on the proposal Q and the log-weight function which allows us to get lower bounds on the conductance for P and therefore its mixing time. Then, we specialize our result to the case Q = N(0, σ2 Id) and π positive and strongly log-concave, see. Assumption E.7. Without loss of generality, we assume that the unique mode of π is 0.

Assumption E.2. 0 is a mode for π, i.e., arg maxx Rd π(x) = 0.

π satisfies an isoperimetric inequality with isoperimetric constant ψ(π) i.e., for any partition {S1, S2, S3} of Rd 13, we have π(S3) ψ(π)dist(S1, S2)π(S1)π(S2) , (25)

where dist(S1, S2) = inf{ x y : x S1, y S2}

The log-weight function is locally Lipschitz, i.e., for all R 0 there exist CR such that for all (x, y) B(0, R)2,

| w(x) w(y)| CR x y . (26)

Note that if π is log-concave, then (25) holds; see (Bobkov, 1999).

Our assumption on Q is the following. Let α (1/2, 1) and s (0, 1/2].

Assumption E.3 (α, s). There exist (0, 1), δ > 0 and R 0 satisfying Z

B π(dx)Q(dy) 1 δ , (27)

B = (x, y) B(0, R)2 : x y R , where R = log( )

and 1 1 αδ min s

64CR ψ(π)s . (29)

Theorem E.4 (Conductance lower bound for IMH). Let 1/2 < α < 1 and 0 < s < 1/2. Assume assumption E.2 and E.3(α, s) hold. Then, it holds

2 , (1 α)(2α 1)

128CR ψ(π) . (30)

13i.e., S1 S2 S3 = Rd but for any i = j, Si Sj =

On Sampling with Approximate Transport Maps

Proof. Define the set E ,α by

E ,α = x B(0, R) : Z 1(x,y) B Q(dy) α .

Using (27), we have by definition that

π(E ,α) 1 1 1 αδ (31)

Indeed, it would not hold, using the law of total probability, we would have 1 δ < [1 δ /(1 α)]1+[δ /(1 α)]α = 1 δ . Let S be a measurable subset of Rd with s π(S) 1/2 and define

S1 = {x S : P(x, Sc) 1 α} ,

S2 = {x Sc : P(x, S) 1 α} ,

S3 = (S1 S2)c .

If π(S1) < π(S)/2 or π(S2) < π(Sc)/2, then we may conclude from the reversibility of P that

S P(x, Sc)π(dx) = 1

S P(x, Sc)π(dx) + 1

Sc P(x, S)π(dx) > 1

2 (1 α) = 1 α

4 π(S) . (32)

In the following, we assume that π(S1) π(S)/2 or π(S2) π(Sc)/2. Then it follows from the definition of the total variation distance that if x E ,α S1 and y E ,α S2 such that

P(x, .) P(y, .) TV 2α 1 .

Moreover using the fact that x E ,α S1 and y E ,α S2 and the expression of the Markov kernel, we have that

P(x, .) P(y, .) TV = Z |acc(x, z) acc(y, z)| Q(dz)

= Z min 1, π(z)q(x)

min 1, π(z)q(y)

= Z min 1, w(z)

min 1, w(z)

Z | w(x) w(y)| Q(dz) | w(x) w(y)| CR x y ,

where we used the Lipschitz behavior of t 7 min(1, et) and property (26) in the last lines. This leads to 2α 1 CR x y and more generally to

dist(E ,α S1, E ,α S2) 2α 1

Using the isoperimetric inequality (25), we have that

π(Ec ,α S3) = π((E ,α S1)c (E ,α S2)c)

CR ψ(π)π(E ,α S1)π(E ,α S2) .

Using (29)-(31), we have with the union bound, s (0, 1/2] and the condition π(S1) π(S)/2 or π(S2) π(Sc)/2

On Sampling with Approximate Transport Maps

π(S3) + 1 1 αδ π(S3) + π(Ec ,α)

CR ψ(π)π(E ,α S1)π(E ,α S2)

CR ψ(π)(π(S1) π(Ec ,α))(π(S2) π(Ec ,α))

CR ψ(π) π(S1) 1 1 αδ

π(S2) 1 1 αδ

CR ψ(π) π(S)

CR ψ(π)π(S)

Using (29) again, we get

64CR ψ(π)π(S) .

Consequently, using that P is reversible, we get Z

S P(x, Sc)π(dx) 1

S S3 P(x, Sc)π(dx) + Z

Sc S3 P(x, S)π(dx)

S S3 (1 α)π(dx) + Z

Sc S3 (1 α)π(dx)

(1 α)(2α 1)

128CR ψ(π)π(S) ,

along with (32) and the definition of the s-conductance, we get

ϕs min (1 α)

2 , (1 α)(2α 1)

128CR ψ(π) .

Corollary E.5 (Mixing time upper bound for IMH). Let 1/2 < α < 1, 0 < ϵ < 1 and µ a β-warm initial distribution with respect to π. Assume assumption E.2 and E.3 hold, then we have the following upper bound on the mixing time

τmix(µ, ϵ) 8 (1 α)2 log 2β

max 1, 642C2 R ψ(π)2(2α 1)2

Proof. Apply Theorem E.1 taking s = ϵ/(2β).

Theorem E.6. Take Q = N(0, σ2Id) where σ > 0 and assume E.2. Let 0 < ϵ < 1 and µ a β-warm distribution with respect to π. Assume that there exist R1 > 0 and 0 < c1 < 1/16 such that

π(B(0, R1)) 1 c1ϵ

On Sampling with Approximate Transport Maps

with c2 = 1/16 c1, then

τmix(µ, ϵ) 128 log 2β

1, 1282C2 Rϵ,β ψ(π)2

Proof. Let s = ϵ/(2β). (Dwivedi et al., 2018, Lemma 1) states that for any c > 0, there exists Rs such that ϕ(B(0, Rs)) 1 cs where ϕ is a m-strongly log-concave distribution and

Applying this result to Q which is 1/σ2-strongly log-concave leads to Q(B(0, R2,s)) 1 c2s where c2 > 0 and

Let R = R1 + R2,s and R = max(R1, R2,s) in assumption E.3, this leads to B(0, R1) B(0, R2,s) B so Z

B π(dx)Q(dy) Z

B(0,R1) B(0,R2,s) π(dx)Q(dy) (1 c1s)(1 c2s) .

Because c1 > 0 and c2 > 0, we have that R

B π(dx)Q(dy) 1 (c1 + c2)s = 1 δ thus checking condition (27) of assumption E.3. Checking (29) of assumption E.3 amounts to check

1 1 αδ min 1

4 s δ (1 α)(2α 1)

( (c1 + c2) 1 α

4 CR (1 α)(2α 1)

64(c1+c2) ψ(π)

Maximising (1 α)(2α 1) with α = 3/4 leads to ( (c1 + c2) 1 16 CR 1 512(c1+c2)ψ(π) ,

and taking c2 = 1/16 c1, we get CR ψ(π)/32.

Assumption E.7. The target π is positive and log π is m-strongly convex on Rd: for any x, y Rd and t [0, 1],

log π(tx + (1 t)y) t log π(x) (1 t) log π(y) mt(1 t)

Corollary E.8. Let π be a m-strongly log-concave distribution (Assumption E.7) with a mode at 0 and Q = N(0, σ2Id) with σ > 0. Let 0 < ϵ < 1 and µ a β-warm distribution with respect to π. Assume that the log-weight function checks the local Lipschitz condition (26) from assumption E.2 then provided that

CRϵ,β log 2 m

272, β, d ,

On Sampling with Approximate Transport Maps

r(ϵ, β, d) = 2

τmix(µ, ϵ) 128 log 2β

1, 1282C2 Rϵ,β log(2)2m

Proof. We use the fact that π is a m-strongly log-concave distribution to deduce that ψ(π) = log 2 m using (Cousins & Vempala, Theorem 4.4). Moreover, as the mode of π is 0, we apply (Dwivedi et al., 2018, Lemma 1) again to obtain R1 and c1 for Theorem E.6 with c1 = 1/17 < 1/16.

Proposition 4.3 in the main text corresponds to the asymptotic version of corollary E.8.

Illustration with a Gaussian target Take π = N(0, Id) and Q = N(0, (1 + λ)2Id) with λ > 0. λ is the error term of the proposal. π is m-strongly log-concave with m = 1. The importance weight function w can be computed

w(x) = π(x)

q(x) = σd/2 exp 1

2x T Id 1 (1 + λ)2 Id

so the log-weight function is

2x T Id 1 (1 + λ)2 Id

Let R > 0 and (x, y) B(0, R)2,

| w(x) w(y)| = 1

1 1 (1 + λ)2

k=1 (x2 k y2 k)

1 1 (1 + λ)2

R 1 1 (1 + λ)2

k=1 |xk yk|

R 1 1 (1 + λ)2

where in the last line we used the Cauchy-Schwartz inequality Pd k=1 |xk| = |x| , 1 | |x| , 1 | x

d. This gives us CR = R

d 1 1/(λ + 1)2 . Using corollary (E.8), we find that asymptotically Rϵ,β d 2

d(λ + 1) which leads to

CRϵ,β d 2d(λ + 1) 1 1 (1 + λ)2

While checking the hypothesis of corollary (E.8), the error of Q need to scale as an inverse law of the dimension d

K 2d + 1 1 d 1 d where K = log 2

in order to keep the mixing time bellow a constant despite the increase in dimension.

On Sampling with Approximate Transport Maps

Illustration with a non-isotropic Gaussian target Take π = N(0, Σ) with Σ = diag(c2 1, . . . , c2 d) with ci > 0 for all i {1, . . . , d} such m c1 . . . L with m, L > 0. Moreover, we set Q = N(0, σ2Id) with σ > 0. For sake of simplicity, we assume that L = m 1 and m < 1. Note that m is the strong log-concave constant of π. With the same reasoning as the previous example, we have that

w(x) = π(x)

q(x) = σd/2 exp 1

Let R > 0 and (x, y) B(0, R)2,

| w(x) w(y)| = 1

(x2 k y2 k)

R max i {1,...,d}

k=1 |xk yk|

where cσ = maxi {1,...,d} 1

c2 i 1 σ2 . This leads to

Using, corollary E.8 we find that asymptotically

CR d 2d max σ, 1 m

We now choose σ to minimize either the forward KL or the backward KL between N(0, Σ) and N(0, σ2Id).

DKL(N(0, Σ), N(0, σ2Id)) = 1

i=1 c2 i d + d log σ2

i=1 log c2 i

d DKL(N(0, Σ), N(0, σ2Id))

dσ2 (σ2) = 1 2σ2

DKL(N(0, σ2Id), N(0, Σ)) = 1

i=1 c2 i d log σ2 !

d DKL(N(0, Id), N(0, Σ))

dσ2 (σ2) = 1

leading to σ2 f = Pd i=1 c2 i /d being the minimizer of the forward KL and σ2 b = d/ Pd i=1(1/c2 i ) being the minimizer of the backward KL. As mentionned in Sec. 3, σ2 f corresponds to an over-spread Gaussian while σ2 b corresponds to an overconcentrated Gaussian. Using that m c2 i L for all i {1, . . . , d}, we can compute max (σf, 1/ m) = 1/ m and max (σb, 1/ m) = 1/ m. Using the same property, we have that cσf 1/m m and cσb 1/m m. Using the upper bound for the mixing time from corollary E.8 and ignoring all the constants, we get an upper bound in O d2(1 m)2/m4

for both σf and σb which is much more dimension sensitive than MALA s bound O(

d/m2) (Dwivedi et al., 2018).

F. Additional details on real world examples

F.1. Computational considerations

neutra-MCMC methods should have a much higher computational cost compared to other methods. Figure 17 shows that in two of our real world experiments, neutra-MCMC methods were significantly more expensive than flow-MCMC methods.

On Sampling with Approximate Transport Maps

Alanine Dipeptide

Sparse Logisitic Regression

Speedup factor over

neutra-MCMC (per iteration)

ﬂow-MCMC (i-SIR) ﬂow-MCMC (IMH) local MCMC (MALA/HMC) neutra-MCMC (TESS) neutra-MCMC (MALA) reference

Figure 17. Speedup factor compared to neutra-MCMC on real world experiments

However, in the molecular system experiment (Sec. 6.1) flow-MCMC methods were more costly : that s because the distribution of this molecular system is very expensive to evaluate and breaks the parallel capabilities of flow-MCMC methods such as i-SIR. Note that using ESS always led to high computational costs.

F.2. Experimental details on logistic regression

Logistic regression Consider a training dataset D = {(xk, yk)}M j=1 where xk Rd and yk { 1, 1}. In our case D is the german credit dataset (M = 750) where each xk represent 24 details about a person who takes a credit (age, sex, job, ... ) and yk is 1 if this person is considered having bad credits risks and 1 otherwise. We standardised this dataset so that each component of xk is between -1 and 1 and also added a constant 1 component.

We consider that the likelihood of a pair (x, y) is given by p(y|x, w) = Bernoulli(y; σ(x T w)) where w Rd is a weight vector and σ is the sigmoid function. Given a prior distribution p(w), we sample the posterior distribution p(w|D) and compute the posterior predictive distribution p(y|x, D) = R p(y|x, w, D)p(w|D)dw 1/n Pn i=1 p(y|x, wi, D)p(wi|D) for (x, y) Dtest.

Here, we consider a sparse prior suggested by (Carvalho et al., 2009) which can be written as follows, w = τβ λ and

p(w) = p(τ, β, λ) = Gamma(τ; α = 0.5, β = 0.5)

i=1 Gamma(λi; α = 0.5, β = 0.5) Normal(βi; 0, 1) .

We log-transform the Gamma distributions by replacing Gamma with Gammalog to ensure the positivity of the global scale τ and the local scale λ.

Gammalog(y; α, β) = Gamma(exp y; α, β) exp y .

Flow and training The normalizing flow at stake is an Inverse Autoregressive Flow (IAF) (Papamakarios et al., 2017) trained with the procedure described in (Hoffman et al., 2019) on a train dataset : it is a 3 layers deep flow using a residual neural networks wide of 51 neurons and deep of 2 layers with elu activation function. The flow was trained by optimizing the backward Kullback-Leiber with a learning rate of 10 2 (scaled by 10% every 1000 optimization steps) using Adam optimizer for 5000 steps with a batch size of size 4096.

Sampling details We used Pyro s (Bingham et al., 2019) implementation of Hamilotonian Monte Carlo (HMC) to sample the previously described model. The warmup phase 14 lasted 64 steps and the MCMC chains were 256 steps long - this number was chosen as it guarantees a ˆR close to 1. We automatically adapted both the step size and the mass matrix while the length of the trajectory (for the Leapfrog integrator) was frozen to 8 . Samplers involving i-SIR used N = 100 particles and importance sampling leveraged N 256 = 25600 particles. The global/local samplers interleaved 20 local steps between global steps.

14In the following, warmup samples will be always discarded

On Sampling with Approximate Transport Maps

Figure 18. Visualization of alanine dipeptide and the dihedral angles ϕ and ψ for the Ramachandran plot (taken from (Midgley et al., 2022))

F.3. Experimental details on alanine dipeptide

The alanine dipeptide experiment aimed at sampling from the Boltzmann distribution on the 3D coordinates of the 22 atoms alanine dipeptide molecule (shown in Fig. 18). The details of the distribution can be found in (Midgley et al., 2022), we used their neural spline normalizing flow trained using FAB as the common flow in all flow-based samplers. The ground truth was obtained through parallel tempering MD simulations from the same paper. We used 512 chains of length 600 using the same hyperparameters as the mixture of Gaussians : N = 128, nlocal = 5 and a target acceptance of 75%.

F.4. Experimental details on ϕ4

Target distribution Following (Gabri e et al., 2022) we consider a 1d ϕ4 on a grid of d points with Dirichlet boundary conditions at both extremeties of the field; that is ϕ0 = ϕd+1 = 0 where one configuration is the d-dimensional vector (ϕi)d i=1. The negative logarithm of the density is given by

ln π(ϕ) = β

i=1 (ϕi ϕi 1)2 + 1 4ad

i=1 (1 ϕ2 i )2 !

with a flow a colored normal distribution as base built by keeping the quadratic terms from above:

ln ρ(ϕ) = β

i=1 (ϕi ϕi 1)2 + 1 2ad

also with Dirichlet boundary conditions at 0. We chose parameter values for which the system is bimodal: a = 0.1 and inverse temperature β = 20 (see Fig. 4 in the main paper).

Training of the normalizing flow The flows at stake are again Real NVPs and their hyper-parameters are given in table 8. We used Adam optimizer to minimize the forward KL. This forward KL was approximated by taken exactly 50% of samples in each mode by running MALA locally. The size of those chains is batch size # of parallel MCMC chains.

Sampling parameters The sampling parameters were selected using a grid-search which optimized the quality of the mode weight estimation. The starting sample of each chain was taken to be in a single mode 15. We used 256 MCMC chains of length 512 where the first half of the chain was discarded for warmup purposes. The sampling hyperparameters are detailed in table 9.

Table 8. Real NVP hyperparameters for the ϕ4 experiment.

Dimension Depth of the flow

# of layers in NN

# of neuron per layer

# of training steps Learning rate Learning rate decay factor Batch size # MCMC chains

64 5 3 128 104 10 2 0.98 4096 64 128 6 256 3 2 104 10 2 0.98 4096 64 256 10 512 4 3.5 104 5 10 3 0.98 8192 64 512 15 512 5 6 104 5 10 4 0.99 8192 32

15This is achieved by running a gradient descent starting from a constant unit configuration

On Sampling with Approximate Transport Maps

Figure 19. MCMC chains sampled from a SN-GAN as an energy-based model

Table 9. Sampling hyperparameters for the ϕ4 experiment.

Dimension n N (flow-MCMC)

N (neutra-flow-MCMC)

nlocal (flow-MCMC)

nlocal (neutra-flow-MCMC)

64 256 ( 2) 60 60 3840 25 25 128 256 ( 2) 80 80 5120 25 25 256 256 ( 2) 100 100 6400 25 25 512 256 ( 2) 120 120 7680 25 25

G. Sampling image distributions

Design of the target distribution We trained a SN-GAN (Miyato et al., 2018) on the CIFAR10 dataset for 100.000 epochs (implementation from https://github.com/kwotsin/mimicry). The latent space of the SNGAN is of dimension 128. We use a multivariate centered standard Gaussian as a base distribution. Following (Che et al., 2020), we define a probability measure as an energy-based model on the latent space of the GAN

p(z) = exp( EGAN(z))/Zθ with EGAN(z) = log p0(z) logit(D(G(z))),

where p0 is the base distribution, G is the generator, D is the discriminator and logit(y) = log(y/(1 y)) is the inverse sigmoid function. Slices of EGAN can be found on Fig. 20 : p(z) is multimodal and has many bad geometries. The samples of p(z) can be transformed in images (which belong to R3 32 32) by pushing them through G.

Sampling procedure We sampled p(z) using 4 different MCMC algorithms. Fig. 19 depicts 8 chains for each sampler started in the same state. The flow based samplers use a Real NVP normalizing flow which has 8 layers each one having a MLP wide of 128 neurons and deep of 3 layers. This normalizing flow was trained using the backward KL loss for 512

On Sampling with Approximate Transport Maps

Figure 20. Random 2D slices of EGAN. The purple colors reperensent the level lines of EGAN(z) and the grey lines are the level lines of z 7 log p0(z)

iterations using batches of size 1024. The chains are 128 samples long (on Fig. 19, the chains are subsampled every 8 steps) and have the same hyper-parameters as in the four Gaussians experiment in dimension 128.

Fig. 19 shows that flow-MCMC algorithm mixes between the modes of this moderately-high dimensional distribution (d = 128) more often than its neutra-MCMC counterparts. This observation is consistent with the results of Sec. 3 and Sec. 6.

H. Amount of computation and type of resources

We used a single type of GPU, the Nvidia A100. Each experiment of Section 3 took about 2 hours to run when distributed on 4 GPUs. Moreover, the training of the 5 Real NVPs for Section 3.2 took 6 hours on 4 GPUs. Regarding the real world experiments (see Section 6), training the flow for the molecular experiment took 24 hours on a single GPU and 6 hours to perform the sampling part on 4 GPUs. Training the flows for the ϕ4 experiments took 12 hours on 2 GPUs and 1 hours on 4 GPUs for sampling. Finally, training the flow for the bayesian logistic regression took less than a minute but sampling with HMC lasted 6 hours on 4 GPUs. In total, this work required 55 hours of parallel GPU run-time.