# nets_a_nonequilibrium_transport_sampler__da235785.pdf

NETS: A Non-Equilibrium Transport Sampler

Michael S. Albergo 1 2 Eric Vanden-Eijnden 3 4

We introduce the Non-Equilibrium Transport Sampler (NETS), an algorithm for sampling from unnormalized probability distributions. NETS builds on non-equilibrium sampling strategies that transport a simple base distribution into the target distribution in finite time, as pioneered in Neal s annealed importance sampling (AIS). In the continuous-time setting, this transport is accomplished by evolving walkers using Langevin dynamics with a time-dependent potential, while simultaneously evolving importance weights to debias their solutions following Jarzynski s equality. The key innovation of NETS is to add to the dynamics a learned drift term that offsets the need for these corrective weights by minimizing their variance through an objective that can be estimated without backpropagation and provably bounds the Kullback-Leibler divergence between the estimated and target distributions. NETS provides unbiased samples and features a tunable diffusion coefficient that can be adjusted after training to maximize the effective sample size. In experiments on standard benchmarks, highdimensional Gaussian mixtures, and statistical lattice field theory models, NETS shows compelling performances.

1. Introduction

The aim of this paper is to sample probability distributions on Rd known only up to normalization. This problem, central to applications from statistical physics (Faulkner & Livingstone, 2023; Wilson, 1974; H enin et al., 2022)

1Society of Fellows, Harvard University, Cambridge, MA, USA 2NSF AI Institute for Artificial Intelligence and Fundamental Interactions, Cambridge, MA, USA 3Machine Learning Lab, Capital Fund Management, Paris, France 4Courant Institute of Mathematical Sciences, New York, NY, USA. Correspondence to: Michael S. Albergo <malbergo@fas.harvard.edu>, Eric Vanden-Eijnden <eve2@cims.nyu.edu>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

to Bayesian inference (Neal, 1993), becomes particularly challenging for non-log-concave targets. In such cases, traditional ergodic sampling methods based on Markov Chain Monte Carlo (MCMC) or Langevin dynamics often exhibit prohibitively slow convergence rates.

An alternative approach offered by non-equilibrium sampling strategies is to transport samples on non-stationary distributions that evolve from a simple base distribution (e.g., a normal distribution) to the target in finite time. Methods like Annealed Importance Sampling (AIS) (Neal, 2001), Sequential Monte Carlo (SMC) (Del Moral, 1997; Doucet et al., 2001), and their continuous-time variants based on Jarzynski s equality (Hartmann et al., 2018) achieve this transport through a dynamical quench, using importance weights to correct for the bias arising when the walkers distribution lags behind their evolving target. However, these methods can fail when the lag is too strong, leading to high-variance weights.

We address this limitation by modifying the dynamics via some additional transport learned to guide toward the target. In the context of quenches performed via Langevin dynamics on an evolving potential, we show that an optimal drift can be added to eliminate the need for importance weights entirely, and can be characterized as the minimizer of objective functions amenable to empirical estimation. This leads to an unbiased sampling strategy where importance weights, while still available for exact correction, have substantially reduced variance due to the improved transport.

In sum, our work makes the following main contributions:

We introduce the Non-Equilibrium Transport Sampler (NETS), which augments Langevin dynamics on an evolving potential with learned transport. We show that the method remains unbiased through a generalization of Jarzynski s equality, and that it can be used in concert with sequential Monte-Carlo (SMC) strategies.

We demonstrate that the optimal drift for this additional transport minimizes an objective function, based on physics-informed neural networks (PINN), whose gradient can be estimated without backpropagating through the sampling equations.

We show that the PINN objective has two key properties:

NETS: A Non-Equilibrium Transport Sampler

it is off-policy (requiring no samples from the target density) and it bounds the KL-divergence between the sampler and target distributions.

We discuss other possible objectives to make connections with Action Matching (AM) as well as Controlled Monte Carlo Diffusions (CMCD).

We establish that NETS s performance can be optimized post-training by tuning both the integration time-step and diffusion coefficient, which we demonstrate through high-dimensional numerical experiments.

1.1. Related work

Dynamical Measure Transport. Modern generative models for continuous data often employ dynamical measure transport, where samples from a base density are transformed into samples from a target density by solving ordinary or stochastic differential equations (ODE/SDE) with learnable drift coefficients. This approach, pioneered in (Chen et al., 2018; Grathwohl et al., 2019), has developed in several directions, notably with score-based diffusion models (Ho et al., 2020; Song et al., 2020) that frame drift estimation as quadratic regression. This formulation has led to more general frameworks (Albergo & Vanden-Eijnden, 2022; Lipman et al., 2022; Albergo et al., 2023; Liu et al., 2022; De Bortoli et al., 2021; Neklyudov et al., 2023). A key requirement of these methods is access to samples from both base and target distributions. Our work shows that similar models building on dynamical transport can be learned without prior access to data from the target.

Augmenting sampling with learning. Augmenting MCMC and importance sampling proceures with transport has been an active area of research for the past decade. Early work makes use of the independence Metropolis algorithm (Hastings, 1970; Liu, 1996), in which proposals come from a transport map (Parno & Marzouk, 2018; No e et al., 2019; Albergo et al., 2019; Gabri e et al., 2022) that are accepted or rejected based off their likelihood ratio with the target. These methods were further improved by combining them with AIS and SMC perspectives, learning incremental maps that connect a sequence of interpolating densities between the base and target (Arbel et al., 2021; Matthews et al., 2022; Midgley et al., 2023). Similar works in the highenergy physics community posit that interleaving stochastic updates within a sequence of maps can be seen as a form of non-equilibrium sampling (Caselle et al., 2022; Bonanno et al., 2024).

Following the success of generative models based on dynamical transport, several approaches have emerged to apply these ideas to sampling. Some translate concepts from diffusion models to minimize the KL divergence between model and target (Vargas et al., 2023; Berner et al., 2024),

while others reformulate sampling as a stochastic optimal control (SOC) problem (Zhang & Chen, 2021; Behjoo & Chertkov, 2024; Hua et al., 2024). However, these methods face a limitation: learning the optimal drift is a complicated optimization problem which becomes computationally prohibitive in high dimensions, and simplified approaches such as Akhound-Sadegh et al. (2024) introduce bias into their objective function. Alternative perspectives include using denoising oracles to simplify the sampling problem (Bruna & Han, 2024) and adapting graph-based distribution modeling techniques for sampling, including off-policy training (Malkin et al., 2023; Sendera et al., 2024).

Vargas et al. (2024) establish Controlled Monte Carlo Diffusions (CMCD), another unbiased sampler that, like ours, augments the dynamics with learned transport. The methods differ in how this drift is learned: CMCD derives a gradientform objective through path integrals and Girsanov s theorem, requiring either backpropagation through the SDE or computation with a numerically unstable reference measure on a fixed grid. Our approach, based on Fokker-Planck equation manipulations, yields new objectives for the additional drift that avoid backpropagation entirely. Moreover, our optimize-then-discretize framework allows for post-training adaptation of both step size and time-dependent diffusion, providing tunable parameters to enhance performance. One of our objectives is a Physics-Informed Neural Network (PINN) loss, which has appeared elsewhere for sampling (M at e & Fleuret, 2023; Tian et al., 2024; Fan et al., 2024). We establish two key results: this objective is valid for annealed Langevin dynamics, and it directly controls both the KL divergence and the importance weights arising from Jarzynski s equality.

2.1. Setup and Notations

We assume that the target distribution is absolutely continuous with respect to the Lebesgue measure on Rd, with probability density function (PDF) ρ1(x) = Z 1 1 e U1(x): here x Rd, U : Rd R is a known energy potential, assumed twice differentiable and bounded below, and Z1 = R

Rd e U1(x)dx < is an unknown normalization constant, referred to as the partition function in physics and the evidence in statistics. Our aim is to generate samples from ρ1(x) so as to be able to estimate expectations with respect to this density. Additionally we wish to estimate Z1.

To this end, we introduce a family of time-dependent potentials Ut(x) that interpolate between a simple initial potential U0(x) (e.g., U0(x) = 1

2|x|2) at t = 0 and the target potential U1(x) at t = 1. While linear interpolation, Ut(x) = (1 t)U0(x)+t U1(x), provides a straightforward choice, more sophisticated interpolation schemes are often

NETS: A Non-Equilibrium Transport Sampler

preferable. Our only requirements are that Ut=0 = U0, Ut=1 = U1, and Ut(x) is twice differentiable in (t, x) [0, 1] Rd. We assume that the time-dependent PDF associated with this potential Ut(x) is normalizable for all t [0, 1] and denote it as

ρt(x) = Z 1 t e Ut(x), Zt = Z

Rd e Ut(x)dx < , (1)

so that ρt=0(x) = ρ0(x) and ρt=1(x) = ρ1(x); we also assume that ρ0(x) is simple to sample (either directly or via MCMC or Langevin dynamics) and that its partition function Z0 is known. To simplify the notations we introduce the free energy Ft = log Zt, (2)

and note the useful identity

Rd t Ut(x)ρt(x)dx. (3)

since t log R

Rd e Ut = R

Rd t Ute Ut/ R

2.2. Non-equilibrium Sampling with Importance Weights

Annealed importance sampling (AIS) uses a finite sequence of MCMC moves that satisfy detailed-balance locally in time but not globally, thereby introducing a bias that can be corrected with weights. Here we use a time-continuous variant of AIS based on Jarzynski equality.

By definition of the PDF in (1), ρt(x) = Ut(x)ρt(x) and hence, for any ϵt 0, we have

0 = ϵt ( Utρt + ρt). (4)

Since we also have

tρt = ( t Ut t Ft)ρt, (5)

we can combine these last two equations to deduce that

tρt = ϵt ( Utρt + ρt) ( t Ut t Ft)ρt. (6)

The effect of the last term at the right hand-side of this equation can be accounted for by using weights. To see how, extend the phase space to (x, a) Rd+1 and introduce the PDF ft(x, a) solution to the Fokker-Planck equation (FPE)

tft = ϵt ( Utft + ft) + t Ut aft, (7)

with initial condition ft=0(x, a) = δ(a)ρ0(x). A direct calculation using (3) (for details see Appendix A) shows that

R eaft(x, a)da R

Rd+1 eaft(y, a)dady . (8)

Therefore we can use the solution to the SDE associated with the FPE (7) in the extended space to estimate expectations with respect to ρt(x):

Proposition 2.1 (Jarzynski equality). Let (Xt, At) solve the coupled system of SDE/ODE

d Xt = ϵt Ut(Xt)dt +

2ϵtd Wt, X0 ρ0, (9)

d At = t Ut(Xt)dt, A0 = 0, (10)

where ϵt 0 is a time-dependent diffusion coefficient and Wt Rd is the Wiener process. Then for all t [0, 1] and any test function h : Rd R, we have Z

Rd h(x)ρt(x)dx = E[e Ath(Xt)]

Zt/Z0 = e Ft+F0 = E[e At] (11)

where the expectations are taken over the law of (Xt, At).

The proof of this proposition is given in Appendix A and it relies on the identity R

Rd h(x)ρt(x) = R

R+1d eah(x)ft(x, a)dadx/ R

Rd+1 ft(x, a)dadx which follows from (8). The second equation in (11) for the free energy Ft is what is referred to as Jarzynski s equality, and was originally surmised in the context of non-equilibrium thermodynamics (Jarzynski, 1997).

Remark 2.2. We stress that it is key to use the weights e At in (11) because ρt(x) is not the PDF of Xt in general. Indeed, if we denote by ρt(x) the PDF of Xt, it satisfies

t ρt = ϵt ( Ut ρt + ρt), ρt=0 = ρ. (12)

This FPE misses the term ( t Ut t Ft)ρt at the right hand-side of (6), and as a result ρt(x) = ρt(x) in general intuitively, ρt(x) lags behind ρt(x) when the potential Ut(x) evolves and this lag is what the weights in (11) correct for.

It is important to realize that, while (11) is exact, estimators based on this relation can be high variance if the lag between the PDF ρt(x) of Xt and ρt(x) is too pronounced. This issue can be alleviated by using resampling methods as is done in SMC (Doucet et al., 2001). Here we will address it by adding some additional drift in (9) that will reduce the lag and as a result lower the variance of the weights.

2.3. Non-equilibrium Sampling with Perfect Transport

We can add a transport term to eliminate the need of the weights. To see how, let bt(x) Rd be a velocity field which at all times t [0, 1] satisfies

(btρt) = tρt. (13)

We stress that this is an equation for bt(x) in which ρt(x) is prescribed and given by (1): In Appendix C we show how to express the solution to (13) via Feynman-Kac formula. If (13) is satisfied, then we can combine this equation with (4) and (5) to arrive at

tρt = ϵt ( Utρt + ρt) (btρt), (14)

NETS: A Non-Equilibrium Transport Sampler

which is a standard FPE. Therefore the solution to the SDE associated with (14) allows us to sample ρt(x) directly (without weights). We phrase this result as:

Proposition 2.3 (Sampling with perfect additional transport.). Let bt(x) be a solution to (13) and let Xb t satisfy the SDE

d Xb t = ϵt Ut(Xb t )dt + bt(Xb t )dt +

2ϵtd Wt, (15)

with Xb 0 ρ0 and where ϵt 0 is a time-dependent diffusion coefficient. Then ρt(x) is the PDF of Xb T , i.e. for all t [0, 1] and, given any test function h : Rd R, we have Z

Rd h(x)ρt(x)dx = E[h(Xb t )], (16)

where the expectation at the right-hand side is taken over the law of (Xb t ).

This proposition is proven in Appendix A and it shows that we can in principle get rid of the weights altogether by adding the drift bt(x) in the Langevin SDE. This possibility was first noted in Vaikuntanathan & Jarzynski (2008) and is also exploited in Tian et al. (2024) for deterministic dynamics (i.e. setting ϵt = 0 in (15)) and in Vargas et al. (2024) using the SDE (15). Of course, in practice we need to estimate bt(x), and also correct for sampling errors if this drift is imperfectly learned. Let us discuss this second question first, and defer the derivation of objectives to learn bt(x) to Secs. 2.5 and E. In Appendix C we show how to express the solution to (13) via Feynman-Kac formula.

2.4. Non-Equilibrium Transport Sampler

Let us now combine the approaches discussed in Secs. 2.2 and 2.3 to design samplers in which we use an added transport, possibly imperfect, and importance weights.

To this end, suppose that we wish to add an additional transport term (ˆbtρt) in (6), where ˆbt(x) Rd is some given velocity that does not necessarily solve (13). Using the expression in (1) for ρt(x), we have the identity

(ˆbtρt) = ˆbtρt + Ut ˆbtρt (17)

Therefore we can rewrite (4) equivalently as

tρt = ϵt ( Utρt + ρt) (ˆbtρt)

+ ( ˆbt Ut ˆbt t Ut + t Ft)ρt (18)

We can now proceed as we did with (4) and extend state space to account for the effect of the terms ( ˆbt Ut ˆbt t Ut + t Ft)ρt in this equation via weights, while having the term (ˆbt(x)ρt(x)) contribute to some additional transport. This leads us to a result originally obtained in Vaikuntanathan & Jarzynski (2008) and recently re-derived in Vargas et al. (2024):

Proposition 2.4 (Nonequilibrium Transport Sampler (NETS)). Let (Xˆb t , Aˆb t) solve the coupled system of SDE/ODE

d X ˆb t = ϵt Ut(X ˆb t )dt + ˆbt(X ˆb t )dt +

2ϵtd Wt, (19)

d A ˆb t = ˆbt(X ˆb t )dt Ut(X ˆb t ) ˆbt(X ˆb t )dt

t Ut(X ˆb t )dt, (20)

with Xˆb 0 ρ0 and Aˆb 0 = 0, and where ϵt 0 is a timedependent diffusion coefficient. Then for all t [0, 1] and any test function h : Rd R, we have Z

Rd h(x)ρt(x)dx = E[e A ˆb th(Xˆb t )]

E[e Aˆb t] ,

Zt/Z0 = e Ft+F0 = E[e A ˆb t].

where the expectations are taken over the law of (Xˆb t , Aˆb t).

A proof of this proposition using manipulations of the FPE is given in Appendix A, which will allow us to write down a variety of new loss functions for learning ˆbt; for an alternative proof using Girsanov theorem, see Vargas et al. (2024). For completeness, in Appendix B we also give a time-discretized version of Proposition 2.4, and in Appendix D we generalize it in two ways: to include inertia and to turn t into a vector coordinate for multimarginal sampling. We also discuss the connection between NETS and the method of Vargas et al. (2024) in Appendix F.

Notice that, if ˆbt(x) = 0, the equations in Proposition 2.4 simply reduce to those in Proposition 2.1, whereas if ˆbt(x) = bt(x) solves (13) we can show that

Ab t = Ft + F0, (22)

i.e. the weights have zero variance and give the free energy difference. Indeed, by expanding both sides of (13) and dividing them by ρt(x) > 0, this equation can equivalently be written as

bt Ut bt = t Ut t Ft. (23)

As a result, when ˆbt(x) = bt(x), (20) reduces to

d Ab t = t Ftdt, Ab 0 = 0, (24)

and the solution to this equation is (22). In practice, achieving zero variance of the weights by estimating bt(x) exactly is not generally possible, but having a good approximation of bt(x) can help reducing this variance dramatically, as we will illustrate below via experiments.

2.5. Estimating the Drift bt(x) via a PINN Objective

Equation (23) can be used to derive an objective for both bt(x) and Ft. The reason is that in this equation the unknown t Ft can be viewed as factor that guarantees solvability: indeed, integrating both sides of (13) gives 0 =

NETS: A Non-Equilibrium Transport Sampler

Rd Ut(x)ρt(x)dx + t Ft, which, by (3), is satisfied if and only if Ft is (up to a constant fixed by F0 = log Z0) the exact free energy (2). This offers the possibility to learn both bt(x) and Ft variationally using an objective fitting the framework of physics informed neural networks (PINNs):

Proposition 2.5 (PINN objective). Given any T (0, 1] and any PDF ˆρt(x) > 0 consider the objective for (ˆb, ˆF) given by:

LT PINN[ˆb, ˆF] = Z T

0 E qt(xt) 2 dt, (25)

where xt ˆρt and we denote

qt(x) = ˆbt(x) Ut(x) ˆbt(x) t Ut(x)+ t ˆFt. (26)

Then minˆb, ˆ F LT PINN[ˆb, ˆF] = 0, and all minimizers (b, F) are such that and bt(x) solves (23) and Ft is the free energy (2) for all t [0, T].

This result is proven in Appendix A: in practice, we will use T (0, 1] for annealing but ultimately we are interested in the result when T = 1. Note that since the expectation over an arbitrary ˆρt(x) in (25), it can be used as an off-policy objective. It is however natural to use ˆρt(x) = ρt(x) since it allows us to put statistical weight in the objective precisely in the regions where we need bt(x) to transport probability mass. In either case, there is no need to backpropagate through simulation of the SDE used to produce data. We show in Section 2.7 below how the expectation over ρt(x) can be estimated without bias to arrive at an empirical estimator for (25) when ˆρt(x) = ρt(x). Note also that, while minimization of the objective (25) gives an estimate ˆFt of the free energy, it is not needed at sampling time when solving (19). An objective similar to (25) was also recently posited in M at e & Fleuret (2023); Tian et al. (2024) for use with deterministic flows. Here, we devise it in the context of augmenting annealed Langevin dynamics.

In addition to the above results, in Appendix D we supply extensions to the setup to the case where there are multiple marginals and where the stochastic dynamics have inertia. We also discuss an alternative objective, relying on the action matching formalism (Neklyudov et al., 2023), in Appendix E.

2.6. Control of the Kullback-Leibler Divergence

One advantage of the PINN objective (25) is that we know that its minimum is zero, and hence we can track its value to monitor convergence when minimizing (25) by gradient descent, as we do below. Another advantage of the loss (25) is that it controls the quality of the transport as measured by the Kullback-Leibler divergence:

Proposition 2.6 (KL control). Let ˆρt be the solution to the

Fokker-Planck equation

tˆρt + (ˆbtˆρt) = ϵt ( Utˆρt + ˆρt), (27)

with ˆρt=0 = ρ0 and where ˆbt(x) is some predefined velocity field and ϵt 0. Then, we have

DKL(ˆρt=1||ρ1) q

LT =1 PINN(ˆb, F). (28)

where Ft is the free energy. In addition, given any estimate ˆFt such that R 1 0 | t ˆFt Ft|2dt δ for some δ 0, we have

DKL(ˆρt=1||ρ1) q

2LT =1 PINN(ˆb, ˆF) + 2δ. (29)

This proposition is proven in Appendix A. Notice that the bound (28) can be estimated by using t Ft = E[e A ˆb t t Ut(Xˆb t )]/E[e A ˆb t] in the PINN loss (25).

2.7. Implementation

If we minimize (25) off-policy, i.e. with samples xt from some ˆρt = ρt, this is perfectly valid, but may be inefficient for learning ˆbt over the support necessary for the problem. If we decide instead to set ˆρt(x) = ρt(x), since the SDEs in (19) and (20) can be used with any ˆbt(x) to estimate expectation over ρt(x) via (21), we can write the PINN objective on-policy as

LT PINN[ˆb, ˆF] = Z T

E e A ˆb t|qt(Xˆb t )|2

E e Aˆb t dt (30)

These expectations can be estimated empirically over a population of solutions to (19) and (20). Crucially, since we can switch from off-policy to on-policy after taking the gradient of the PINN objective, when computing the gradient of (30) over ˆbt(x), (Xˆb t , Aˆb t) can be considered independent of ˆbt(x) and do not need to be differentiated over. In other words, the method does not require backpropagation through the simulation even if used on-policy, i.e. even though it uses the current value of ˆbt to estimate the loss and its gradient. Finally note that we can use the ODE (20) for Aˆb t to write (30) as

LT PINN[ˆb, ˆF] = Z T

E e A ˆb t t Aˆb t + t ˆFt 2

E e Aˆb t dt (31)

Since E[e A ˆb t t Aˆb t]/E[e A ˆb t] = t log E[e A ˆb t] = t Ft, (31) clearly shows that this loss controls the variance of t Aˆb t, which directly connects the Jarzynski weights to the PINN objective.

The computation of the divergence bt(x) in the PINN objective given in (25) can be avoided by using Hutchinson s trace estimator, see Appendix G.

NETS: A Non-Equilibrium Transport Sampler

Algorithm 1 Training: Note that the resultant set of walkers across time slices {xi k} are detached from the computational graph when taking a gradient step (off-policy learning).

1: Initialize: n walkers, x0 ρ0, A0 = 0, K time steps, model parameters for {ˆbt, ˆFt}, diffusion coefficient ϵt, learning rate η 2: repeat 3: Randomize time grid: t0, t1, . . . , t K Uniform(0, T), sort such that t0 < t1 < < t K 4: for k = 0, . . . , K do 5: tk = tk+1 tk, 6: for each walker i = 1, . . . , n do 7: xi k+1 = xi k + [ˆbtk(xi k) ϵtk Utk(xi k)] tk + 2ϵtk(W i tk+1 W i tk)

8: Ai k+1 = Ai k + [ ˆbtk(xi k) t Utk(xi k) ˆbtk(xi k) Utk(xi k)] tk 9: Estimate (30) by replacing the expectation by an empirical average over the n walkers and the time integral by an empirical average over t0, . . . , t K. 10: Take gradient descent step to update the model parameters. 11: until converged

Learning bt(x) and Ft for t [0, 1] from the start can be challenging if the initial ˆbt(x) is far from exact and the weights gets large variance as t increases. This problem can be alleviated by estimating bt(x) sequentially. In practice, this amounts to annealing T from a small initial value to T = 1, in such a way that bt(x) is learned sufficiently accurately so that variance of the weights remains small. This variance can be estimated on the fly, which also give us an estimate of the effective sample size (ESS) of the population at all times t [0, 1].

Note that we can also employ resampling strategies of the type used in SMC to keep the variance of the weights low (Doucet et al., 2001; Boli c et al., 2004).

Details for the numerical implementation of the minimization of the objective (25) is summarized in Algorithm 1.

2.8. Learning the Ut of Stochastic Interpolants

The choice of the potential Ut used in the annealing can have a significant impact on both the learnability of bt and the numerical stability of solving (19). A desirable characteristic is that Ut gives a density ρt that is geometrically smooth in its evolution between ρ0 and ρ1, so that transport via vector fields is simple. One approach toward achieving this is to use the drift associated with a stochastic interpolant (Albergo & Vanden-Eijnden, 2022; Lipman et al., 2022), i.e., the stochastic process defined as

It = αtx0 + βtx1, x0 ρ0, x1 ρ1, (32)

where the coefficients αt, βt satisfy α0 = β1 = 1, α1 = β0 = 0 and αt < 0, βt > 0. The PDF ρt of It satisfies (13) with a drift bt given by

bt(x) = E[ αtx0 + βtx1 | It = x], (33)

where the expectation is taken over the law of x0, x1 conditional on It = x. When ρ0 := N(0, Id), an application of Stein s lemma indicates that the gradient of the potential Ut associated with ρt is given by

Ut(x) = α 1 t E[x0 | It = x]. (34)

While theoretically appealing, these relations are not immediately useful for two reasons. First, this velocity field can only be regressed when samples from ρ1 are readily available, which is not the case in our setting. Second, the potential Ut associated with this ρt is not analytically known. However, by combining (33) with (34) and using x = E[It | It = x], we notice that bt can be written as

bt(x) = ( αtαt α2 t βtβ 1 t ) Ut(x) + βtβ 1 t x, (35)

which we can exploit to directly learn the Ut associated with the interpolant and, in the process, solve the transport problem. A convenient choice is to set, for example, αt =

1 t2 and βt = t, so that (35) reduces to

tbt(x) = x Ut(x). (36)

If we parameterize the potential as Ut = ˆU f t with

ˆU f t (x) = (1 t) 1

2|x|2 + t U1(x) + t(1 t) ˆft(x), (37)

where ˆft(x) : [0, 1] Rd R is a neural network, (36) gives us an expression for the drift also in terms of ˆft:

ˆbf t (x) = t 1(x ˆU f t (x))

= x U1(x) (1 t) ˆft(x). (38)

This allows us to write the PINN objective (25) as an objective for ( ˆf, ˆF):

LPINN[ ˆf, ˆF] = Z 1

0 E |qf t (xt)|2 dt, (39)

where qf t (xt) is obtained from (26) by replacing Ut with (37) and bt with (38):

qf t(x) = ˆbf t(x) ˆU f t (x) ˆbf t(x) t ˆU f t (x)+ t ˆFt. (40)

When minimized to zero, the objective (39) yields both a drift ˆbf t and a potential ˆU f t such that the resulting density ρt = e ˆU f t + ˆ Ft matches the PDF of the interpolant It and possesses favorable transport properties.

NETS: A Non-Equilibrium Transport Sampler

GMM (d = 2)

Algorithm ESS W2

FAB 0.653 0.017 12.0 5.73 PIS 0.295 0.018 7.64 0.92 DDS 0.687 0.208 9.31 0.82 p DEM 0.634 0.084 12.20 0.14 i DEM 0.734 0.092 7.42 3.44 CMCD-KL 0.268 0.069 9.32 0.71 CMCD-LV 0.655 0.023 4.01 0.25 NETS-AM ϵt = 5 (ours) 0.808 0.031 3.89 0.22 NETS-PINN ϵt = 0 (ours) 0.954 0.003 3.55 0.57 NETS-PINN ϵt = 4 (ours) 0.979 0.002 3.14 0.46 NETS-PINN-resample (ours) 0.993 0.004 3.27 0.31

Table 1. Performance of NETS in terms of ESS and W2 metrics for 40-mode GMM (d = 2) with comparative results quoted from Akhound-Sadegh et al. (2024) for reproducibility.

3. Numerical Experiments

In what follows, we test the NETS method, for both the PINN objective (25) and the action matching objective (119), on standard challenging sampling benchmarks. We then study how the method scales in comparison to baselines, particularly AIS on its own, by testing it on an increasingly high dimensional Gaussian Mixture Models (GMM). Following that, we show that it has orders of magnitude better statistical efficiency as compared to AIS on its own when applied to the study of lattice field theories, even past the phase transition of these theories and in 400 dimensions (an L = 20 L = 20 lattice).

3.1. 40-Mode Gaussian Mixture

A common benchmark for machine learning augmented samplers that originally appeared in the paper introducing Flow Annealed Importance Sampling Bootstrap (FAB) (Midgley et al., 2023) is a 40-mode GMM in 2-dimensions for which the means of the mixture components span from 40 to 40. The high variance and many wells make this problem challenging for re-weighting or locally updating MCMC processes. We choose as a time dependent potential Ut(x) that linearly interpolates the means of the GMM with U0(x) the potential for a standard multivariate Gaussian with standard deviation scale σ = 2.

We train a simple feed-forward neural network of width 256 against both the PINN objective (25), parameterizing (ˆb, ˆF), or the action matching objective (119), parameterizing ˆϕ. We compare the learned model from both objectives to recent related literature: FAB, Path Integral Sampler (PIS) (Zhang & Chen, 2021), Denoising Diffusion Sampler (DDS) (Vargas et al., 2023), and Denoising Energy Matching (p DEM, i DEM) (Akhound-Sadegh et al., 2024). For reproducibility with the benchmarks provided in the latter method, we compute the effective sample size (ESS) estimated from 2000 generated samples as well as the 2 Wasserstein (W2) distance between the model and the target. As noted in Table 1, all proposed variants of

NETS outperform existing methods. In addition, because our method can be turned into an SMC method by including resampling during the generation, we can push the acceptance rate of the same learned PINN model to nearly 100% by using a single resampling step when the ESS of the walkers dropped below 98%. NETS uses 100 sampling steps and an ϵt = 0.0, 4.0 in the SDE. Note that NETS and CMCD as published use different interpolating potentials on this example based on code conventions, but they could be the same.

3.2. Funnel and Student-t Mixture

We next test NETS on Neal s funnel, a challenging synthetic target distribution which exhibits correlations at different scales across its 10 dimensions, as well as the 50dimensional Mixture of Student-T (Mo S) distribution used in (Blessing et al., 2024). The definitions of the target densities and the interpolating potentials are given in Appendix H. Heuristically, the first dimension is Gaussian with variance σ2 = 9, and the other 9 dimensions are conditionally Gaussian with variance exp(x0), creating the funnel.

We again parameterize (ˆb, ˆF) or ˆϕ using simple feed forward neural networks, this time of hidden size 512. We use 100 sampling steps for both, with diffusion coefficients given in the caption of Table 2. Following (Blessing et al., 2024), we compute the maximum mean discrepancy (MMD) and W2 distance between 2000 samples from the model and 2000 samples from the target and compare to related methods in Table 2. NETS outperforms other methods with both losses on the high dimensional Mo S target in both metrics. In addition this can be improved using SMC-style resampling in the interpolation when the ESS drops below 70%. NETS matches the best performance in MMD for the Funnel distribution, but it is slightly worse in W2.

Algorithm Funnel (d = 10) Mo S (d = 50)

MMD W2 MMD W2

FAB (Midgley et al., 2023) 0.032 0.000 153.894 3.916 0.093 0.014 1204.160 147.7 GMMVI (Arenz et al., 2023) 0.031 0.000 105.620 3.472 0.135 0.017 1255.216 296.9 PIS (Zhang & Chen, 2022) 0.218 0.007 2113.172 31.17 DDS (Vargas et al., 2023) 0.172 0.031 142.890 9.552 0.131 0.001 2154.884 3.861 AFT (Arbel et al., 2021) 0.159 0.010 145.138 6.061 0.395 0.082 2648.410 301.3 CRAFT (Arbel et al., 2021) 0.115 0.003 134.335 0.663 0.257 0.024 1893.926 117.3 CMCD-KL (Vargas et al., 2024) 0.095 0.003 513.339 192.4 NETS-AM (ours) 0.041 0.001 435.793 96.17 0.0396 0.001 407.827 69.64 NETS-PINN (ours) 0.033 0.002 388.91 141.5 0.032 0.001 482.393 174.6 NETS-PINN-resample (ours) 0.027 0.003 343.78 65.25 0.030 0.000 400.076 59.31

Table 2. Performance of NETS on Neal s Funnel and Mixture of Student-T distributions, measured in MMD and W2 distances from the true distribution. Benchmarking is in accordance with the setup of (Blessing et al., 2024). Diffusion coefficient ϵt = 5, 4 was used for NETS-AM on the Funnel and Mo S, respectively. Equivalently, ϵt = 5, 5 were used by NETS-PINN. Bold numbers are within standard deviation the best performing. Note that NETS still has perfect sample in the ϵt limit, but would require finer time discretization than the 100 sampling steps used here (see Figure 4).

NETS: A Non-Equilibrium Transport Sampler

40 20 0 20 40

AIS, no transport, ϵ = 4.0

40 20 0 20 40

Only transport, ϵ = 0.0

40 20 0 20 40

AIS and transport, ϵ = 4.0 Figure 1. Comparison of the performance of annealed Langevin dynamics alone, transport alone, and annealed Langevin coupled with transport when sampling the 40mode GMM from (Midgley et al., 2023). Left: Annealed Langevin run for 250 steps with ϵt = 4.0, failing to capture the modes with 0% ESS. Center: Learning using the PINN loss and sampling with 100 steps and ϵt = 0 achieves an ESS of 95%. Right: Same learning and now sampling with ϵt = 4.0 achieves an ESS of 98%.

0 10 20 30 40 50 60 70 80 Diﬀusion coeﬃcient ϵ

Eﬀective Sample Size

8-mode GMM, dim=[36, 64, 128, 200]

dim=36, transport dim=36, no transport dim=64, transport dim=64, no transport dim=128, transport dim=128, no transport dim=200, transport dim=200, no transport

8 Cross section of x27 vs x3 Cross section of x10 vs x24 Cross section of x13 vs x27

-8 0 8 -8 0 8

Figure 2. Demonstration of high-dimensional sampling with our method using the PINN loss in (25) and a study of how diffusivity impacts performance, with and without transport. Left: NETS can achieve high ESS through transport alone, and the effect of increased diffusivity has more of a positive effect on performance with sampling than without. AIS cannot achieve ESS above 0 in high dimension. Right: Kernel density estimates of 2-d cross sections of the high-dimensional, multimodal distribution arising from the model and ground truth.

3.3. Scaling on High-dimensional GMMs

In order to demonstrate that the method generalizes to high dimension, we study sampling from multimodal GMMs in higher and higher dimensions and observe how the performance scales. In addition, we are curious to understand how the factor in the sampling SDE coming from annealed Langevin dynamics, U, interacts with the learned drift ˆb or ˆϕ as we change the diffusivity. We construct 8-mode target GMMs in d = 36, 64, 128, 200 dimensions and learn ˆb with the PINN loss in each scenario. We use the same feed forward neural network of width 512 and depth 4 to parameterize both ˆb and ˆF for all dimensions tested and train for 4000 training iterations. Figure 2 summarizes the results. On the left plot, we note that AIS on its own cannot produce any effective samples, while even in 200 dimensions, NETS works with transport alone with 60% ESS. As we increase the diffusivity ϵt and therefore the effect of the Langevin term coming from the gradient of the potential, we note that all methods converge to nearly independent sampling, and the discrepancy in performance across dimensions is diminished. Note that the caveat to achieve this is that the step size in the SDE integrator must be taken smaller to accommodate the increased diffusivity, especially for the ϵt = 80 data point. The number of sampling steps used to discretize the SDEs in these experiments ranged from

K = 100 for ϵt = 0 up to K = 2000 for ϵt = 80. Nonetheless, it suggests that diffusion can be more helpful when there is already some successful transport than without.

3.4. Lattice φ4 Theory

We next apply NETS to the simulation of a statistical lattice field theory at and past the phase transition from which the lattice goes from disordered, to semi-ordered, to fully ordered (neighboring sites are highly correlated to be of the same sign and magnitude). We study the lattice φ4 theory in D = 2 spacetime dimensions (Vierhaus, 2010; Albergo et al., 2019). The random variables in this circumstance are field configurations φ RL L, where L is the extent of space and time. The interpolating energy function under which we seek to sample is defined as:

x y |φx φy|2 + X

h m2 tφ2 x + λtφ4 x i , (41)

where the sums are taken over the lattices sites x, or adjacent sites x y, and λt are time-dependent parameters of the theory that define the phase of the lattice (ranging from disordered to ordered, otherwise known as magnetized). A derivation of this energy function is given in Appendix I. Importantly, sampling the lattice configurations becomes challenging when approaching the phase transition between the disordered and ordered phases. As an example, we identify the phase transition on L = 16 (d = 256) and L = 20 (d = 400) lattices and run NETS with the action matching loss, with ˆϕt a simple feed forward neural network. We use the free theory λ0 = 0 as the base distribution under which we initially draw samples. The definition of the target parameter values m2 1, λ1 both at the phase transition and in the ordered phase are given in the Appendix I. In Figure 3, the top row shows samples from NETS for L = 20 at the phase transition, where correlations begin to appear in the lattice configurations. NETS is almost 2 orders of magnitude more statistically efficient than AIS (the same setup without the transport) in sampling at the critical point, as seen in the plot showing ESS over time. Note also that NETS can produce unbiased estimates of the magnetization as compared to a Hybrid Monte Carlo (HMC) ground truth. The bottom row shows samples past the phase transition and into the ordered phase, where the lattices begin to take on

NETS: A Non-Equilibrium Transport Sampler

100 ESS over Time

M[φ(x)] = P x φ(x)

0.0 0.2 0.4 0.6 0.8 1.0 t

0.50 0.25 0.00 0.25 0.50 Average Magnetization

φ U0 AIS NETS Reweighted, NETS HMC

Figure 3. Comparison of the performance of NETS to AIS on two different settings for the study of φ4 theory. Top row, left: 10 example generative lattice configurations with parameters L = 20, m2 = 1.0, λ = 0.9, which demarcates the phase transition to the antiferromagnetic phase. Top row, right: Performance of AIS (purple curve) vs. NETS (red curve) in terms of effective sample size over time of integration t, and a histogram of the average magnetization of 4000 lattice configurations, sampled with AIS, NETS, and HMC (superposed in this order). Note that NETS is closer to the HMC target and re-weights correctly. Re-weighted AIS was not plotted because the weights were too high variance. Bottom row: Equivalent setup for L = 16, m2 = 1.0, λ = 0.8, past the phase transition and into the ordered phase. Note that the field configurations generated by NETS are either all positive across lattice sites or all negative. AIS fails to sample the correct distribution, and its weights are too high variance to be used on the histogram.

either all positive or all negative values. Again in this regime, NETS is nearly 2 orders of magnitude more statistically efficient.

While NETS performs significantly better than conventional annealed samplers on the challenging field theory problem, algorithms built out of dynamical transport still experience slowdowns near phase transitions because of the difficulty of resolving the dynamics of the integrators near these critical points. As such, we need to use 1500-2000 steps in the integrator to properly resolve the dynamics of the SDE.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Akhound-Sadegh, T., Rector-Brooks, J., Bose, J., Mittal, S., Lemos, P., Liu, C.-H., Sendera, M., Ravanbakhsh, S., Gidel, G., Bengio, Y., Malkin, N., and Tong, A. Iterated denoising energy matching for sampling from boltzmann densities. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of

Machine Learning Research, pp. 760 786. PMLR, 21 27 Jul 2024. URL https://proceedings.mlr. press/v235/akhound-sadegh24a.html.

Albergo, M. S. and Vanden-Eijnden, E. Building normalizing flows with stochastic interpolants. In The Eleventh International Conference on Learning Representations, 2022.

Albergo, M. S., Kanwar, G., and Shanahan, P. E. Flowbased generative models for markov chain monte carlo in lattice field theory. Phys. Rev. D, 100: 034515, Aug 2019. doi: 10.1103/Phys Rev D.100. 034515. URL https://link.aps.org/doi/10. 1103/Phys Rev D.100.034515.

Albergo, M. S., Boffi, N. M., and Vanden-Eijnden, E. Stochastic interpolants: A unifying framework for flows and diffusions. ar Xiv preprint ar Xiv:2303.08797, 2023.

Arbel, M., Matthews, A. G. D. G., and Doucet, A. Annealed flow transport monte carlo. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research, 18 24 Jul 2021.

Arenz, O., Dahlinger, P., Ye, Z., Volpp, M., and Neumann, G. A unified perspective on natural gradient variational inference with gaussian mixture models. Transactions on Machine Learning Research, 2023. ISSN 2835-

NETS: A Non-Equilibrium Transport Sampler

8856. URL https://openreview.net/forum? id=t LBjs X4tjs.

Behjoo, H. and Chertkov, M. Harmonic path integral diffusion, 2024. URL https://arxiv.org/abs/2409. 15166.

Berner, J., Richter, L., and Ullrich, K. An optimal control perspective on diffusion-based generative modeling, 2024. URL https://arxiv.org/abs/2211.01364.

Blessing, D., Jia, X., Esslinger, J., Vargas, F., and Neumann, G. Beyond ELBOs: A large-scale evaluation of variational methods for sampling. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 4205 4229. PMLR, 21 27 Jul 2024. URL https://proceedings.mlr.press/ v235/blessing24a.html.

Boli c, M., Djuri c, P. M., and Hong, S. Resampling algorithms for particle filters: A computational complexity perspective. EURASIP Journal on Advances in Signal Processing, 2004(15):403686, 2004. doi: 10.1155/S1110865704405149. URL https://doi. org/10.1155/S1110865704405149.

Bonanno, C., Nada, A., and Vadacchino, D. Mitigating topological freezing using out-of-equilibrium simulations. Journal of High Energy Physics, 2024(4):126, 2024. doi: 10.1007/JHEP04(2024)126. URL https: //doi.org/10.1007/JHEP04(2024)126.

Bruna, J. and Han, J. Posterior sampling with denoising oracles via tilted transport, 2024. URL https: //arxiv.org/abs/2407.00745.

Caselle, M., Cellini, E., Nada, A., and Panero, M. Stochastic normalizing flows as non-equilibrium transformations. Journal of High Energy Physics, 2022(7):15, 2022. doi: 10.1007/JHEP07(2022)015. URL https: //doi.org/10.1007/JHEP07(2022)015.

Chen, R. T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 69386f6bb1dfed68692a24c8686939b9-Paper. pdf.

De Bortoli, V., Thornton, J., Heng, J., and Doucet, A. Diffusion schr odinger bridge with applications to score-based generative modeling. In Advances in Neural Information Processing Systems, volume 34, pp. 17695 17709, 2021.

Del Moral, P. Nonlinear filtering: Interacting particle resolution. Comptes Rendus de l Acad emie des Sciences - Series I - Mathematics, 325(6):653 658, 1997. ISSN 0764-4442. doi: https://doi.org/10.1016/S0764-4442(97)84778-7. URL https://www.sciencedirect.com/ science/article/pii/S0764444297847787.

Doucet, A., de Freitas, N., and Gordon, N. J. (eds.). Sequential Monte Carlo Methods in Practice. Statistics for Engineering and Information Science. Springer, 2001. ISBN 978-1-4419-2887-0. doi: 10.1007/ 978-1-4757-3437-9. URL https://doi.org/10. 1007/978-1-4757-3437-9.

Fan, M., Zhou, R., Tian, C., and Qian, X. Path-guided particle-based sampling. In Forty-first International Conference on Machine Learning, 2024. URL https: //openreview.net/forum?id=Kt4fwiu Kqf.

Faulkner, M. F. and Livingstone, S. Sampling algorithms in statistical physics: a guide for statistics and machine learning, 2023. URL https://arxiv.org/abs/ 2208.04751.

Gabri e, M., Rotskoff, G. M., and Vanden-Eijnden, E. Adaptive monte carlo augmented with normalizing flows. Proceedings of the National Academy of Sciences, 119(10):e2109420119, 2022. doi: 10.1073/pnas. 2109420119. URL https://www.pnas.org/doi/ abs/10.1073/pnas.2109420119.

Grathwohl, W., Chen, R. T. Q., Bettencourt, J., and Duvenaud, D. Scalable reversible generative models with free-form continuous dynamics. In International Conference on Learning Representations, 2019. URL https: //openreview.net/forum?id=r Jxgkn Cc K7.

Hartmann, C., Sch utte, C., and Zhang, W. Jarzynski s equality, fluctuation theorems, and variance reduction: Mathematical analysis and numerical algorithms. Journal of Statistical Physics, 175:1214 1261, 2018. URL https://api.semanticscholar. org/Corpus ID:59394118.

Hastings, W. K. Monte carlo sampling methods using markov chains and their applications. Biometrika, 57 (1):97 109, 1970. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334940.

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in neural information processing systems, volume 33, pp. 6840 6851, 2020.

NETS: A Non-Equilibrium Transport Sampler

Hua, M., Lauri ere, M., and Vanden-Eijnden, E. An efficient on-policy deep learning framework for stochastic optimal control. ar Xiv preprint ar Xiv:2410.05163, 2024.

H enin, J., Leli evre, T., Shirts, M. R., Valsson, O., and Delemotte, L. Enhanced sampling methods for molecular dynamics simulations [article v1.0]. Living Journal of Computational Molecular Science, 4(1): 1583, Dec. 2022. doi: 10.33011/livecoms.4.1.1583. URL https://livecomsjournal.org/index. php/livecoms/article/view/v4i1e1583.

Jarzynski, C. Nonequilibrium equality for free energy differences. Phys. Rev. Lett., 78:2690 2693, Apr 1997. doi: 10.1103/Phys Rev Lett.78.2690. URL https://link.aps.org/doi/10.1103/ Phys Rev Lett.78.2690.

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2022.

Liu, J. S. Metropolized independent sampling with comparisons to rejection sampling and importance sampling. Statistics and Computing, 6(2):113 119, 1996. doi: 10.1007/BF00162521. URL https://doi.org/10. 1007/BF00162521.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2022.

Malkin, N., Lahlou, S., Deleu, T., Ji, X., Hu, E., Everett, K., Zhang, D., and Bengio, Y. Gflownets and variational inference, 2023. URL https://arxiv.org/abs/ 2210.00580.

Matthews, A. G. D. G., Arbel, M., Rezende, D. J., and Doucet, A. Continual repeated annealed flow transport monte carlo. In Proceedings of the 39th International Conference on Machine Learning, Proceedings of Machine Learning Research, Jul 2022.

Midgley, L. I., Stimper, V., Simm, G. N. C., Sch olkopf, B., and Hern andez-Lobato, J. M. Flow annealed importance sampling bootstrap. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=XCTVFJw S9LJ.

M at e, B. and Fleuret, F. Learning interpolations between boltzmann densities, 2023. URL https://arxiv. org/abs/2301.07388.

Neal, R. M. Probabilistic inference using markov chain monte carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto, September 1993.

Neal, R. M. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001. doi: 10.1023/ A:1008923215028. URL https://doi.org/10. 1023/A:1008923215028.

Neklyudov, K., Brekelmans, R., Severo, D., and Makhzani, A. Action matching: Learning stochastic dynamics from samples. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 25858 25889. PMLR, 23 29 Jul 2023. URL https://proceedings.mlr.press/ v202/neklyudov23a.html.

No e, F., Olsson, S., K ohler, J., and Wu, H. Boltzmann generators: Sampling equilibrium states of many-body systems with deep learning. Science, 365(6457):eaaw1147, 2019. doi: 10.1126/science. aaw1147. URL https://www.science.org/ doi/abs/10.1126/science.aaw1147.

Parno, M. D. and Marzouk, Y. M. Transport map accelerated markov chain monte carlo. SIAM/ASA Journal on Uncertainty Quantification, 6(2):645 682, 2018. doi: 10.1137/17M1134640. URL https://doi.org/10. 1137/17M1134640.

Sendera, M., Kim, M., Mittal, S., Lemos, P., Scimeca, L., Rector-Brooks, J., Adam, A., Bengio, Y., and Malkin, N. Improved off-policy training of diffusion samplers, 2024. URL https://arxiv.org/abs/2402.05098.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

Tian, Y., Panda, N., and Lin, Y. T. Liouville flow importance sampler. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp. 48186 48210. PMLR, 21 27 Jul 2024. URL https://proceedings.mlr. press/v235/tian24c.html.

Vaikuntanathan, S. and Jarzynski, C. Escorted free energy simulations: Improving convergence by reducing dissipation. Phys. Rev. Lett., 100:190601, May 2008. doi: 10.1103/Phys Rev Lett.100.190601. URL https://link.aps.org/doi/10.1103/ Phys Rev Lett.100.190601.

Vargas, F., Grathwohl, W. S., and Doucet, A. Denoising diffusion samplers. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=8pvnf TAbu1f.

NETS: A Non-Equilibrium Transport Sampler

Vargas, F., Padhy, S., Blessing, D., and N usken, N. Transport meets variational inference: Controlled monte carlo diffusions. In The Twelfth International Conference on Learning Representations, 2024. URL https:// openreview.net/forum?id=PP1rudnxi W.

Vierhaus, I. Simulation of ϕ4 theory in the strong coupling expansion beyond the Ising Limit. Ph D thesis, Humboldt University of Berlin, 07 2010.

Wilson, K. G. Confinement of quarks. Phys. Rev. D, 10:2445 2459, Oct 1974. doi: 10.1103/Phys Rev D. 10.2445. URL https://link.aps.org/doi/10. 1103/Phys Rev D.10.2445.

Zhang, Q. and Chen, Y. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2021.

Zhang, Q. and Chen, Y. Path integral sampler: A stochastic control approach for sampling. In International Conference on Learning Representations, 2022. URL https: //openreview.net/forum?id=_u Cb2yn Ru7Y.

NETS: A Non-Equilibrium Transport Sampler

A. Proofs of Section 2

Here we provide the proofs of the statements made in Sec. 2 which, for the reader convenience, we recall.

Proof of Proposition 2.1. Let ft(x, a) with (x, a) Rd+1 be the PDF of the joint process (Xt, At) defined by the SDE (9) and (10). This PDF solves the FPE

tft = ϵt ( Utft + ft) + t Ut aft, ft=0(x, a) = δ(a)ρ0(x). (42)

Define gt(x) = Z

R eaft(x, a)da. (43)

We can derive an equation for gt(x) by multiplying both sides of the FPE (42) by ea and integrating over a R. Using Z

R ea tft(x, a)da = t

R eaft(x, a)da = tgt, Z

R eaϵt ( Utft + ft)da = ϵt Ut

R eaft(x, a)da + Z

R eaft(x, a)da

= ϵt ( Utgt + gt), Z

R ea t Ut aftda = t Ut

R eaftda = t Utgt,

where we arrived at the second equality in the third equation by integration by parts, we deduce that

tgt = ϵt ( Utgt + gt) t Utgt, gt=0(x) = ρ0(x) = e U0(x)+F0. (45)

The solution to this parabolic PDE is unique and it can be checked by direct substitution that it is given by

gt(x) = e Ut(x)+F0. (46)

Note that this solution is not normalized since it contains F0 rather than Ft. In fact it is easy to see that Z

Rd gt(x)dx = Z

Rd+1 eaft(x, a)dxda = e Ft+F0, (47)

where the first equality follows from the definition of gt and the second from its explicit expression and the definition of the free energy that implies R

Rd e Ut(x)dx = e Ft. Equation (46) is the second equation in (11). From (46) we also deduce that, given any test function h : Rd R, we have R

Rd+1 eah(x)ft(x, a)dxda R

Rd+1 eaft(x, a)dxda =

Rd h(x)gt(x)dx R

Rd h(x)e Ut(x)+F0dx R

Rd e Ut(x)+F0

Rd h(x)e Ut(x)dx R

Rd e Ut(x)dx

Rd h(x)e Ut(x)dx

Rd h(x)ρt(x)dx.

Since by definition of ft(x, a) the left hand-side of this equation can be expressed as the ratio of expectations over (Xt, At) in the first equation in (11) we are done.

NETS: A Non-Equilibrium Transport Sampler

Proof of Proposition 2.3. If bt satisfies (13), then ρt satisfies the FPE (14). Since (15) is the SDE associated with this FPE, (16) holds.

Proof of Proposition 2.4. We can follow the same steps as in the proof of Proposition 2.1 by considering the PDF fˆb t (x, a) of (Xˆb t , Aˆb t). This PDF solves the FPE

tf ˆb t = ϵt ( Utf ˆb t + f ˆb t ) (ˆbtf ˆb t ) ( ˆbt Ut t Ut) af ˆb t ,

f ˆb t=0(x, a) = δ(a)ρ0(x). (49)

g ˆb t(x) = Z

R eaf ˆb t (x, a)da. (50)

We can derive an equation for gˆb t(x) by multiplying both sides of the FPE (49) by ea and integrating over a R. Using

R ea tf ˆb t da = t

R eaf ˆb t da = tg ˆb t, Z

R eaϵt ( Utf ˆb t + f ˆb t )da = ϵt Ut

R eaf ˆb t da + Z

R eaf ˆb t da ,

R ea (ˆbtf ˆb t )da = ˆbt

R eaf ˆb t da

= (ˆbtg ˆb t),

R ea( ˆbt Ut t Ut) af ˆb t da = ( ˆbt Ut t Ut) Z

R ea af ˆb t da

= ( ˆbt Ut t Ut) Z

R eaf ˆb t da

= ( ˆbt Ut t Ut)g ˆb t,

where we arrived at the second equality in the fourth equation by integration by parts, we deduce that

tg ˆb t = ϵt ( Utg ˆb t + g ˆb t) (ˆbtg ˆb t) + ( ˆbt Ut t Ut)g ˆb t,

g ˆb t=0(x) = ρ0(x) = e U0(x)+F0. (52)

The solution to this parabolic PDE is unique and it can be checked by direct substitution that it is given by

g ˆb t(x) = e Ut(x)+F0. (53)

This solution is not normalized since it contains F0 rather than Ft, and it is easy to see that

Rd g ˆb t(x)dx = Z

Rd+1 eaf ˆb t (x, a)dxda = e Ft+F0. (54)

where the first equality follows from the definition of gˆb t and the second from its explicit expression and the definition of the free energy that implies R

Rd e Ut(x)dx = e Ft. Equation (54) is the second equation in (21). From (53) we also deduce

NETS: A Non-Equilibrium Transport Sampler

that, given any test function h : Rd R, we have

Rd+1 eah(x)fˆb t (x, a)dxda R

Rd+1 eafˆb t (x, a)dxda =

Rd h(x)gˆb t(x)dx R

Rd gˆb t(x)dx

Rd h(x)e Ut(x)+F0dx R

Rd e Ut(x)+F0

Rd h(x)e Ut(x)dx R

Rd e Ut(x)dx

Rd h(x)e Ut(x)dx

Rd h(x)ρt(x)dx.

Since by definition of fˆb t (x, a) the left hand-side of this equation can be expressed as the ratio of expectations over (Xˆb t , Aˆb t) in the first equation in (21) we are done.

Proof of Proposition 2.5. Clearly the minimum value of (25) is zero and the minimizing pair (ˆb, ˆF) must satisfy

ˆbt Ut ˆbt t Ut + t ˆFt = 0 (56)

By multiplying both sides of this equation by ρt is can be written as

(ˆbtρt) t Utρt + t ˆFtρt = 0 (57)

This equation requires a solvability condition obtained by integrating it over Rd. This gives

Rd t Ut(x)ρt(x)dx + t ˆFt = 0, (58)

which, by (3), implies that t ˆFt = t Ft. In turn, this implies that (57) is equivalent to (13), i.e. ˆbt solves (13).

Proof of Proposition 2.6. Consider

DKL(ˆρt||ρt) = Z

Rd log ˆρt(x)

ˆρt(x)dx (59)

where ˆρt satisfies (27). Taking the time-derivative of this expression we deduce that (using (27), ρt(x) = e Ut(x)+Ft, the

NETS: A Non-Equilibrium Transport Sampler

identity ϵt ( Utˆρt + ˆρt) = ϵt (ρt (ˆρt/ρt)), and multiple integrations by parts)

t DKL(ˆρt||ρt) = Z

tˆρt(x) tρt(x)

ρt(x) ˆρt(x) dx

Rd log ˆρt(x)

(ˆbt(x)ˆρt(x)) + ϵt ( Ut(x)ˆρt(x) + ˆρt(x)) dx

Rd( t Ut(x) t Ft)ˆρt(x)dx

ˆbt(x) log ˆρt(x)

+ t Ut t Ft

Rd log ˆρt(x)

h ˆbt(x) ˆρt(x) + ˆbt(x) Ut(x) + t Ut t Ft ˆρt(x) i dx

h ˆbt(x) + ˆbt(x) Ut(x) + t Ut t Ft i ˆρt(x)dx

2 ˆρt(x)dx.

DKL(ˆρt=1||ρ1) = Z 1

h ˆbt(x) + ˆbt(x) Ut(x) + t Ut t Ft i ˆρt(x)dxdt

2 ˆρt(x)dxdt

h ˆbt(x) + ˆbt(x) Ut(x) + t Ut t Ft i ˆρt(x)dxdt

ˆbt(x) + ˆbt(x) Ut(x) + t Ut t Ft 2 ˆρt(x)dxdt 1/2

LT =1 PINN(ˆb, F)

which gives (28). To establish (29) observe that

L1 PINN(ˆb, F)

ˆbt(x) ˆbt(x) Ut(x) t Ut + t Ft 2 ˆρt(x)dxdt

ˆbt(x) ˆbt(x) Ut(x) t Ut + t ˆFt 2 + t Ft t ˆFt 2 ˆρt(x)dxdt

= 2L1 PINN(ˆb, ˆF) + 2 Z 1

0 | t Ft t ˆFt|2dt

Therefore, if R 1 0 | t ˆFt t Ft|2dt δ, we have

L1 PINN(ˆb, F) 2L1 PINN(ˆb, ˆF) + 2δ (63)

Combining this bound with (28) gives (29).

B. Time-discretized Version of Proposition 2.4

Here we show how to generalize the result in Proposition 2.4 if we time discretize the SDE in (19) using Euler-Marayuma scheme and use some suitable time-discretized version of the ODE (20).

NETS: A Non-Equilibrium Transport Sampler

Proposition B.1. Let 0 = t0 < t1 < < t K = 1 be a time grid on [0, 1], denote tk = tk+1 tk for k = 0, . . . , K 1, set Xˆb 0 ρ0 and Aˆb 0 = 0, and for k = 0, . . . , K 1 define Xˆb tk+1, Aˆb tk+1 recursively via

X ˆb tk+1 = X ˆb tk ϵtk Utk( X ˆb tk) tk + ˆbtk( X ˆb tk) tk + p

2ϵtk(Wtk+1 Wtk), (64)

A ˆb tk+1 = A ˆb tk + Utk( ˆX ˆb tk) Utk+1( ˆX ˆb tk+1) + R+ k ( X ˆb tk, X ˆb tk+1) R k ( X ˆb tk+1, X ˆb tk), (65)

where we defined

R k (x, y) = 1 4ϵtk tk

y x + tk(ϵtk Utk(x) btk(x)) 2 (66)

Then for all k = 0, . . . , K and any test function h : Rd R, we have

Rd h(x)ρtk(x)dx = E[e A ˆb tk h( Xˆb tk)]

E[e Aˆb tk ] , Ztk = e Ftk = E[e A ˆb tk ], (67)

where the expectations are taken over the law of ( Xˆb tk, Aˆb tk)

Note that the weights in (67) correct for the bias coming for both the time evolution of Ut(x) and the fact that the Euler Maruyama update in (64) does not satisfy the detailed-balance condition locally. It cab be checked by direct calculation that (65) is a consistent time-discretization of the ODE (20).

Proof. For simplicity of notations we will prove (67) for k = K: the argument for all the other k = 1, . . . , K 1 is similar. The update rule in (65) implies that

Utk( ˆX ˆb tk) Utk+1( ˆX ˆb tk+1) + R+ k ( X ˆb tk, X ˆb tk+1) R k ( X ˆb tk+1, X ˆb tk)

= U0( X ˆb t0) UK( X ˆb t K) +

R+ k ( X ˆb tk, X ˆb tk+1) R k ( X ˆb tk+1, X ˆb tk) ,

Now, given the test function h : Rd R, consider

I[h] E e A ˆb t K h( X ˆb t K) (69)

Since the transition probability density function of the Euler-Maruyama update in (64) reads

ρ+ tk(xk+1|xk) = (4πϵtk tk) d/2 exp R+ k (xk, xk+1) , (70)

the joint probability density function of the path ( Xˆb t0, Xˆb t1, . . . , Xˆb t K) is given by

ρ(x0, . . . , x K) = exp ( U0(x0) + F0)

k=0 ρ+ tk, tk(xk+1|xk)

U0(x0) + F0

k=0 R+ k (xk, xk+1)

where C = QK 1 k=0 (4πϵtk tk) d/2. We can use this density along with the explicit expression for Aˆb TK in (68) to express the expectation (69) as an integral over ρ(x0, x1, . . . , x K)

Rd(K+1 dx0 dx K exp

U0(x0) + F0

k=0 R+ k (xk, xk+1)

U0(x0) UK(x K) +

R+ k (xk, xk+1) R k (xk+1, xk) !

NETS: A Non-Equilibrium Transport Sampler

where the second exponential comes from the factor e A ˆb K. (72) simplifies into

Rd(K+1 dx0 dx K exp

UK(x K) + F0

k=0 R k (xk+1, xk) !

h(x K) (73)

In this expression we recognize a product of factors involving

ρ tk(xk|xk+1) = (4πϵtk) d/2 exp R k (xk+1, xk) , (74)

which is the transition probability density function of the time-reversed update

Y ˆb tk = Y ˆb tk+1 ϵtk Utk( Y ˆb tk+1) tk ˆbtk( Y ˆb tk+1) tk + p

2ϵtk(Wtk+1 Wtk). (75)

This implies in particular that we can perform the integrals in (73) sequentially over x0, x1, .., x K 1 to be left with

Rd exp ( UK(x K) + F0) h(x K)dx K (76)

Therefore I[1] = Z

Rd exp ( UK(x K) + F0) dx K = e FK+F0, (77)

which is the second equation in (67), and

I[h] I[1] = e FK F0 Z

Rd exp ( UK(x K) + F0) h(x K)dx K = Z

Rd h(x)ρt K(x)dx (78)

which is the first equation in (67).

C. Solving for the Optimal Drift via Feynman-Kac Formula

Without loss of generality, we can always look for a solution to (23) in the form of bt(x) = ϕt(x), so that this equation becomes the Poisson equation ϕt Ut ϕt = t Ut t Ft. (79)

The solution to this equation can be expressed via Feynman-Kac formula: Proposition C.1. Let Xt,x τ satisfy the following SDE

d Xt,x τ = Ut(Xt,x τ )dτ +

2d Wτ, Xt,x τ=0 = x (80)

where Ut is evaluated fixed at t [0, 1] fixed. Assume geometric ergodicity of the semi-group associated with (80), i.e. the probability distribution of the solutions to this SDE converges exponentially fast towards their unique equilibrium distribution with density ρt(x). Then for all (t, x) [0, 1] Rd we have

0 E t Ft t Ut(Xt,x τ ) dτ (81)

where the expectation is taken over the law of Xt,x τ .

Proof. By Ito formula,

dϕt(Xt,x τ ) = ϕt(Xt,x τ ) Ut(Xt,x τ ) ϕt(Xt,x τ ) dτ +

2 ϕt(Xt,x τ ) d Wτ

= t Ut(Xt,x τ ) t Ft dτ +

2 ϕt(Xt,x τ ) d Wτ (82)

where the differential is taken with respect to τ at t fixed, and we used (79) to get the second equality. If we integrate this relation on τ [0, T] and take expectation, we deduce that

E ϕt(Xt,x T ) ϕt(x) = Z T

0 E t Ut(Xt,x τ ) t Ft dτ (83)

where we use Ito isometry to zero the expectation of the martingale term involving

2 ϕt(Xt,x τ ) d Wτ. If we let T , by ergodicty the first term at the left hand side converges towards a constant independent of (t, x) which we can neglect this fixes the gauge of the solution to (79) which is unique only up to a constant. What remains in this limit is the expression (81). Note that the integral in this expression converges since E t Ut(Xt,x τ ) t Ft exponentially fast as τ by assumption of geometric ergodicity.

NETS: A Non-Equilibrium Transport Sampler

Example: moving Gaussian distribution. Let us consider the case where

2(x bt)T At(x bt), (84)

where bt Rd is a time-dependent vector field and At = AT t Rd Rd is a time-dependent positive-definite matrix: we assume that both bt and At are C1 in time, and also that At At = At At. The free energy in this example is

Ft = log Zt, Zt = (2π)d/2| det At| 1/2, (85)

so that t Ut(x) = b T t At(x bt) + 1

2(x bt)T At(x bt), t Ft = 1

2tr(A 1 t At). (86)

In this case, the SDE (80) reads

d Xt,x τ = At(Xt,x τ bt)dτ +

2d Wτ, Xt,x τ=0 = x, (87)

and its solution is

Xt,x τ = e Atτx + 1 e Atτ bt +

0 e At(τ τ )d Wτ . (88)

This implies that (using Ito isometry)

E t Ut(Xt,x τ ) = b T t Ate Atτ(x bt) + 1

2(x bt)T e Atτ Ate Atτ(x bt)

0 tr e Atτ Ate Atτ dτ

= b T t Ate Atτ(x bt) + 1

2(x bt)T e Atτ Ate Atτ(x bt)

2tr A 1 t At) 1

2tr(A 1 t Ate 2Atτ).

Therefore, from (81), we have (using also (85))

b T t Ate Atτ(x bt) 1

2(x bt)T e Atτ Ate Atτ(x bt)

2tr(A 1 t Ate 2Atτ) dτ

= bt (x bt) 1

4(x bt)T At A 1 t (x bt) + 1

4tr(A 1 t At A 1 t ).

This solution checks out since it implies that

Ut(x) ϕt(x) + ϕt(x) = b T t At(x bt) + 1

2(x bt)T At(x bt) 1

2tr(A 1 t At), (91)

which is t Ut(x) t Ft as it should.

D. Extensions and Generalizations

D.1. Inertial NETS

It is straightforward to generalize Proposition D.1 so that the stochastic dynamics involves some memory/inertia:

Proposition D.1. Let (X ˆb,µ t , R ˆb, t , A ˆb,µ t ) solve the coupled system of SDE/ODE

d X ˆb,µ t = ˆbt(X ˆb,µ t )dt + R ˆb,µ t dt, X ˆb,µ 0 ρ0, (92)

d R ˆb,µ t = µ Ut(X ˆb,µ t )dt µϵ 1 t R ˆb,µ t dt + µ q

2ϵ 1 t d Wt, R ˆb,µ 0 N(0, µId), (93)

d A ˆb,µ t = ˆbt(X ˆb,µ t )dt Ut(X ˆb,µ t ) ˆbt(X ˆb,µ t )dt t Ut(X ˆb,µ t )dt, A ˆb,µ 0 = 0, (94)

where ϵt > 0 is a time-dependent diffusion coefficient, µ 0 is a mobility coefficient, and Wt Rd is the Wiener process. Then for all t [0, 1] and any test function h : Rd R, we have Z

Rd h(x)ρt(x)dx = E[e A ˆb,µ t h(X ˆb,µ t )]

E[e A ˆb,µ t ] , Zt/Z0 = e Ft+F0 = E[e A ˆb,µ t ], (95)

where the expectations are taken over the law of (X ˆb,µ t , A ˆb,µ t ).

NETS: A Non-Equilibrium Transport Sampler

The proof of this proposition can be found at the end of this subsection. Note that when ˆb = b, the solution to (23), (94) is simply

Ab,γ t = Ft + F0, (96)

i.e. the weights are again deterministic with zero variance. In general, ˆb will not be the optimal one, in which case using the SDE in (92)-(94) gives us the extra parameter µ to play with post-training to improve the ESS. Below we show that (92)-(94) reduce to (19)-(20) in the limit as µ . It is also easy to see that, if we set µ = 0 in (92)-(94), we simply get that R ˆb,µ t = 0 and hence (97) reduces to the ODE d X ˆb,µ t = ˆbt(X ˆb,µ t )dt. Finally it is worth noting that (92)-(93) can be cast into Langevin equations with some extra forces. Indeed, if we introduce the velocity V ˆb,µ t = ˆbt(X ˆb,µ t ) + R ˆb,µ t , (92)-(93) can be written as

d X ˆb,µ t = V ˆb,µ t dt X ˆb,µ 0 ρ0, (97)

d V ˆb,µ t = µ Ut(X ˆb,µ t )dt + µϵ 1 t ˆbt(X ˆb,µ t )dt tˆbt(X ˆb,µ t )dt

+ bt(X ˆb,µ t )V ˆb,µ t dt µϵ 1 t V ˆb,µ t dt + µ q

2ϵ 1 t d Wt, V ˆb,µ 0 N(ˆb0(X ˆb,µ 0 ), µId) (98)

In these equations, the terms µϵ 1 t ˆbt tˆbt can be interpreted as non-conservative forces added to µ Ut, and the term bt V ˆb,µ t as an extra friction term added to µϵ 1 t V ˆb,µ t .

Proof of Proposition D.1. Denote by f ˆb,µ t (x, r, a) the joint PDF of (X ˆb,µ t , R ˆb,µ t , A ˆb,µ t ). This PDF satisfies the FPE

tf ˆb,µ t = x ([ˆbt + r]f ˆb,µ t ) + µ Ut rf + µϵ 1 t r (rf ˆb,µ t + µ rf ˆb,µ t )

( ˆbt Ut ˆbt t Ut) af ˆb,µ t ,

f ˆb,µ 0 (x, r, a) = ρ0(x)(2πµ) d/2e |r|2/(2µ)δ(a).

g ˆb,µ t (x, r) = Z

R eaf ˆb,µ t (x, r, a)da. (100)

We can derive an equation for g ˆb,µ t (x) by multiplying both sides of the FPE (99) by ea and integrating over a R. Using equations similar to (51), we arrive at

tg ˆb,µ t = x ([ˆbt + r]g ˆb,µ t ) + µ Ut rf + µϵ 1 t r (rg ˆb,µ t + µ rg ˆb,µ t )

+ ( ˆbt Ut ˆbt t Ut)g ˆb,γ, t

g ˆb,µ 0 (x, r) = ρ0(x)(2πµ) d/2e |r|2/(2µ).

Since ρ0(x) = e U0(x)+F0, it can be checked by direct substitution that the solution to this equation is

g ˆb,µ t (x, r) = e Ut(x)+F0(2πµ) d/2e |r|2/(2µ). (102)

Therefore Z

R2d g ˆb,µ t (x, r)dxdr = Z

R2d+1 eaf ˆb,µ t (x, r, a)dxdrda = e Ft+F0, (103)

where the first equality follows from the definition of g ˆb,µ t and the second from its explicit expression and the definition of the free energy that implies R

Rd e Ut(x)dx = e Ft. Equation (103) is the second equation in (95). From (102) we also

NETS: A Non-Equilibrium Transport Sampler

deduce that, given any test function h : Rd R, we have

R2d+1 eah(x)f ˆb,µ t (x, r, a)dxdrda R

R2d+1 eaf ˆb,µ t (x, r, a)dxdrda =

R2d h(x)g ˆb,µ t (x, r)dxdr R

R2d g ˆb,µ t (x, r)dxdr

Rd h(x)e Ut(x)+F0dx R

Rd e Ut(x)+F0

Rd h(x)e Ut(x)dx R

Rd e Ut(x)dx

Rd h(x)e Ut(x)dx

Rd h(x)ρt(x)dx.

Since by definition of f ˆb,µ t (x, r, a) the left hand-side of this equation can be expressed as the ratio of expectations over

(X ˆb,µ t , A ˆb,µ t ) in the first equation in (95) we are done.

To see what happens when µ , let us assume that ϵt = ϵ (time-independent) and integrate (93) using Duhamel principle as

R ˆb,µ t = e µϵ 1t R ˆb,µ 0 µ Z t

0 e µϵ 1(t s) Us(X ˆb,µ s )ds + µ

0 e µϵ 1(t s)d Ws, (105)

Letting µ , we see that the first term at the right hand side of (105) tends to zero, whereas the second one gives

lim µ µ Z t

0 e µϵ 1(t s) Us(X ˆb,µ s )ds = ϵ Ut(X ˆb,µ t ) (106)

Finally, the third term at the right hand side of (105) is a Gaussian process with covariance

Cµ t,t = 2µ2ϵ 1 Z min(t,t )

0 e µϵ 1(t s) µϵ 1(t s)ds = 2µ e µϵ 1|t t | e µϵ 1(t+t ) (107)

As a result, given any test function ϕt, we have

[0,1]2 ϕt Cµ t,t ϕt dtdt = 2ϵ Z 1

0 ϕ2 tdt (108)

which indicates that Cµ t,t converges weakly towards the Dirac distribution ϵδ(t t ). Putting these results together shows

that in the limit as µ , R ˆb,µ t dt converges weakly towards ϵ Ut(X ˆb,µ t )dt +

2ϵd Wt, which, if inserted in (92), reduces this equation to (19). The case where ϵt depends on time can be treated similarly.

D.2. Multimarginal NETS

Let U(α, x) be a potential depending on α D RN with N N as well as x Rd, and assumed to be continuously differentiable in both arguments. Assume that e U(α,x) is integrable in x for all α D, and define the family of PDF

ϱ(α, x) = e U(α,x)+F(α), F(α) = log Z

Rd e U(α,x)dx. (109)

Finally, define the family of matrix-valued ˆB(α, x) : D Rd RN Rd, assumed to be continuously differentiable in both arguments. These quantities allow us to give a generalization of Proposition 2.4 in which we can sample the PDF ϱ(α, x) along any differential path αt D:

NETS: A Non-Equilibrium Transport Sampler

Proposition D.2. Let α : [0, 1] D be a differentiable path in D and define the vector field b : [0, 1] Rd Rd as

ˆbα t (x) = αT t ˆB(αt, x) (110)

as well as U α t (x) = U(αt, x), F α t = F(αt), ρα t = ϱ(α, x) = e U α t (x)+F α t (111)

Let (X ˆb,α t , A ˆb,α t ) solve the coupled system of SDE/ODE

d X ˆb,α t = ˆbα t (X ˆb,α t )dt ϵt U α t (X ˆb,α t )dt +

2ϵtd Wt X ˆb,α 0 ρα 0 , (112)

d A ˆb,α t = ˆbα t (X ˆb,α t )dt U α t (X ˆb,α t ) ˆbα t (X ˆb,α t )dt t U α t (X ˆb,α t )dt, A ˆb,α 0 = 0, (113)

where ϵt > 0 is a time-dependent diffusion coefficient and Wt Rd is the Wiener process. Then for all t [0, 1] and any test function h : Rd R, we have

Rd h(x)ρα t (x)dx = E[e A ˆb,α t h(X ˆb,α t )]

E[e A ˆb,α t ] , e F α t +F α 0 = E[e A ˆb,α t ], (114)

where the expectations are taken over the law of (X ˆb,α t , A ˆb,α t ).

We will omit to give the proof of this proposition since it is a simple consequence of Proposition 2.4. The interest in formulating the problem in this new way is that is it easy to see that the right hand side of (113) (with t F α t added for convenience) can be written as

ˆbα t (x) U α t (x) ˆbα t (x) t U α t (x) + t F α t

= αT t x ˆB(αt, x) ˆB(αt, x) x U(αt, x) αU(αt, x) + αF(αt) . (115)

Therefore if we zero this term for all (α, x) D Rd by picking the right ˆB(α, x) we will obtain that (113) reduces to

d A ˆb,α t = F α t dt, i.e. A ˆb,α t = F α t F α 0 . Finding this optimal B(α, x) can be obtained using the following result:

Proposition D.3 (Multimarginal PINN objective). Consider the objective for ( ˆB, ˆF) given by:

Lα PINN[ ˆB, ˆF]

x ˆB(α, x) ˆB(α, x) x U(α, x) αU(α, x) + α ˆF(α) 2 ˆϱ(α, x)f(α)dxdα (116)

where ˆϱ(α, x) > 0 is a PDF in x for all α D, and f(α) is a PDF in α. Then min B,F Lα PINN[ ˆB, ˆF] = 0, and all minimizers (B, F) are such that and B(α, x) solves

(α, x) D Rd : 0 = x ˆB(α, x) ˆB(α, x) x U(α, x) αU(α, x) + αF(α), (117)

and F(α) is the free energy (109) for all α D.

We will omit to give the proof of this proposition since it is a simple generalization of the proof of Proposition 2.5.

E. Estimating the Drift bt(x) = ϕt(x) via Action Matching (AM)

In general (13) is solved by many bt(x). One way to get a unique (up to a constant in space) solution to this equation is to impose that the velocity be in gradient form, i.e. set bt(x) = ϕt(x) for some scalar-valued potential ϕt(x). If we do so, (13) can be written as ( ϕt(x)ρt) = tρt, and it is easy to see that at all times t [0, 1] the solution to this equation minimizes over ˆϕt the objective Z

2| ˆϕt(x)|2ρt(x) ˆϕt(x) tρt(x) dx

2| ˆϕt(x)|2 + ( t Ut(x) t Ft)ˆϕt(x) ρt(x)dx. (118)

NETS: A Non-Equilibrium Transport Sampler

If we use (3) to set t Ft = R

Rd t Ut(x)ρt(x)dx we can use the objective at the right hand-side of (118) to learn ϕt(x) locally in time (or globally if we integrate this objective on t [0, 1]). Alternatively, we can integrate the objective at the left hand-side of (118) over t [0, T] and use integration by parts for the term involving tρt(x) to arrive at:

Proposition E.1 (Action Matching objective). Given any T (0, 1] consider the objective for ˆϕt(x):

LT AM[ˆϕ] = Z T

2| ˆϕt(x)|2 + t ˆϕt(x) ρt(x)dxdt

ˆϕ0(x)ρ0(x) ˆϕT (x)ρT (x) dx. (119)

Then the minimizer ϕt(x) of (25) is unique (up to a constant) and bt(x) = ϕt(x) satisfies (13) for all t [0, T].

This objective is analogous to the loss presented in Neklyudov et al. (2023), but adapted to the sampling problem. In practice, we will use again T (0, 1] for annealing, but ultimately we are interested in the result at T = 1. Note that, unlike with the PINN objective (25), it is crucial that we use the correct ρt(x) in the AM objective (119): that is, unlike (25), (119) cannot be turned into an off-policy objective.

Proof. By integrating by parts in time the term involving tϕt in the AM objective (120), we can express is as

LT AM[ˆϕ] = Z T

2| ˆϕt(x)|2ρt(x) ϕt(x) tρt(x) dxdt. (120)

This is a convex objective in ˆϕ whose minimizers satisfy

( ˆϕtρt) = tρt. (121)

This is (13) written in terms of bt(x) = ϕt(x). The solution of this equation is unique up to a constant by the Fredholm alternative since its right hand-side satisfies the solvability condition R

Rd tρt(x)dx = 0.

If we use ˆbt(x) = ˆϕt(x) in the SDEs in (19) and (20), we need to compute ˆbt(x) = ˆϕt(x), which is computationally costly. Fortunately, when ϵt > 0, the calculation of this Laplacian can be avoided by using the following alternative equation for Aˆb t:

ϵt [ˆϕt(X ˆb t ) ˆϕ0(X ˆb 0)] Bt, (122)

where d Bt = t Ut(X ˆb t )dt + 1

ϵt t ˆϕt(X ˆb t ) + 1

ϵt | ˆϕt(X ˆb t )|2dt

2 ϵt ˆϕt(X ˆb t ) d Wt. (123)

To derive these equations, notice f ˆbt(x) = ˆϕt(x), the SDEs (19) and (20) reduce to

d X ˆb t = ϵt Ut(X ˆb t )dt + ˆ ϕt(X ˆb t )dt +

2ϵtd Wt, ˆX ˆb 0 ρ0, (124)

d A ˆb t = ˆϕt(X ˆb t )dt Ut(X ˆb t ) ˆϕ(X ˆb t )dt t Ut(X ˆb t )dt, A ˆb 0 = 0, (125)

Since by Itˆo formula we have

dˆϕt(X ˆb t ) = t ˆϕt(X ˆb t )dt ϵt ˆϕt(X ˆb t ) Ut(X ˆb t )dt + | ˆϕt(X ˆb t )|2dt

2ϵt ˆϕt(X ˆb t ) d Wt + ϵt ˆϕt(X ˆb t )dt, (126)

we can express

ˆϕt(X ˆb t )dt = 1

ϵt dˆϕt(X ˆb t )dt 1

ϵt t ˆϕt(X ˆb t )dt + ˆϕt(X ˆb t ) Ut(X ˆb t )dt

ϵt | ˆϕt(X ˆb t )|2dt r

2 ϵt ˆϕt(X ˆb t ) d Wt. (127)

NETS: A Non-Equilibrium Transport Sampler

If we insert this expression in the SDE (125), we can write it as

d A ˆb t = 1

ϵt dˆϕt(X ˆb t )dt + d Bt. (128)

where d Bt is given by (123). Integrating (128) gives (122).

In terms of implementation, we can write the AM loss (119) as

LT AM[ˆϕ] = Z T

E e A ˆb t 1

2| ˆϕt(Xˆb t )|2 + tϕt(Xˆb t )

E[e Aˆb t] dt + E e A ˆb 0ϕ0(Xˆb 0)

E[e Aˆb 0] E e A ˆb T ϕT (Xˆb T )

E[e Aˆb T ] . (129)

These expectations can be estimated empirically over solutions to (19) and (20) with ˆbt(x) = ˆϕt(x). The above implementation is detailed in Algorithm 1.

F. Link with CMCD (Vargas et al., 2024)

Consider the process Y ˆb t solution to the SDE

d Y ˆb t = ϵt Ut(Y ˆb t )dt + 2ϵt log ρ ˆb t(Y ˆb t )dt + ˆbt(Y ˆb t )dt +

2ϵtd Wt, Y ˆb 0 ρ0. (130)

where ρˆb t denotes the PDF of the process Xˆb t defined by the SDE (19), i.e. the solution to the FPE

tρ ˆb t = ϵt ( Utρ ˆb t + ρ ˆb t) (ˆbtρ ˆb t), ρ ˆb 0 = ρ0 (131)

The process Y ˆb t has a simple interpretation: it is the time-reversed of the process run using the time-reversed potential U1 t and ˆb1 t: that is, if the additional drift ˆbt was the perfect one solution to (23), the law of Xˆb = (Xˆb t )t [0,1] and Y b = (Y ˆb t )t [0,1] should coincide. This suggests to learn b using as objective a divergence of the path measure of Xˆb from

that of Y ˆb. This is essentially what is suggested in (Vargas et al., 2024), and for the reader convenience let us re-derive some of their results in our notations.

The Kullback-Leibler divergence (or relative entropy) of the path measure of Xˆb from that of Y ˆb reads

KL(X ˆb Y ˆb) = 1

0 E | Ut(X ˆb t ) + log ρ ˆb t(X ˆb t )|2 dt (132)

This objective is akin to the one used in score-based diffusion modeling (SBDM) and simply says that one way to adjust ˆb is by matching the score of ρˆb t to that of ρt. As written (132) is not explicit since we do not know log ρˆb t. We can however make it explicit after a few manipulations similar to those used in SBDM. To this end, notice first that, by Ito formula, we have d log ρ ˆb t(X ˆb t ) = log ρ ˆb t(X ˆb t ) ( ϵt Ut(X ˆb t ) + ˆbt(X ˆb t ))dt + ϵt log ρ ˆb t(X ˆb t )dt

2ϵt log ρ ˆb t(X ˆb t ) d Wt (133)

which implies that

ϵt log ρ ˆb t(X ˆb t ) Ut((X ˆb t )dt = d log ρ ˆb t(X ˆb t ) + log ρ ˆb t(X ˆb t ) ˆbt(X ˆb t )dt

+ ϵt log ρ ˆb t(X ˆb t )dt +

2ϵt log ρ ˆb t(X ˆb t ) d Wt (134)

Inserting this expression in (132) after expanding the square, and noticing that the martingale term involving d Wt disappears by Ito isometry and that the term d log ρˆb t(Xˆb t ) can be integrated in time we arrive at

KL(X ˆb Y ˆb) = 1

0 E | Ut(X ˆb t )|2 + | log ρ ˆb t(X ˆb t )|2 + 2 Ut(X ˆb t ) log ρ ˆb t(X ˆb t ) dt

0 E ϵt| Ut(X ˆb t )|2 + ϵt| log ρ ˆb t(X ˆb t )|2 + ϵt Ut(X ˆb t ) log ρ ˆb t(X ˆb t ) dt

0 E log ρ ˆb t(X ˆb t ) ˆbt(X ˆb t ) + ϵt log ρ ˆb t(X ˆb t ) dt

4E[log ρ0(X0)] 1

4E[log ρ ˆb 1(X1)]

NETS: A Non-Equilibrium Transport Sampler

where we used ρˆb 0 = ρ0. We can now use the following identities, each obtained using ρˆb t log ρˆb t = ρˆb t and one integration by parts:

E | log ρ ˆb t(X ˆb t )|2 = Z

Rd | log ρ ˆb t(x)|2ρ ˆb t(x)dx

Rd log ρ ˆb t(x) ρ ˆb t(x)dx

Rd log ρ ˆb t(x)ρ ˆb t(x)dx

= E log ρ ˆb t(X ˆb t ) ,

E Ut(X ˆb t ) log ρ ˆb t(X ˆb t ) = Z

Rd Ut(x) log ρ ˆb t(x)ρ ˆb t(x)dx

Rd Ut(x) ρ ˆb t(x)dx

Rd Ut(x)ρ ˆb t(x)dx

= E Ut(X ˆb t )

and E log ρ ˆb t(X ˆb t ) ˆbt(X ˆb t ) = Z

Rd log ρ ˆb t(x) ˆbt(x)ρ ˆb t(x)dx

Rd ˆbt(x)ρ ˆb t(x)dx

= E ˆbt(X ˆb t )

Inserting these expressions in (135), it reduces to

KL(X ˆb Y ˆb) = 1

0 E ϵt| Ut(X ˆb t )|2 ϵt Ut(X ˆb t ) bt(X ˆb t ) dt

4E[log ρ0(X0)] 1

4E[log ρ ˆb 1(X1)]

This objective is still not practical because it involves log ρˆb 1, which is unknown. There is however a simple way to fix this, by adding a term in the Kullback-Leibler divergence (132)

KL (X ˆb Y ˆb) = KL(X ˆb Y ˆb) + 1

4E[log(ρ ˆb 1(X ˆb 1)/ρ1(X ˆb 1)]) (140)

This additional term is proportional to the Kullback-Leibler divergence of ρˆb 1 from the target PDF ρ1. Using (139) as well as ρ1(x) = e U1(x)+F1, we can now express (140) as

KL (X ˆb Y ˆb) = 1

0 E ϵt| Ut(X ˆb t )|2 ϵt Ut(X ˆb t ) bt(X ˆb t ) dt

4E[log ρ0(X0)] + 1

4E[U1(X1)] 1

This is Equation (24) in (Vargas et al., 2024) in which we set ˆϕt(x) = ˆbt(x) and we used that, for any ct : Rd Rd, we have

0 ct(X ˆb t ) d Wt =

0 E ct(X ˆb t ) dt. (142)

Note that we can neglect the term 1

4E[log ρ0(X0)] in (139) since it does not depend on ˆb, so that the minimization of (139) can be cast into the minimization of (after multiplication by 4) Z 1

0 E ϵt| Ut(X ˆb t )|2 ϵt Ut(X ˆb t ) bt(X ˆb t ) + E[U1(X1) F1]

ϵt| Ut(x)|2 ϵt Ut(x) bt(x)) ρ ˆb t(x)dx + Z

Rd(U1(x) F1)ρ ˆb 1(x)dx

NETS: A Non-Equilibrium Transport Sampler

where ρˆb t solves (131).

Let us check that the minimizer of (143) is ˆbt = bt, the solution to (23), so that we also have ρˆb t = ρb t = ρt. To this end, notice that the minimization of (143) can be performed with the method Lagrange multiplier, using the extended objective Z 1

ϵt| Ut(x)|2 ϵt Ut(x) ˆbt(x) ρ ˆb t(x)dxdt + Z

Rd(U1(x) F1)ρ ˆb 1(x)dx

Rd λt(x) tρ ˆb t ϵt ( Utρ ˆb t + ρ ˆb t) + (ˆbtρ ˆb t) dxdt

where λt(x) is the Lagrange multiplier to be determined. Taking the first variation of this objective over λ, ρˆb, and ˆb, we arrive at the Euler-Lagrange equations

0 = tρ ˆb t ϵt ( Utρ ˆb t + ρ ˆb t) + (ˆbtρ ˆb t), ρ ˆb 0 = ρ0,

0 = ϵt| Ut|2 ϵt Ut ˆbt

tλt + ϵt Ut λt ϵt λt ˆbt λt λ1 = U1 + F1

0 = ρ ˆb t ρ ˆb t λt

We can check that ˆbt(x) = bt(x), ρˆb t(x) = ρt(x) = e Ut(x)+Ft, and λt(x) = Ut(x) + Ft is a solution: indeed this solves the first and the last equations in (145) and reduces the second to

0 = ϵt| Ut|2 ϵt Ut bt

+ t Ut t Ft ϵt| Ut|2 + Ut + bt Ut = bt + t Ut t Ft + bt Ut

which is satisfied since bt solves (23).

G. Hutchinson s Trace Estimator for the Evaluation of ˆbt(x)

It is well-known that, if bt(x) is bounded,

2δ E η ˆbt(x + δη) ˆbt(x δη) + O(δ2), (147)

where 0 < δ 1 is an adjustable parameter and η N(0, Id). Indeed we have

1 2δ η ˆbt(x + δη) ˆbt(x δη) = ηT bt(x)η + O(δ2), (148)

which implies (147) after taking the expectation over η.

We can use this formula to estimate the PINN loss via

LT,δ PINN[ˆb, ˆF] = Z T

0 E h Rδ t(xt, η)Rδ t(xt, η ) i dt (149)

where the expectation is now taken independent over xt ˆρt, η N(0, Id), and η N(0, Id), and we defined

Rδ t(x, η) = 1

2δ η ˆbt(x + δη) ˆbt(x δη) Ut(x) ˆbt(x) t Ut(x) + t ˆFt (150)

The expectation in (149) is unbiased since η η , and its accuracy can be controlled by lowering δ.

H. Details on Numerical Experiments

In the following we include details for reproducing the experiments presented in Section 3. An overview of the training procedure is given in Algorithm 1. Note that the SDE for the weights can replaced with (122) when learning with ˆϕt, as one would do with the action matching loss (119).

NETS: A Non-Equilibrium Transport Sampler

H.1. Performance metrics

Effective sample size. We can compute the self-normalized ESS as

N 1 PN i=1 exp Ai t 2

N 1 PN i=1 exp 2Ai t (151)

at time t along the SDE trajectory. We can use the ESS both as a quality metric and as a trigger for when to perform resampling of the walkers based on the weights, using, e.g. systematic resampling (Doucet et al., 2001; Boli c et al., 2004). Systematic resampling is one of many resampling techniques from particle filtering wherein some walkers are killed and some are duplicated based on their importance weights.

2-Wasserstein distance. The 2-Wasserstein distance reported in Table 1 were computed with 2000 samples from the model and the target density using the Python Optimal Transport library.

Maximum Mean Discrepancy (MMD). We use the MMD code from (Blessing et al., 2024) to benchmark the performance of NETS on Neal s funnel. We use the definition of the MMD as

MMD2 (ˆρ, ρ) 1 n(n 1)

i,j k (ˆxi, ˆxj) + 1 m(m 1)

i,j k (xi, xj) 2 nm

j k (ˆxi, xj) (152)

where ˆx ˆρ is from the model distribution and x ρ is from the target and k : Rd Rd R is chosen to be the radial basis kernel with unit bandwidth.

H.2. 40-mode GMM

The 40-mode GMM is defined with the mean vectors given as:

µ1 = ( 0.2995, 21.4577) , µ2 = ( 32.9218, 29.4376) ,

µ3 = ( 15.4062, 10.7263) , µ4 = ( 0.7925, 31.7156) ,

µ5 = ( 3.5498, 10.5845) , µ6 = ( 12.0885, 7.8626) ,

µ7 = ( 38.2139, 26.4913) , µ8 = ( 16.4889, 1.4817) ,

µ9 = (15.8134, 24.0009) , µ10 = ( 27.1176, 17.4185) ,

µ11 = (14.5287, 33.2155) , µ12 = ( 8.2320, 29.9325) ,

µ13 = ( 6.4473, 4.2326) , µ14 = (36.2190, 37.1068) ,

µ15 = ( 25.1815, 10.1266) , µ16 = ( 15.5920, 34.5600) ,

µ17 = ( 25.9272, 18.4133) , µ18 = ( 27.9456, 37.4624) ,

µ19 = ( 23.3496, 34.3839) , µ20 = (17.8487, 19.3869) ,

µ21 = (2.1037, 20.5073) , µ22 = (6.7674, 37.3478) ,

µ23 = ( 28.9026, 20.6212) , µ24 = (25.2375, 23.4529) ,

µ25 = ( 17.7398, 1.4433) , µ26 = (25.5824, 39.7653) ,

µ27 = (15.8753, 5.4037) , µ28 = (26.8195, 23.5521) ,

µ29 = (7.4538, 31.0122) , µ30 = ( 27.7234, 20.6633) ,

µ31 = (18.0989, 16.0864) , µ32 = ( 23.6941, 12.0843) ,

µ33 = (21.9589, 5.0487) , µ34 = (1.5273, 9.2682) ,

µ35 = (24.8151, 38.4078) , µ36 = ( 30.8249, 14.6588) ,

µ37 = (15.7204, 33.1420) , µ38 = (34.8083, 35.2943) ,

µ39 = (7.9606, 34.7833) , µ40 = (3.6797, 25.0242)

NETS: A Non-Equilibrium Transport Sampler

These means follow the definition given in the FAB (Midgley et al., 2023) code base that has been subsequently used in recent papers. The time-dependent potential Ut(x) is given by the interpolation of means. To directly compare CMCD and NETS on this task, i.e. compare the log-variance path loss versus the PINN, we test the same mean-interpolation potential for CMCD using their code. The results are summarized in Table 3.

Method nstep = 100 nstep = 256 nstep = 100 nstep = 256

NETS-PINN 0.979 0.002 0.988 0.001 3.14 0.46 3.01 0.49

CMCD-LV 0.162 0.030 0.51 0.04 9.13 1.17 3.62 0.55

Table 3. Comparison of NETS and CMCD with the log-variance loss on GMM-40 using Effective Sample Size (ESS) and Wasserstein-2 distance for the mean-interpolation potential Ut.

H.3. Neal s 10-d funnel

The Neal s Funnel distribution is a 10-d probability distribution defined as

x0 N(0, σ2), x1:9 N(0, ex0) (153)

where σ = 3 and we use subscripts here as a dimensional index and not as a time index like in the rest of the paper. Following this, we use as a definition of the interpolating potential:

2x2 0(1 t + t

i=1 e tx0x2 i + (d 1)tx0 (154)

so that at t = 0, we have U0(x) = 1

2 Pd 1 i=1 x2 i and at time t = 1 we have the funnel potential given as U1(x) = 1 2σ2 x0 + 1

2 Pd 1 i=1 e x0x2 i + (d 1)x0.

H.4. 50-d Mixture of Student-T distributions

Following (Blessing et al., 2024), we use their mixture of 10 student-T distributions in 50 dimensions. We construct Ut via interpolation of means from a single standard student-T distribution (mean 0). We use the same neural network as used in the GMM experiments.

To further drive home the fact that our annealed Langevin dynamics with transport can be taken post-training to the ϵ limit to approach perfect sampling, we provide the following ablation from our model learned with the action matching loss given in Figure 4.

I. Details of the φ4 model

We consider the Euclidean scalar ϕ4 theory given by the action

SEuc[φ] = Z µφ(x) µφ(x) + m2φ2(x) + λφ4(x) d Dx (155)

where we use Einstein summation to denote the dot product with respect to the Euclidean metric and D is the spacetime dimension. We are interested in acquiring a variant of this expression that provides a fast computational realization when put onto the lattice. Using Green s identity (integrating by parts) we note that Z ( µφ(x) µφ(x)ddx = Z µφ µφ ddx = Z φ(x) µ µφ(x)ddx + vanishing surface term (156)

SEuc[φ] = Z φ(x) µ µφ(x) + m2φ2(x) + λφ4(x) ddx. (157)

NETS: A Non-Equilibrium Transport Sampler

Diﬀusion coeﬃcient ϵ

Mo S (d = 50) Target, NETS-AM

Figure 4. Reduction in W2 distance from taking the ϵ limit in sampling with NETS. Note that the resolution of the SDE integration must increase to accommodate the higher stochasticity. Average taken over 3 sampling runs of 2000 walkers each.

Discretizing SEuc onto the lattice

Λ = {a(n0, . . . , nd 1) | ni {0, 1, 2, . . . , L}, i = 0, 1, . . . , d, a R+},

where a is the lattice spacing used to define the physical point x = an, we use the forward difference operator to define

a[φ(x + µ) φ(x)] µ µφ(x) 1 a2 [φ(x + µ) 2φ(x) + φ(x µ)]. (158)

Using these expressions, we write the discretized lattice action as

x Λ a D " D X

a2 φx+µφx 2φ2 x + φx µφx] + m2φ2 x + λφ4 x

2Da 2φ2 x a 2 X

φx+µφx + φx µφx] + m2φ2 x + λφ4 x

2Da 2φ2 x 2a 2 X

µ [φxφx+µ] + m2φ2 x + λφ4 x

µ φxφx+µ + (2a 2D + m2)φ2 x + λφ4 x

where we have used the fact that on the lattice P

x φxφx+ˆµ = P

x φx ˆµφx to get the third equality. It is useful to put the action in a form that is independent of the lattice spacing a. To do so, we introduce the re-scaled lattice field as

φx a D/2 1φx, m2 a2m2, and λ a4 Dλ. (163)

Plugging these rescalings into (162) gives us the final expression

+ (2D + m2)φ2 x + λφ4 x, (164)

which we are to use in simulation.

NETS: A Non-Equilibrium Transport Sampler

I.1. Free theory λ = 0

Turning off the interaction makes it possible to analytically solve the theory. To do this, introduce the discrete Fourier transform relations

x φxe ik x (165)

k φkeik x (166)

for discrete wavenumbers k = 2lπ

L with l = 0, , L 1. Plugging in (166) into the first part of (162), we get the expanded sum

µ ˆφx ˆφx+µ

k φkφk ei(k+k ) xeik µ (167)

k δk, k φkφk eik µ (168)

k φkφ ke ikµ (169)

k φkφk e ikµ (170)

k |φk|2[cos kµ + i sin kµ] = X

k |φk|2 cos kµ (171)

where ϕ indicates conjugation, and we got the first equality by the orthogonality of the Fourier modes, the second by the Kronecker delta, and the third by the reality of the scalar field. Proceeding similarly for the terms proportional to φ2 gives us the expression

m2 + 2D 2 X

The above equation can be written in quadratic form to highlight that the field may be sampled analytically

k φk Mk, kφ k (173)

where Mk, k =

m2 + 2D 2 X

δk, k (174)

Note that this free theory can be sampled for any m2 > 0.

I.2. φ4 numerical details

We numerically realize the above lattice theory in D=2 spacetime dimensions. We use an interpolating potential with time dependent m2 t = (1 t)m2 0 + tm2 1, λt = (1 t)λ0 + tλ1 where λ0 is always chosen to be 0 (though we note that you could run this sampler for any U0 that you could sample from easily, not just analytically but also with existing MCMC methods). For the L = 20 (d = L L = 400 dimensional) experiments, we identify the critical point of the theory (where the lattices go from ordered to disordered) using HMC by studying the distribution of the magnetization of the field configurations as M[φi(x)] = P

x φi(x), where summation is taken over all lattice sites on the ith lattice configuration. We identify this at m2 1 = 1.0, λ1 = 0.9 and use these as the target theory parameters on which to perform the sampling. For the L = 16 test (d = 256), we go past this phase transition into the ordered phase of the theory, which we identify via HMC simulations at m2 1 = 1.0, λ1 = 0.8.