# underdamped_diffusion_bridges_with_applications_to_sampling__c57eedf1.pdf Published as a conference paper at ICLR 2025 UNDERDAMPED DIFFUSION BRIDGES WITH APPLICATIONS TO SAMPLING Denis Blessing ,1, Julius Berner ,2, , Lorenz Richter ,3,4, Gerhard Neumann1,5 1Karlsruhe Institute of Technology, 2NVIDIA, 3Zuse Institute Berlin, 4dida Datenschmiede Gmb H, 5FZI Research Center for Information Technology We provide a general framework for learning diffusion bridges that transport prior to target distributions. It includes existing diffusion models for generative modeling, but also underdamped versions with degenerate diffusion matrices, where the noise only acts in certain dimensions. Extending previous findings, our framework allows to rigorously show that score matching in the underdamped case is indeed equivalent to maximizing a lower bound on the likelihood. Motivated by superior convergence properties and compatibility with sophisticated numerical integration schemes of underdamped stochastic processes, we propose underdamped diffusion bridges, where a general density evolution is learned rather than prescribed by a fixed noising process. We apply our method to the challenging task of sampling from unnormalized densities without access to samples from the target distribution. Across a diverse range of sampling problems, our approach demonstrates state-of-the-art performance, notably outperforming alternative methods, while requiring significantly fewer discretization steps and no hyperparameter tuning. 1 INTRODUCTION In this paper we propose a general diffusion-based framework for sampling from a density ptarget = ρtarget Rd ρtarget(x) dx, (1) where ρtarget C(Rd, R 0) can be evaluated pointwise, but the normalization constant Z is typically intractable. This task is of great practical relevance in the natural sciences, e.g., in fields such as molecular dynamics and statistical physics (Stoltz et al., 2010; Schopmans & Friederich, 2025), but also in Bayesian statistics (Gelman et al., 2013). Recently, multiple approaches based on diffusion processes have been proposed, where the overall idea is to learn a stochastic process in such a way that it transports an easy prior distribution to the potentially complicated target over an artificial time. Typically, the process is defined as an ordinary Itˆo diffusion, in particular, demanding non-degenerate noise. In this work, we aim to generalize this setting to diffusion processes with degenerate noise. This is motivated by the following model from statistical physics. Classical sampling approaches based on stochastic processes have been extensively conducted using some version of the overdamped Langevin dynamics d Xs = log ptarget(Xs) ds + 2 d Ws, X0 pprior, (2) whose stationary distribution is given by ptarget (under some rather mild technical assumptions on the target and on the prior pprior). Furthermore, we can define an extended dynamics by introducing an additional variable, bringing the so-called underdamped Langevin dynamics d Xs = Ys ds, X0 pprior, (3a) d Ys = ( log ptarget(Xs) Ys) ds + 2 d Ws, Y0 N(0, Id), (3b) Equal contribution. Work partially done at Caltech. Published as a conference paper at ICLR 2025 uncontrolled controlled overdamped underdamped Figure 1: Illustration of uncontrolled (see (2) and (3)) and controlled (see (4) and (15)) diffusion processes in the overdamped and underdamped regime, transporting the Gaussian prior distribution to the target. For the underdamped case, we show both the positional coordinate (left/blue) as well as the velocity (right/black). While the underdamped version enjoys better convergence guarantees, both uncontrolled diffusions only converge asymptotically. Learning the control, we can achieve convergence in finite time. where now the stationary distribution is given by τ(x, y) := ptarget(x)N(y; 0, Id) (and π(x, y) := pprior(x)N(y; 0, Id) can be defined as an extended prior distribution). Intuitively, the y-variable can be interpreted as a velocity, which is coupled to the space variable x via Hamiltonian dynamics. While both (2) and (3) converge to the desired (extended) target distribution after infinite time, their convergence speed can be exceedingly slow, in particular for multimodal targets (Eberle et al., 2019). At the same time, it has been observed numerically that the underdamped version can be significantly faster (Stoltz et al., 2010). This might be attributed to the fact that the Brownian motion is only indirectly coupled to the space variable, leading to smoother paths of X and lower discretization error in numerical integrators (since log ptarget only depends on X, but not on Y ). In particular, for smooth and strongly log-concave1 targets, the number of steps to obtain KL divergence ε can be reduced from e O(d/ε2) to e O( d/ε) (Ma et al., 2021). The idea of learned diffusion-based sampling is to reach convergence to multimodal targets after finite time. In particular, for overdamped diffusion models, the convergence rate can be shown to match the one of Langevin dynamics without the need for log-concavity assumptions as long as the learned model exhibits sufficiently small approximation error (Chen et al., 2022). In the overdamped setting, this can be readily formulated as adding a control function to the dynamics (2), d Xs = ( log ptarget(Xs) + u(Xs, s)) ds + 2 d Ws, (4) where the task is to learn u C(Rd [0, T], Rd) as to reach XT ptarget (Richter & Berner, 2024; Vargas et al., 2024); see Figure 1 for an illustration. It is now natural to ask the question whether we can use the same control ideas to the (typically better behaved) underdamped dynamics (3). Motivated by this guiding question this paper includes the following: Controlled diffusions with degenerate noise: Building on previous work based on path space measures, we generalize diffusion-based sampling to processes with degenerate noise, in particular including controlled underdamped Langevin equations (Section 2). Underdamped methods in generative modeling: This framework can be used to derive and analyze underdamped methods in generative modeling. In particular, we derive the ELBO and variational gap for diffusion bridges where both forward and reverse-time processes are learned. Novel underdamped samplers: Moreover, our framework culminates in underdamped versions of existing sampling methods and in particular in the novel underdamped diffusion bridge sampler (Section 3). In extensive numerical experiments, we can demonstrate significantly improved performance of our method. Numerical integrators and ablation studies: We provide careful ablation studies of our improvements, including the benefits of the novel integrators for controlled diffusion bridges as well as end-to-end training of hyperparameters (Section 4). We note that the latter eliminates the need for tuning and also significantly improves existing methods in the overdamped regime. 1Or, more general, log-concave outside of a region. Published as a conference paper at ICLR 2025 1.1 RELATED WORK Many approaches to sampling problems build an augmented target by using a sequence of densities bridging the prior and target distribution and defining forward and backward kernels to approximately transition between the densities, often referred to as annealed importance sampling (AIS) (Neal, 2001). For instance, taking uncorrected overdamped Langevin kernels leads to Unadjusted Langevin Annealing (ULA) (Thin et al., 2021; Wu et al., 2020). Moreover, Monte Carlo Diffusion (MCD) optimizes the extended target distribution in order to minimize the variance of the marginal likelihood estimate (Doucet et al., 2022b). Going one step further, Controlled Monte Carlo Diffusion (CMCD) (Vargas et al., 2024) proposes an objective for directly optimizing the transition kernels to match the annealed density and Sequential Controlled Langevin Diffusion (SCLD) (Chen et al., 2025) connects this attempt with resampling strategies. Furthermore, methods relying on diffusion processes that prescribe the backward transition kernel have recently been suggested. For instance, those include the Path Integral Sampler (PIS) (Zhang & Chen, 2022; Vargas et al., 2023b; Richter, 2021), Time-Reversed Diffusion Sampler (DIS) (Berner et al., 2024), Diffusion generative flow samplers (DGFS) (Zhang et al., 2024), Denoising Diffusion Sampler (DDS) (Vargas et al., 2023a), as well as the Particle Denoising Diffusion Sampler (Phillips et al., 2024). For those diffusion-based samplers, the optimal forward transition corresponds to the score of the current density, which can also be learned via its associated Fokker-Planck equation (Sun et al., 2024) or its representation via the Feynman-Kac formula (Akhound-Sadegh et al., 2024). Finally, there are methods learning both kernels separately, e.g., the (Diffusion) Bridge2 Sampler (DBS) (Richter & Berner, 2024). For some of the above methods improved convergence has been observed when using underdamped versions or Hamiltonian dynamics, which can be viewed as a form of momentum. In particular, ULA has been extended to Uncorrected Hamiltonian Annealing (UHA) (Geffner & Domke, 2021; Zhang et al., 2021), MCD has been extended to Langevin Diffusion Variational Inference (LDVI) (Geffner & Domke, 2023), and the works on DDS and CMCD also proposed underdamped versions. Our proposed framework in principle encompasses all these works as special cases (see Table 2 in the appendix), noting, however, that each of the previously existing method brings some respective additional details. Moreover, we can easily derive novel algorithms using our framework, ranging from an underdamped version of DIS to an underdamped version of the Diffusion Bridge Sampler (Appendix A.9). Further, our unifying framework allows us to easily share integrators and training techniques for the different methods. First, we improve tuning for all considered methods by learning hyperparameters end-to-end, overall resulting in better performance (Figure 5). Second, we advance underdamped methods with our novel integrator (Figure 4 and Figure 9). Third, we show how to scale DBS to more complex targets by using a suitable parametrization (Table 4 & Figure 11) and divergence-free training objective (Proposition 2.3 vs. Proposition A.7). This makes our underdamped version of DBS a state-of-the-art method across a wide range of tasks (Table 1, Figure 3, & Table 3). 2 DIFFUSION BRIDGES WITH DEGENERATE NOISE In this section, we lay the theoretical foundations for diffusion bridges with degenerate noise, extending the frameworks suggested in Richter & Berner (2024) and Vargas et al. (2024). Relating to the example from the introduction, we note that this includes cases where the noise only appears in certain dimensions of the stochastic process and in particular includes underdamped dynamics. We refer to Appendices A.1 and A.2 for a summary of our notation and assumptions. The general idea of diffusion bridges is to learn a stochastic process that transports a given prior density to the prescribed target. This can be achieved via the concept of time-reversal (see, e.g., Figure 2). To this end, let us define the forward and reverse-time SDEs d Zs = (f + η u) (Zs, s) ds + η(s) d Ws, Z0 π, (5) d Zs = (f + η v) (Zs, s) ds + η(s) d Ws, ZT τ, (6) on the state space RD, where d Ws and d Ws denote forward and backward Brownian motion increments (see Appendix A.1 for details), respectively, both living in dimension d D. The function 2We clarify the connection to Schr odinger bridges and other diffusion bridges in Remark A.1. Published as a conference paper at ICLR 2025 L(u, v) = D( Pu,π| u , v arg minu,v L(u, v) Figure 2: Illustration of our general framework for learning diffusion bridges with degenerate noise. Left: We consider forward and reverse-time SDEs (see (5) and (6)), starting at the (extended) prior π and target τ and being controlled by u and v, respectively. Middle: We learn optimal controls u and v of the corresponding SDEs (5) and (6), respectively, by minimizing a suitable divergence D between the associated path measures Pu,π and Pv,τ on the SDE trajectories (Problem 2.1). Right: In general, the optimal controls are not unique and we depict an alternative solution (transparent). However, every solution leads to a perfect time-reversal and, in particular, represents a diffusion bridge with the correct marginals π at time s = 0 and τ at time s = T. We note that the trajectories can be smooth since we allow for degenerate diffusion coefficients, where the Brownian motion noise only acts in certain dimensions. f C(RD [0, T], RD) is typically fixed and maps to the full space, whereas the control functions u, v C(RD [0, T], Rd) will be learned as to approach the desired bridge. In our setting, the noise coefficient η C([0, T], RD d) may be degenerate in the sense that it has the shape η = (0, σ) , where 0 RD d d and σ C([0, T], Rd d) is assumed to be invertible for each t [0, T]. Importantly, the (scaled) control functions and the (scaled) Brownian motions operate in the same dimensions. For simplicity, we assume that f = ( bf, ef) with ef C(RD [0, T], Rd) and bf C(RD [0, T], RD d), where bf only depends on the last d coordinates of its D-dimensional inputs. Referring to the underdamped Langevin equation (3), we may think of Z = (X, Y ) . The general idea is to learn the control functions u and v such that the two processes defined in (5) and (6) are time reversals of each other. This task can be approached via measures on the space of continuous trajectories C([0, T], RD), also called path space (see Appendix A.1 for details). To this end, let us denote by Pu,π the measure of the forward process (5) and by Pv,τ the measure of the backward process (6). We may consider the following optimization problem; see also Figure 2. Problem 2.1 (Time-reversal). Let D : P P R 0 be a divergence and let U C(RD [0, T], Rd) be the set of admissible controls3. We aim to identify optimal controls u , v such that u , v arg min u,v U U D Pu,π| Clearly, if we can drive the divergence in (7) to zero, we have solved the time-reversal task and it readily follows for the time marginals4 that Pu ,π T = τ and Pv ,τ 0 = π. We note that optimality in Problem 2.1 can be expressed by a local condition on the level of time marginals for any time in between prior and target. Lemma 2.2 (Nelson s relation). The following statements are equivalent: (ii) u( , t) v( , t) = η (t) log Pu,π t for all t (0, T] and Pu,π T = τ. (iii) u( , t) v( , t) = η (t) log Pv,τ t for all t [0, T) and Pv,τ 0 = π. Proof. The equivalence follows from the classical Nelson relation (Nelson, 1967; Anderson, 1982; F ollmer, 1986), which also holds for degenerate η; cf. Haussmann & Pardoux (1986); Millet et al. (1989); Chen et al. (2022). However, we cannot directly use Lemma 2.2 to approach Problem 2.1 since the marginals Pv,τ t and Pu,π t are typically intractable. Instead, to turn (7) into a feasible optimization problem, we need a 3We refer to Appendix A.2 for assumptions on U. We note that a divergence D is zero if and only if both its arguments coincide (in the space of probability measures P on C([0, T], RD); see Appendix A.1). 4We denote the marginal of a path space measure P at time t [0, T] by Pt. Similarly, we denote by Ps|t the conditional distribution of Ps given Pt; see Appendix A.1. Published as a conference paper at ICLR 2025 way to explicitly compute divergences between path measures, which (in analogy to a likelihood ratio) typically involves the Radon-Nikodym derivative between those measures. A key observation is that this can indeed be achieved for forward and reverse-time processes, as stated in the following proposition. Proposition 2.3 (Likelihood of path measures). Let η+(s) be the pseudoinverse of η(s) for each s [0, T]. Then for Pu,π-almost every Z C([0, T], RD) it holds that Pv,τ (Z) = log π(Z0) 0 η+f + u 2(Zs, s) ds + 1 0 η+f + v 2(Zs, s) ds 0 (η+f + u)(Zs, s) η+(s) d Zs Z T 0 (η+f + v)(Zs, s) η+(s) Proof. Following Vargas et al. (2024, proof of Proposition 2.2), the proof applies the Girsanov theorem to the forward and reverse-time processes; see Appendix A.5. We refer to Proposition A.7 in the appendix for an alternative version of Proposition 2.3, which, for non-degenerate noise, has been used to define previous diffusion bridge samplers (Richter & Berner, 2024). However, this version relies on a divergence instead of backward stochastic processes, which renders it prohibitive for high dimensions and does not guarantee an ELBO after discretization; see also Remark 3.2. It is important to highlight that the optimization task in Problem 2.1 allows for infinitely many solutions. For numerical applications one may either accept this non-uniqueness (cf. Richter & Berner (2024)) or add additional constraints, such as regularizers (leading to, e.g., the so-called Schr odinger bridge (Vargas et al., 2021; De Bortoli et al., 2021)), a prescribed density evolution (Vargas et al., 2024) or a fixed noising process (Berner et al., 2024). Those different choices lead to different algorithms, for which we can now readily state corresponding degenerate (and thus underdamped) versions using our framework, see Appendix A.9. Divergences and loss functions for sampling. In order to solve Problem 2.1, we need to choose a divergence D, in turn leading to a loss function L : U U R 0 via L(u, v) := D( Pu,π| Pv,τ). A common choice is the Kullback-Leibler (KL) divergence, which brings the loss LKL(u, v) := DKL Pu,π| Pv,τ = EZ Pu,π While we will focus on the KL divergence in our experiments, we mention that our framework can be applied to arbitrary divergences. In particular, one can use divergences that allow for off-policy training and improved mode exploration, such as the log-variance divergence (N usken & Richter, 2021; Richter et al., 2020), see Remark A.2. 2.1 IMPLICATIONS FOR GENERATIVE MODELING: THE EVIDENCE LOWER BOUND Contrary to the sampling setting described above, generative modeling typically assumes that one has access to samples X ptarget, but cannot evaluate the (unnormalized) density. In this section we show how our general setup from the previous section can also be applied in this scenario. For instance, it readily brings an underdamped version of stochastic bridges (Chen et al., 2021) and serves as a theoretical foundation for underdamped diffusion models stated in Dockhorn et al. (2022). To this end, we may approach Problem 2.1 with the forward5 KL divergence Pu,π) = EZ Pv,τ For the sake of notation, we have reversed time, which can be viewed as interchanging τ and π. Since the process corresponding to Pv,τ starts at the target measure τ, we indeed require samples from this measure to compute the divergence in (9). At the same time, looking at Proposition 2.3, we 5While we optimize the measures in both arguments of the KL divergence, the measure Pu,π, corresponding to the generative process, is in the second component, which is typically referred to as forward KL divergence. Published as a conference paper at ICLR 2025 realize that the divergence cannot be computed directly, since τ cannot be evaluated. A workaround is to instead consider an evidence lower bound (ELBO) (or, equivalently, a lower bound on the log-likelihood). In our setting, we have the following decomposition. Lemma 2.4 (ELBO for generative modeling). It holds that Pu,π 0 (Z0)] | {z } evidence / log-likelihood = DKL( Pv,τ| Pev,τ) | {z } variational gap + EZ0 τ [log τ(Z0)] DKL( Pv,τ| Pu,π), | {z } ELBO where ev( , t) u( , t) = η (t) log Proof. This follows from Lemma 2.2 and the chain rule for KL divergences; see Appendix A.5. Crucially, we observe that the ELBO in Lemma 2.4 does not depend on the target τ anymore as the dependency cancels between the two terms (cf. Proposition 2.3). Moreover, the variational gap is zero if and only if v = ev almost everywhere, i.e., the path measures are time-reversals conditioned on the same terminal condition due to Lemma 2.2. The ELBO is maximized when additionally Pu,π 0 equals the target measure τ, i.e., if and only if we found a minimizer (u , v ) of Problem 2.1. In consequence, it provides a viable objective to learn stochastic bridges in an underdamped setting (or, more generally, with degenerate noise coefficients η) using samples from the target distribution τ. We note that for non-degenerate coefficients η, the ELBO from Lemma 2.4 has already been derived in Chen et al. (2021); see also Richter & Berner (2024); Vargas et al. (2024). For diffusion models, i.e., v = 0 and f such that P0,τ T π, this ELBO reduces to the one derived by Berner et al. (2024); Huang et al. (2021). In particular, it has been shown that maximizing the ELBO is equivalent to minimizing the denoising score matching objective (with a specific weighting of noise scales) typically used in practice. For general forward and backward processes, allowing for degenerate noise, as stated in (5) and (6), the derivation of the ELBO is less explored. For (underdamped) diffusion models with degenerate η, a corresponding (hybrid) score matching loss has been suggested and connected to likelihood optimization by Dockhorn et al. (2022, Appendix B.3). In the following proposition, we show that this also follows as a special case from Lemma 2.4. Proposition 2.5 (Underdamped score matching maximizes the likelihood). For the ELBO defined in (10) (setting v = 0) it holds ELBO(u) = T 2 EZ P0,τ , s Unif([0,T ]) h u(Zs, s) + η (s) log P0,τ s|0(Zs|Z0) 2i + const., where the constant does not depend on u. Proof. Following Huang et al. (2021, Appendix A), the proof combines Proposition 2.3 with Stokes theorem; see Appendix A.5. Note that in our notation u learns the negative and scaled score. 3 UNDERDAMPED DIFFUSION BRIDGES In order to approach Problem 2.1 and minimize divergences (such as the KL divergence) in practice, we need to numerically approximate the Radon-Nikodym derivative in Proposition 2.3. Analogously to Vargas et al. (2024, Proposition E.1), we can discretize the appearing integrals to show that Pv,τ (Z) π( b Z0) QN 1 n=0 pn+1|n( b Zn+1 b Zn) τ( b ZN) QN 1 n=0 pn|n+1( b Zn b Zn+1) , (11) where the expressions for the forward and backward transition kernels p and p depend on the choice of the integrator for Z. Since we have degenerate diffusion matrices, the backward kernel p can exhibit vanishing values, which requires careful choice of the integrators for Z. In particular, naively using an Euler Maruyama scheme as an integrator is typically not well-suited (Leimkuhler & Reich, 2004; Neal, 2012; Doucet et al., 2022b); see also Figure 4. We therefore consider alternative integration methods, specifically splitting schemes (Bou-Rabee & Owhadi, 2010; Melchionna, 2007), which divide the SDE into simpler parts that can be integrated individually before combining them. Such methods are particularly useful when certain parts can be Published as a conference paper at ICLR 2025 solved exactly. To formalize splitting schemes, we leverage the Fokker-Planck operator framework, proposing a decomposition of the generator L for diffusion processes Z of the form (5). We can define L via the (kinetic) Fokker-Planck equation6 tp = Lp with Lp = (f + ηu)p) + 1 2 Tr(ηη 2p), (12) governing the evolution of the density p( , t) = Pu,π t of the solution to the SDE in (5). In order to approximate the generator L, we want to assume a suitable structure for f and η, such that we decompose L into simpler pieces. For this, we come back to the setting of the underdamped Langevin equation stated in the introduction in equation (3). We can readily see that its controlled counterpart can be incorporated in the framework presented in Section 2 by making the choices D = 2d, Z = (X, Y ) , and f(x, y, s) = (y, ef(x, s) 1 2σσ (s)y) , η = (0, σ) (13) in (5) and (6), where 0 Rd d and ef C(Rd [0, T], Rd) is chosen appropriately. Following Monmarch e (2021); Geffner & Domke (2023) we split the generator as L = LA + LB + LO (sometimes referred to as free transport, acceleration, and damping) with LAp = y xp, LBp = ef yp, LOp = y gp) + 1 2 Tr(σσ 2 yp), (14) where g(x, y, s) := 1 2σσ (s)y + σ(s)u(x, y, s), resulting in d Xs d Ys ds + 0 ef(Xs, s) 2σσ (s)Ys + σu(Zs, s) ds + σ(s) d Ws where we use a standard normal for the last d components of the initial and terminal distributions following Geffner & Domke (2023), i.e., π(x, y) = pprior(x)N(y; 0, Id) and τ(x, y) = ptarget(x)N(y; 0, Id). (16) According to the Trotter theorem (Trotter, 1959) and the Strang splitting formula (Strang, 1968), the time evolution of the system can be approximated as: e(LA+LB+LO)t e LA e LB e LO N + O(N 3), (17) where a finite number of time steps of length approximates the system dynamics. For a higher accuracy, symmetric splitting can be used: e(LA+LB+LO)t e LO 2 e LA e LB 2 N + O(N 2), (18) which reduces the approximation error (Yoshida, 1990). The optimal composition of terms is generally problem-dependent and has been extensively studied for uncontrolled Langevin dynamics (Monmarch e, 2021). For the controlled setting, prior works often use the OBAB ordering (Geffner & Domke, 2023; Doucet et al., 2022a). In this work, we additionally consider OBABO and BAOAB, which show improved performance (cf. Section 4). Further details on the integrators for forward and backward kernels p and p corresponding to these splitting schemes can be found in Appendix A.8. We refer to Algorithm 1 in the appendix for an overview of our method and to Appendix A.10 for further details. A few remarks are in order (see also Remark A.3 for a note on higher-order Langevin equations). Remark 3.1 (Mass matrix). Previous works, such as Geffner & Domke (2021) and Doucet et al. (2022b), consider incorporating a mass matrix M C([0, T], Rd d) into the SDE formulation in (15) and terminal conditions. For simplicity, we have omitted this consideration in the current section. However, additional details on its inclusion and effects can be found in Appendix A.7. Furthermore, we conduct experiments where we learn the mass matrix, as discussed in Section 4. Remark 3.2 (Discrete Radon-Nikodym derivative). We note that our discretization of the Radon Nikodym derivative in (11) corresponds to a (discrete-time) Radon-Nikodym derivative between the joint distributions of the discretized forward and backward processes. In particular, we can analogously define a KL divergence which allows us to obtain a (guaranteed) lower bound for the 6We denote by Tr the trace and by the derivative operator w.r.t. spatial variable z; see Appendix A.1. Published as a conference paper at ICLR 2025 Table 1: Results for benchmark problems of various dimensions d, averaged across four runs. Evaluation criteria include importance-weighted errors for estimating the log-normalizing constant, log Z, effective sample size ESS, Sinkhorn distance Wγ 2 , and a lower bound (LB) on log Z; see Appendix A.10.4 for details on the metrics. The best results are highlighted in bold. Arrows ( , ) indicate whether higher or lower values are preferable. Blue shading indicates that the method uses the underdamped Langevin equation. Funnel (d = 10) Many Well (d = 50) LGCP (d = 1600) Method log Z ESS Wγ 2 log Z ESS log Z (LB) ESS 10 ULA 0.310 0.020 0.140 0.003 169.859 0.195 0.016 0.003 0.179 0.008 482.024 0.009 0.029 0.003 0.130 0.021 0.151 0.016 159.212 0.093 0.009 0.002 0.418 0.002 484.087 0.063 0.030 0.004 MCD 0.173 0.046 0.206 0.026 164.967 0.334 0.005 0.002 0.737 0.002 483.137 0.368 0.031 0.004 0.088 0.008 0.375 0.016 144.753 0.153 0.005 0.000 0.866 0.012 484.933 0.298 0.032 0.006 CMCD 0.023 0.003 0.567 0.023 104.644 0.710 0.004 0.002 0.859 0.001 483.875 0.275 0.032 0.004 0.268 0.198 0.369 0.186 148.990 19.81 0.008 0.003 0.585 0.034 483.535 0.232 0.028 0.004 DIS 0.047 0.003 0.498 0.021 107.458 0.826 0.006 0.002 0.798 0.002 405.686 4.019 0.015 0.003 0.048 0.009 0.550 0.039 114.580 0.457 0.005 0.000 0.856 0.002 diverged diverged DBS 0.021 0.003 0.603 0.014 102.653 0.586 0.005 0.001 0.887 0.004 486.376 1.020 0.032 0.002 0.010 0.001 0.779 0.009 101.418 0.425 0.005 0.000 0.898 0.002 497.545 0.183 0.174 0.017 log-normalization constant log Z in discrete time. On the other hand, this is not the case if we discretize the divergence-based Radon-Nikodym derivative in Proposition A.7 as done in previous work (Berner et al., 2024; Richter & Berner, 2024). Moreover, we can still optimize the divergences between the corresponding discrete path measures as presented in (8) and Appendix A.10.5. Finally, we note that the discretized Radon-Nikodym derivative does not depend on ef for the integrators considered in Appendix A.8. We thus choose ef to have a good initialization for the process Z, see Appendix A.10. Remark 3.3 (Properties of the score). Since the target density ptarget in (16) only appears in the coordinates where η vanishes, Nelson s identity in Lemma 2.2 shows that u (x, y, T) v (x, y, T) = σ (T) y log N(y; 0, Id), (19) i.e., the optimal controls u and v do not depend on the score of the target distribution, x log ptarget, at terminal time T, as in the case of corresponding overdamped versions. This can lead to numerical benefits in cases where this score would attain large values, e.g., when ptarget is essentially supported on a lower dimensional manifold (Dockhorn et al., 2022; Chen et al., 2022). 4 NUMERICAL EXPERIMENTS In this section, we present a comparative analysis of underdamped approaches against their overdamped counterparts. We consider five diffusion-based sampling methods, specifically, Unadjusted Langevin Annealing (ULA) (Thin et al., 2021; Geffner & Domke, 2021), Monte Carlo Diffusions (MCD) (Doucet et al., 2022b; Geffner & Domke, 2023), Controlled Monte Carlo Diffusions (CMCD) (Vargas et al., 2024), Time-Reversed Diffusion Sampler (DIS)7 (Berner et al., 2024), and Diffusion Bridge Sampler (DBS) (Richter & Berner, 2024). We stress that the underdamped versions of DIS and DBS have not been considered before. To ensure a fair comparison, all experiments are conducted under identical settings. Our evaluation methodology adheres to the protocol suggested in Blessing et al. (2024). For a comprehensive overview of the experimental setup and additional details, we refer to Appendix A.10. Moreover, we provide further numerical results in Appendix A.10.5, including the comparison to competing state-of-the-art methods. The code is publicly available8. 4.1 BENCHMARK PROBLEMS We evaluate the different methods on various real-world and synthetic benchmark examples. 7It is worth noting that we do not separately consider the Denoising Diffusion Sampler (DDS) (Vargas et al., 2023a), as it can be viewed as a special case of DIS (see Appendix A.10.1 in Berner et al. (2024)). 8https://github.com/Denis Bless/Underdamped Diffusion Bridges Published as a conference paper at ICLR 2025 8 16 32 64 128 Funnel (d = 10) 8 16 32 64 128 Credit (d = 25) 8 16 32 64 128 0 Seeds (d = 26) 8 16 32 64 128 Cancer (d = 31) 8 16 32 64 128 0 Brownian (d = 32) 8 16 32 64 128 Ionosphere (d = 35) 8 16 32 64 128 Many Well (d = 50) 8 16 32 64 128 Sonar (d = 61) ULA MCD CMCD DIS DBS OD UD Figure 3: Effective sample size (ESS) for real-world benchmark problems of various dimensions d, averaged across four seeds. Here, N refers to the number of discretization steps. Dashed and solid lines indicate the usage of the overdamped (OD) and underdamped (UD) Langevin dynamics, respectively. EM (OD) EM (UD) OBAB BAOAB OBABO 8 16 32 64 128 0 8 16 32 64 128 0 wallclock time [ s 104 ] Figure 4: Effective sample size (ESS) and wallclock time (in seconds) of the diffusion bridge sampler (DBS) for different integration schemes, averaged across multiple benchmark problems and four seeds. Integration schemes include Euler-Maruyama (EM) for over- (OD) and underdamped (UD) Langevin dynamics and various splitting schemes (OBAB, BAOAB, OBABO). Real-world benchmark problems. We consider seven real-world benchmark problems: Four Bayesian inference tasks, namely Credit (d = 25), Cancer (d = 31), Ionosphere (d = 35), and Sonar (d = 61). Additionally, we choose Seeds (d = 26) and Brownian (d = 32), where the goal is to perform inference over the parameters of a random effect regression model, and the time discretization of a Brownian motion, respectively. Lastly, we consider LGCP (d = 1600), a highdimensional Log Gaussian Cox process (Møller et al., 1998). Synthetic benchmark problems. We consider two synthetic benchmark problems in this work: The challenging Funnel distribution (d = 10) introduced by Neal (2003), whose shape resembles a funnel, where one part is tight and highly concentrated, while the other is spread out over a wide region. Moreover, we choose the Many Well (d = 50) target, a highly multi-modal distribution with 25 = 32 modes. 4.2 RESULTS Underdamped vs. overdamped. Our analysis of both real-world and synthetic benchmark problems reveals consistent improvements when using underdamped Langevin dynamics compared to its overdamped counterpart, as illustrated in Table 1 and Figure 3. The underdamped diffusion bridge sampler (DBS) demonstrates particularly impressive performance, consistently outperforming alternative methods. Remarkably, even with as few as N = 8 discretization steps, it often surpasses competing methods that utilize significantly more steps. Numerical integration schemes. We further examine various numerical schemes for the diffusion bridge sampler (DBS) introduced in Section 3. Results and a discussion for other methods can be found in Appendix A.10.5. To provide a concise overview, we present the average effective sample Published as a conference paper at ICLR 2025 σ M T π Mσ M T σπ Mπ Tπ Mσ T Figure 5: Effective sample size (ESS) of the underdamped diffusion bridge sampler (DBS) for various combinations of learned parameters, averaged across multiple benchmark problems and four seeds using N = 64 discretization steps. Hyperparameters include mass matrix M, diffusion matrix σ, terminal time T, and extended prior distribution π. See Figure 10 for the results with N = 8 discretization steps. size (ESS) and wallclock time across all tasks, excluding LGCP, in Figure 4. Detailed results for individual benchmarks can be found in Appendix A.10.5. While is is known that classical Euler methods are not well-suited for underdamped dynamics (Leimkuhler & Reich, 2004), our findings indicate that both OBAB and BAOAB schemes offer significant improvements without incurring additional computational costs. The OBABO scheme yields the best results overall, albeit at the expense of increased computational demands due to the need for double evaluation of the control per discretization step. However, it is worth noting that in many real-world applications, target evaluations often constitute the primary computational bottleneck. In such scenarios, OBABO may be the preferred choice despite its higher computational requirements. End-to-end hyperparameter learning. Finally, we examine the impact of end-to-end learning of various hyperparameters on the performance of the underdamped diffusion bridge sampler. Our investigation focuses on optimizing the (diagonal) mass matrix M (cf. Appendix A.7), diffusion matrix σ, terminal time T, and prior distribution π. Figures 5 and 10 illustrate the effective sample size, averaged across all tasks (excluding LGCP) for N = 64 and N = 8 diffusion steps, respectively. The results reveal that learning these parameters, particularly the terminal time and prior distribution, leads to substantial performance gains. We note that this feature improves the method s usability and accessibility by minimizing or eliminating the need for manual hyperparameter tuning. 5 CONCLUSION AND OUTLOOK In this work we have formulated a general framework for diffusion bridges including degenerate stochastic processes. In particular, we propose the novel underdamped diffusion bridge sampler, which achieves state-of-the-art results on multiple sampling tasks without hyperparameter tuning and only a few discretization steps. We provide careful ablation studies showing that our improvements are due to the combination of underdamped dynamics, our novel numerical integrators, learning both the forward and backward processes as well as end-to-end learned hyperparameters. Our results also suggest to extend the method by Chen et al. (2021) and benchmark underdamped diffusion bridges for generative modeling using the ELBO derived in Lemma 2.4. Finally, our favorable findings encourage further investigation of the theoretical convergence rate of underdamped diffusion samplers. Similar to what has already been observed in generative modeling by Dockhorn et al. (2022), we find significant and consistent improvements over overdamped versions, in particular also for high-dimensional targets with only a few discretization steps N. However, previous results showed that (for the case v = 0), the improved convergence rates of underdamped Langevin dynamics do not carry over to the learned setting, since (different from the score log ptarget in Langevin dynamics) the control u depends not only on the smooth process X, but also on Y (Chen et al., 2022). Specifically, it can be shown that a small KL divergence between the path measures generally requires the step size to scale at least linearly in d (instead of d). While the tightness of our lower bounds on log Z corresponds to such KL divergences, we believe that the results can still can be reconciled with our empirical findings due to the following reasons: (1) our samplers are initialized as Langevin dynamics (see Appendix A.10) such that theoretical benefits of the underdamped case hold at least initially (2) the learning problem becomes numerically better behaved (see (19)), leading to better approximation of the optimal parameters, (3) learning both u and v as well as the prior π, diffusion coefficient σ, and terminal time T (see Figure 5) can reduce the discretization error. Published as a conference paper at ICLR 2025 ACKNOWLEDGEMENTS J.B. acknowledges support from the Wally Baer and Jeri Weiss Postdoctoral Fellowship. The research of L.R. was partially funded by Deutsche Forschungsgemeinschaft (DFG) through the grant CRC 1114 Scaling Cascades in Complex Systems (project A05, project number 235221301). D.B. acknowledges support by funding from the pilot program Core Informatics of the Helmholtz Association (HGF) and the state of Baden-W urttemberg through bw HPC, as well as the Hore Ka supercomputer funded by the Ministry of Science, Research and the Arts Baden-W urttemberg and by the German Federal Ministry of Education and Research. Tara Akhound-Sadegh, Jarrid Rector-Brooks, Avishek Joey Bose, Sarthak Mittal, Pablo Lemos, Cheng-Hao Liu, Marcin Sendera, Siamak Ravanbakhsh, Gauthier Gidel, Yoshua Bengio, Nikolay Malkin, and Alexander Tong. Iterated denoising energy matching for sampling from Boltzmann densities. In International Conference on Machine Learning, pp. 760 786, 2024. Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. Michael Arbel, Alex Matthews, and Arnaud Doucet. Annealed flow transport Monte Carlo. In International Conference on Machine Learning, pp. 318 330. PMLR, 2021. Oleg Arenz, Philipp Dahlinger, Zihan Ye, Michael Volpp, and Gerhard Neumann. A unified perspective on natural gradient variational inference with Gaussian mixture models. ar Xiv preprint ar Xiv:2209.11533, 2022. Julius Berner, Lorenz Richter, and Karen Ullrich. An optimal control perspective on diffusion-based generative modeling. Transactions on Machine Learning Research, 2024. Denis Blessing, Xiaogang Jia, Johannes Esslinger, Francisco Vargas, and Gerhard Neumann. Beyond ELBOs: A large-scale evaluation of variational methods for sampling. In International Conference on Machine Learning, pp. 4205 4229. PMLR, 2024. Vladimir I Bogachev, Nicolai V Krylov, Michael R ockner, and Stanislav V Shaposhnikov. Fokker Planck Kolmogorov Equations, volume 207. American Mathematical Society, 2022. Nawaf Bou-Rabee and Houman Owhadi. Long-run accuracy of variational integrators in the stochastic context. SIAM Journal on Numerical Analysis, 48(1):278 297, 2010. James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake Vander Plas, Skye Wanderman-Milne, et al. Jax: Autograd and xla. Astrophysics Source Code Library, pp. ascl 2111, 2021. Junhua Chen, Lorenz Richter, Julius Berner, Denis Blessing, Gerhard Neumann, and Anima Anandkumar. Sequential controlled Langevin diffusions. In International Conference on Learning Representations, 2025. Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. ar Xiv preprint ar Xiv:2209.11215, 2022. Tianrong Chen, Guan-Horng Liu, and Evangelos A Theodorou. Likelihood training of Schr odinger bridge using forward-backward SDEs theory. ar Xiv preprint ar Xiv:2110.11291, 2021. Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013. Marco Cuturi, Laetitia Meng-Papaxanthos, Yingtao Tian, Charlotte Bunne, Geoff Davis, and Olivier Teboul. Optimal transport tools (OTT): A jax toolbox for all things Wasserstein. ar Xiv preprint ar Xiv:2201.12324, 2022. Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion Schr odinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695 17709, 2021. Published as a conference paper at ICLR 2025 Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411 436, 2006. Bernard Delyon and Ying Hu. Simulation of conditioned diffusion and application to parameter estimation. Stochastic Processes and their Applications, 116(11):1660 1675, 2006. Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with criticallydamped Langevin diffusion. In International Conference on Learning Representations, 2022. Arnaud Doucet, Will Grathwohl, Alexander G Matthews, and Heiko Strathmann. Score-based diffusion meets annealed importance sampling. Advances in Neural Information Processing Systems, 35:21482 21494, 2022a. Arnaud Doucet, Will Grathwohl, Alexander G de G Matthews, and Heiko Strathmann. Score-based diffusion meets annealed importance sampling. In Advances in Neural Information Processing Systems, 2022b. R. Durrett. Brownian Motion and Martingales in Analysis. Wadsworth & Brooks/Cole Mathematics Series. Wadsworth Advanced Books & Software, 1984. Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 2019. H F ollmer. Time reversal on Wiener space. Stochastic Processes Mathematics and Physics, pp. 119 129, 1986. A Friedman. Partial differential equations of parabolic type. Prentice-Hall, 1964. Avner Friedman. Stochastic differential equations and applications. In Stochastic differential equations, pp. 75 148. Springer, 1975. Tomas Geffner and Justin Domke. MCMC variational inference via uncorrected Hamiltonian annealing. In Advances in Neural Information Processing Systems, 2021. Tomas Geffner and Justin Domke. Langevin diffusion variational inference. In International Conference on Artificial Intelligence and Statistics, pp. 576 593. PMLR, 2023. A. Gelman, J.B. Carlin, H.S. Stern, D.B. Dunson, A. Vehtari, and D.B. Rubin. Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis, 2013. Ulrich G Haussmann and Etienne Pardoux. Time reversal of diffusions. The Annals of Probability, pp. 1188 1205, 1986. Jeremy Heng, Valentin De Bortoli, Arnaud Doucet, and James Thornton. Simulating diffusion bridges with score matching. ar Xiv preprint ar Xiv:2111.07243, 2021. Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusionbased generative models and score matching. Advances in Neural Information Processing Systems, 34:22863 22876, 2021. Aapo Hyv arinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations, 2015. Peter E Kloeden and Eckhard Platen. Stochastic differential equations. In Numerical Solution of Stochastic Differential Equations, pp. 103 160. Springer, 1992. Andrei Kolmogoroff. Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung. Mathematische Annalen, 104:415 458, 1931. Takeshi Koshizuka and Issei Sato. Neural Lagrangian Schr odinger bridge: Diffusion modeling for population dynamics. In International Conference on Learning Representations, 2023. Published as a conference paper at ICLR 2025 Hiroshi Kunita. Stochastic flows and jump-diffusions. Springer, 2019. Jean-Franc ois Le Gall. Brownian motion, martingales, and stochastic calculus. Springer, 2016. Ben Leimkuhler and Charles Matthews. Molecular dynamics. Interdisciplinary applied mathematics, 39(1), 2015. Benedict Leimkuhler and Sebastian Reich. Simulating Hamiltonian dynamics. Number 14. Cambridge university press, 2004. Christian L eonard. A survey of the schr odinger problem and some of its connections with optimal transport. Discrete and Continuous Dynamical Systems, 34(4):1533 1574, 2013. Christian L eonard. Some properties of path measures. S eminaire de Probabilit es XLVI, pp. 207 230, 2014. Guan-Horng Liu, Tianrong Chen, Oswin So, and Evangelos Theodorou. Deep generalized Schr odinger bridge. Advances in Neural Information Processing Systems, 35:9374 9388, 2022. Guan-Horng Liu, Yaron Lipman, Maximilian Nickel, Brian Karrer, Evangelos A Theodorou, and Ricky TQ Chen. Generalized Schr odinger bridge matching. The Twelfth International Conference on Learning Representations, 2024. Yi-An Ma, Niladri S Chatterji, Xiang Cheng, Nicolas Flammarion, Peter L Bartlett, and Michael I Jordan. Is there an analog of nesterov acceleration for gradient-based mcmc? Bernoulli, 27(3), 2021. Alex Matthews, Michael Arbel, Danilo Jimenez Rezende, and Arnaud Doucet. Continual repeated annealed flow transport Monte Carlo. In International Conference on Machine Learning, pp. 15196 15219. PMLR, 2022. Simone Melchionna. Design of quasisymplectic propagators for Langevin dynamics. The Journal of chemical physics, 127(4), 2007. Laurence Illing Midgley, Vincent Stimper, Gregor NC Simm, Bernhard Sch olkopf, and Jos e Miguel Hern andez-Lobato. Flow annealed importance sampling bootstrap. The Eleventh International Conference on Learning Representations, 2023. Annie Millet, David Nualart, and Marta Sanz. Integration by parts and time reversal for diffusion processes. The Annals of Probability, pp. 208 238, 1989. Jesper Møller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen. Log Gaussian Cox processes. Scandinavian journal of statistics, 25(3):451 482, 1998. Pierre Monmarch e. High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion. Electronic Journal of Statistics, 15(2):4117 4166, 2021. Wenlong Mou, Yi-An Ma, Martin J Wainwright, Peter L Bartlett, and Michael I Jordan. Highorder Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(42):1 41, 2021. Radford M Neal. Annealed importance sampling. Statistics and Computing, 11(2):125 139, 2001. Radford M Neal. Slice sampling. The annals of statistics, 31(3):705 767, 2003. Radford M Neal. MCMC using Hamiltonian dynamics. ar Xiv preprint ar Xiv:1206.1901, 2012. Kirill Neklyudov, Rob Brekelmans, Daniel Severo, and Alireza Makhzani. Action matching: Learning stochastic dynamics from samples. In International conference on machine learning, pp. 25858 25889. PMLR, 2023. E Nelson. Dynamical theories of Brownian motion. Press, Princeton, NJ, 1967. Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pp. 8162 8171. PMLR, 2021. Published as a conference paper at ICLR 2025 Nikolas N usken and Lorenz Richter. Solving high-dimensional Hamilton Jacobi Bellman PDEs using neural networks: perspectives from the theory of controlled diffusions and measures on path space. Partial differential equations and applications, 2(4):48, 2021. Grigorios A Pavliotis. Stochastic processes and applications. Texts in applied mathematics, 60, 2014. Angus Phillips, Hai-Dang Dau, Michael John Hutchinson, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. Particle denoising diffusion sampler. International Conference on Machine Learning, 2024. Lorenz Richter. Solving high-dimensional PDEs, approximation of path space measures and importance sampling of diffusions. Ph D thesis, BTU Cottbus-Senftenberg, 2021. Lorenz Richter and Julius Berner. Improved sampling via learned diffusions. In International Conference on Learning Representations, 2024. Lorenz Richter, Ayman Boustati, Nikolas N usken, Francisco Ruiz, and Omer Deniz Akyildiz. Var Grad: a low-variance gradient estimator for variational inference. Advances in Neural Information Processing Systems, 33:13481 13492, 2020. Simo S arkk a and Tommi Sottinen. Application of Girsanov theorem to particle filtering of discretely observed continuous-time non-linear systems. ar Xiv preprint ar Xiv:0705.1598, 2007. Moritz Schauer, Frank van der Meulen, and Harry van Zanten. Guided proposals for simulating multi-dimensional diffusion bridges. ar Xiv preprint ar Xiv:1311.3606, 2013. Ren e L Schilling and Lothar Partzsch. Brownian motion: an introduction to stochastic processes. Walter de Gruyter Gmb H & Co KG, 2014. Henrik Schopmans and Pascal Friederich. Temperature-annealed Boltzmann generators. ar Xiv preprint ar Xiv:2501.19077, 2025. Ziqiang Shi and Rujie Liu. Generative modelling with high-order Langevin dynamics. ar Xiv preprint ar Xiv:2404.12814, 2024. Gabriel Stoltz, Mathias Rousset, et al. Free energy computations: A mathematical perspective. World Scientific, 2010. Gilbert Strang. On the construction and comparison of difference schemes. SIAM journal on numerical analysis, 5(3):506 517, 1968. Jingtong Sun, Julius Berner, Lorenz Richter, Marius Zeinhofer, Johannes M uller, Kamyar Azizzadenesheli, and Anima Anandkumar. Dynamical measure transport and neural PDE solvers for sampling. ar Xiv preprint ar Xiv:2407.07873, 2024. Achille Thin, Nikita Kotelevskii, Alain Durmus, Eric Moulines, Maxim Panov, and Arnaud Doucet. Monte Carlo variational auto-encoders. In International Conference on Machine Learning, 2021. Hale F Trotter. On the product of semi-groups of operators. Proceedings of the American Mathematical Society, 10(4):545 551, 1959. A S uleyman Ust unel and Moshe Zakai. Transformation of measure on Wiener space. Springer Science & Business Media, 2013. Francisco Vargas, Pierre Thodoroff, Austen Lamacraft, and Neil Lawrence. Solving Schr odinger bridges via maximum likelihood. Entropy, 23(9):1134, 2021. Francisco Vargas, Will Grathwohl, and Arnaud Doucet. Denoising diffusion samplers. The Eleventh International Conference on Learning Representations, 2023a. Francisco Vargas, Andrius Ovsianas, David Fernandes, Mark Girolami, Neil D Lawrence, and Nikolas N usken. Bayesian learning via neural Schr odinger F ollmer flows. Statistics and Computing, 33(1):3, 2023b. Published as a conference paper at ICLR 2025 Francisco Vargas, Shreyas Padhy, Denis Blessing, and Nikolas N usken. Transport meets variational inference: Controlled Monte Carlo diffusions. In International Conference on Learning Representations, 2024. Hao Wu, Jonas K ohler, and Frank No e. Stochastic normalizing flows. Advances in Neural Information Processing Systems, 33:5933 5944, 2020. Haruo Yoshida. Construction of higher order symplectic integrators. Physics letters A, 150(5-7): 262 268, 1990. Dinghuai Zhang, Ricky Tian Qi Chen, Cheng-Hao Liu, Aaron Courville, and Yoshua Bengio. Diffusion generative flow samplers: Improving learning signals through partial trajectory optimization. The Twelfth International Conference on Learning Representations, 2024. Guodong Zhang, Kyle Hsu, Jianing Li, Chelsea Finn, and Roger Grosse. Differentiable annealed importance sampling and the perils of gradient noise. In Advances in Neural Information Processing Systems, 2021. Qinsheng Zhang and Yongxin Chen. Path integral sampler: a stochastic control approach for sampling. International Conference on Learning Representations, 2022. Published as a conference paper at ICLR 2025 A.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 A.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.3 Further remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 A.4 Auxiliary results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.5 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 A.6 Additional statements on diffusion models . . . . . . . . . . . . . . . . . . . . . . 21 A.7 Including a mass matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.8 Numerical discretization schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 22 A.8.1 Euler-Maruyama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A.8.2 OBAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 A.8.3 BAOAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.8.4 OBABO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 A.9 Underdamped version of previous diffusion-based sampling methods . . . . . . . . 25 A.10 Further computational details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.10.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.10.2 SDE preconditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 A.10.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A.10.4 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 A.10.5 Further experiments and comparisons . . . . . . . . . . . . . . . . . . . . 30 A.1 NOTATION We denote by Tr(Σ) and Σ+ the trace and the (Moore-Penrose) pseudoinverse of a real-valued matrix Σ, by µ the Euclidean norm of a vector µ, and by µ1 µ2 the Euclidean inner product between vectors µ1 and µ2. For a function p C(RD [0, T], R), depending on the variables z = (x, y) Rd RD d RD and t [0, T], we denote by tp it partial derivative w.r.t. the time coordinate t and by xp and yp its gradients w.r.t. the spatial variables x and y, respectively. Here, C(A, B) denotes the set of all continuous functions mapping from the set A to the set B. Moreover, we denote by the gradient w.r.t. both spatial variables z = (x, y). We analogously denote by 2p the Hessian of p w.r.t. the spatial variables. Similarly, we define f = PD i=1 xifi to be the divergence of a (time-dependent) vector field f = (fi)D i=1 C(RD [0, T], RD) w.r.t. the spatial variables. We denote by N(µ, Σ) a multivariate normal distribution with mean µ Rd and (positive semidefinite matrix) covariance matrix Σ Rd d and write N(x; µ, Σ) for the evaluation of its density (w.r.t. the Lebesgue measure) at x Rd. Moreover, we denote by Unif([0, T]) the uniform distribution on [0, T]. For an Rd-valued random variable X with law P and a function f (Rd, R), we denote by EX P[f(X)] = Z f d P (21) the expected value of the random variable f(X). Published as a conference paper at ICLR 2025 For suitable continuous stochastic processes Z = (Zt)t [0,T ] and Y = (Yt)t [0,T ], we define forward and backward Itˆo integrals via the limits Z t Xs d Ys = lim n i=0 Xtn i (Ytn i+1 Ytn i ), (22) d Ys = lim n i=0 Xtn i+1 (Ytn i+1 Ytn i ), (23) where t < tn 0 < < tn kn = t is an increasing sequence of subdivisions of t, t with mesh size tending to zero; see Vargas et al. (2024) for details. The relation between forward and backward integrals is given in Lemma A.6. We denote by P the set of probability measures on C([0, T], RD), equipped with the Borel σ-field associated with the topology of uniform convergence on compact sets. For suitable vector fields u, v and distributions π, τ, we denote by Pu,π P and Pv,τ P the forward and reverse-time path measures, i.e., the laws or pushforwards on C([0, T], RD), of the solutions Z = (Zt)t [0,T ] to the SDEs Zt = Z0 + Z t 0 (f + η u) (Zs, s) ds + Z t 0 η(s) d Ws, Z0 π, (24) Zt = ZT Z T t (f + η v) (Zs, s) ds Z T d Ws, ZT τ, (25) respectively. In the above, W denotes a standard d-dimensional Brownian motion satisfying the usual conditions, see, e.g., Kunita (2019). Note that we consider degenerate diffusion coefficients η of the form η = (0, σ) .n Moreover, we assume that the first components of f = ( bf, ef) , namely bf, satisfies x bf(z, ) = 0 for every z = (x, y) RD, i.e., bf only depends on y but not on x. Finally, we denote the marginal of a path space measure P at time t [0, T] by Pt, which can be interpreted as the pushforward under the evaluation Z 7 Zt. Moreover, we denote by Ps|t the conditional distribution of Ps given Pt. A.2 ASSUMPTIONS Throughout the paper, we assume that all vector fields are smooth, i.e., for a vector field g it holds g C (RD [0, T], Rd), and satisfy a global Lipschitz condition (uniformly in time), i.e., there exists a constant C such that for all z1, z2 RD and t [0, T] it holds that g(z1, t) g(z2, t) C z1 z2 . (26) These assumptions also define the set of admissible controls U C (RD [0, T], Rd). Moreover, we assume that the diffusion coefficients appearing in the dimensions with the control, σ, are invertible for all t [0, T] and satisfy that σ C ([0, T], Rd d). Our continuity assumptions on the SDE coefficient functions and the global Lipschitz condition in (26) guarantee strong solutions with pathwise uniqueness (see, e.g., Le Gall (2016, Section 8.2)) and are sufficient for Girsanov s theorem in Theorem A.4 to hold (see, e.g., Delyon & Hu (2006)). Moreover, our conditions allow the definition of the forward and backward Itˆo integrals via limits of time discretizations as in (22) and (23) that are independent of the specific sequence of refinements (Vargas et al., 2024). Finally, we assume that all SDEs admit densities of their time marginals (w.r.t. the Lebesgue measure) that are sufficiently smooth9 such that we have strong solutions to the corresponding Fokker- 9Sufficient conditions for the existence of densities can be found in Millet et al. (1989, Proposition 4.1) and Haussmann & Pardoux (1986, Theorem 3.1). For time-independent SDE coefficient functions, a result by Kolmogoroff (1931) guarantees that the Fokker-Planck equation is satisfied if the density is in C2,1(Rd [0, T], R); see also Pavliotis (2014, Proposition 3.8). and Schilling & Partzsch (2014, 19.6 Proposition). However, we note that popular results by Friedman (1964, Section 1.6) (see also Friedman (1975, Section 5) and Durrett (1984, Section 9.7)) for showing existence and uniqueness of solutions to Fokker-Planck equations require uniform ellipticity assumptions, which are not satisfied for our degenerate diffusion coefficients. We refer to Bogachev et al. (2022, Sections 6.7(ii) and 9.8(i)-(iii)) for existence and uniqueness in the degenerate case and note that we only make use of the Fokker-Planck equation for motivating our splitting schemes in Section 3. Published as a conference paper at ICLR 2025 Planck equations. The existence of continuously differentiable densities and our assumptions on the SDE coefficient functions are sufficient for Nelson s relation in Lemma 2.2 to hold; see, e.g., Millet et al. (1989). While we use the above assumptions to simplify the presentation, we note they can be significantly relaxed. A.3 FURTHER REMARKS Remark A.1 (Stochastic bridges and bridge sampling). By stochastic bridge or diffusion bridge (also referred to as general bridge by Richter & Berner (2024)), we refer to an SDE that satisfies the marginals pprior and ptarget at times t = 0 and t = T, respectively. For a given diffusion coefficient of the SDE, there exist infinitely many drifts satisfying these constraints. In particular, for every sufficiently regular density evolution between the prior and target, we can find a drift (given by a unique gradient field) that establishes a corresponding stochastic bridge; see, e.g., Vargas et al. (2024, Proposition 3.4) and Neklyudov et al. (2023, Appendix B.3). However, any stochastic bridge solves our problem of sampling from ptarget and the non-uniqueness can even lead to better performance in gradient-based optimization (Sun et al., 2024; Blessing et al., 2024). Other previous methods have obtained unique objectives by prescribing the density evolution, e.g., as diffusion process in DIS (Berner et al., 2024) or geometric annealing between prior and target in CMCD (Vargas et al., 2024). Another popular approach for obtaining uniqueness consists of minimizing the distance10 to a reference process (additionally to satisfying the marginals). In case the distance is measured via a Kullback-Leibler divergence between the path measures of the bridge and reference process, this setting is often referred to as (dynamical) Schr odinger bridge problem. In the context of samplers, reference processes have been chosen as scaled Brownian motions in PIS (Zhang & Chen, 2022) and ergodic processes in DDS (Vargas et al., 2023a); see also Richter & Berner (2024) for an overview. A special case of such a Schr odinger bridge problem is given if the marginals pprior and ptarget are Dirac measures. Sampling from the solution to such a problem is equivalent to sampling from the reference SDE conditioned on the start and end point at the times t = 0 and t = T (specified by the Dirac measures). For instance, if the reference measure is a Brownian motion, solutions are commonly referred to as Brownian bridges. As special cases of our considered bridges, solutions to such problems are also sometimes called diffusion bridges and we refer to Schauer et al. (2013); Heng et al. (2021) for further details and numerical approaches. However, our sampling problem is in some form orthogonal to such tasks: in case of a Dirac target distribution, sampling is trivial and one is interested in the conditional trajectories. For the sampling problem, the trajectories are not (directly) relevant and one is interested in samples from a general target distribution. Remark A.2 (Log-variance loss). As an alternative to the KL divergence in (8), we can consider the log-variance (LV) loss defined as Lw LV(u, v) := Dw LV Pu,π, Pv,τ = Var Z Pw,π where the expectation is taken with respect to a path space measure corresponding to a forward process of the form (5), but with the control replaced by an arbitrary control w U. This allows for off-policy training and avoids the need to differentiate through the simulation of the SDE. Moreover, the estimator achieves zero variance at the optimum (u , v ), see Richter et al. (2020); N usken & Richter (2021); Richter & Berner (2024). Remark A.3 (Higher-order Langevin equations). We note that our general framework from Section 2 can readily be used for higher-order dynamics and in particular higher-order Langevin equations, where next to a position and velocity variable one considers acceleration. As argued by Shi & Liu (2024), corresponding trajectories become smoother the higher the order, which can lead to improved performance of (uncontrolled) Langevin dynamics. Also, Mou et al. (2021) observed improved convergence of third-order Langevin dynamics for convex potentials. We leave related extensions to diffusion bridges for future work. 10In the context of generative modeling, also more general settings, referred to as mean-field games or generalized Schr odinger bridges, have been explored; see, e.g., Liu et al. (2022); Koshizuka & Sato (2023); Liu et al. (2024). Published as a conference paper at ICLR 2025 A.4 AUXILIARY RESULTS Theorem A.4 (Girsanov theorem). For Pu,π-almost every Z C([0, T], RD) it holds that d Pw,π (Z) = Z T 2 u w 2 + (η+f + w) (u w) (Zs, s) ds + S (28a) η+f + w 2 η+f + u 2 (Zs, s) ds + S, (28b) 0 (u w)(Zs, s) η+(s) d Zs. (29) In particular, for Z Pu,π we obtain that d Pw,π (Z) = 1 0 u w 2(Zs, s) ds + Z T 0 (u w)(Zs, s) d Bs. (30) Proof. See S arkk a & Sottinen (2007); Ust unel & Zakai (2013); Chen et al. (2022). Theorem A.5 (Reverse-time Girsanov theorem). For Pu,π-almost every Z C([0, T], RD) it holds that Pw,π (Z) = log d Pu,π d Pw,π (Z) Z T 0 (u w)(Zs, s) η+(s) d Zs (31) 0 (u w)(Zs, s) η+(s) Proof. Using Theorem A.4 and the definitions in (22) and (23), we observe that d Pw,π (Z) equals the Radon-Nikodym derivative between the path spaces measures corresponding to forward SDEs as in (5) with initial conditions π and all functions f, u, w, and η reversed in time, evaluated at t 7 ZT t. We can now substitute t 7 T t to proof the claim; see also Vargas et al. (2024, Proof of Proposition 2.2). Lemma A.6 (Conversion formula). For Z Pw,π and suitable g C(RD [0, T], RD) it holds that Z t g(Zs, s) d Zs + Z t (ηη g)(Zs, s) ds. (33) Proof. Similar to the conversion formula in Vargas et al. (2024, Remark 3), the result follows from combining (22) and (23). First, we rewrite the problem by observing that Z t g(Zs, s) d Zs + Z t eg(Zs, s) d Ws, where eg = η g. Then we can compute Z t d Ws = lim n i=0 (eg(Ztn i+1, tn i+1) + eg(Ztn i , tn i )) (Wtn i+1 Wtn i ) Z t eg(Zs, s) d Ws eg(Zs, s) d Ws Z t eg(Zs, s) d Ws, where denotes Stratonovich integration. The result now follows from the relationship between Itˆo and Stratonovich stochastic integrals, i.e., Z t eg(Zs, s) d Ws = Z t eg(Zs, s) d Ws + 1 (ηeg)(Zs, s) ds, (34) Published as a conference paper at ICLR 2025 see, e.g., Kloeden & Platen (1992, Section 4.9). Proof of Proposition 2.3. The proof follows the one by Vargas et al. (2024, proof of Proposition 2.2). Using disintegration (L eonard, 2014), we first observe that11 d d Pw,π (Z) = τ(ZT ) π(Z0) for w = η+f. Thus, it holds that Pv,τ (Z) = log d Pu,π d Pw,π (Z) + log d Pv,τ (Z) + log π(Z0) τ(ZT ). (35) The result now follows by applying the Girsanov theorem; see Theorem A.4 and Theorem A.5. Proof of Lemma 2.4. Using Lemma 2.2 and the chain rule for the KL divergence, we observe that Pu,π) = DKL( Pv,τ| Pev,eτ) = DKL( Pv,τ| Pev,τ) + DKL(τ| Pu,π 0 ), (36) Pu,π 0 . We note that the Girsanov theorem (see Theorem A.4) implies that the variational gap can equivalently be written as DKL( Pv,τ| Pev,τ) = EZ Pv,τ v(Zs, s) u(Zs, s) + η (s) log Pu,π s (Zs) 2 ds see also Vargas et al. (2024, Appendix C). Proof of Proposition 2.5. The proof extends the ones by Huang et al. (2021, Appendix A), Berner et al. (2024, Lemma A.11), and (Vargas et al., 2024, Appendix C.2) to the case of degenerate diffusion coefficients η. Using Proposition A.7 and a Monte Carlo approximation, we first observe that, for the case v = 0, the ELBO can be represented as ELBO = EZ P0,τ log π(ZT ) Z T 2 u 2 (ηu + f) (Zs, s) ds = T EZ P0,τ , s Unif([0,T ]) 2 u 2 (ηu) (Zs, s) + const., (38) where the last expression can be viewed as an extension of implicit score matching (Hyv arinen & Dayan, 2005) to degenerate η. Completing the square and using the tower property in (37), it remains to show that E[r(Zs)|Z0] = E [ (ηu)(Zs, s)|Z0] (39) for fixed s [0, T], where we used the abbreviations p(z) := P0,τ s|0(z|Z0) and r(z) = u(z, s) η (s) log p(z) = η(s)u(z, s) p(z) p(z) . (40) Under suitable assumptions, the statement in (39) follows from the computation E[r(Zs)|Z0] = Z Rd r(z)p(z) dz = Z Rd (ηup)(z, s) dz | {z } =0 Rd ηu (z, s)p(z) dz (41) = E [ (ηu)(Zs, s)|Z0] , (42) where we used identities for divergences and Stokes theorem. 11Considering the (kinetic) Fokker-Planck equation in (12), the Lebesgue measure is an invariant measure of the SDE in (5) with control w = η+f if and only if bf merely depends on the last d coordinates. Published as a conference paper at ICLR 2025 A.6 ADDITIONAL STATEMENTS ON DIFFUSION MODELS The following proposition is an alternative version of Proposition 2.3, which, instead of backward integrations, depends on the divergence operation and does not rely on computing the pseudoinverse of η. Proposition A.7 (Radon-Nikodym derivative). For a process Z Pw,π as defined in (5) it holds Pv,τ (Z) = log π(Z0) τ(ZT ) + Z T (u v) w u + v (f + ηv) (Zw s , s) ds 0 (u v)(Zs, s) d Ws. Proof. This follows from combining Proposition 2.3 with Lemma A.6. In particular, note that for Z Pw,π it holds that Pv,τ (Z) = log π(Z0) 0 (η+f + u) 2(Zs, s) ds + 1 0 (η+f + v) 2(Zs, s) ds 0 (η+f + u)(Zs, s) η+(s) d Zs Z T 0 (η+f + v)(Zs, s) η+(s) = log π(Z0) 0 (η+f + u) 2(Zs, s) ds + 1 0 (η+f + v) 2(Zs, s) ds 0 (u v)(Zs, s) η+(s) d Zs Z T 0 (ηη+f + ηv)(Zs, s) ds = log π(Z0) τ(ZT ) + Z T (u v) w u + v (ηη+f + ηv) (Zw s , s) ds 0 (u v)(Zs, s) d Ws. We note that ηη+ = 0 0 0 Idd d . Together with our assumption that the first component bf of f = ( bf, ef) only depends on the last d coordinates, we obtain that (ηη+f) = f, which proves the claim. Remark A.8 (PDE perspective). Similar to Berner et al. (2024); Sun et al. (2024), we can also derive the expression in Proposition A.7 using the underlying PDEs. To this end, we recall that the density p of the solution Z to the SDE in stated (5) is governed by the (kinetic) Fokker-Planck equation in (12). Using the Hopf-Cole transformation V := log p, we get the Hamilton Jacobi Bellman equation t V = div(f + ηu) V (f + ηu) + 1 2 η V 2 + 1 2 Tr(ηη 2V ). Moreover, Itˆo s formula implies that V (ZT , T) V (Z0, 0) = Z T 2 Tr(ηη 2V ) + (f + ηw) V (Zs, s) ds 0 V (Zs, s) η(s) d Ws, where Z Pw,π. Following the same computations as in Proposition 3.1 in Sun et al. (2024) and minimizing the squared residual of the above Itˆo formula, we obtain the loss Lw(u, v) = EZ Pw,π where the Radon-Nikodym derivative is equivalent to the one given in Proposition A.7. Published as a conference paper at ICLR 2025 A.7 INCLUDING A MASS MATRIX In Section 3, we omitted the mass matrix M for simplicity. Here, we give further details on the SDEs when the mass matrix is incorporated. It can be incorporated in the framework presented in Section 2 by making the choices D = 2d, Z = (X, Y ) and f(x, y, s) = (y, ef(x, y, s) 1 2σσ (s)y) , η = (0d, σM 1/2) (43) in (5) and (6), where 0d Rd d, ef C(Rd [0, T], Rd) is chosen appropriately and σ, M C([0, T], Rd d). For the terminal conditions, the standard normal for the last d components of the initial and terminal distributions is replaced by a Gaussian whose covariance matrix is given by the mass, i.e., π(x, y) = pprior(x)N(y; 0, M) and τ(x, y) = ptarget(x)N(y; 0, M). (44) We, therefore, get the forward and reverse-time processes d Xs = M 1Ys ds, X0 pprior, (45a) d Ys = ef(Zs, s) 1 2σσ (s)Ys + σM 1/2u(Zs, s) ds + σ(s)M 1/2 d Ws, Y0 N(0, M), d Xs = M 1Ys ds, XT ptarget, d Ys = ef(Zs, s) 1 2σσ (s)Ys + σM 1/2v(Zs, s) ds + σ(s)M 1/2 d Ws, YT N(0, M). In a similar spirit to the diffusion matrix σ, one can also learn the mass matrix. However, our experiments (Section 4) showed little improvements when doing so. A.8 NUMERICAL DISCRETIZATION SCHEMES In this section we provide details on the numerical integration schemes discussed in this work, called OBAB, BAOAB, and OBABO. In particular, we derive the transition kernels p and p for computing the discrete-time approximation of the Radon-Nikodym derivative as Pv,τ (Z) π( b Z0) QN 1 n=0 pn+1|n( b Zn+1 b Zn) τ( b ZN) QN 1 n=0 pn|n+1( b Zn b Zn+1) . (47) We note that such splitting schemes are well-studied in the uncontrolled setting, see, e.g., Section 7 in Leimkuhler & Matthews (2015) or Section 2.2.3.2 in Stoltz et al. (2010). The controlled setting, and in particular the approximation of the Radon-Nikodym derivative between path space measures, has to the best of our knowledge only been considered for OBAB yet Geffner & Domke (2023); Doucet et al. (2022b). For convenience, let us recall the following splitting for the forward SDE that is used throughout this section, i.e., d Xs d Ys ds + 0 ef(Xs, s) 2σσ (s)Ys + σu(Zs, s) ds + σ(s) d Ws and let us use the following splitting for the reverse SDE d Xs d Ys ds + 0 ef(Xs, s) 2σσ (s)Ys + σv(Zs, s) ds + σ(s) Here, we use arrows to indicate whether the corresponding splitting belongs to the generative or inference SDE. To simplify the notation, we define σn := σ(n ), efn := ef(Xn , n ), pn+1|n := pn+1|n( b Zn+1| b Zn) and analogously for the backward transition Published as a conference paper at ICLR 2025 A.8.1 EULER-MARUYAMA We follow Geffner & Domke (2023) and leverage a semi-implicit Euler-Maruyama (EM) scheme, where the velocity update is computed first and then used to move the position, i.e., b Yn+1 = b Yn(1 1 2σnσ n ) + σnu( b Zn, n ) + efn + σn ξn, (50) b Xn+1 = b Xn + b Yn+1 , (51) with ξn N (0, I) and step size > 0. We further define b Zn+1 = Φ( b Xn, b Yn+1) := ( b Xn + b Yn+1 , b Yn+1) which helps to simplify the discrete-time approximation of the Radon-Nikodym derivative as shown in the following. We obtain the forward transition density pn+1|n =N b Yn+1 b Yn(1 1 2σnσ n ) + σnu( b Zn, n ) + efn , σnσ n δΦ( b Xn,b Yn+1)( b Zn+1), where δ is the Dirac delta distribution. Integrating the reverse process (49) using the semi-implicit EM scheme, we obtain b Xn = b Xn+1 b Yn+1 (52) b Yn = b Yn+1(1 + 1 2σn+1σ n+1 ) σn+1v( b Zn+1, (n + 1) ) efn+1 + σn+1 with corresponding transition density pn|n+1 =N b Yn b Yn+1(1 + 1 2σn+1σ n+1 ) σn+1v( b Zn+1, (n + 1) ) efn+1 , σn+1σ n+1 δΦ 1( b Zn+1)( b Xn, b Yn+1), (54) resulting in the following ratio between forward and backward transitions pn|n+1 = N b Yn+1|b Yn(1 1 2σnσ n ) + σnu( b Zn, n ) + efn , σnσ n N b Yn|b Yn+1(1 + 1 2σn+1σ n+1 ) σn+1v( b Zn+1, (n + 1) ) efn+1 , σn+1σ n+1 , as the ratio between the two Dirac delta distribution cancel. Composing the splitting terms as O B A B yields the integrator b Y n = b Yn(1 1 2σnσ n ) + σnu( b Zn, n ) + σn ξn, ξn N (0, I) (56a) b Y n = b Y n + efn 2 b Xn+1 = b Xn + b Y n b Yn+1 = b Y n + efn+1 with b Zn+1 = Φ( b Xn, b Y n). The resulting forward transition is given by pn+1|n = δΦ( b Xn,b Y n)( b Zn+1)N b Y n b Yn(1 1 2σnσ n ) + σnu( b Zn, n ) , σnσ n . The inference SDE, i.e., B, is integrated as b Y n = b Yn+1 efn+1 2 b Xn = b Xn+1 b Y n b Y n = b Y n efn b Yn = b Y n(1 + 1 2σnσ n ) σnv( b Z n, n ) + σn ξn, ξn N (0, I) , (58) with ( b Xn, b Y n) = Φ 1( b Zn+1), giving the backward transition pn|n+1 = δΦ 1( b Zn+1)( b Xn, b Y n)N b Yn b Y n 1 + 1 2σnσ n σnv( b Z n, n ) , σnσ n . Published as a conference paper at ICLR 2025 This results in the following ratio between forward and backward transitions pn|n+1 = N b Y n b Yn(1 1 2σnσ n ) + σnu( b Zn, n ) , σnσ n N b Yn b Y n 1 + 1 2σnσ n σnv( b Z n, n ) , σnσ n . (59) A.8.3 BAOAB Composing the splitting terms as B A O A B yields the integrator b Y n = b Yn + efn 2 b X n = b Xn + b Y n b Y n = b Y n(1 1 2σnσ n ) + σnu( b X n, b Y n, n ) + σn ξn (61) b Xn+1 = b X n + b Y n 2 b Yn+1 = b Y n + efn+1 with ξn N (0, I), ( b X n, b Y n) = Φ1( b Zn), and b Zn+1 = Φ2( b X n, b Y n ). Hence, we obtain the forward transition density pn+1|n =δΦ2( b X n,b Y n )( b Zn+1) δΦ1( b Zn)( b X n, b Y n) N b Y n b Y n(1 1 2σnσ n ) + σnu( b X n, b Y n, n ) , σnσ n . (63) B we obtain b Y n = b Yn+1 efn+1 2 b X n = b Xn+1 b Y n b Y n = b Y n (1 + 1 2σnσ n ) σnv( b X n, b Y n , n ) + σn ξn (65) b Xn = b X n b Y n 2 b Yn = b Y n efn with ( b X n, b Y n ) = Φ 1 2 ( b Zn+1) and b Zn = Φ 1 1 ( b X n, b Y n). Moreover, we have pn|n+1 =δΦ 1 1 ( b X n,b Y n)( b Zn) δΦ 1 2 ( b Zn+1)( b X n, b Y n ) N b Y n b Y n (1 + 1 2σnσ n ) σnv( b X n, b Y n , n ) , σnσ n . (67) We therefore obtain the following ratio between forward and backward transitions as pn|n+1 = N b Y n b Y n(1 1 2σnσ n ) + σnu( b X n, b Y n, n ) , σnσ n N b Y n b Y n (1 + 1 2σnσ n ) σnv( b X n, b Y n , n ) , σnσ n . (68) A.8.4 OBABO Composing the splitting terms as O B A B O yields the integrator b Y n = b Yn(1 1 4σnσ n ) + σnu( b Zn, n ) 2 ξ(1) n (69) b Y n = b Y n + efn 2 b Xn+1 = b Xn + b Y n b Y n = b Y n + efn+1 b Yn+1 = b Y n (1 1 4σnσ n ) + σnu( b Xn+1, b Y n , (n + 1 2 ξ(2) n (71) Published as a conference paper at ICLR 2025 with ξ(1) n , ξ(2) n N (0, I) and ( b Xn+1, b Y n ) = Φ( b Xn, b Y n). The resulting forward transition density is given by pn+1|n =N b Yn+1 b Y n (1 1 4σnσ n ) + σnu( b Xn+1, b Y n , (n + 1 δΦ( b Xn,b Y n)( b Xn+1, b Y n ) N b Y n b Yn(1 1 4σnσ n ) + σnu( b Zn, n ) The inference SDE, i.e., O, is integrated as b Y n = b Yn+1(1 + 1 4σnσ n ) σnv( b Zn+1, (n + 1) ) 2 ξ(2) n (73) b Y n = b Y n efn+1 2 b Xn = b Xn+1 b Y n b Y n = b Y n efn b Yn = b Y n(1 + 1 4σnσ n ) σnv( b Xn, b Y n, (n + 1 2 ξ(1) n , (75) with ( b Xn, b Y n) = Φ 1( b Xn+1, b Y n ), and gives the following backward transition densties pn|n+1 =N b Yn b Y n(1 + 1 4σnσ n ) σnv( b Xn, b Y n, (n + 1 δΦ 1( b Xn+1,b Y n )( b Xn, b Y n) N b Y n b Yn+1(1 + 1 4σnσ n ) σnv( b Zn+1, (n + 1) ) resulting in the following ratio between forward and backward transitions pn|n+1 = N b Yn+1 b Y n (1 1 4σnσ n ) + σnu( b Xn+1, b Y n , (n + 1 N b Y n b Yn+1(1 + 1 4σnσ n ) σnv( b Zn+1, (n + 1) ) N b Y n b Yn(1 1 4σnσ n ) + σnu( b Zn, n ) N b Yn b Y n(1 + 1 4σnσ n ) σnv( b Xn, b Y n, (n + 1 A.9 UNDERDAMPED VERSION OF PREVIOUS DIFFUSION-BASED SAMPLING METHODS In this section we outline how our framework in Section 2 includes previous diffusion-based sampling methods. First, we note that setting the drift ef and controls u and v in (15) to specific values recovers underdamped methods of ULA, MCD, and CMCD, see Table 2. Moreover, we can also introduce reference processes with controls eu and ev that satisfy Pev,eτ 1, (77) where eπ and eτ are known reference distributions. In other words, this assumes having knowledge of a perfect time-reversal for specific controls eu, ev and marginals eπ, eτ (cf. Section 3.3 in Richter & Berner (2024)). We remark that these processes take a role similar to the Brownian motion used in Published as a conference paper at ICLR 2025 the proof of Proposition 2.3. In particular, by applying Proposition 2.3 twice, we obtain that Pv,τ (Z) = log d Pu,π Pv,τ (Z) log d Peu,eπ = log π(Z0) eπ(Z0) log τ(ZT ) (v ev) (2η+f + v + ev) (u eu) (2η+f + u + eu) (Zs, s) ds 0 (u eu)(Zs, s) η+(s) d Zs Z T 0 (v ev)(Zs, s) η+(s) Several previous methods, such as versions of PIS and DDS, can be recovered by fixing v and using the choices ev = v as well as eπ = π, which significantly simplifies the above expression (Vargas et al., 2024; Richter & Berner, 2024). A.10 FURTHER COMPUTATIONAL DETAILS In this section we provide additional computational details. A.10.1 ALGORITHM In Algorithm 1 we display the training process of our underdamped diffusion bridge sampler. A.10.2 SDE PRECONDITIONING Here, we introduce a preconditioning technique that modifies the SDEs to improve numerical behavior while preserving the optimal control for both overdamped and underdamped diffusions. Overdamped diffusions. Consider the controlled forward and backward pair of SDEs given by d Xs = (f + σ u) (Xs, s) ds + σ(s) d Ws, X0 pprior, (78) d Xs = (f + σ v) (Xs, s) ds + σ(s) d Ws, XT ptarget, (79) with forward and backward Brownian motion d Ws and d Ws , f C(Rd [0, T], Rd), diffusion coefficient σ C([0, T], Rd d), control functions u, v C(Rd [0, T], Rd) and path measures Pu,p0 and Pv,p T corresponding to (78) and (79), respectively, with p0 := pprior and p T := ptarget. Note that for denoising diffusion sampler such as DIS, we have12 f(x, ) = 2σσ x and v = 0, i.e., an Ornstein-Uhlenbeck (OU) process. While such a choice for f ensures for a suitable σ that Pv,p T 0 N(0, I), it may lead to a poor initialization for the path measure Pu,p0. Similarly, for DBS it helps significantly to choose f = log ptarget or f = log ν where ν(x, s) p1 β(s) prior (x)pβ(s) target(x); see Appendix A.10.5 to obtain a good initialization for Pu,p0. How- ever, this may again lead to a poor initialization for Pv,p T . We can alleviate these problems by preconditioning of the SDEs: For DIS, we set σu = σeu 2f with f(x, ) = 2σσ x, resulting in the forward SDE d Xs = ( f + σ eu) (Xs, s) ds + σ(s) d Ws, X0 pprior. (80) Note that when initializing eu = 0, we have Peu,p0 t = N(0, I) for all t [0, T] which may lead to more stable training. Similarly, for ULA, MCD, and DBS, we set σv = σev 2f to obtain the backward SDE d Xs = ( f + σ ev) (Xs, s) ds + σ(s) d Ws, XT ptarget, (81) where f = log ν for ULA and MCD. Moreover, DBS uses preconditioning for all choices of f discussed in Appendix A.10.5. The impact of preconditioning is illustrated in Figure 6. A few remarks are in order. Remark A.9 (Preconditioning for CMCD). The controlled Monte Carlo diffusion sampler (CMCD, Vargas et al. (2024)) prescribes a path of densities (ν( , t))t [0,T ] and uses f = 1 2σσ log ν. 12Please note that the sign differs from most existing literature as the process initialized at the target starts at t = T instead of t = 0. Published as a conference paper at ICLR 2025 DIS (No preconditioning) DIS (With preconditioning) DBS (No preconditioning) DBS (With preconditioning) DBS-UD (No preconditioning) DBS-UD (With preconditioning) Figure 6: Illustration of the effect of preconditioning as explained in Appendix A.10.2 at initialization, i.e. with u, v = 0. The red trajectories start from the prior distribution (uni-modal Gaussian distribution) and are integrated forward in time (from left to right), whereas the blue trajectories start at the target (bi-modal distribution) and move backward in time (from right to left). The non-transparent trajectories correspond to the SDE that is affected by preconditioning. Therefore, Nelsons relation (see Lemma 2.2) yields u v = σ log ν resulting in the pair of SDEs 2σσ log ν + σ u (Xs, s) ds + σ(s) d Ws, X0 pprior, (82) 2σσ log ν + σ u (Xs, s) ds + σ(s) d Ws, XT ptarget, (83) by replacing v = u σ log ν. As a consequence, CMCD does not require preconditioning as the sign flip of f naturally occurs due to Nelsons relation. Remark A.10 (Preconditioning for DIS). The time-reversed diffusion sampler (DIS, Berner et al. (2024)) and related methods (Zhang & Chen, 2022; Vargas et al., 2023a) incorporate log ptarget (or log ν) into the SDE by treating it as part of the control, i.e., u(x, s) = u1(x, s) + u2(s) log ρtarget(x). (84) However, we found that this approach can lead to worse numerical behavior. As a result, we opted not to use this form of preconditioning. Underdamped diffusions. We consider the pair of SDEs given by (5) and (6) where the drift is chosen as in (13), i.e., f(x, y, s) = y, ef(x, s) 1 2σσ (s)y . (85) In a similar fashion to the overdamped case, we can precondition the underdamped SDEs: Assume for now that ef, u = 0. Then, (85) results in an OU process for Y which is desirable as both initial and terminal density of the velocity are Gaussian, see (16). However, the sign flip when performing the time-reversal may again lead to a bad initialization for the path measure Pv,τ. We therefore transform the control in the backward process (6) as ηv( , y, ) = ηev( , y, )+σσ y, resulting in the Published as a conference paper at ICLR 2025 Algorithm 1 Training of underdamped diffusion bridge sampler Require: See Appendix A.10 for details model: neural networks uθ, vγ with initial parameters θ(0), γ(0) fixed hyperparameters: number of gradient steps K, number of discretization steps N, batch size m, optimizer method step, integrator method integrate learned hyperparameters: prior distribution pprior = N(ζµ, diag(softplus(ηΣ))), diffusion and mass matrices σ = diag(softplus(ησ)) and M = diag(softplus(ηM)), and terminal time T = N softplus(η ) with initial parameters η(0) µ , η(0) Σ , η(0) σ , η(0) M , η(0) Θ(0) = {θ(0), γ(0), η(0) µ , η(0) Σ , η(0) σ , η(0) M , η(0) }, π = pprior N(0, M), eτ = ρtarget N(0, M) for k 0, . . . , K 1 do for i 1, . . . , m do Approximate cost (batched in practice) b Z0 π See (16) rndi,0 log π( b Z0) for n 0, . . . , N 1 do b Zn+1 integrate( b Zn, Θ(k)) See Appendix A.8 rndi,n+1 rndi,n + log pn+1|n( b Zn+1 b Zn) log pn|n+1( b Zn b Zn+1) See (11) rndi,N rndi,N log eτ( b ZN) b L 1 m Pm i=1 rndi,N Compute loss Θ(k+1) step Θ(k), Θ b L Gradient descent return optimized parameters Θ(K) backward SDE d Ys = ef + η ev (Ys, s) ds + 1 2σ(s)σ (s)Ys ds + η(s) d Ws, YT N(0, Id). (86) The impact of preconditioning for the underdamped setting is illustrated in Figure 6. A.10.3 EXPERIMENTAL SETUP Here, we provide further details on our experimental setup. Moreover, we provide an algorithmic description of the training of an underdamped diffusion sampler in Algorithm 1. General setting. All experiments are conducted using the Jax library (Bradbury et al., 2021). Our default experimental setup, unless specified otherwise, is as follows: We use a batch size of 2000 (halved if memory-constrained) and train for 140k gradient steps to ensure approximate convergence. We use the Adam optimizer (Kingma & Ba, 2015), gradient clipping with a value of 1, and a learning rate scheduler that starts at 5 10 3 and uses a cosine decay starting at 60k gradient steps. We utilized 128 discretization steps and the EM and OBABO schemes to integrate the overdamped and underdamped Langevin equations, respectively. The control functions uθ and vγ with parameters θ and γ, respectively, were parameterized as two-layer neural networks with 128 neurons. Unlike Zhang & Chen (2022), we did not include the score of the target density as part of the parameterized control functions uθ and vγ. Inspired by Nichol & Dhariwal (2021), we applied a cosine-square scheduler for the discretization step size, i.e., = a cos2 π 2 n N at step n, where a : [0, ) (0, ) is learned. Furthermore, we use preconditioning as explained in Appendix A.10.2. The diffusion matrix σ and the mass matrix M were parameterized as diagonal matrices, and we learned the parameters µ and Σ for the prior distribution pprior = N(µ, Σ), with Σ also set as a diagonal matrix. We enforced non-negativity of a and made σ, M, and Σ positive semidefinite via an element-wise softplus transformation. For the methods that use geometric annealing (see Table 2), that is, ν(x, s) p1 β(s) prior (x)pβ(s) target(x), where β : [0, T] [0, 1] is a monotonically increasing function satisfying β(0) = 0 and β(T) = 1, we additionally learn the annealing schedule β. Similar to prior works (Doucet et al., 2022b), we parameterize an increasing sequence of N steps using unconstrained parameters b(s). We map these to our annealing schedule with n n softplus(b(n )) PN n=1 softplus(b(n )) , (87) where softplus ensures non-negativity. Further, we fix β(0) = 0 and β(T) = 1, ensuring β(n ) β(n ) when n n. We initialized b such that β is a linear interpolation between 0 and 1. Published as a conference paper at ICLR 2025 Table 2: Comparison of different diffusion-based sampling methods based on ef, u, v, ν as defined in the text. Method ef u v ULA σσ x log ν 0 0 MCD σσ x log ν 0 learned CMCD 1 2σσ x log ν learned σ x log ν u DIS σσ x log pprior learned 0 DBS arbitrary learned learned Moreover, we initialized σ = M = Σ = Id and µ = 0 for all experiments. In the case of the Brownian, LGCP, and Many Well tasks, we set a = 0.1, while for the remaining benchmark problems, we chose a = 0.01 to avoid numerical instabilities encountered with a = 0.1. Evaluation protocol and model selection. We follow the evaluation protocol of prior work (Blessing et al., 2024) and evaluate all performance criteria 100 times during training, using 2000 samples for each evaluation. To smooth out short-term fluctuations and to obtain more robust results within a single run, we apply a running average with a window of 5 evaluations. We conduct each experiment using four different random seeds and average the best results of each run. Benchmark problem details. All benchmark problems, with the exception of Many Well, were taken from the benchmark suite of Blessing et al. (2024). In their work, the authors used an uninformative prior for the parameters in the Bayesian logistic regression models for the Credit and Cancer tasks, which frequently caused numerical instabilities. To maintain the challenge of the tasks while ensuring stability, we opted for a Gaussian prior with zero mean and variance of 100. For more detailed descriptions of the tasks, we refer readers to Blessing et al. (2024). The Many Well target involves a d-dimensional double well potential, corresponding to the (unnormalized) density ρtarget(x) = exp i=1 (x2 i δ)2 1 with m N representing the number of combined double wells (resulting in 2m modes), and a separation parameter δ (0, ) (see also Wu et al. (2020)). In our experiments, we set d = 50, m = 5 and δ = 2. Since ρtarget factorizes across dimensions, we can compute a reference solution for log Z via numerical integration, as described in Midgley et al. (2023). A.10.4 EVALUATION CRITERIA Here, we provide further information on how our evaluation criteria are computed. To evaluate our metrics, we consider n = 2 103 samples (x(i))n i=1. Effective sample size (ESS). We compute the (normalized) ESS as Pn i=1 w(i) 2 n Pn i=1 w(i) 2 , (88) where (w(i))n i=1 are the unnormalized importance weights of the samples (Z(i))n i=1 in path space given as w(i) := Z d d Pu,π (Z(i)). (89) Note that the dependence on Z vanishes when replacing d d Pu,π with the expression in Proposition 2.3. Published as a conference paper at ICLR 2025 Table 3: Results for lower bounds on log Z for various real-world benchmark problems. Higher values indicate better performance. The best results are highlighted in bold. Blue shading indicates that the method uses underdamped Langevin dynamics. Red shading indicate competing state-of-the-art methods. Method Credit Seeds Cancer Brownian Ionosphere Sonar DBS 585.524 0.414 73.437 0.001 83.395 4.184 1.081 0.004 111.673 0.002 108.595 0.006 585.112 0.001 73.423 0.001 77.881 0.014 1.136 0.001 111.636 0.001 108.458 0.004 GMMVI 585.098 0.000 73.415 0.002 77.988 0.054 1.092 0.006 111.832 0.007 108.726 0.007 SMC 698.403 4.146 74.699 0.100 194.059 0.613 1.874 0.622 114.751 0.238 111.355 1.177 CRAFT 594.795 0.411 73.793 0.015 95.737 1.067 0.886 0.053 112.386 0.182 115.618 1.316 FAB 585.102 0.001 73.418 0.002 78.287 0.835 1.031 0.010 111.678 0.003 108.593 0.008 Sinkhorn distance. We estimate the Sinkhorn distance W2 γ (Cuturi, 2013), i.e., an entropy regularized optimal transport distance between a set of samples from the model and target using the Jax ott library (Cuturi et al., 2022). Log-normalizing constant. For the computation of the log-normalizing constant log Z in the general diffusion bridge setting, we note that for any u, v U it holds that Together with Proposition 2.3, this shows that log Z = EZ Pu,π Pv,τ |T d Pu,π |0 (Z) + eτ(ZT ) where eτ(ZT ) = ρtarget(XT )N(0, Id) and Pu,π |0 denotes the path space measure of the process Z with initial condition Z0 = b Z0 R2d (analogously for Pv,τ |T ), see e.g. L eonard (2013). If u = u and v = v , the expression in the expectation is almost surely constant, which implies log Z = log d Pv ,τ |T d Pu ,π |0 (Z) + eτ(ZT ) If we only have approximations of u and v , Jensen s inequality shows that the right-hand side in (92) yields a lower bound to log Z. For other methods, the log-normalizing constants can be computed analogously, by replacing u, v accordingly, see e.g. Berner et al. (2024) for DIS. Our experiments use the lower bound as an estimator for log Z when labeled with LB . A.10.5 FURTHER EXPERIMENTS AND COMPARISONS Comparison with competing methods. We extend our evaluation by comparing DBS against several state-of-the-art techniques, including Gaussian Mixture Model Variational Inference (GMMVI) (Arenz et al., 2022), Sequential Monte Carlo (SMC) (Del Moral et al., 2006), Continual Repeated Annealed Flow Transport (CRAFT) (Arbel et al., 2021; Matthews et al., 2022), and Flow Annealed Importance Sampling Bootstrap (FAB) (Midgley et al., 2023). The results, presented in Table 3, are primarily drawn from Blessing et al. (2024), where hyperparameters were carefully optimized. Since our experimental setup differs for the Credit and Cancer tasks (detailed in Section A.10), we adhered to the tuning recommendations provided by Blessing et al. (2024). Across most tasks, we observe that the underdamped variants of DBS and CMCD consistently yield similar or tighter bounds on log Z compared to the competing methods, without the necessity for hyperparameter tuning. Notably, the underdamped version of DBS consistently performs well across all tasks and demonstrates robustness, as evidenced by the low variance between different random seeds. Choice of integrator. To complement the results from Section 4, we conducted an ablation study evaluating the performance and runtime of different integrators for ULA, MCD, CMCD, and DIS. The results are presented in Figures 7 and 8. Consistent with previous findings, the OBABO integrator delivers the best overall performance, with the exception of ULA. We hypothesize that in the Published as a conference paper at ICLR 2025 8 16 32 64 128 0 8 16 32 64 128 0 8 16 32 64 128 0 8 16 32 64 128 0 EM (OD) EM (UD) OBAB BAOAB OBABO Figure 7: Effective sample size (ESS) for various methods (ULA, MCD, CMCD, DIS) and different integration schemes, averaged across multiple benchmark problems and four seeds. Integration schemes include Euler Maruyama (EM) for over- (OD) and underdamped (OD) Langevin dynamics and various splitting schemes (OBAB, BAOAB, OBABO). case of ULA, simulating the controlled part (O) twice offers little advantage, as both the forward and backward processes are uncontrolled for this method. wallclock time [s] Funnel (d = 10) wallclock time [s] Credit (d = 25) 0 1 2 3 4 5 wallclock time [s] Seeds (d = 26) wallclock time [s] Cancer (d = 31) wallclock time [s] Ionosphere (d = 35) wallclock time [s] Sonar (d = 61) EM (OD) EM (UD) OBAB BAOAB OBABO Figure 8: Effective sample size (ESS) over wallclock time of the diffusion bridge sampler (DBS) with 128 diffusion steps for different integration schemes, multiple benchmark problems, and four seeds. Integration schemes include Euler-Maruyama (EM) for over- (OD) and underdamped (UD) Langevin dynamics and various splitting schemes (OBAB, BAOAB, OBABO). Additional results for DBS. We present further details regarding the results discussed in Section 4. Specifically, we provide a breakdown of the performance of different integration schemes Published as a conference paper at ICLR 2025 8 16 32 64 128 Funnel (d = 10) 8 16 32 64 128 Credit (d = 25) 8 16 32 64 128 Seeds (d = 26) 8 16 32 64 128 Cancer (d = 31) 8 16 32 64 128 Brownian (d = 32) 8 16 32 64 128 Ionosphere (d = 35) 8 16 32 64 128 0.2 Many Well (d = 50) 8 16 32 64 128 Sonar (d = 61) EM (OD) EM (UD) OBAB BAOAB OBABO Figure 9: Effective sample size (ESS) of the diffusion bridge sampler (DBS) for different integration schemes, multiple benchmark problems, and four seeds. Integration schemes include Euler-Maruyama (EM) for over- (OD) and underdamped (UD) Langevin dynamics and various splitting schemes (OBAB, BAOAB, OBABO). σ π M T σπ Mσ Figure 10: Effective sample size (ESS) of the underdamped diffusion bridge sampler (DBS) for various combinations of learned parameters, averaged across multiple benchmark problems and four seeds with N = 8 discretization steps. Parameters include mass matrix M, diffusion matrix σ, terminal time T, and extended prior distribution π. across all tasks in Figure 9 (ESS values) and Table 5 (log Z (LB) values). Overall, we observe a notable improvement in performance with (symmetric) splitting schemes compared to Euler Maruyama discretization. However, as the number of discretization steps increases, the performance differences between OBAB, BAOAB, and OBABO become less pronounced. Interestingly, OBABO tends to yield substantial performance gains when the number of discretization steps is small. Furthermore, we examine the impact of parameter learning for N = 8 discretization steps, with the results shown in Figure 10. Surprisingly, while learning either the terminal time T or the parameters of the prior distribution yields modest improvements, learning both leads to a remarkable 5 performance increase. Choice of drift for DBS. The drift term ef in the diffusion bridge sampler (DBS) can be freely chosen. To explore the impact of different drift choices, we conducted an ablation study. We tested several options: no drift, x log pprior, x log ptarget, and a geometric annealing path, represented by x log ν, where ν(x, s) p1 β(s) prior (x)pβ(s) target(x). We also tested using a learned function for β. The results of these experiments are presented in Table 4 and Figure 11. The findings suggest that the most consistent performance is achieved when using the learned geometric annealing path as the drift ef. Interestingly, using the score of the target distribution, x log ptarget, resulted in worse performance compared to no drift for overdamped DBS and only marginal improvements for underdamped DBS. Published as a conference paper at ICLR 2025 x log pprior x log ptarget x log ν (learned) x log pprior x log ptarget x log ν (learned) wallclock time [ s 104 ] overdamped underdamped Figure 11: Effective sample size (ESS) and wallclock time for various drifts ef of the underdamped and overdamped diffusion bridge sampler, averaged across multiple benchmark problems and four seeds. Here, ν(x, s) p1 β(s) prior (x)pβ(s) target(x), where learned indicates that β is learned (end-to-end). Table 4: Lower bounds on log Z for different drift function ef for DBS on various benchmark problems. Higher values indicate better performance. The best results are highlighted in bold. Here, ν(x, s) p1 β(s) prior (x)pβ(s) target(x), where learned indicates that β is learned (end-to-end). Blue shading indicates that the method uses underdamped Langevin dynamics. ef Funnel Credit Seeds Cancer Brownian Ionosphere Many Well Sonar 0 0.212 0.001 585.208 0.008 73.501 0.001 81.712 0.151 0.466 0.096 111.778 0.005 38.609 0.829 108.936 0.014 0.155 0.004 585.155 0.007 73.505 0.009 81.307 0.114 0.449 0.042 111.845 0.007 42.771 0.002 109.718 0.013 x log pprior 0.216 0.001 585.173 0.005 73.483 0.001 81.792 0.142 0.787 0.011 111.741 0.002 42.772 0.000 108.893 0.050 0.145 0.005 585.146 0.004 73.460 0.000 81.080 0.520 0.972 0.004 111.760 0.003 42.787 0.001 109.035 0.025 x log ptarget 0.186 0.001 685.852 2.400 73.467 0.000 126.194 15.528 0.901 0.004 111.979 0.018 N/A 109.463 0.010 0.096 0.004 585.271 0.022 73.445 0.006 81.250 0.219 1.061 0.004 111.826 0.005 42.782 0.000 109.264 0.025 x log ν 0.183 0.002 4990.364 4405.152 73.442 0.000 83.981 2.105 1.055 0.010 111.678 0.000 42.772 0.003 108.616 0.005 0.110 0.000 585.127 0.000 73.432 0.000 78.086 0.015 1.106 0.001 111.661 0.001 42.756 0.013 108.530 0.002 x log ν 0.175 0.003 585.166 0.017 73.438 0.000 78.853 0.168 1.074 0.005 111.673 0.001 42.769 0.002 108.593 0.008 (learned) 0.102 0.003 585.112 0.000 73.422 0.001 77.866 0.007 1.137 0.001 111.636 0.000 42.765 0.005 108.454 0.003 Table 5: Results for lower bounds on log Z for various benchmark problems, integration methods, and discretization steps N for DBS. Higher values indicate better performance. The best results are highlighted in bold. Blue shading indicates that the method uses underdamped Langevin dynamics. Integrator Funnel (d = 10) Credit (d = 25) Seeds (d = 26) Cancer (d = 31) Brownian (d = 32) Ionosphere (d = 35) Many Well (d = 50) Sonar (d = 61) EM (OD) 0.860 0.010 585.400 0.054 73.643 0.003 80.960 0.169 0.198 0.074 111.858 0.003 42.162 0.002 109.046 0.017 EM (UD) 0.725 0.001 590.417 0.859 73.852 0.036 96.286 2.349 3.185 1.159 123.426 0.022 42.002 0.018 137.601 0.025 OBAB 0.670 0.003 585.168 0.004 73.607 0.005 79.167 0.176 0.801 0.004 111.822 0.005 42.502 0.008 108.937 0.015 BAOAB 0.674 0.008 585.164 0.010 73.603 0.011 79.252 0.183 0.807 0.003 111.811 0.007 42.497 0.004 108.906 0.019 OBABO 0.557 0.004 585.179 0.005 73.560 0.008 78.951 0.050 0.835 0.017 111.818 0.009 42.625 0.002 108.933 0.007 EM (OD) 0.645 0.002 585.792 0.243 73.561 0.003 81.628 0.345 0.683 0.010 111.780 0.005 42.460 0.004 108.902 0.009 EM (UD) 0.568 0.007 587.429 0.323 73.752 0.018 82.696 2.850 0.421 0.036 123.426 0.022 42.354 0.013 137.601 0.025 OBAB 0.491 0.004 585.153 0.004 73.520 0.004 79.118 0.723 0.943 0.004 111.735 0.006 42.546 0.048 108.766 0.010 BAOAB 0.490 0.003 585.149 0.003 73.516 0.003 78.685 0.261 0.944 0.004 111.726 0.007 42.548 0.037 108.754 0.009 OBABO 0.381 0.007 585.149 0.002 73.491 0.003 78.454 0.027 0.977 0.003 111.725 0.002 42.685 0.009 108.760 0.010 EM (OD) 0.452 0.002 585.941 0.778 73.503 0.001 84.032 2.197 0.898 0.008 111.730 0.005 42.626 0.004 108.758 0.005 EM (UD) 0.425 0.006 585.388 0.120 73.627 0.003 80.207 0.338 0.595 0.014 111.973 0.009 42.552 0.009 109.378 0.026 OBAB 0.346 0.003 585.126 0.002 73.465 0.005 78.224 0.014 1.024 0.004 111.680 0.002 42.665 0.005 108.612 0.006 BAOAB 0.347 0.003 585.127 0.002 73.463 0.003 78.206 0.008 1.035 0.004 111.677 0.003 42.661 0.006 108.602 0.006 OBABO 0.249 0.003 585.129 0.004 73.448 0.002 78.189 0.069 1.048 0.005 111.673 0.004 42.729 0.002 108.601 0.008 EM (OD) 0.295 0.002 586.567 1.871 73.463 0.002 80.890 1.226 1.027 0.001 111.692 0.004 42.718 0.004 108.661 0.005 EM (UD) 0.328 0.009 585.231 0.012 73.554 0.003 79.747 0.382 0.702 0.017 111.837 0.009 42.661 0.006 109.410 0.019 OBAB 0.228 0.002 585.116 0.001 73.441 0.002 77.968 0.005 1.082 0.002 111.652 0.002 42.683 0.003 108.517 0.005 BAOAB 0.606 0.643 585.116 0.001 73.441 0.001 77.979 0.011 1.091 0.002 111.650 0.002 42.684 0.004 108.509 0.003 OBABO 0.164 0.005 585.113 0.002 73.431 0.003 77.945 0.010 1.104 0.003 111.648 0.002 42.730 0.004 108.501 0.003 EM (OD) 0.187 0.003 585.524 0.414 73.437 0.001 83.395 4.184 1.081 0.004 111.673 0.002 42.760 0.003 108.595 0.006 EM (UD) 0.249 0.003 585.235 0.009 73.508 0.005 79.704 0.177 0.684 0.038 111.786 0.006 42.731 0.002 109.351 0.075 OBAB 0.151 0.003 585.112 0.001 73.428 0.001 77.856 0.007 1.121 0.004 111.637 0.001 42.731 0.002 108.459 0.001 BAOAB 0.159 0.005 585.112 0.001 73.428 0.001 77.874 0.010 1.131 0.002 111.637 0.002 42.733 0.006 108.457 0.004 OBABO 0.103 0.003 585.112 0.001 73.423 0.001 77.881 0.014 1.136 0.001 111.636 0.001 42.763 0.002 108.458 0.004