# piecewise_deterministic_generative_models__0f2ea51e.pdf

Piecewise deterministic generative models

Andrea Bertazzi1, , Dario Shariatian2

Umut Simsekli2, Eric Moulines1,3, Alain Durmus1

1 École Polytechnique, Institut Polytechnique de Paris 2 INRIA, CNRS, Ecole Normale Supérieure, PSL Research University 3 MBZUAI andrea.bertazzi@polytechnique.edu

We introduce a novel class of generative models based on piecewise deterministic Markov processes (PDMPs), a family of non-diffusive stochastic processes consisting of deterministic motion and random jumps at random times. Similarly to diffusions, such Markov processes admit time reversals that turn out to be PDMPs as well. We apply this observation to three PDMPs considered in the literature: the Zig-Zag process, Bouncy Particle Sampler, and Randomised Hamiltonian Monte Carlo. For these three particular instances, we show that the jump rates and kernels of the corresponding time reversals admit explicit expressions depending on some conditional densities of the PDMP under consideration before and after a jump. Based on these results, we propose efficient training procedures to learn these characteristics and consider methods to approximately simulate the reverse process. Finally, we provide bounds in the total variation distance between the data distribution and the resulting distribution of our model in the case where the base distribution is the standard d-dimensional Gaussian distribution. Promising numerical simulations support further investigations into this class of models.

1 Introduction

Diffusion-based generative models [Ho et al., 2020, Song et al., 2021] have recently achieved stateof-the-art performance in various fields of application [Dhariwal and Nichol, 2021, Croitoru et al., 2023, Jeong et al., 2021, Kong et al., 2021]. In their continuous time interpretation [Song et al., 2021], these models leverage the idea that a diffusion process can bridge the data distribution µ to a base distribution π, and its time reversal can transform samples from π into synthetic data from µ . Anderson [1982] showed that the time reversal of a diffusion process, i.e., the backward process, is itself a diffusion with explicit drift and covariance functions that are related to the score functions of the time-marginal densities of the original, forward diffusion. Consequently, the key element of these generative models is learning these score functions using techniques such as (denoising) score-matching [Hyvärinen, 2005, Vincent, 2011].

In this work we propose a new family of generative models which use piecewise deterministic Markov processes (PDMPs) as noising processes instead of diffusions. PDMPs were introduced around forty years ago [Davis, 1984, 1993] and since then have been successfully applied in various fields, including communication networks [Dumas et al., 2002], biology [Berg and Brown, 1972, Cloez, Bertrand et al., 2017], risk theory [Embrechts and Schmidli, 1994], and the reliability of complex systems [Zhang et al., 2008]. More recently, PDMPs have been intensively studied in the context of Monte Carlo algorithms [Fearnhead et al., 2018] as alternatives to Langevin diffusionbased methods and Metropolis-Hastings mechanisms. This renewed interest in PDMPs has led to the development of novel processes, such as the Zig-Zag process (ZZP) [Bierkens et al., 2019a], the Bouncy Particle Sampler (BPS) [Bouchard-Côté et al., 2018], and the Randomised Hamiltonian

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Monte Carlo (RHMC) [Bou-Rabee and Sanz-Serna, 2017]. PDMPs offer several advantages compared to Langevin-based methods, such as better scalability and reduced computational complexity in high-dimensional settings [Bierkens et al., 2019a]. In the context of generative modelling PDMPs offer several potential advantages over diffusion processes. A key strength is their ability to effectively model data distributions supported on constrained or restricted domains. By adjusting their deterministic dynamics, PDMPs can easily incorporate boundary behaviour, making them straightforward to implement in such settings [Bierkens et al., 2018, Davis, 1993]. Similarly, PDMPs can model data on Riemannian manifolds by employing flows that respect the manifold s geometry (see, e.g., Yang et al. [2022] for a PDMP on the sphere). Moreover, PDMPs are well-suited for modelling data distributions that combine a continuous density and a positive mass on a lower dimensional manifold [Bierkens et al., 2022].

Our contributions are the following:

1) Leveraging the existing literature on time reversals of Markov jump processes [Conforti and Léonard, 2022], we characterise the time reversal of any PDMP under appropriate conditions. It turns out that this time reversal is itself a PDMP with characteristics related to the original PDMP; see Proposition 1. 2) We further specify the characteristics of the time-reversal processes associated with the three aforementioned PDMPs: ZZP, BPS, and RHMC. For these processes, Proposition 2 shows the corresponding time-reversals are PDMPs with simple reversed deterministic motion and with jump rates and kernels that depend on (ratios of) conditional densities of the velocity of the forward process before and after a jump. In contrast to common diffusion models, the emphasis is on distributions of the velocity, similar to the case of the underdamped Langevin diffusion [Dockhorn et al., 2022], which includes an additional velocity vector akin to the PDMPs we consider. Moreover, the structure of the backward jump rates and kernels closely connects to the case of continuous time jump processes on discrete state spaces [Sun et al., 2023, Lou et al., 2024]. 3) We define our piecewise deterministic generative models employing either ZZP, BPS, or RHMC as forward process, transforming data points to a noise distribution of choice, and develop methodologies to estimate the backward rates and kernels. Then, we define the corresponding backward process based on approximations of the time reversed ZZP, BPS, and RHMC obtained with the estimated rates and kernels. In Section 4 we test our models on simple toy distributions. 4) We obtain a bound for the total variation distance between the data distribution and the distribution of our generative models taking into account two sources of error: first, the approximation of the characteristics of the backward PDMP, and second, its initialisation from the limiting distribution of the forward process; see Theorem 1.

2 PDMP based generative models

2.1 Piecewise deterministic Markov processes

Informally, a PDMP [Davis, 1984, 1993] on the measurable space (RD, B(RD)) is a stochastic process that follows deterministic dynamics between random times, while at these times the process can evolve stochastically on the basis of a Markov kernel. In order to define a PDMP precisely, we need three components, called characteristics of the PDMP: a vector field Φ : R+ RD RD, which governs the deterministic motion, a jump rate λ : R+ RD R+, which defines the law of random event times, and finally a jump kernel Q : R+ RD B(RD) [0, 1], which is applied at event times and defines the new state of the process. Let us give an informal description of the evolution of a PDMP Zt, clarifying the role of the three characteristics. Suppose at time T R+ the PDMP is at state z RD, that is ZT = z. The deterministic motion of the PDMP is described by the ODE d ZT+s = Φ(T + s, ZT+s)ds for s 0, with initial condition ZT = z. We introduce the differential flow φ : (t, s, z) 7 φt,t+s(z), which solves the ODE in the sense that dφt,t+s(z) = Φ(T + s, φt,t+s(z))ds for s 0. The process evolves deterministically according to φ until the next event time T + τ, where τ is a random variable with law P(τ > s|ZT = z) = exp( R s 0 λ(φT,T+u(z))du), i.e. the exponential distribution with non-homogeneous rate s 7 λ(φT,T+s(z)). We can, at least in principle, simulate τ by solving

τ = inf t > 0 : Z t

0 λ(T + u, φT,T+u(z))du E (1)

where E Exp(1). The process is then defined on [T, T + τ) by ZT+t = φT,T+t(ZT) for t [0, τ). At time T + τ the process jumps to a new state that is drawn from the Markov kernel Q, hence we set ZT+τ Q(T + τ, φT,T+τ(z), ). A realisation of the path of a PDMP for a given time horizon can then be obtained following this procedure (see also Algorithm 1 in Appendix C.1 for a pseudo-code). The formal construction of a PDMP can be found in Appendix A.1.

Typically a PDMP has several types of jumps, which can be represented by a family of jump rates and kernels (λi, Qi)i {1,...,ℓ}. A PDMP of such type can be obtained with the construction we have described by setting

i=1 λi(t, z) , Q(t, z, dz ) =

λ(t, z) Qi(t, z, dz ) . (2)

An alternative, equivalent construction of a PDMP with λ, Q satisfying (2) is given in Appendix A.2. Finally, we say a PDMP is homogeneous (as opposed to the non-homogeneous case we have described) when the characteristics do not depend on time, that is Φ : RD RD, λ : RD R+, and Q : RD B(RD) [0, 1]. In all this work, we suppose that the PDMPs that we consider are non-explosive in the sense of Davis [1993], that is they are such that the time of the n-th random event goes to + as n + , almost surely (see Durmus et al. [2021] for conditions ensuring this).

We now introduce the three PDMPs we consider throughout the paper. All these PDMPs are timehomogeneous and live on a state space of the form E = Rd V, for V Rd, assuming V0 V. Then, Zt can be decomposed as Zt = (Xt, Vt), where Xt Rd is the component of interest and has the interpretation of the position of a particle, whereas Vt V is an auxiliary vector playing the role of the particle s velocity. In the sequel, if there is no risk of confusion, we take the convention that any z Rd V, and we write z = (x, v) for x Rd and v V. All the PDMPs below have a stationary distribution of the form π(dx) ν(dv), where π has density proportional to x 7 e ψ(x), for ψ : Rd R a continuously differential potential, and ν is a simple distribution on V for the velocity vector. In our experiments we take π to be the standard normal distribution, while ν is the standard normal when V = Rd or the uniform distribution when V is a compact set. Figure 1 shows sample paths for the position vector of the three PDMPs we introduce below.

Figure 1: Trace plots for ZZP (left), BPS (centre), RHMC (right). In all cases λr = 1 and Tf = 10.

The Zig-Zag process The Zig-Zag process (ZZP) [Bierkens et al., 2019a] is a PDMP with state space EZ = Rd { 1, 1}d. The deterministic motion is determined by the homogeneous vector field ΦZ(x, v) = (v, 0)T, i.e. the particle moves with constant velocity v. For i {1, . . . , d} we define the jump rates λZ i (x, v) := (vi iψ(x))+ + λr, where (a)+ = max(0, a), i denotes the i-th partial derivative, and λr 0 is a user chosen refreshment rate. The corresponding (deterministic) jump kernels are given by QZ i ((x, v), (dy, dw)) = δ(x,RZ i v)(dy, dw), where δz denotes the Dirac measure at z E. Here, RZ i is the operator that reverses the sign of the i-th component of the vector to which it is applied, i.e. RZ i v = (v1 . . . , vi 1, vi, vi+1, . . . , vd). The ZZP falls within our definition of PDMP taking λ, Q as in (2). As shown in Bierkens et al. [2019a], the ZZP has invariant distribution π ν, where ν is the uniform distribution over { 1}d. Moreover, Bierkens et al. [2019b] shows that for any λr 0 the law of the ZZP converges exponentially fast to its invariant distribution e.g. when π is a standard normal distribution.

The Bouncy Particle sampler The Bouncy Particle sampler (BPS) [Bouchard-Côté et al., 2018] is a PDMP with state space is EB = Rd VB, where VB = Rd or VB = Sd 1 := {v Rd : v = 1}.

The deterministic motion is governed as ZZP by the homogeneous vector field defined for z = (x, v) E by ΦB(x, v) = (v, 0)T. Now we introduce two jump rates which correspond to two types of random events: reflections and refreshments. Reflections enforce that µ(x, v) = π(x)ν(v) is the invariant density of the process, where π(dx) exp( ψ(x))Leb(dx) is a given distribution and ν is either a standard normal distribution when VB = Rd or the uniform distribution on Sd 1 when VB = Sd 1. Reflections are associated to the homogeneous jump rate (x, v) 7 λB 1 (x, v) = v, ψ(x) +, while refreshments are associated to (x, v) 7 λB 2 (x, v) = λr for λr > 0. The corresponding jump kernels are QB 1 ((x, v), (dy, dw)) = δ(x,RB x v)(dy, dw) , QB 2 ((x, v), (dy, dw)) = δx(dy)ν(dw), where RB x v = v 2( v, ψ(x) /| ψ(x)|2) ψ(x) . The operator RB x reflects the velocity v off the hyperplane that is tangent to the contour line of ψ passing though point x. The norm of the velocity is unchanged by the application of RB, and this gives the interpretation that RB is an elastic collision of the particle off such hyperplane. As observed in Bouchard-Côté et al. [2018], BPS requires a strictly positive λr to avoid being reducible, that is to make sure the process can reach any area of the state space. Exponential convergence of the BPS to its invariant distribution was shown by Deligiannidis et al. [2019], Durmus et al. [2020].

Randomised Hamiltonian Monte Carlo Randomised Hamiltonian Monte Carlo (RHMC) [Bou Rabee and Sanz-Serna, 2017] refers to the PDMP with state space EH = Rd Rd which is characterised by Hamiltonian deterministic flow and refreshments of the velocity vector from the standard normal distribution. The flow is governed by the homogeneous vector field defined by (x, v) 7 ΦH(x, v) = (v, ψ(x))T, where ψ is the potential of π. The jump rate coincides with the refreshment part of BPS, i.e., it is the constant function λH : (x, v) 7 λr > 0 and jump kernel QH((x, v), (dy, dw)) = δx(dy)ν(dw). When the stationary distribution π is a standard Gaussian, the deterministic dynamics (xt, vt)t 0 satisfy dxt = vtdt, dvt = xtdt, which for t 0 has solution xt = x0 cos(t) + v0 sin(t) and vt = x0 sin(t) + v0 cos(t), where (x0, v0) is the initial condition. It is well known that Hamiltonian dynamics preserve the density µ(x, v) = π(x)ν(v) [Neal, 2010], where ν is the standard normal distribution, while velocity refreshments are necessary to ensure the process is irreducible. Exponential convergence of the law of this PDMP to µ was shown in Bou-Rabee and Sanz-Serna [2017].

Remark 1 (Noise schedule) Similarly to diffusion models we can introduce a noise schedule β(t) that regulates the amount of randomness injected at time t. This can be achieved using the time change of a given PDMP with characteristics (Φ, λ, Q) as forward process, resulting in the PDMP with characteristics (Φβ, λβ, Q) for Φβ(t, z) = β(t)Φ(t, z) and λβ(t, z) = β(t)λ(t, z).

2.2 Time reversal of PDMPs

In this section we characterise the time reversal of a PDMP. This key result, stated in Proposition 1, is essential to be able to use PDMPs for generative modelling. The time reversal of a PDMP (Zt)t [0,Tf ] with initial distribution µ0 is the process that at time t [0, Tf] has distribution µ0PTf t, where µ0Ps denotes the law of Zs. It follows that the law of the time reversal at time Tf is µ0, which is the key observation in the context of generative modelling. Characterisations of the law of time reversed Markov processes with jumps were obtained in Conforti and Léonard [2022] and in the following statement we adapt their Theorem 5.7 to our setting, showing that the time reversal of a PDMP with characteristics (Φ, λ, Q) is a PDMP with reversed deterministic motion and jump rates and kernels satisfying (3).

Proposition 1 Consider a non-explosive PDMP (Zt)t 0 with characteristics (Φ, λ, Q) and initial distribution µ0 on RD. In addition, let Tf be a time horizon. Suppose that Φ is locally bounded, (t, z) 7 λ(t, z) is continuous in both its variables, and R Tf 0 E[λ(t, Zt)]dt < . Assume the technical conditions H3, H4, postponed to the appendix. Then, the corresponding time reversal process is a PDMP with characteristics ( Φ, λ , Q), where Φ(t, z) = Φ(Tf t, z) and λ , Q are the unique solutions to the following balance equation: for almost all t [0, Tf],

µ0PTf t(dy) λ (t, y) Q(t, y, dz) = µ0PTf t(dz)λ(Tf t, z)Q(Tf t, z, dy) , (3)

where µ0Pt stands for the distribution of Zt starting from µ0.

The proof is postponed to Appendix A.4. The condition H3 is standard in the literature on PDMPs [Davis, 1993] and is verified for ZZP, BPS, and RHMC. H4 is a technical assumption on the do-

main of the generator of the forward PDMP and has been shown to hold e.g. for the ZZP. In the next proposition we derive expressions for the backward jump rate and kernel satisfying (3) corresponding to a forward PDMP with characteristics with the same structure as those of ZZP, BPS, and RHMC. We state the result assuming the PDMP has only one jump type, but the generalisation to the case of ℓ> 1 jump mechanisms of the form (2) can be immediately obtained applying Proposition 2 to each pair (λi, Qi) for i {1, . . . , ℓ}. We refer to Appendix A.6 for the details.

Proposition 2 Consider a non-explosive PDMP (Xt, Vt)t 0 with characteristics (Φ, λ, Q) and initial distribution µX 0 µV 0 on R2d. In addition, let Tf be a time horizon. Suppose that Φ and λ satisfy the same conditions as Proposition 1, in particular the technical conditions H3, H4 postponed to the appendix. Suppose in addition that for any t (0, Tf], the conditional distribution of Vt given Xt has a transition density (x, v) 7 pt(v|x) with respect to some reference measure µV ref on Rd.

(1) (Deterministic jumps). Suppose Q((y, w), (dx, dv)) = δy(dx)δRyw(dv) where for any y Rd, Ry : Rd Rd is an involution which preserves µV ref, i.e., R 1 y = Ry and µV ref(d Ryw) = µV ref(dw). Then for almost all t [0, Tf] and any (y, w) R2d such that p Tf t(w|y) > 0 it holds that

λ (t, (y, w)) = p Tf t(Ryw|y)

p Tf t(w|y) λ(Tf t, (y, Ryw)) , Q((y, w), (dx, dv)) = δy(dx)δRyw(dv) .

(2) (Refreshments). Suppose Q((y, w), (dx, dv)) = δy(dx)ν(dv|y), where ν is a transition kernel on Rd B(Rd), and λ(t, (y, w)) = λ(t, y). Suppose also for any y Rd, ν( |y) is absolutely continuous with respect to µV ref. Then for almost all t [0, Tf] and any (y, w) R2d such that p Tf t(w|y) > 0 it holds that

λ (t, (y, w)) = (dν/dµV ref)(w|y) p Tf t(w|y) λ(Tf t, y), Q(t, (y, w), (dx, dv)) = δy(dx)p Tf t(v|x)µV ref(dv).

The proof is postponed to Appendix A.5. We remark that we consider that µV 0 is a distribution on Rd also when µV 0 (V) = 1 for V Rd, in which case the reference measure can simply be chosen such that µV ref(V) = 1. Applying Proposition 2 we are able to derive explicit expressions for the characteristics of the time reversals of ZZP, RHMC, and BPS. The rigorous statements and their proofs can be found in Appendix A.7. For ZZP and BPS we assume the following condition on π, the limiting distribution for the position vector of the forward process.

H1 Recall π(x) e ψ(x). It holds that ψ C2(Rd) and supx Rd 2ψ(x) < + .

This assumption is satisfied e.g. by any multivariate normal distribution. For BPS and RHMC we suppose that for any t (0, Tf], the conditional distribution of Vt given Xt has a transition density (x, v) 7 pt(v|x) with respect to the Lebesgue measure. Moreover, for all samplers we assume H4.

Time reversal of ZZP In order to apply Proposition 2 we additionally assume that R | iψ(x)|dµ (x) < for all i = 1, . . . , d. We find that the deterministic motion is defined by Φ Z(y, w) = ( w, 0)T for any (y, w) R2d, while the backward rates and kernels are for i = 1, . . . , d and for all (y, w) R2d such that p Tf t(w|y) > 0,

λ Z i (t, (y, w)) = p Tf t(RZ i w|y) p Tf t(w|y) λZ i (y, RZ i w) , QZ i ((y, w), (dx, v)) = δ(y,RZ i w)(dx, v) . (4)

Time reversal of BPS Whereas in Appendix A.7 we consider the case where the velocity of BPS is initialised on Sd 1, we can formally apply Proposition 2 to the case of ν is the standard ddimensional Gaussian distribution assuming that R | ψ(x)|dµ (x) < . The drift of the backward BPS is clearly the same as for the backward ZZP, while jump rates and kernels are for all t [0, Tf] and (y, w) R2d such that p Tf t(w|y) > 0

λ B 1 (t, (y, w)) = p Tf t(RB y w|y) p Tf t(w|y) λB 1 (y, RB y w), QB 1 ((y, w), (dx, dv)) = δ(y,RB y w)(dx, dv) ,

λ B 2 (t, (y, w)) = λr ν(w) p Tf t(w|y) , QB 2 (t, (y, w), (dx, dv)) = p Tf t(v|y)δy(dx)dv . (5)

Time reversal of RHMC. The deterministic motion of the backward RHMC follows the system of ODEs Φ H(x, v) = ( v, ψ(x))T, which, when the limiting distribution π is Gaussian, has solution xt = x0 cos(t) v0 sin(t) and vt = x0 sin(t) + v0 cos(t). The backward refreshment rate and kernel coincide with those of BPS as given in (5).

Remark 2 (Variance exploding PDMPs) Similarly to the case of diffusion models [Song et al., 2021], we can define variance exploding PDMPs choosing ψ(x) = 0 for all x Rd, that is when π(dx) is the Lebesgue measure. In this case, the deterministic motion of RHMC coincides with ZZP and BPS, and all three processes have only velocity refreshment events.

2.3 Approximating the characteristics of time reversals of PDMPs

In Section 2.2 we showed that the backward jump rates and kernels of ZZP, BPS, and RHMC, involve the conditional densities of the velocity vector of the forward process given its position vector at all times t [0, Tf]. These conditional densities are unavailable in analytic form, hence in this section we develop methods to learn the jump rates and kernels of our time reversed PDMPs. In Appendix D we give the pseudo codes and more detailed descriptions of the training procedure for our models, together with a comparison with diffusion models.

Approximating the jump rates of the backward ZZP via ratio matching In the case of ZZP, we need to approximate for any i {1, . . . , d}, the rates in (4). Since the terms λZ i (x, RZ i v) are known, it is sufficient to estimate the density ratios r Z i (x, v, t) := pt(RZ i v|x)/pt(v|x) for all states (x, v) such that pt(v|x) > 0. To this end, we introduce a class of functions {sθ : Rd { 1, 1}d [0, Tf] Rd + : θ Θ} for some parameter set Θ Rdθ and aim to find a parameter θ Θ such that for any i {1, . . . , d}, the i-th component of sθ , denoted by sθ i ( ), is an approximation of r Z i . We then approximate the backward ZZP by using the rates λZ i (t, (x, v)) = sθ i (x, v, Tf t) λZ i (x, RZ i v). To address the problem of fitting θ, we consider different loss functions inspired by the ratio matching (RM) problem considered in Hyvärinen [2007].

From a discrete probability density p on { 1, 1}d, RM consists in learning the d ratios v 7 p (Riv)/p (v) for i {1, . . . , d}. This problem was motivated in Hyvärinen [2007] as a means to estimate p without requiring its normalising constant, similarly to score matching applied to estimate continuous probability densities [Hyvärinen, 2005]. In our context we are interested only in the ratios, hence as opposed to Hyvärinen [2007] we do not model the conditional distributions (x, v) 7 pt(v|x), but directly the ratios r Z i . Adapting the ideas of Hyvärinen [2007] to our context, we introduce the function G : r 7 (1 + r) 1 and define the Explicit Ratio Matching objective function

ℓE(θ) = Z Tf

i=1 E h {G(sθ i (Xt, Vt, t)) G(ri(Xt, Vt, t))}2

+ {G(sθ i (Xt, RZ i Vt, t)) G(ri(Xt, RZ i Vt, t))}2i .

where ω : [0, Tf] R + is a probability density, and (Xt, Vt)t 0 is a ZZP initialised from µ Unif({ 1, 1}d). This objective function considers simultaneously the square error in the estimation of both (x, v, t) 7 ri(x, v, t) and (x, v, t) 7 ri(x, RZ i v, t), where the function G improves numerical stability, particularly when one of the two ratios is very small. Clearly ℓE(θ) = 0 if and only if sθ i (x, v, t) = ri(x, v, t) for almost all x, v, t and all i. Moreover, the choice of G allows us to optimise without knowledge of the true ratios, as shown in the following result.

Proposition 3 It holds that arg minθ ℓE(θ) = arg minθ ℓI(θ) for

ℓI(θ) = Z Tf

i=1 E h G2(sθ i (Xt, Vt, t)) + G2(sθ i (Xt, RZ i Vt, t)) 2G(sθ i (Xt, Vt, t)) i , (7)

where (Xt, Vt)t R+ is a ZZP starting from µ Unif({ 1, 1}d).

Therefore we aim to solve the minimisation problem associated with ℓI, which has for empirical counterpart

i=1 G2(sθ i (Xn τ n, V n τ n, τ n)) + G2(sθ i (Xn τ n, RZ i V n τ n, τ n)) 2G(sθ i (Xn τ n, V n τ n, τ n))

where {τ n}N n=1 are i.i.d. samples from ω, independent of {(Xn t , V n t )t 0}N n=1, which are N i.i.d. realisations of the ZZP respectively starting at the n-th training data point with velocity V n 0 , where {V n 0 }N n=1 are i.i.d. observations of Unif({ 1, 1}d).

Notice that the loss above has computational cost increasing linearly in d because d + 1 evaluations of the model are needed for each datum. This can be improved considering an estimate for the ratio which does not take as input the whole velocity vector (see Appendix D.1 for the details). This variation has computational cost that is constant in the dimension, but might have lower accuracy.

Approximating the characteristics of BPS and RHMC For BPS and RHMC, Proposition 2 shows that if we aim to sample from the backward process, we have to estimate both ratios of the conditional density of the velocity of the forward PDMP given its position at any time t [0, Tf], and also to be able to sample from such densities as prescribed by the backward jump kernel (5). In order to address both requirements, we introduce a parametric family of conditional probability distributions {pθ : θ Θ} of the form (x, v, t) 7 pθ(v|x, t), where Θ Rdθ, which we model with the framework of normalising flows (NFs) [Papamakarios et al., 2021]. The advantage of NFs lays in their feature that, once the network is learned, it is possible both to obtain an estimate of the density at a given state and time, and also to generate samples which are approximately from (x, v, t) 7 pt(v|x). However, training conditional NFs can be challenging in high dimensions.

Focusing on BPS, we now illustrate how we can use NFs to learn the backward jump rates and kernels. We aim to find a parameter θB such that pθB (v|x, t) is a good approximation of pt(v|x), the conditional density of the forward BPS with respect to the Lebesgue measure. We choose to optimise θ following the maximum likelihood approach, which gives the theoretical loss

ℓML(θ) = Z Tf

0 dt ω(t)E [log pθ(Vt|Xt, t)] , (8)

where ω : [0, Tf] R + is a probability density, and (Xt, Vt)t 0 is a a BPS initialised from µ ν, with ν denoting the density of the d-dimensional standard normal distribution. The optimal parameter θB can then be found minimising the empirical counterpart of ℓML(θ):

θB = arg min θ

n=1 log pθ(V n τ n|Xn τ n, τ n), (9)

where {τ n}N n=1 are i.i.d. samples from ω, independent of {(Xn t , V n t )t 0}N n=1, which are N i.i.d. realisations of the ZZP respectively starting at the n-th training data point with velocity V n 0 , where {V n 0 }N n=1 are i.i.d. observations from the multivariate standard normal distribution. Once we have obtained the optimal parameter θB , we can define our approximation of the backward refreshment mechanism of BPS taking the rate λB 2 (t, (x, v)) = λr ν(v)/pθB (v|x,Tf t) and the kernel QB 2 (t, (y, w), (dx, dv)) = pθB (v|y, Tf t)δy(dx)dv. Similarly, we estimate the backward reflection ratio of BPS as λB 1 (t, (x, v)) = λB 1 (x, RB x v) pθB (RB x v|x,Tf t)/pθB (v|x,Tf t).

2.4 Simulating the backward process

We now discuss how we can simulate the backward PDMP with exact backward flow map (t, x, v) 7 φ t(x, v) and jump characteristics λ and Q that are approximations of the jump rates and kernels of the time reversed PDMPs obtained as discussed in Section 2.3. We recall that the backward rates have the general form λ(t, (x, v)) = sθ(x, v, Tf t)λ(x, Rv), where sθ is an estimate of a density ratio and R is a suitable involution. In principle such a PDMP can be simulated following the procedure described in Section 2.1, but the generation of the random jump times via (1) requires the integration of λ(t, φ t(x, v)) with respect to t. This cannot be achieved since λ is defined through a neural network. A standard approach in the literature (see e.g. Bertazzi et al.

[2022, 2023]) is to discretise time and (informally) approximate the integral in (1) with a finite sum. Here we focus on approximations based on splitting schemes discussed in Bertazzi et al. [2023], adapting their ideas to the non-homogeneous case. Such splitting schemes approximate a PDMP with a Markov chain defined on the time grid {tn}n {0,...,N}, with t0 = 0 and t N = Tf. The key idea is that the deterministic motion and the jump part of the PDMP are simulated separately in a suitable order, obtaining second order accuracy under suitable conditions (see Theorem 2.6 in Bertazzi et al. [2023]). Now, we give an informal description of the splitting scheme that we use for RHMC, that is based on splitting DJD in Bertazzi et al. [2023], where D stands for deterministic motion and J for jumps. We define our Markov chain based on the step sizes {δj}j {1,...,N}, where δj = tj tj 1. Suppose we have defined the Markov chain on {tk}k {0,...,n} for n < N and that the state at time tn is (xtn, vtn). The next state is obtained following three steps. First, the particle moves according to its deterministic motion for a half-step, that is we define an intermediate state ( xtn, vtn) = φ δn+1/2(xtn, vtn). Second, we turn our attention to the jump part of the process. In this phase, the particle is only allowed to move through jumps and there is no deterministic motion. This means that the rate is frozen to the value λ(tn + δn+1/2, ( xtn, vtn)) and thus the integral in (1) can be computed trivially. The proposal for the next event time is then given by τn+1 Exp(λ(tn+δn+1/2, ( xtn, vtn))). If τn+1 δn+1, we draw w Q(tn+δn+1/2, ( xtn, vtn), ) and update vtn = w, else we leave vtn unchanged. Finally we conclude with an additional half-step of deterministic motion, letting (xtn+1, vtn+1) = φ δn+1/2( xtn, vtn). We refer to Appendix C.2 for a detailed description of the schemes used for each process together with the pseudo-codes.

3 Error bound in total variation distance

In this section, we give a bound on the total variation distance between the data distribution µ and the law of the synthetic data generated by a PDMP with initial distribution π ν and approximate characteristics obtained, e.g., with the methods described in Section 2.3. We obtain our result comparing the law of such PDMP to the law of the exact time reversal obtained in Section 2.2, that is the PDMP with the analytic characteristics of Proposition 2 and with initial distribution L(XTf , VTf ), i.e. the law of the forward PDMP at time Tf when initialised from µ ν. In Theorem 1 below we then take into account two of the three sources of error in our models: (i) the error introduced initialising the backward PDMP from the limiting distribution of the forward, (ii) the error due to the approximation of the backward rates and kernels. For simplicity we neglect the discretisation error caused by the methods discussed in Section 2.4.

We shall assume the following condition, which deals with the error introduced by initialising the backward PDMP from π ν.

H2 The forward PDMP with semigroup (Pt)t 0 is such that there exist γ, C > 0 for which

π ν µ νPt TV Ce γt.

Informally, H2 is verified for some C < when π is a multivariate standard Gaussian distribution for ZZP and BPS if the tails of µ are at least as light as those of π, while for RHMC it is enough if µ has finite second moments. We refer to Appendix E.1 for a more detailed discussion on this aspect. We are now ready to state our result.

Theorem 1 Consider a non-explosive PDMP (Xt, Vt)t 0 with initial distribution µ ν, stationary distribution π ν, and characteristics (Φ, λ, Q). Let Tf be a time horizon. Suppose the assumptions of Proposition 1 as well as H2 hold. Let (Xt, V t)t [0,Tf ] be a non-explosive PDMP initial distribution π ν and characteristics (Φ, λ, Q), where Φ(t, (x, v)) = Φ(Tf t, (x, v)) for all t [0, Tf] and (x, v) R2d. Then it holds that

µ L(XTf ) TV Ce γTf + 2E

0 g Tf t(Xt, Vt)dt

gt(x, v) = ( λ λ )(t, (x, v))

2 Q(t, (x, v), ) Q(t, (x, v), ) TV + λ (t, (x, v)) λ(t, (x, v))

(11) and λ , Q are as given by Proposition 1.

Dataset i-DDPM BPS RHMC ZZP

Checkerboard 2.49 0.98 1.96 1.51 4.27 3.36 0.81 0.19 Fractal tree 8.04 5.58 2.25 1.70 4.41 4.35 1.12 0.58 Gaussian grid 23.19 9.72 4.59 4.03 4.01 3.32 4.43 4.05 Olympic rings 2.03 1.60 2.07 1.19 2.41 2.24 1.43 0.86 Rose 6.77 5.81 1.92 1.57 2.16 1.59 0.90 0.35 Table 1: MMD , in units of 1e 3, averaged over 6 runs, with the corresponding standard deviations.

Fractal tree

Olympic rings

Figure 2: Comparative results on two-dimensional generation of synthetic datasets.

The proof is postponed to Appendix E.2. The first term in (10) is caused by initialising the process from π ν, while the second term represents the error introduced by the approximate jump rate λ and kernel Q. For the sake of illustration we obtain a simple upper bound to (10) in the case of ZZP (for the details see Appendix E.3). We assume the conditions of Theorem 1 are satisfied and also that the expected error of the learned rates λZ i is bounded by a constant, in the sense that E[|r Z i (Xt, Vt, Tf t) sθ i (Xt, Vt, Tf t)|λZ i (Xt, RZ i Vt)] M for all t [0, Tf] and i {1, . . . , d}. The latter condition is similar to the standard assumption asked on the approximation of the score in diffusion models [Chen et al., 2023]. Under these conditions we obtain the bound

µ L(XTf ) TV Ce γTf + 4MTfd . (12)

4 Numerical simulations

In this section, we test our piecewise deterministic generative models on simple synthetic datasets.

Design We compare the generative models based on ZZP, BPS, and RHMC with the improved denoising diffusion probabilistic model (i-DDPM) given in Nichol and Dhariwal [2021]. For all of our models, we choose the standard normal distribution as target distribution for the position vector, as well as for the velocity vector in the cases of BPS and RHMC. The accuracy of trained generative models is evaluated by the kernel maximum mean discrepancy (MMD). We refer to Appendix F for a detailed description of the parameters and networks choices.

Sample quality In Table 1 we report the MMD score for five, 2-dimensional toy distributions. We observe that the PDMP based generative models perform well compared to i-DDPM in all of these five datasets. In particular, ZZP and i-DDPM are implemented with the same neural network architecture, hence ZZP appears to compare favourably to i-DDPM with the same model expressivity. The results of Table 1 are supported by the plots of generated data shown in Figure 2, illustrating how ZZP and BPS are able to generate more detailed edges compared to i-DDPM.

In Figure 4, we compare the output of RHMC and i-DDPM for a very small number of reverse steps. We observe how in this setting the data generated by RHMC are noticeably closer to the true data distribution compared to i-DDPM. This phenomenon is observed also for BPS as shown in Table 2, and is intuitively caused by the refreshment kernel, which is able to generate velocities that correct wrong positions. Respecting this intuition, ZZP does not perform as well as BPS and RHMC for a small number of reverse steps since its velocities are constrained to { 1, 1}. Nonetheless, ZZP generates the most accurate results in our experiments given a large enough number of reverse steps.

Table 2 and Figure 3 show that PDMP-based models require a smaller computational time to generate samples of a given quality compared to i-DDPM. This is the case because PDMP models require

considerably less backward steps than i-DDPM, although each step is more expensive (see Table 2). Additional results can be found in Appendix F, including promising results applying ZZP to the MNIST dataset.

Table 2: MMD for various number of backward steps, Rose dataset.

steps i-DDPM BPS RHMC ZZP

2 696.28 165.09 26.48 358.25 5 192.17 22.18 3.00 89.49 10 45.08 5.48 1.75 11.31 25 12.34 1.58 0.60 1.20 100 8.72 3.66 1.72 1.04

time/step(ms) 3.94 45.8 15.1 11.2

Figure 3: MMD , runtime (ms) per method, Rose dataset.

Gaussian grid

i-DDPM, 2 steps

i-DDPM, 10 steps

RHMC, 2 steps

RHMC, 10 steps

Rose dataset

i-DDPM, 2 steps

i-DDPM, 10 steps

RHMC, 2 steps

RHMC, 10 steps

Figure 4: Comparing RHMC and i-DDPM for small number of reverse steps.

5 Discussion and conclusions

We have introduced new generative models based on piecewise deterministic Markov processes, developing a theoretically sound framework with specific focus on three PDMPs from the sampling literature. While this work lays the foundations of this class of methods, it also opens several directions worth investigating in the future.

Similarly to other generative models, our PDMP based algorithms are sensitive to the choice of the network architecture that is used to approximate the backward characteristics. Therefore, it is crucial to investigate which architectures are most suited for our algorithms in order to achieve state of the art performance in real world scenarios. For instance, in the case of BPS and RHMC it could be beneficial to separate the estimation of the density ratios and the generation of draws of the velocity conditioned on the position and time. For the case of ZZP, efficient techniques to learn the network in a high dimensional setting need to be investigated, while network architectures that resemble those used to approximate the score function appear to adapt well to the case of density ratios. Moreover, there are several alternative PDMPs that could be used as generative models and that we did not consider in detail in this paper, as for instance variance exploding alternatives.

Acknowledgments and Disclosure of Funding

AB, EM, AD are funded by the European Union (ERC-2022-Sy G, 101071601). US and DS are funded by the European Union (ERC, Dynasty, 101039676). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency. Neither the European Union nor the granting authority can be held responsible for them. US is additionally funded by the French government under management of Agence Nationale de la Recherche as part of the "Investissements d avenir" program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute). The authors are grateful to the CLEPS infrastructure from the Inria of Paris for providing resources and support.

Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982. ISSN 0304-4149. doi: https://doi.org/10.1016/0304-4149(82) 90051-5.

Howard C Berg and Douglas A Brown. Chemotaxis in escherichia coli analysed by threedimensional tracking. Nature, 239(5374):500 504, 1972.

Andrea Bertazzi, Joris Bierkens, and Paul Dobson. Approximations of Piecewise Deterministic Markov Processes and their convergence properties. Stochastic Processes and their Applications, 154:91 153, 2022.

Andrea Bertazzi, Paul Dobson, and Pierre Monmarché. Piecewise deterministic sampling with splitting schemes. ar Xiv.2301.02537, 2023.

Joris Bierkens, Alexandre Bouchard-Côté, Arnaud Doucet, Andrew B. Duncan, Paul Fearnhead, Thibaut Lienart, Gareth Roberts, and Sebastian J. Vollmer. Piecewise deterministic markov processes for scalable monte carlo on restricted domains. Statistics & Probability Letters, 136:148 154, 2018. The role of Statistics in the era of big data.

Joris Bierkens, Paul Fearnhead, and Gareth Roberts. The zig-zag process and super-efficient sampling for bayesian analysis of big data. Annals of Statistics, 47, 2019a.

Joris Bierkens, Gareth O Roberts, and Pierre-André Zitt. Ergodicity of the zigzag process. The Annals of Applied Probability, 29(4):2266 2301, 2019b.

Joris Bierkens, Sebastiano Grazzi, Frank van der Meulen, and Moritz Schauer. Sticky PDMP samplers for sparse and local inference problems. Statistics and Computing, 33(1):8, 2022.

Nawaf Bou-Rabee and Jesús María Sanz-Serna. Randomized hamiltonian monte carlo. The Annals of Applied Probability, 27(4):2159 2194, 2017.

Alexandre Bouchard-Côté, Sebastian J. Vollmer, and Arnaud Doucet. The Bouncy Particle Sampler: A Nonreversible Rejection-Free Markov Chain Monte Carlo Method. Journal of the American Statistical Association, 113(522):855 867, 2018.

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023.

Cloez, Bertrand, Dessalles, Renaud, Genadot, Alexandre, Malrieu, Florent, Marguet, Aline, and Yvinec, Romain. Probabilistic and Piecewise Deterministic models in Biology. ESAIM: Procs, 60:225 245, 2017.

Giovanni Conforti and Christian Léonard. Time reversal of markov processes with jumps under a finite entropy condition. Stochastic Processes and their Applications, 144:85 124, 2022. ISSN 0304-4149. doi: https://doi.org/10.1016/j.spa.2021.10.002.

Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

M. H. A. Davis. Piecewise-Deterministic Markov Processes: A General Class of Non-Diffusion Stochastic Models. Journal of the Royal Statistical Society. Series B (Methodological), 46(3): 353 388, 1984.

M.H.A. Davis. Markov Models & Optimization. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis, 1993. ISBN 9780412314100.

George Deligiannidis, Alexandre Bouchard-Côté, and Arnaud Doucet. Exponential ergodicity of the bouncy particle sampler. The Annals of Statistics, 47(3):1268 1287, 06 2019.

Prafulla Dhariwal and Alexander Quinn Nichol. Diffusion models beat GANs on image synthesis. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

Tim Dockhorn, Arash Vahdat, and Karsten Kreis. Score-based generative modeling with criticallydamped langevin diffusion. In International Conference on Learning Representations, 2022.

Vincent Dumas, Fabrice Guillemin, and Philippe Robert. A Markovian analysis of additive-increase multiplicative-decrease algorithms. Advances in Applied Probability, 34(1):85 111, 2002. doi: 10.1239/aap/1019160951.

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.

Alain Durmus, Arnaud Guillin, and Pierre Monmarché. Geometric ergodicity of the Bouncy Particle Sampler. The Annals of Applied Probability, 30(5):2069 2098, 10 2020.

Alain Durmus, Arnaud Guillin, and Pierre Monmarché. Piecewise deterministic Markov processes and their invariant measures. Annales de l Institut Henri Poincaré, Probabilités et Statistiques, 57(3):1442 1475, 2021.

Paul Embrechts and Hanspeter Schmidli. Ruin estimation for a general insurance risk model. Advances in Applied Probability, 26(2):404 422, 1994.

Paul Fearnhead, Joris Bierkens, Murray Pollock, and Gareth O. Roberts. Piecewise deterministic markov processes for continuous-time monte carlo. Statistical Science, 33(3):386 412, 08 2018.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840 6851. Curran Associates, Inc., 2020.

Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695 709, 2005.

Aapo Hyvärinen. Some extensions of score matching. Comput. Stat. Data Anal., 51:2499 2512, 2007.

M. Jacobsen. Point Process Theory and Applications: Marked Point and Piecewise Deterministic Processes. Probability and Its Applications. Birkhäuser Boston, 2005. ISBN 9780817642150.

Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. Diff TTS: A Denoising Diffusion Model for Text-to-Speech. In Proc. Interspeech 2021, pages 3605 3609, 2021. doi: 10.21437/Interspeech.2021-469.

Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. In International Conference on Learning Representations, 2021.

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. ar Xiv:2310.16834, 2024.

Radford M. Neal. MCMC using Hamiltonian dynamics. Handbook of Markov Chain Monte Carlo, 54:113 162, 2010.

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8162 8171. PMLR, 18 24 Jul 2021.

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. Journal of Machine Learning Research, 22(57):1 64, 2021.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library, 2019.

Franccois Rozet et al. Zuko: Normalizing flows in pytorch, 2022.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.

Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio matching under the bregman divergence: A unified framework of density ratio estimation. Annals of the Institute of Statistical Mathematics, 64, 10 2011. doi: 10.1007/s10463-011-0343-8.

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. In The Eleventh International Conference on Learning Representations, 2023.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Jun Yang, Krzysztof Łatuszy nski, and Gareth O Roberts. Stereographic Markov chain Monte Carlo. ar Xiv preprint ar Xiv:2205.12112, 2022.

Huilong Zhang, Franccois Dufour, Y. Dutuit, and Karine Gonzalez. Piecewise deterministic markov processes and dynamic reliability. Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, 222:545 551, 12 2008. doi: 10.1243/1748006XJRR181.

The Appendix is organised as follows. Appendix A includes the details and the proofs regarding Section 2.1 and Section 2.2. Appendix B contains details and proofs regarding the framework of density ratio matching. Appendix C gives details and pseudo-codes for the exact simulation of our forward processes (see Appendix C.1), and also for the splitting schemes that are used to approximate the backward processes (see Appendix C.2). Appendix D details the algorithms used to train our generative models and outlines the computational differences with the popular framework of denoising diffusion models. Appendix E contains the proof for Theorem 1 and some related details Finally, Appendix F contains the details on the numerical simulations, as well as additional results.

A PDMPs and their time reversals

A.1 Construction of a PDMP

Here we describe the formal construction of a PDMP with the characteristics (Φ, λ, Q). To this end, consider the differential flow φ : (t, s, z) 7 φt,t+s(z), which solves the ODE, dzt+s = Φ(t + s, zt+s)ds for s 0, i.e. zt+s = φt,t+s(zt). We define by recursion on n N the process on (Zt)t [0,Tn] on [0, Tn] and the increasing sequence of jump times (Tn)n N starting from an initial state Z0 and setting T0 = 0. Assume that (Ti)i {0,...,n} and (Zt)t [0,Tn] are defined for some n N. We now define (Zt)t [Tn,Tn+1]. First, we define

τn+1 = inf t > 0 : Z t

0 λ(Tn + u, φTn,Tn+u(ZTn))du En+1

where En+1 Exp(1), and set the n + 1-th jump time Tn+1 = Tn + τn+1. The process is then defined on [Tn, Tn+1) by ZTn+t = φTn,Tn+t(ZTn) for t [0, τn+1). Finally, we set ZTn+1 Q(Tn+1, φTn,Tn+τn+1(ZTn), ). The process (Zt)t 0 is a Markov process by [Jacobsen, 2005, Theorem 7.3.1].

A.2 Construction of a PDMP with multiple jump types

In this section we describe the formal construction of a non-homogeneous PDMP with the characteristics (Φ, λ, Q) where λ, Q are of the form

i=1 λi(t, z) , Q(t, z, dz ) =

λ(t, z) Qi(t, z, dz ) . (14)

Recall the differential flow φ : (t, s, z) 7 φt,t+s(z), which solves the ODE, dzt+s = Φ(t + s, zt+s)ds for s 0, i.e. zt+s = φt,t+s(zt). Similarly to the case of one type of jump only, we start the PDMP from an initial state Z0, assume it is defined as (Zt)t [0,Tn] on [0, Tn] for some n N, and we now define define (Zt)t [Tn,Tn+1]. First, we define the proposals (τ i n+1)i {1,...,ℓ} for next event time as

τ i n+1 = inf t > 0 : Z t

0 λi(Tn + u, φTn,Tn+u(ZTn))du Ei n+1

where Ei n+1 Exp(1) for i {1, . . . , ℓ}. Then define i = arg mini {1,...,ℓ} τ i n+1 and set the next jump time to Tn+1 = Tn + τ i n+1.

The process is then defined on [Tn, Tn+1) by ZTn+t = φTn,Tn+t(ZTn) for t [0, τn+1). Finally, we set ZTn+1 Qi (Tn+1, φTn,Tn+τn+1(ZTn), ).

A.3 Extended generator

In order to obtain the generator of a PDMP, Theorem 26.14 of Davis [1993] requires "standard conditions" on the characteristics (see conditions (24.8) in Davis [1993]). We state these conditions for a non-homogeneous PDMP in the next assumption.

H3 The non-homogeneous characteristics (Φ, λ, Q) satisfy the following conditions:

1. Φ is locally Lipschitz and the associated flow map φ has infinite explosion time;

2. λ is such that u 7 λ(φt,t+u(x)) is integrable on [0, ε(x, t)) for some ε(x, t) > 0 and all (t, x) R+ E.

3. Q is measurable and such that Q(t, x, {x}) = 0 for all (t, x) R+ E.

4. Let (Tn)n {0,1,... } be the random sequence of event times of the PDMP and define Nt = P k=0 1t Tk. It holds that Ex[Nt] < for all (t, x) R+ E.

Notably, the PDMP is required to be non-explosive in the sense that the expected number of random events after any time t starting the PDMP from any state should be finite. These conditions are verified for all the three PDMPs we consider as forward processes. Assuming H3 we can apply Theorem 26.14 in Davis [1993] to the homogeneous PDMP obtained including the time variable, which gives that the extended generator of the non-homogeneous PDMP with characteristics (Φ, λ, Q) is given by

Ltf(z) = Φ(t, z), zf(z) + λ(t, z) Z

Rd(f(y) f(z))Q(t, z, dy) , (15)

for all functions f dom(Lt), that is the space of measurable functions such that

M f t = f(Zt) f(Z0) Z t

0 Lsf(Zs)ds

is a local martingale. We also introduce the Carré du champ Γt(f, g) := Lt(fg) f Ltg g Ltf, with domain dom(Γt) := {f, g : f, g, fg dom(Lt)} which in the case of a PDMP with generator (15) takes the form

Γt(f, g)(z) = λ(t, z) Z

Rd(f(y) f(z))(g(y) g(z))Q(t, z, dy) .

A.4 Proof of Proposition 1

In order to prove Proposition 1 we apply Conforti and Léonard [2022, Theorem 5.7] and hence in this section we verify the required assumptions. Before starting, we state the following technical condition which we omitted in Proposition 1 and is assumed in Conforti and Léonard [2022, Theorem 5.7].

H4 It holds C2 c(Rd) dom(Lt) for any t R+.

We now turn to verifying the remaining assumptions in Conforti and Léonard [2022, Theorem 5.7]. The General Hypotheses of Conforti and Léonard [2022] are satisfied since we assume the vector field Φ is locally bounded, the switching rate (t, z) 7 λ(t, z) is a continuous function, and the jump kernel Q is such that Q(t, x, ) is a probability distribution. In particular these assumptions imply that

sup t [0,T ],|z| ρ

Rd(1 |z y|2)λ(t, z)Q(t, z, dy) sup t [0,T ],|z| ρ λ(t, z) < for all ρ 0.

Then, Conforti and Léonard [2022, Theorem 5.7] requires a further integrability condition, which is satisfied when Z

[0,T ] Rd Rd(1 |z y|2)µ0Pt(dz)λ(t, z)Q(t, z, dy) < .

It is then sufficient to have that Z Tf

0 E[λ(t, Zt)]dt < (16)

Finally, Conforti and Léonard [2022, Theorem 5.7] requires some technical assumptions which we now discuss. Introduce the class of functions that are twice continuously differentiable and compactly supported, denoted by C2 c(Rd), and for f C2 c(Rd) consider the two following conditions: Z T

Rd µ0Pt(dz)|Ltf(z)|dt < , (17)

Rd µ0Pt(dz)|Γt(f, g)(z)|dt < for all g C2 c(Rd). (18)

We define F := {f C2 c(E) : (17), (18) hold }. We need to verify that F C2 c(E). Let us start by considering (17): we find Z T

0 µ0Pt(dz)|Ltf(z)|dt

0 µ0Pt(dz) | Φ(t, z), f(z) | + λ(t, z) Z |u(y) u(z)|Q(t, z, dy) dt.

Since f C2 c(E) we have that | Φ(t, z), f(z) | is compactly supported and hence integrable, while the second term is finite assuming R T 0 E[λ(t, Zt)]dt < . Under the latter assumption, (18) can be easily verified.

A.5 Proof of Proposition 2

Let us denote the initial condition of the forward PDMP by µ0 = µX 0 µV 0 . First of all, notice that, for a PDMP with position-velocity decomposition and homogeneous jump kernel, the flux equation (3) becomes

µ0P t(dy, dw) λ (t, (y, w)) Q(t, (y, w), (dx, dv)) = µ0P t(dx, dv)λ( t, (x, v))Q((x, v), (dy, dw))

where t = Tf t. Moreover, since the jump kernel leaves the position vector unchanged we obtain that this is equivalent to

µ0P t(dw|y) λ (t, (y, w)) Q(t, (y, w), (dx, dv)) = µ0P t(dv|y)λ( t, (y, v))Q((x, v), (dy, dw)),

where µ0Pt(dw|y) is the conditional law of the velocity vector given the position vector at time t with initial distribution µ0.

Suppose first that Q((y, w), (dx, dv)) = δy(dx)δRyw(dv) for an involution Ry. Then we find

µ0P t(dw|y) λ (t, (y, w)) Q(t, (y, w), (dx, dv)) = µ0P t(d Ryw|y)λ( t, (y, Ryw))δy(dx)δRyw(dv)

where we used that δRyw(dv) = δRyv(dw) since Ry is an involution. Under our assumptions we have µ0P t(d Ryw|y) = p t(Ryw|y)µV ref(dw), µ0P t(dw|y) = p t(w|y)µV ref(dw),

since we assumed µV ref(dw) = µV ref(d Ryw). Hence we find for any (y, w) R2d such that p t(w|y) > 0

λ (t, (y, w)) Q(t, (y, w), (dx, dv)) = p t(Ryw|y)

p t(w|y) λ( t, (y, Ryw))δy(dx)δRyw(dv).

This can only be satisfied if

λ (t, (y, w)) = p t(Ryw|y)

p t(w|y) λ( t, (y, Ryw)), Q(t, (y, w), (dx, dv)) = δy(dx)δRyw(dv).

Consider now the second case, that is Q((y, w), (dx, dv)) = δy(dx)ν(dv|y) and λ(t, (y, w)) = λ(t, y). The flux equation (3) can be rewritten as

µ0P t(dw|y) λ (t, (y, w)) Q(t, (y, w), (dx, dv)) = µ0P t(dv|y)λ( t, y)δy(dx)ν(dw|y)

Under our assumptions we have ν(dw|y) = (dν/dµV ref)(w|y)µV ref(dw) and µ0P t(dw|y) = p t(w|y)µV ref(dw) for some measure µV ref. Hence for any (y, w) R2d such that p t(w|y) > 0 we obtain λ (t, (y, w)) Q(t, (y, w), (dx, dv)) = (dν/dµV ref)(w|y) p t(w|y) λ( t, y)p t(dv|y)δy(dx).

This is satisfied when λ (t, (y, w)) = (dν/dµV ref)(w|y) p t(w|y) λ( t, y), Q(t, (y, w), (dx, dv)) = µ0P t(dv|y)δy(dx).

A.6 Extension of Proposition 2 to multiple jump types

Proposition 2 considers PDMPs with one type of jump, while here we discuss the case of characteristics of the form (14), which is e.g. the case of ZZP and BPS. In this setting we can assume the backward jump rate and kernel have a similar structure, that is

λ i(t, z) , Q(t, z, dz ) =

Qi(t, z, dz ) ,

in which case the balance condition (3) can be rewritten as

µ0PTf t(dy)

λ i(t, y) Qi(t, y, dz) = µ0PTf t(dz)

i=1 λi(Tf t, z)Qi(Tf t, z, dy) .

It is then enough that

µ0PTf t(dy) λ i(t, y) Qi(t, y, dz) = µ0PTf t(dz)λi(Tf t, z)Qi(Tf t, z, dy)

holds for all i {1, . . . , ℓ}. It follows that it is sufficient to apply Proposition 2 to each pair (λi, Qi) to obtain ( λ i, Qi) such that (3) holds.

A.7 Time reversals of ZZP, BPS, and RHMC

In this section we give rigorous statements regarding time reversals of ZZP, BPS, and RHMC. For all samplers we rely on Proposition 2 and hence we focus on verifying its assumptions. In the cases of ZZP and RHMC we assume the technical condition H4 since proving it rigorously is out of the scope of the present paper. We remark that this can be proved with techniques as in Durmus et al. [2021], which show H4 in the case of BPS.

Proposition 4 (Time reversal of ZZP) Consider a ZZP (Xt, Vt)t [0,Tf ] with initial distribution µ ν, where ν = Unif({ 1}d) and invariant distribution π ν, where π has potential ψ satisfying H1. Assume that H4 holds and that R µ (dx)| iψ(x)| < for all i = 1, . . . , d. Then the time reversal of the ZZP has vector field Φ Z(x, v) = ( v, 0)T

and jump rates and kernels are given for all (y, w) R2d such that PTf t(w|y) > 0 by

λ Z i (t, (y, w)) = p Tf t(RZ i w|y) p Tf t(w|y) λZ i (y, RZ i w), QZ i ((y, w), (dx, v)) = δ(y,RZ i w)(dx, v)

for i = 1, . . . , d.

Proof We verify the conditions of Proposition 2 corresponding to deterministic transitions and rely on Appendix A.6 to apply the proposition to each pair (λZ i , QZ i ). First notice the vector field Φ(x, v) = (v, 0)T is clearly locally bounded and (t, x) 7 λ(x, v) is continuous since ψ is continuously differentiable. Moreover, the ZZP can be shown to be non-explosive applying Durmus et al. [2021, Proposition 9]. Then, we need to verify (16). First, observe that E[λi(Xt, Vt)] E[| iψ(Xt)|]. Then

E[| iψ(Xt)|] = E iψ(X0) + Z 1

0 Xt X0, iψ(X0 + s(Xt X0)) ds

E[| iψ(X0)|] + E Z 1

0 | Xt X0, 2ψ(X0 + s(Xt X0))ei |ds

where ei is the i-th vector of the canonical basis. Notice that |Xt X0| t

d. Thus we find

E[| iψ(Xt)|] E[| iψ(X0)|] + t

d sup x Rd 2ψ(x)

and therefore Z Tf

0 E[λ(Xt, Vt)]dt Tf

E| iψ(X0)| + Tf

d sup x Rd 2ψ(x) .

Since Eµ | iψ(X)| < and because we are assuming H1, we obtain (16). Finally, notice that Pt(dv|x) is absolutely continuous with respect to the counting measure on {1, 1}d, which is clearly invariant with respect to RZ i .

Proposition 5 (Time reversal of BPS) Consider a BPS (Xt, Vt)t [0,Tf ] with initial distribution µ ν, where ν = Unif(Sd 1), and invariant distribution π ν, where π has potential ψ satisfying H1. Assume that Eµ [| ψ(X)|] < . Then there exists a density pt(w|y) := d(µ0Pt)(dw|y)/ν(dw). Moreover, the time reversal of the BPS has vector field Φ B(x, v) = ( v, 0)T ,

while the jump rates and kernels are given for all t, y, w [0, Tf] R2d such that p Tf t(w|y) > 0 by

λ B 1 (t, (y, w)) = p t(RB y w|y) p t(w|y) λB 1 (y, RB y w), QB 1 ((y, w), (dx, dv)) = δ(y,RB y w)(dx, dv),

λ B 2 (t, (y, w)) = λr 1 p Tf t(w|y), QB 2 (t, (y, w), (dx, dv)) = µ0PTf t(dv|y)δy(dx)dv, (19)

where t = Tf t.

Remark 3 Under the assumption that ν is the uniform distribution on the sphere, it is natural to take µV ref = ν, which gives that dν/dµV ref = 1 and hence the backward refreshment rate is as in (19). When ν is the d-dimensional Gaussian distribution, the natural choice is to let µV ref be the Lebesgue measure and hence we obtain a rate as given in (5).

Proof We verify the general conditions of Proposition 2, then focusing on the deterministic jumps and the refreshments relying on Appendix A.6. The BPS was shown to be non-explosive for any initial distribution in Durmus et al. [2021, Proposition 10]. Since λ(t, (x, v)) = λ(x, v) = v, ψ(x) +, with a similar reasoning of the proof of Proposition 4 we have

E[λ(Xt, Vt)] = E

" Vt, ψ(X0) + Z 1

0 Vt, 2ψ(X0 + s(Xt X0))(Xt X0) ds

Taking advantage of |Vt| = 1 we have |Xt X0| t and thus we find

E[λ(Xt, Vt)] E | ψ(X0)| + Z 1

0 | 2ψ(X0 + s(Xt X0))(Xt X0)|ds

E [| ψ(X0)|] + t sup x Rd 2ψ(x) .

This is sufficient to obtain (16) since E[| ψ(X0)|] < and we assume H1. Moreover, H4 holds by Durmus et al. [2021, Proposition 23]. Finally notice that Pt(dv|x) is absolutely continuous with respect to µV ref = Unif(Sd 1), which satisfies µB ref(RB(x)v) = µB ref(v) for all x, v Rd Sd 1. All the required assumptions in Proposition 2 are thus satisfied.

Proposition 6 (Time reversal of RHMC) Consider a RHMC (Xt, Vt)t [0,Tf ] with initial distribution µ ν, where ν is the d-dimensional standard normal distribution, and invariant distribution π ν, where π has potential ψ C1(Rd). Suppose that H4 holds and that for any y Rd, Pt(dw|y) is absolutely continuous with respect to Lebesgue measure, with density pt(w|y). Then the time reversal of the RHMC has vector field Φ H(x, v) = ( v, ψ(x))T ,

while the jump rates and kernels are given for all (y, w) R2d such that p Tf t(w|y) > 0 by

λ H 2 (t, (y, w)) = λr ν(w) p Tf t(w|y), QH 2 (t, (y, w), (dx, dv)) = p Tf t(v|y)δy(dx)dv.

Proof First of all, RHMC is non-explosive by Durmus et al. [2021, Proposition 8]. Then Φ is locally bounded and (16) is trivially satisfied. Finally, we can take µV ref to be the Lebesgue measure.

B Density ratio matching

B.1 Ratio matching with Bregman divergences

We now describe a general approach to approximate ratios of densities based on the minimisation of Bregman divergences [Sugiyama et al., 2011], which as we discuss is closely connected to the loss of Hyvärinen [2007].

For a differentiable, strictly convex function f we define the Bregman divergence Bf(r, s) := f(r) f(s) f (s)(r s). Given two time-dependent probability density functions on R2d, p, q, we wish to approximate their ratio r(x, v, t) = pt(x,v)/qt(x,v) for t [0, Tf] with a parametric function sθ : Rd Rd [0, Tf] R+ by solving the minimisation problem

0 ω(t) E h Bf(r(Xt, Vt, t), sθ(Xt, Vt, t)) i dt,

where the expectation is with respect to the joint density qt(x, v), that is (Xt, Vt) qt, while ω is a probability density function for the time variable. Well studied choices of the function f include e.g. f(r) = r log r r, that is related to a KL divergence, or f(r) = (r 1)2, related to the square loss, or f(r) = r log r (1 + r) log(1 + r), which corresponding to solving a logistic regression task. Ignoring terms that do not depend on θ we can rewrite the minimisation as

0 ω(t) Ept f (sθ(Xt, Vt, t))sθ(Xt, Vt, t) f(sθ(Xt, Vt, t)) Eqt f (sθ(Xt, Vt, t)) dt.

Notably this is independent of the true density ratio and thus it is a formulation with similar spirit to implicit score matching. Naturally, in practice the loss can be approximated empirically with a Monte Carlo average.

B.2 Details and proofs regarding Hyvärinen s ratio matching

B.2.1 Connection to Bregman divergences

In the next statement, we show that the loss ℓI defined in (6), or equivalently its explicit counterpart ℓE (see Proposition 3), can be put in the framework of Bregman divergences.

Corollary 1 Recall G(r) = (1 + r) 1 and let f(r) = (r 1)2/2. The task minθ ℓE(θ) is equivalent to

i=1 Ept h Bf(G(sθ i (Xt, Vt, t)), G(ri(Xt, Vt, t)))

+ Bf(G(sθ i (Xt, RZ i Vt, t)), G(ri(Xt, RZ i Vt, t))) i

Proof The result follows by straightforward computations.

B.2.2 Proof of Proposition 3

The proof follows the same lines as Hyvärinen [2007, Theorem 1]. We find

ℓE(θ) = C + Z Tf

i=1 Ept h G2(sθ i (Xt, Vt, t)) + G2(sθ i (Xt, RZ i Vt, t))

2G(ri(Xt, Vt, t))G(sθ i (Xt, Vt, t)) 2G(sθ i (Xt, RZ i Vt, t))G(ri(Xt, RZ i Vt, t)) i dt,

Algorithm 1: Pseudo-code for the simulation of a homogeneous PDMP Input : Time horizon T, initial condition (x, v). Set t = 0, (X0, V0) = (x, v); while t < T do

draw E Exp(1); compute the next event time, that is τ R+ such that R τ 0 λ(φu(Xt, Vt))du = E ; if t + τ > T then

set ZT = φT t(Zt); else

simulate Zt+τ Q(φs(Zt), ); end set t = t + τ; end

where C is a constant independent of θ. Then plugging in the expression of G we can rewrite the last term as

Ept h G(sθ i (Xt, RZ i Vt, t))G(ri(Xt, RZ i Vt, t)) i

v { 1}d pt(x, v)G(sθ i (x, RZ i v, t)) pt(x, RZ i v) pt(x, v) + pt(x, RZ i v)dx

= Ept h G(sθ i (Xt, Vt, t)) pt(Xt, RZ i Vt) pt(Xt, Vt) + pt(Xt, RZ i Vt)

Therefore we find

ℓE(θ) = C + Z Tf

G2(sθ i (Xt, Vt, t)) + G2(sθ i (Xt, RZ i Vt, t))

2G(sθ i (Xt, Vt, t)) pt(Xt, Vt) pt(Xt, Vt) + pt(Xt, RZ i Vt) 2G(sθ i (Xt, Vt, t)) pt(Xt, RZ i Vt) pt(Xt, Vt) + pt(Xt, RZ i Vt)

i=1 Ept h G2(sθ i (Xt, Vt, t)) + G2(sθ i (Xt, RZ i Vt, t)) 2G(sθ i (Xt, Vt, t)) i dt.

C Simulation of forward and backward PDMPs

C.1 Simulation of the forward PDMPs

The forward PDMPs that we consider can all be simulated exactly by solving the integral (13), at least when the limiting distribution is the multivariate standard normal. This is possible because for each process we can easily simulate the random event times as well as their deterministic dynamics. The general procedure for the simulation of a time-homogeneous PDMP up to a fixed time horizon T is given in Algorithm 1. In the remainder of the section we give additional details on the simulation of each process.

RHMC: The case of RHMC is trivial, since the random events have the exponential distribution with constant parameter λr and the deterministic dynamics are given by xt = x0 cos(t) + v0 sin(t) and vt = x0 sin(t) + v0 cos(t), where (x0, v0) is the initial condition. Hence all the steps in Algorithm 1 can be performed and the state (XT, VT) can be easily obtained.

ZZP: Notice the event rates of ZZP are of the form λi(x, v) = (vixi)+ when the stationary distribution is the standard normal. In this case we find that each coordinate of the ZZP is independent, that is the evolution of ((Xt)i, (Vt)i) is not affected by ((Xt)j, (Vt)j) for i, j = 1, . . . , d with i = j. Therefore we can simulate each coordinate of the ZZP in parallel following the procedure in Algorithm 1. Let us illustrate how one can obtain the next event time τ of the i-th coordinate when the

process is at (xi, vi). We have that τ solves R τ 0 (vixi + u)+du E = 0 for E Exp(1). This gives the following quadratic equation for τ:

2 + vixiτ vixi( vixi)+ 1

2( vixi)2 + E = 0,

which has solution

τ = vixi + q

(vixi)2 + + 2E.

BPS: In the case of BPS one has to simulate a proposal for both the next reflection event, τb and for the next refreshment event, τr, then accepting the smallest of the two as event time. Obtaining the proposal for the next refreshment time is straightforward since the rate is constant. The proposal for the following reflection time can be obtained similarly to the case of ZZP, but noticing that in this case the event rate is λ(x, v) = v, x +. Then τb solves R τb 0 ( v, x + |v|2u)+du E = 0 for E Exp(1). Noticing that it must be τb > v,x /|v|2, this gives the following quadratic equation for τb: |v|2

2 τ 2 b + v, x τb + 1

2( v, x )2 + E = 0,

which has solution

τb = v, x + q

v, x 2 + + 2|v|2E

C.2 Simulation of time reversed PDMPs with splitting schemes

Here we discuss the splitting schemes we use to discretise the backward PDMPs. For further details on this class of approximations we refer the reader to Bertazzi et al. [2023].

RHMC: We have already discussed the splitting scheme DJD for RHMC in Section 2.4, and we give the pseudo-code in Algorithm 2.

Algorithm 2: Splitting scheme DJD for the time reversed RHMC

Initialise either from (X0, V 0) π ν or (X0, V 0) π(dx)pθ (dv|x, Tf); for n = 0, . . . , N 1 do

( e X, e V ) = φH δn+1/2(Xtn, V tn) ; et = Tf tn δn+1

2 ; Estimate ratio: s = ν(e V )/pθ ( e V | e X, et ) ; Draw proposal τn+1 Exp(sλr) ; if τn+1 δn+1 then

Draw e V pθ ( | e X, et ); end (Xtn+1, V tn+1) = φH δn+1/2( e X, e V ); end

ZZP: For ZZP we apply the splitting scheme DJD discussed in Section 2.4, with the only difference that we allow multiple velocity flips during the jump step similarly to Bertazzi et al. [2023]. Algorithm 3 gives a pseudo-code.

BPS: In the case of BPS, we follow the recommendations of Bertazzi et al. [2023] and adapt their splitting scheme RDBDR, where R stands for refreshments, D for deterministic motion, and B for bounces. We give a pseudo-code in Algorithm 4. We remark that an alternative is to use the scheme DJD for BPS, simulating reflections and refreshments in the J part of the splitting. This choice has the advantage of reducing the number of model evaluations.

Algorithm 3: Splitting scheme DJD for the time reversed ZZP

Initialise (X0, V 0) π ν; for n = 0, . . . , N 1 do

e X = Xtn δn+1

2 V tn ; e V = V tn ; et = Tf tn δn+1

2 ; Estimate density ratios: sθ ( e X, e V , et ) ; for i = 1 . . . , d do

With probability (1 exp( δn+1 sθ i ( e X, e V , et ) λi( e X, RZ i e V ))) set e V = RZ i e V ; end Xtn+1 = e X δn+1

2 e V ; V tn+1 = e V ; end

Algorithm 4: Splitting scheme RDBDR for the time reversed BPS

Initialise either from (X0, V 0) π ν or (X0, V 0) π(dx)pθ ( |x, Tf) ; for n = 0, . . . , N 1 do

e V = V tn ; Estimate density ratio: s2 = ν(e V )/pθ ( e V | Xtn, Tf tn) ; With probability (1 exp( λrs2 δn+1

2 )) draw e V pθ ( |Xtn, Tf tn) ; e X = Xtn δn+1

2 V tn ; et = Tf tn δn+1

2 ; Estimate density ratio: s1 = pθ (RB f X e V | e X, et )/pθ ( e V | e X, et ) ; With probability (1 exp( δn+1s1λ1( e X, RB e X e V ))) set e V = RB e X e V ;

Xtn+1 = e X δn+1

2 e V ; Estimate density ratio: s2 = ν(e V )/pθ ( e V | Xtn+1, Tf tn+1) ; With probability (1 exp( λrs2 δn+1

2 )) draw e V pθ ( |Xtn, Tf tn+1) ; V tn+1 = e V ; end

D Training the generative models

In this section, we present the algorithmic procedures used to train our generative models, and outline computational differences with the popular framework of denoising diffusion models. In Table 3 we list the backward rates and kernels of the time reversal associated with each forward PDMP introduced in Section 2.1. In Appendix D.1 we give the procedure used for ZZP, while in Appendix D.2 we focus on RHMC and BPS, which can be trained with the same approach. In Appendix D.3 we compare the training phase of PDMP-based and diffusion-based generative models.

D.1 Fitting the ZZP-based generative model

In the case of ZZP, we only need to approximate the jump rates λ Z i of the backward process, since we have access to the true backward kernel. To this end, we introduce a class of functions {sθ : Rd { 1, 1}d [0, Tf] Rd + : θ Θ} for some parameter set Θ Rdθ such that for any i {1, . . . , d}, the i-th component of sθ(y, w, t), denoted by sθ i (y, w, t), is an approximation of

r Z i (y, w, t) = p Tf t(RZ i w|y) p Tf t(w|y) .

Process Backward Rates Backward Kernels

ZZP for i = 1, . . . , d for i = 1, . . . , d λ Z i (t, (y, w)) = p Tf t(RZ i w | y) p Tf t(w | y) λZ i y, RZ i w QZ i (y, w), (dx, dv) = δ y,RZ i w (dx, dv)

RHMC λ H(t, (y, w)) = λr ν(w) p Tf t(w | y)

QH t, (y, w), (dx, dv) = p Tf t(v | y) δy(dx) dv

BPS λ B 1 (t, (y, w)) = p Tf t(RB y w | y) p Tf t(w | y) λB 1 y, RB y w QB 1 (y, w), (dx, dv) = δ y,RB y w (dx, dv)

QB 2 t, (y, w), (dx, dv) = p Tf t(v | y) δy(dx) dv λ B 2 (t, (y, w)) = λr ν(w) p Tf t(w | y)

Table 3: Time Reversal Characteristics of ZZP, RHMC, BPS.

We learn sθ by minimising the empirical counterpart of the implicit ratio matching loss ℓI given in (7). We define such empirical loss ˆℓI as the function ˆℓI : θ 7 ˆLI(sθ, (Xn τ n, V n τ n, τ n)N n=1) for

ˆLI(sθ, (Xn τ n, V n τ n, τ n)N n=1)

G2(sθ i (Xn τ n, V n τ n, τ n)) + G2(sθ i (Xn τ n, RZ i V n τ n, τ n)) 2G(sθ i (Xn τ n, V n τ n, τ n)) ,

(20) where {τ n}N n=1 are i.i.d. samples from ω, independent of {(Xn t , V n t )t 0}N n=1, which are N i.i.d. realisations of the ZZP respectively starting at the n-th training data point Xn 0 with velocity V n 0 , where {V n 0 }N n=1 are i.i.d. observations of Unif({ 1, 1}d). Algorithm 5 shows the pseudo code for our training algorithm. We remark that the simulation of the forward ZZP follows the guidelines explained in Appendix C.1.

Algorithm 5: Training loop for ZZP-based generative models

Input: Time distribution ω on [0, Tf], model sθ while θ has not converged do

Get random data batch {Xn 0 }B n=1; Sample {τ n}B n=1 ω B; Sample {V n 0 }B n=1 Unif({ 1}d) B; for n = 1 to B do

Simulate (Xn τ n, V n τ n) by running a ZZP from (Xn 0 , V n 0 );

Compute loss ˆℓI(θ) ˆLI(sθ, (Xn τ n, V n τ n, τ n)B n=1) ; Perform optimization step on ˆℓI(θ);

Output: trained model sθ

A computationally efficient model for the ZZP. The computation of ˆLI in each iteration of Algorithm 5 requires evaluating d+1 times the model sθ for each data-point. Indeed, the loss function (7) requires evaluating sθ for each component flip of the velocity vector. Hence the computational cost of the algorithm scales linearly in the data dimension d.

In order to overcome this computational burden we build an alternative model leveraging the following observation: pt(vi|x, v i) pt(vi|x), (21) where v i denotes the vector containing all components of v other than the i-th. The approximation in (21) is motivated by the fact that we expect the components of the velocity to be nearly independent conditional on the position vector. This is because the position of the process holds most of the information regarding the velocity of each coordinate. Under (21) we find that a good approximation for the ratio r Z i (x, v, t) is given by

r Z i (x, vi, t) := pt( vi|x)

Hence it is reasonable to estimate the simplified ratios r Z i rather than the true ratios r Z i (x, v, t). In particular this can be achieved with a model sθ that requires the current position and time, and only the i-th velocity component rather than the full vector v.

Following the reasoning above we build an alternative model which takes as input the position vector and the time variable and outputs a 2d-dimensional vector, that is {sθ : Rd [0, Tf] R2d + : θ Θ}. We introduce the following notation for the output of the neural network sθ:

sθ : (x, t) 7 sθ +(x, t), sθ (x, t) R2d + , (22)

where sθ + and sθ are both vectors in Rd and denote the two, d-dimensional blocks in the output of sθ. The model will be trained in such a way that sθ +,i(x, t), that is the i-th component of the vector sθ +(x, t), approximates r Z i (x, +1, t), i.e. in the case vi = +1. Similarly, we estimate r Z i (x, 1, t) with sθ ,i(x, t).

Let us now describe how we can train this model in such a way that the computational cost remains constant in the dimensionality of the data. For any w {1, 1}d, we introduce the projection operator Πi w defined for any w { 1}d, i {1, . . . , d}, y Rd, t [0, Tf] as

Πi wsθ(y, t) = sθ sign(wi),i(y, t), (23)

where we have sign(wi) = + when wi = +1 and sign(wi) = when wi = 1. In words, Πi wsθ(y, t) selects either sθ + or sθ in (22) based on sign(wi) and returns the estimate of the i-th ratio at (y, t) corresponding to the velocity wi. Using the operator Πi w we can re-formulate the loss (7) as

ℓI(θ) = Z Tf

i=1 E h G2(Πi Vtsθ(Xt, t)) + G2(Πi Vtsθ(Xt, t)) 2G(Πi Vtsθ(Xt, t)) i .

The associated empirical loss is ˆℓI : θ 7 ˆLSimple I (sθ, (Xn τ n, V n τ n, τ n)N n=1) were

ˆLSimple I (sθ, (xn, vn, tn)N n=1) =

G2(Πi vnsθ(xn, tn)) + G2(Πi vnsθ(xn, tn)) 2G(Πi vnsθ(xn, tn)) , (24)

where {τ n}N n=1 and {(Xn t , V n t )t 0}N n=1 are obtained as in (20). This simplified model can then be trained using the same procedure shown in Algorithm 5, but where the loss above is used instead of the loss (20). The computational cost for the training of this model is clearly constant in the data dimension d, since only a single evaluation of the model sθ is needed for each data-point.

D.2 Fitting the RHMC and BPS-based generative models

In the case of RHMC and BPS both the backward rate and backward jump kernel are characterised by pt(v|x). We introduce a parametric family of conditional probability distributions {pθ : θ Θ} of the form (x, v, t) 7 pθ(v|x, t), where Θ Rdθ, which we model with normalising flows (NFs) [Papamakarios et al., 2021], as it permits both obtaining an estimate of the density, at a given state and time, and also to generate samples according to this distribution. To train our model, we rely on the maximum likelihood population loss method, as derived in (8), and use its empirical counterpart (9). The final BPS and RHMC training algorithm is given in Algorithm 6. The simulation of the forward BPS and RHMC follows the methods listed in Appendix C.1.

D.3 Computational comparison with diffusion models

We provide a short comparison of the computational complexity of our PDMP generative methods with diffusion models, as each generative method admits the same design: a fixed forward process {Xt}0 t Tf ran from time 0 to Tf, with XTf approximately distributed as a standard Gaussian N(0, Id) (using the Variance-Preserving process in the case of diffusion models [Song et al., 2021]), a corresponding backward process { X t}0 t Tf , and a generative process { X θ t }0 t Tf being an

approximation to the true backward process, initialized from X θ 0 N(0, Id).

Algorithm 6: Training loop for RHMC and BPS based generative models Input: Time distribution ω on [0, Tf], model pθ while θ has not converged do

Get random data batch {Xn 0 }B n=1; Sample {τ n}B n=1 ω B; Sample {V n 0 }B n=1 N(0, Id) B; for n = 1 to B do

Simulate (Xn τ n, V n τ n) running a RHMC/BPS from (Xn 0 , V n 0 );

B PB n=1 log pθ(V n τ n|Xτ n, τ n); Perform optimisation step on Lθ; Output: trained model pθ

Using conventional notations for diffusion models [Song et al., 2021] we denote by ( αt)0 t Tf the variance-preserving noise schedule, such that the distribution of the forward process at any time t is given by

Xt d= αt X0 +

1 αt Gt , (25)

where Gt N(0, Id). The generative process { X θ t}0 t Tf is typically defined by a denoiser neural network ϵθ, θ Rdθ, trained with the denoiser loss

ℓθ diffusion = E ϵθ(Xt, t) Xt αt X0 1 αt

where t ω is the time parameter. We display the training algorithm for diffusion models in Algorithm 8. For the sake of comparison we give in Algorithm 7 the general procedure used for PDMP-based generative models.

Algorithm 7: Training loop for PDMP-based generative models Input: Time distribution ω on [0, Tf], model sθ while θ has not converged do

Get random data batch {Xn 0 }B n=1; Sample {τ n}B n=1 ω B; Sample {V n 0 }B n=1 Unif({ 1}d) B; for n = 1 to B do

Simulate (Xn τ n, V n τ n) by running PDMP from (Xn 0 , V n 0 );

Compute loss Lθ ;

Perform optimization step on Lθ;

Output: trained model sθ

Algorithm 8: Training loop for diffusionbased generative models Input: Time distribution ω on [0, Tf], noise schedule { αt}Tf t=0, model ϵθ while θ has not converged do

Get random data batch {Xn 0 }B n=1; Sample {τ n}B n=1 ω B; Sample {ϵn}B n=1 N(0, Id) B; for n = 1 to B do

Compute Xn τ n = ατ n Xn 0 + 1 ατ n ϵn; Compute loss Lθ 1

B PB n=1 ϵθ(Xn τ n, τ n) ϵn 2; Perform optimization step on Lθ;

Output: trained model ϵθ

Computational differences in training the models We compare the training algorithms:

The simulation of the forward process {Xt}0 t Tf is more efficient in the case of diffusion models, since the distribution of Xt given X0 is explicitly characterizable as given in (25), whereas in the case of PDMPs the complexity scales with the expected number of random jumps in [0, Tf].

Computing the loss function has the same computational cost (number of evaluations of the network relative to the dimension of the data) for PDMP and diffusion models (assuming we are using the simplified model for ZZP, as introduced in Appendix D.1).

Computational differences in the generative processes Since we are using the splitting scheme for our PDMPs, simulating the backward processes admits a computational complexity growing linearly with the chosen number of backward steps N, alike diffusion models [Song et al., 2021]. The latter models require only a single network inference per backward step. However, as measured in Table 2, each backward step is costlier in the case of our PDMP samplers. We provide an explanation for each sampler:

ZZP Using the simplified model introduced in Appendix D.1 and Algorithm 3, only one network inference is required per backward step, to approximate the backward rate. However, this approach requires a slight modification to the neural network architecture, involving a doubled channel output and adjustments for the element selection mechanisms (projection operator (23)), resulting in a higher cost per step than diffusion models.

RHMC Using Algorithm 2, one likelihood computation from the normalizing flow is needed, to approximate the backward rate. Additionally, at most one random variable must be drawn from the normalizing flow, which approximates the backward kernel.

BPS Using Algorithm 4, three likelihood computations are required from the normalizing flow, to approximate backward rates. Moreover, at most two random variables need to be sampled from the normalizing flow, which approximates the backward kernel. The final cost per step depends on the chosen normalizing flow architecture, but is higher for BPS than for all other samplers.

It should be noted that, in our experiments, each generative PDMP model requires less backward steps than the diffusion model, leading to a smaller overall computational time for equal generation quality, as showed in Figure 3.

E Discussion and proof for Theorem 1

E.1 Discussion on H2

In this section we discuss H2 in the case of ZZP, BPS, and RHMC. For all three of these samplers, existing theory shows convergence of the form

δ(x,v)Pt π ν V C e γt V (x, v), (26)

where V : R2d [1, ) is a positive function and µ V := sup|g| V |µ(g)| is the V -norm. When the initial condition of the process is µ ν, we obtain the bound

µ νPt π ν V C e γtµ ν(V ),

which translates to a bound in TV distance, since we assume V 1. Conditions on π ensuring (26) can be found for ZZP in Bierkens et al. [2019b], for BPS in Deligiannidis et al. [2019], Durmus et al. [2020], and for RHMC in Bou-Rabee and Sanz-Serna [2017]. Observe that we can set the constant C in H2 to C = C µ ν(V ). Clearly, C is finite whenever µ ν(V ) < . Since V is such that lim|z| V (z) = + , showing C is finite requires suitable tail conditions on the initial distribution µ ν. Inspecting the results of the papers mentioned above, one can verify that µ ν(V ) < when π is a multivariate standard Gaussian distribution as long as: (i) the tails of µ are at least as light as those of π for ZZP and BPS, (ii) µ has finite second moments for RHMC.

E.2 Proof of Theorem 1

First notice that µ L(XTf ) TV µ ν L(XTf , V Tf ) TV , (27)

hence we focus on bounding the right hand side. Under our assumptions, the forward PDMP (Xt, Vt)t [0,Tf ] admits a time reversal that is a PDMP ( X t, V t)t [0,Tf ] with characteristics

( Φ, λ , Q) satisfying the conditions in Proposition 1. Therefore, it holds µ ν = L( X Tf , V Tf ) and so (27) can be written as

µ L(XTf ) TV L( X Tf , V Tf ) L(XTf , V Tf ) TV ,

We introduce the intermediate PDMP ( e Xt, e Vt)t [0,Tf ] with initial distribution L(XTf , VTf )

and characteristics ( Φ, λ, Q). In particular, ( e Xt, e Vt)t [0,Tf ] has the same characteristics as (Xt, V t)t [0,Tf ], but different initial condition By the triangle inequality for the TV distance we find

µ L(XTf ) TV L( X Tf , V Tf ) L( e XTf , e VTf ) TV + L( e XT , e VTf ) L(XTf , V Tf ) TV.

Applying the data processing inequality to the second term, we find the bound

µ L(XTf ) TV L( X Tf , V Tf ) L( e XTf , e VTf ) TV + L(XTf , VTf ) π ν TV. (28)

The second term in (28) can be bounded applying H2, hence it is left to bound the first term. We introduce the Markov semigroups Pt, P t, P t : R+ R2d B(R2d) [0, 1] defined respectively as Pt((x, v), ) := P(x,v)((Xt, Vt) ), P t((x, v), ) := P(x,v)(( X t, V t) ), and e Pt((x, v), ) := P(x,v)(( e Xt, e Vt) ). Recall that for any probability distribution η on (R2d, B(R2d)), ηPt( ) = R

R2d η(dx, dv)Pt((x, v), ), and similarly for η e Pt( ) and η P t( ). Finally, to ease the notation we denote QTf := L(XTf , VTf ) = (µ ν)PTf . Then we can rewrite the first term in (28) as

L( X Tf , V Tf ) L( e XTf , e VTf ) TV = QTf P Tf QTf e PTf TV

Z QTf (dx, dv) δ(x,v) P Tf δ(x,v) e PTf TV. (29)

Therefore we wish to bound δ(x,v) P Tf δ(x,v) e PTf TV. A bound for the TV distance between two PDMPs with same initial condition and deterministic motion, but different jump rate and kernel was obtained in Durmus et al. [2021, Theorem 11] using the coupling inequality

δ(x,v) P Tf δ(x,v) e PTf TV 2P(x,v) ( X Tf , V Tf ) = ( e XTf , e VTf ) ,

and then bounding the right hand side. Following the proof of Durmus et al. [2021, Theorem 11] we have that a synchronous coupling of the two PDMPs satisfies

P(x,v) ( X Tf , V Tf ) = ( e XTf , e VTf ) 2E(x,v)

0 gt( X t, V t)dt

gt(x, v) = 1

λ (t, (x, v)) λ(t, (x, v)) Q(t, (x, v), ) Q(t, (x, v), ) TV

+ λ (t, (x, v)) λ(t, (x, v)) .

Since L( X t, V t) = L(XTf t, VTf t) for t [0, Tf], we can rewrite this bound as

P(x,v) ( X Tf , V Tf ) = ( e XTf , e VTf ) 2E(x,v)

0 g Tf t(Xt, Vt)dt

Plugging this bound in (29) we obtain

L( X Tf , V Tf ) L( e XTf , e VTf ) TV 2E

0 g Tf t(Xt, Vt)dt

This concludes the proof.

E.3 Application to the ZZP

Here we give the details on the bound (12), which considers the case of ZZP. First, we upper bound the function gt defined in (11). We focus on the first term in (11), that is

g1 t (x, v) = ( λ Z λZ )(t, (x, v))

2 QZ(t, (x, v), ) QZ(t, (x, v), ) TV.

QZ(t, (x, v), ) QZ(t, (x, v), ) TV = sup A

i=1 1(x,RZ i v) A

λ Z i (t, (x, v)) λ Z(t, (x, v)) λZ i (t, (x, v)) λZ(t, (x, v))

i=1 1(x,RZ i v) A

λ Z i (t, (x, v)) λZ i (t, (x, v)) λ Z(t, (x, v)) + λZ i (t, (x, v)) λ Z(t, (x, v)) λZ i (t, (x, v)) λZ(t, (x, v))

| λ Z i (t, (x, v)) λZ i (t, (x, v))| λ Z(t, (x, v))

+ | λZ(t, (x, v)) λ Z(t, (x, v))|

λ Z(t, (x, v))

In the last inequality we used that λZ is non-negative. Therefore we find

g1 t (x, v) 1

i=1 | λ Z i (t, (x, v)) λZ i (t, (x, v))|

2| λZ(t, (x, v)) λ Z(t, (x, v))|

i=1 | λ Z i (t, (x, v)) λZ i (t, (x, v))|.

Noticing that

| λ Z i (t, (x, v)) λZ i (t, (x, v))| = |r Z i (x, v, t) sθ i (x, v, t)| λZ i ((x, RZ i v).

i=1 |r Z i (x, v, t) sθ i (x, v, t)| λZ i ((x, RZ i v)

Finally, we use the inequality 1 e z z, which holds for z 0, to conclude that

0 g Tf t(Xt, Vt)dt

0 |r Z i (Xt, Vt, Tf t) sθ i (Xt, Vt, Tf t)| λZ i (Xt, RZ i Vt)dt

F Experimental details

We run our experiments on 50 Cascade Lake Intel Xeon 5218 16 cores, 2.4GHz. Each experiment is ran on a single CPU and takes between 1 and 5 hours to complete, depending on the dataset and the sampler at hand.

For each forward PDMP, we take a time horizon Tf equal to 5, and set the refreshment rate λr to 1. For training, we choose the uniform distribution Uniform([0, Tf]) as the time distribution ω. For the simulation of backward PDMPs with splitting schemes, we use a quadratic schedule for the time steps, that is (δn)n {1,...,N} given by δn = Tf ((n/N)2 (n 1/N)2).

For i-DDPM, we follow the design choices introduced in Nichol and Dhariwal [2021] and in particular we use the variance preserving (VP) process, the cosine noise schedule, and linear time steps.

F.1 Continuation of Section 4

2D datasets In our experiments we consider the five datasets displayed in Figure 5. The Gaussian grid consists of a mixture of nine Gaussian distribution with imbalanced mixture weights {.01, .02, .02, .05, .05, .1, .1, .15, .2, .3}. We load 100000 training samples for each dataset, and use 10000 test samples to compute the evaluation metrics. We use a batch size of 4096 and train our model for 25000 steps.

refresh rate 0.0 0.1 0.5 1.0 2.0 5.0 10.0 process

BPS 0.286 0.069 0.048 0.052 0.045 0.048 0.072 RHMC 0.324 0.040 0.041 0.033 0.040 0.047 0.072 ZZP 0.040 0.045 0.040 0.036 0.038 0.035 0.057 Table 4: Mean of 2-Wasserstein W2 , on Gaussian grid dataset, averaged over 10 runs.

time horizon 2 5 10 15 process

BPS 0.031 0.040 0.040 0.041 HMC 0.025 0.036 0.036 0.033 ZZP 0.031 0.033 0.030 0.046 Table 5: Mean 2-Wasserstein W2 for different time horizon, averaged over 10 runs.

Detailed setup For ZZP and i-DDPM we use a neural network consisting of eight timeconditioned multi-layer perceptron (MLP) blocks with skip connections, each of which consisting of two fully connected layers of width 256. The time variable t passes through two fully connected layers of size 1 32 and 32 32, and is fed to each time conditioned block, where it passes through an additional 32 64 fully connected layer before being added element-wise to the middle layer. The model size is 6.5 million parameters. For ZZP, we apply the softplus activation function x 7 1/β log(1 + exp(βx)) to the output of the network, with β = 1, to constrain it to be positive and stabilise behaviour for outputs close to 1.

In the case of RHMC and BPS, we use neural spline flows [Durkan et al., 2019] to model the conditional densities of the forward processes, as it shows good performance among available architectures. We leverage the implementation from the zuko package [Rozet et al., 2022]. We set the number of transforms to 8, the hidden depth of the network to 8 and the hidden width to 256. To condition on x, t, we feed them to three fully connected layers of size d 8, 8 8 and 8 8, where d is either the dimension of Xt, or d = 1 in the case of the time variable. The resulting vectors are then concatenated and fed to the conditioning mechanism of zuko. The resulting model has 3.8 million parameters.

We take advantage of the approaches described in Section 2.3 to learn the characteristics of the backward processes.

All experiments are conducted using Py Torch [Paszke et al., 2019]. The optimiser is Adam [Kingma and Ba, 2015] with learning rate 5e-4 for all neural networks.

Additional results In Table 4 we show the accuracy in terms of the refreshment rate, while in Table 5 we show different choices of the time horizon. In both cases, we consider the Gaussian mixture data and we use the 2-Wasserstein metric to characterise the quality of the generated data. Figure 5 shows the generated data by the best model for each process.

F.2 MNIST digits

We consider the task of generating handwritten digits training the ZZP on the MNIST dataset. We use the simplified loss function given in Equation (24) in Appendix D.1 and use the simplified model and loss. In Figure 6 we show promising results obtained with the design choices described previously in Appendix F.1, apart from the differences that follow. The optimiser is Adam [Kingma and Ba, 2015] with learning rate 2e-4. We use a U-Net following the implementation of Nichol and Dhariwal [2021], where the hidden layers are set to [128, 256, 256, 256], where we fix the number of residual blocks to 2 at each level, and add self-attention block at resolution 16 16, with 4 heads. We duplicate the channel of the penultimate layer, and make each copy go through separate MLPs to obtain the two vectors (sθ +( , ), sθ ( , )) R2d + (as in (22)). We use an exponential moving average with a rate of 0.99. At every layer, we use the silu activation function, while we apply

Checkerboard

Fractal tree

Olympic rings

Gaussian mixture

Figure 5: Generation for the various datasets.

Figure 6: Generation for the ZZP trained on MNIST.

the softplus to the output of the network, with β = 0.2. We train the model for 40000 steps with batch size 128.

Neur IPS Paper Checklist

The checklist is designed to encourage best practices for responsible machine learning research, addressing issues of reproducibility, transparency, research ethics, and societal impact. Do not remove the checklist: The papers not including the checklist will be desk rejected. The checklist should follow the references and precede the (optional) supplemental material. The checklist does NOT count towards the page limit.

Please read the checklist guidelines carefully for information on how to answer these questions. For each question in the checklist:

You should answer [Yes] , [No] , or [NA] . [NA] means either that the question is Not Applicable for that particular paper or the relevant information is Not Available. Please provide a short (1 2 sentence) justification right after your answer (even for NA).

The checklist answers are an integral part of your paper submission. They are visible to the reviewers, area chairs, senior area chairs, and ethics reviewers. You will be asked to also include it (after eventual revisions) with the final version of your paper, and its final version will be published with the paper.

The reviewers of your paper will be asked to use the checklist as one of the factors in their evaluation. While "[Yes] " is generally preferable to "[No] ", it is perfectly acceptable to answer "[No] " provided a proper justification is given (e.g., "error bars are not reported because it would be too computationally expensive" or "we were unable to find the license for the dataset we used"). In general, answering "[No] " or "[NA] " is not grounds for rejection. While the questions are phrased in a binary way, we acknowledge that the true answer is often more nuanced, so please just use your best judgment and write a justification to elaborate. All supporting evidence can appear either in the main paper or the supplemental material, provided in appendix. If you answer [Yes] to a question, in the justification please point to the section(s) where related material for the question can be found.

IMPORTANT, please:

Delete this instruction block, but keep the section heading Neur IPS paper checklist", Keep the checklist subsection headings, questions/answers and guidelines below. Do not modify the questions and only use the provided macros for your answers.

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: We believe the abstract and introduction reflect the contributions of the paper. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We believe we made the limitations of our work clear.

Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [Yes] Justification: To the best of our abilities, we made sure all theoretical statements have sound proofs and precise sets of assumptions. Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced. 4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We have indicated all the design choices that we used to obtain the results of our experiments. Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: we provide all the necessary codes to reproduce our experiments. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).

Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: All the design choices are specified in the paper or in the supplementary material.

Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material.

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: We report the standard deviations to our tables, or report when the results are based on a single run.

Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: We provide information on the type of compute workers that were used.

Guidelines:

The answer NA means that the paper does not include experiments.

The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research in this paper respects the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [NA] Justification: Our work does not have immediate societal impacts. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks.

Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We cite all the packages that are used in the paper. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [Yes] Justification: The codes used to obtain the numerical simulations are provided and are accessible. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.