# transdimensional_generative_modeling_via_jump_diffusion_models__df97a73f.pdf Trans-Dimensional Generative Modeling via Jump Diffusion Models Andrew Campbell1 William Harvey2 Christian Weilbach2 Valentin De Bortoli3 Tom Rainforth1 Arnaud Doucet1 1Department of Statistics, University of Oxford, UK 2 Department of Computer Science, University of British Columbia, Vancouver, Canada 3CNRS ENS Ulm, Paris, France {campbell, rainforth, doucet}@stats.ox.ac.uk {wsgh, weilbach}@cs.ubc.ca valentin.debortoli@gmail.com We propose a new class of generative models that naturally handle data of varying dimensionality by jointly modeling the state and dimension of each datapoint. The generative process is formulated as a jump diffusion process that makes jumps between different dimensional spaces. We first define a dimension destroying forward noising process, before deriving the dimension creating time-reversed generative process along with a novel evidence lower bound training objective for learning to approximate it. Simulating our learned approximation to the time-reversed generative process then provides an effective way of sampling data of varying dimensionality by jointly generating state values and dimensions. We demonstrate our approach on molecular and video datasets of varying dimensionality, reporting better compatibility with test-time diffusion guidance imputation tasks and improved interpolation capabilities versus fixed dimensional models that generate state values and dimensions separately. 1 Introduction Generative models based on diffusion processes [1 3] have become widely used in solving a range of problems including text-to-image generation [4, 5], audio synthesis [6] and protein design [7]. These models define a forward noising diffusion process that corrupts data to noise and then learn the corresponding time-reversed backward generative process that generates novel datapoints from noise. In many applications, for example generating novel molecules [8] or videos [9, 10], the dimension of the data can also vary. For example, a molecule can contain a varying number of atoms and a video can contain a varying number of frames. When defining a generative model over these data-types, it is therefore necessary to model the number of dimensions along with the raw values of each of its dimensions (the state). Previous approaches to modeling such data have relied on first sampling the number of dimensions from the empirical distribution obtained from the training data, and then sampling data using a fixed dimension diffusion model (FDDM) conditioned on this number of dimensions [8]. For conditional modeling, where the number of dimensions may depend on the observations, this approach does not apply and we are forced to first train an auxiliary model that predicts the number of dimensions given the observations [11]. 37th Conference on Neural Information Processing Systems (Neur IPS 2023). diffuse diffuse jump Figure 1: Illustration of the jump diffusion generative process on videos and molecules. The generative process consists of two parts: a diffusion part which denoises the current set of frames/atoms and a jump part which adds on a suitable number of new frames/atoms such that the final generation is a clean synthetic datapoint of an appropriate size. This approach to trans-dimensional generative modeling is fundamentally limited due to the complete separation of dimension generation and state value generation. This is exemplified in the common use case of conditional diffusion guidance. Here, an unconditional diffusion model is trained that end-users can then easily and cheaply condition on their task of interest through guiding the generative diffusion process [3, 12 14] without needing to perform any further training or fine-tuning of the model on their task of interest. Since the diffusion occurs in a fixed dimensional space, there is no way for the guidance to appropriately guide the dimension of the generated datapoint. This can lead to incorrect generations for datasets where the dimension greatly affects the nature of the datapoint created, e.g. small molecules have completely different properties to large molecules. To generate data of varying dimensionality, we propose a jump diffusion based generative model that jointly generates both the dimension and the state. Our model can be seen as a unification of diffusion models which generate all dimensions in parallel with autoregressive type models which generate dimensions sequentially. We derive the model through constructing a forward noising process that adds noise and removes dimensions and a backward generative process that denoises and adds dimensions, see Figure 1. We derive the optimum backward generative process as the time-reversal of the forward noising process and derive a novel learning objective to learn this backward process from data. We demonstrate the advantages of our method on molecular and video datasets finding our method achieves superior guided generation performance and produces more representative data interpolations across dimensions. 2 Background Standard continuous-time diffusion models [3, 15 17] define a forward diffusion process through a stochastic differential equation (SDE) where x0 pdata and, for t > 0, dxt = b t(xt)dt + gtdwt, (1) where xt Rd is the current state, b t : Rd Rd is the drift and gt R is the diffusion coefficient. dwt is a Brownian motion increment on Rd. This SDE can be understood intuitively by noting that in each infinitesimal timestep, we move slightly in the direction of the drift b t and inject a small amount of Gaussian noised governed by gt. Let pt(xt) denote the distribution of xt for the forward diffusion process (1) so that p0(x0) = pdata(x0). b t and gt are set such that at time t = T, p T (x T ) is close to pref(x T ) = N(x T ; 0, Id); e.g. b t(xt) = 1 2βtxt, gt = βt for βt > 0 [2, 3]. The time-reversal of the forward diffusion (1) is also a diffusion [18, 19] which runs backwards in time from p T (x T ) to p0(x0) and satisfies the following reverse time SDE dxt = b t(xt)dt + gtd ˆwt, where b t(xt) = b t(xt) g2 t xt log pt(xt), dt is a negative infinitesimal time step and d ˆwt is a Brownian motion increment when time flows backwards. Unfortunately, both the terminal distribution, p T (x T ), and the score, xt log pt(xt), are unknown in practice. A generative model is obtained by approximating p T with pref and learning an approximation sθ t (xt) to xt log pt(xt) typically using denoising score matching [20], i.e. min θ EU(t;0,T )p0,t(x0,xt)[ sθ t (xt) xt log pt|0(xt|x0) 2]. (2) For a flexible model class, sθ, we get sθ t(xt) xt log pt(xt) at the minimizing parameter. 3 Trans-Dimensional Generative Model Instead of working with fixed dimension datapoints, we will instead assume our datapoints consist of a variable number of components. A datapoint X consists of n components each of dimension d. For ease of notation, each datapoint will explicitly store both the number of components, n, and the state values, x, giving X = (n, x). Since each datapoint can have a variable number of components from n = 1 to n = N, our overall space that our datapoints live in is the union of all these possibilities, X X = SN n=1{n} Rnd. For example, for a varying size point cloud dataset, components would refer to points in the cloud, each containing (x, y, z) coordinates giving d = 3 and the maximum possible number of points in the cloud is N. Broadly speaking, our approach will follow the same framework as previous diffusion generative models. We will first define a forward noising process that both corrupts state values with Gaussian noise and progressively deletes dimensions. We then learn an approximation to the time-reversal giving a backward generative process that simultaneously denoises whilst also progressively adding dimensions back until a synthetic datapoint of appropriate dimensionality has been constructed. 3.1 Forward Process Our forward and backward processes will be defined through jump diffusions. A jump diffusion process has two components, the diffusion part and the jump part. Between jumps, the process evolves according to a standard SDE. When a jump occurs, the process transitions to a different dimensional space with the new value for the process being drawn from a transition kernel Kt(Y|X) : X X R 0. Letting Y = (m, y), the transition kernel satisfies P m R y Kt(m, y|X)dy = 1 and R y Kt(m = n, y|X)dy = 0. The rate at which jumps occur (jumps per unit time) is given by a rate function λt(X) : X R 0. For an infinitesimal timestep dt, the jump diffusion can be written as Jump X t = Xt with probability 1 λt(Xt)dt Y Kt(Y|Xt) with probability λt(Xt)dt Diffusion xt+dt = x t + bt(X t)dt + gtdwt nt+dt = n t with Xt (nt, xt) and Xt+dt (nt+dt, xt+dt) and dwt being a Brownian motion increment on Rn td. We provide a more formal definition in Appendix A. With the jump diffusion formalism in hand, we can now construct our forward noising process. We will use the diffusion part to corrupt existing state values with Gaussian noise and the jump part to destroy dimensions. For the diffusion part, we use the VP-SDE introduced in [2, 3] with b t(X) = 1 2βtx and g t = βt with βt 0. When a jump occurs in the forward process, one component of the current state will be deleted. For example, one point in a point cloud or a single frame in a video is deleted. The rate at which these deletions occur is set by a user-defined forward rate λ t(X). To formalize the deletion, we need to introduce some more notation. We let Kdel(i|n) be a user-defined distribution over which component of the current state to delete. We also define del : X N X to be the deletion operator that deletes a specified component. Specifically, (n 1, y) = del((n, x), i) where y R(n 1)d has the same values as x Rnd except for the d values corresponding to the ith component which have been removed. We can now define the forward jump transition kernel as K t(Y|X) = Pn i=1 Kdel(i|n)δdel(X,i)(Y). We note that only one component is ever deleted at a time meaning K t(m, y|X) = 0 for m = n 1. Further, the choice of Kdel(i|n) will dictate the behaviour of the reverse generative process. If we set Kdel(i|n) = I{i = n} then we only ever delete the final component and so in the reverse generative Table 1: Summary of forward and parameterized backward processes Direction bt gt λt(X) Kt(Y|X) 2βtx βt λ t(n) Pn i=1 Kdel(i|n)δdel(X,i)(Y) Backward 1 2βtx βtsθ t (X) βt λ θ t (X) R yadd Pn+1 i=1 Aθ t (yadd, i|X)δins(X,yadd,i)(Y)dyadd direction, datapoints are created additively, appending components onto the end of the current state. Alternatively, if we set Kdel(i|n) = 1/n then components are deleted uniformly at random during forward corruption and in the reverse generative process, the model will need to pick the most suitable location for a new component from all possible positions. The forward noising process is simulated from t = 0 to t = T and should be such that at time t = T, the marginal probability pt(X) should be close to a reference measure pref(X) that can be sampled from. We set pref(X) = I{n = 1}N(x; 0, Id) where I{n = 1} is 1 when n = 1 and 0 otherwise. To be close to pref, for the jump part, we set λ t high enough such that at time t = T there is a high probability that all but one of the components in the original datapoint have been deleted. For simplicity, we also set λ t to depend only on the current dimension λ t(X) = λ t(n) with λ t(n = 1) = 0 so that the forward process stops deleting components when there is only 1 left. In our experiments, we demonstrate the trade-offs between different rate schedules in time. For the diffusion part, we use the standard diffusion βt schedule [2, 3] so that we are close to N(x; 0, Id). 3.2 Backward Process The backward generative process will simultaneously denoise and add dimensions back in order to construct the final datapoint. It will consist of a backward drift b t(X), diffusion coefficient g t, rate λ t(X) and transition kernel K t(Y|X). We would like these quantities to be such that the backward process is the time-reversal of the forward process. In order to find the time-reversal of the forward process, we must first introduce some notation to describe K t(Y|X). K t(Y|X) should undo the forward deletion operation. Since K t(Y|X) chooses a component and then deletes it, K t(Y|X) will need to generate the state values for a new component, decide where the component should be placed and then insert it at this location. Our new component will be denoted yadd Rd. The insertion operator is defined as ins : X Rd N X. It takes in the current value X, the new component yadd and an index i {1, . . . , n + 1} and inserts yadd into X at location i such that the resulting value Y = ins(X, yadd, i) has del(Y, i) = X. We denote the joint conditional distribution over the newly added component and the index at which it is inserted as At(yadd, i|X). We therefore have K t(Y|X) = R yadd Pn+1 i=1 At(yadd, i|X)δins(X,yadd,i)(Y)dyadd. Noting that only one component is ever added at a time, we have K t(m, y|X) = 0 for m = n + 1. This backward process formalism can be seen as a unification of diffusion models with autoregressive models. The diffusion part b t denoises the current set of components in parallel, whilst the autoregressive part At(yadd, i|X) predicts a new component and its location. λ t(X) is the glue between these parts controlling when and how many new components are added during generation. We now give the optimum values for b t(X), g t, λ t(X) and At(yadd, i|X) such that the backward process is the time-reversal of the forward process. Proposition 1. The time reversal of a forward jump diffusion process given by drift b t, diffusion coefficient g t, rate λ t(n) and transition kernel Pn i=1 Kdel(i|n)δdel(X,i)(Y) is given by a jump diffusion process with drift b t (X), diffusion coefficient g t , rate λ t (X) and transition kernel yadd Pn+1 i=1 A t (yadd, i|X)δins(X,yadd,i)(Y)dyadd as defined below b t (X) = b t(X) g 2 t x log pt(X), g t = g t, λ t (X) = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R yadd pt(ins(X, yadd, i))dyadd A t (yadd, i|X) pt(ins(X, yadd, i))Kdel(i|n + 1). All proofs are given in Appendix A. The expressions for b t and g t are the same as for a standard diffusion except for replacing x log pt(x) with x log pt(X) = x log pt(x|n) which is simply the score in the current dimension. The expression for λ t can be understood intuitively by noting that the numerator in the probability ratio is the probability that at time t, given a deletion occurs, the forward process will arrive at X. If this is higher than the raw probability at time t that the forward process is at X (the denominator) then we should have high λ t because X is likely the result of a deletion of a larger datapoint. Finally the optimum A t (yadd, i|X) is simply the conditional distribution of yadd and i given X when the joint distribution over yadd, i, X is given by pt(ins(X, yadd, i))Kdel(i|n + 1). 3.3 Objective for Learning the Backward Process The true b t , λ t and A t are unknown so we need to learn approximations to them, b θ t, λ θ t and Aθ t. Following Proposition 1, we set b θ t (X) = b t(X) g 2 tsθ t (X) where sθ t(X) approximates x log pt(X). The forward and parameterized backward processes are summarized in Table 1. Standard diffusion models are trained using a denoising score matching loss which can be derived from maximizing an evidence lower bound on the model probability for Epdata(x0)[log pθ 0(x0)] [21]. We derive here an equivalent loss to learn sθ t , λ θ t and Aθ t for our jump diffusion process by leveraging the results of [17] and [22]. Before presenting this loss, we first introduce some notation. Our objective for sθ t(Xt) will resemble denoising score matching (2) but instead involve the conditional score xt log pt|0(Xt|X0) = xt log pt|0(xt|X0, nt). This is difficult to calculate directly due to a combinatorial sum over the different ways the components of X0 can be deleted to get to Xt. We avoid this problem by equivalently conditioning on a mask variable Mt {0, 1}n0 that is 0 for components of X0 that have been deleted to get to Xt and 1 for components that remain in Xt. This makes our denoising score matching target easy to calculate: xt log pt|0(xt|X0, nt, Mt) = αt Mt(x0) xt 1 αt where αt = exp( R t 0 β(s)ds) [3]. Here Mt(x0) is the vector removing any components in x0 for which Mt is 0, thus Mt(x0) and xt have the same dimensionality. We now state our full objective. Proposition 2. For the backward generative jump diffusion process starting at pref(XT ) and finishing at pθ 0(X0), an evidence lower bound on the model log-likelihood Ex0 pdata[log pθ 0(x0)] is given by 2 E h g2 t sθ t (Xt) xt log pt|0(xt|X0, nt, Mt) 2i + (3) TE h λ θ t(Xt) + λ t(nt) log λ θ t(Y) + λ t(nt) log Aθ t (xadd t , i|Y) i + C, (4) where expectations are with respect to U(t; 0, T)p0,t(X0, Xt, Mt)Kdel(i|nt)δdel(Xt,i)(Y), C is a constant term independent of θ and Xt = ins(Y, xadd t , i). This evidence lower bound is equal to the log-likelihood when b θ t = b t , λ θ t = λ t and Aθ t = A t . We now examine the objective to gain an intuition into the learning signal. Our first term (3) is an L2 regression to a target that, as we have seen, is a scaled vector between xt and αt Mt(x0). As the solution to an L2 regression problem is the conditional expectation of the target, sθ t (Xt) will learn to predict vectors pointing towards x0 averaged over the possible correspondences between dimensions of xt and dimensions of x0. Thus, during sampling, sθ t (Xt) provides a suitable direction to adjust the current value Xt taking into account the fact Xt represents only a noisy subpart of a clean whole X0. The second term (4) gives a learning signal for λ θ t and Aθ t. For Aθ t, we simply have a maximum likelihood objective, predicting the missing part of Xt (i.e. xadd t ) given the observed part of Xt (i.e. Y). The signal for λ θ t comes from balancing two terms: λ θ t(Xt) and λ t(nt) log λ θ t (Y) which encourage the value of λ θ t to move in opposite directions. For a new test input Z, λ θ t (Z) s value needs to trade off between the two terms by learning the relative probability between Z being the entirety of a genuine sample from the forward process, corresponding to the λ θ t(Xt) term in (4), or Z being a substructure of a genuine sample, corresponding to the λ θ t(Y) term in (4). The optimum trade-off is found exactly at the time reversal λ t as we show in Appendix A.5. We optimize L(θ) using stochastic gradient ascent, generating minibatches by first sampling t U(0, T), X0 pdata and then computing Xt from the forward process. This can be done analytically for the λ t(n) functions used in our experiments. We first sample nt by analytic integration of the dimension deletion Poisson process with time inhomogeneous rate λ t(n). We then add Gaussian noise independently to each dimension under pt|0(xt|X0, nt, Mt) using a randomly drawn mask variable Mt. See Appendix B for further details on the efficient evaluation of our objective. 3.4 Parameterization sθ t(Xt), Aθ t(yadd, i|Xt) and λ θ t (Xt) will all be parameterized by neural networks. In practice, we have a single backbone network suited to the problem of interest e.g. a Transformer [23], an EGNN [24] or a UNet [25] onto which we add prediction heads for sθ t(Xt), Aθ t (yadd, i|Xt) and λ θ t (Xt). sθ t(Xt) outputs a vector in Rntd. Aθ t (yadd, i|Xt) outputs a distribution over i and mean and standard deviation statistics for a Gaussian distribution over yadd. Finally, having λ θ t(Xt) R 0 be the raw output of a neural network can cause optimization issues due to the optimum λ t including a probability ratio which can take on very large values. Instead, we learn a component prediction network pθ 0|t(n0|Xt) that predicts the number of components in X0 given Xt. To convert this into λ θ(Xt), we show in Proposition 3 how the optimum λ t (Xt) is an analytic function of the true p0|t(n0|Xt). We then plug pθ 0|t(n0|Xt) into Proposition 3 to obtain an approximation of λ t (Xt). Proposition 3. We have λ t (Xt) = λ t(nt + 1) pt|0(nt + 1|n0) pt|0(nt|n0) p0|t(n0|Xt), where Xt = (nt, xt) and pt|0(nt + 1|n0) and pt|0(nt|n0) are both easily calculable distributions from the forward dimension deletion process. 3.5 Sampling Algorithm 1: Sampling the Generative Process t T X pref(X) = I{n = 1}N(x; 0, Id) while t > 0 do if u < λ θ t (X)δt with u U(0, 1) then Sample xadd, i Aθ t (xadd, i|X) X ins(X, xadd, i) end x x b θ t (X)δt + gt δtϵ with ϵ N(0, Ind) X (n, x), t t δt end To sample the generative process, we numerically integrate the learned backward jump diffusion process using time-step δt. Intuitively, it is simply the standard continuous time diffusion sampling scheme [3] but at each timestep we check whether a jump has occurred and if it has, sample the new component and insert it at the chosen index as explained by Algorithm 1. 4 Related Work Our method jointly generates both dimensions and state values during the generative process whereas prior approaches [8, 11] are forced to first sample the number of dimensions and then run the diffusion process in this fixed dimension. When diffusion guidance is applied to these unconditional models [14, 26], users need to pick by hand the number of dimensions independent of the conditioning information even though the number of dimensions can be correlated with the conditioning parameter. Instead of automatically learning when and how many dimensions to add during the generative process, previous work focusing on images [27, 28] hand pick dimension jump points such that the resolution of images is increased during sampling and reaches a certain pre-defined desired resolution at the end of the generative process. Further, rather than using any equivalent of Aθ t, the values for new dimensions are simply filled in with Gaussian noise. These approaches mainly focus on efficiency rather than flexible generation as we do here. The first term in our learning objective in Proposition 2 corresponds to learning the continuous part of our process (the diffusion) and the second corresponds to learning the discrete part of our process (the jumps). The first term can be seen as a trans-dimensional extension of standard denoising score matching [20] whilst the second bears similarity to the discrete space ELBO derived in [29]. Finally, jump diffusions also have a long history of use in Bayesian inference, where one aims to draw samples from a trans-dimensional target posterior distribution based on an unnormalized version of its density [30]: an ergodic jump diffusion is designed which admits the target as the invariant distribution [30 32]. The invariant distribution is not preserved when time-discretizing the process. However, it was shown in [33, 34] how general jump proposals could be built and how this process could be Metropolized to obtain a discrete-time Markov process admitting the correct invariant distribution, yielding the popular Reversible Jump Markov Chain Monte Carlo algorithm. Our setup differs significantly as we only have access to samples in the form of data, not an unnormalized target. 5 Experiments 5.1 Molecules We now show how our model provides significant benefits for diffusion guidance and interpolation tasks. We model the QM9 dataset [35, 36] of 100K varying size molecules. Following [8], we consider each molecule as a 3-dimensional point cloud of atoms, each atom having the features: (x, y, z) coordinates, a one-hot encoded atom type, and an integer charge value. Bonds are inferred from inter-atomic distances. We use an EGNN [24] backbone with three heads to predict sθ t , pθ 0|t(n0|Xt), and Aθ t. We uniformly delete dimensions, Kdel(i|n) = 1/n, and since a point cloud is permutation invariant, Aθ t(yadd|Xt) need only predict new dimension values. We set λ t to a constant except for t < 0.1T, where we set λ t<0.1T = 0. This ensures that all dimensions are added with enough generation time remaining for the diffusion process to finalize all state values. We visualize sampling from our learned generative process in Figure 2; note how the process jointly creates a suitable number of atoms whilst adjusting their positions and identities. Before moving on to apply diffusion guidance which is the focus of our experiments, we first verify our unconditional sample quality in Table 2 and find we perform comparably to the results reported in [8] which use an FDDM. We ablate our choice of λ t by comparing with setting λ t to a constant for all t and with setting λ t = 0 for t < 0.9T (rather than just for t < 0.1T). We find that the constant λ t performs worse due to the occasional component being added late in the generation process without enough time for the diffusion process to finalize its value. We find the λ t<0.9T = 0 setting to have satisfactory sample quality however this choice of λ t introduces issues during diffusion guided generation as we see next. Finally, we ablate the parameterization of Proposition 3 by learning λ θ t(Xt) R directly as the output of a neural network head. We find that this reduces sample quality due to the more well-behaved nature of the target, pθ 0|t(n0|Xt) when using Proposition 3. We note pure autoregressive models perform significantly worse than diffusion based models as found in [8]. Figure 2: Visualization of the jump-diffusion backward generative process on molecules. Table 2: Sample quality metrics for unconditional molecule generation. An atom is stable if it has the correct valency whilst a molecule is considered stable if all of its atoms are stable. Molecular validity is measured using RDKit [37]. All methods use 1000 simulation steps and draw 10000 samples. Method % Atom Stable ( ) % Molecule Stable ( ) % Valid ( ) FDDM [8] 98.7 82.0 91.9 TDDM (ours) 98.3 87.2 92.3 TDDM, const λ t 96.7 79.1 86.7 TDDM, λ t<0.9T = 0 97.7 82.6 89.4 TDDM w/o Prop. 3 97.0 66.9 87.1 Table 3: Conditional Molecule Generation for 10 conditioning tasks that each result in a different dimension distribution. We report dimension error as the average Hellinger distance between the generated and ground truth dimension distributions for that property as well as average sample quality metrics. Standard deviations are given across the 10 conditioning tasks. We report in bold values that are statistically indistinguishable from the best result at the 5% level using a two-sided Wilcoxon signed rank test across the 10 conditioning tasks. Method Dimension Error ( ) % Atom Stable ( ) % Molecule Stable ( ) % Valid ( ) FDDM 0.511 0.19 93.5 1.1 31.3 6.3 65.2 10.3 TDDM 0.134 0.076 93.5 2.6 59.1 11 74.8 9.3 TDDM, const λ t 0.226 0.17 88.9 4.8 43.6 15 63.4 14 TDDM, λ t<0.9T = 0 0.390 0.38 95.0 2.1 61.7 17 77.8 13 TDDM w/o Prop. 3 0.219 0.12 93.8 3.2 55.0 19 73.8 13 5.1.1 Trans-Dimensional Diffusion Guidance We now apply diffusion guidance to our unconditional model in order to generate molecules that contain a certain number of desired atom types, e.g. 3 carbons or 1 oxygen and 2 nitrogens. The distribution of molecule sizes changes depending on these conditions. We generate molecules conditioned on these properties by using the reconstruction guided sampling approach introduced in [9]. This method augments the score sθ t (Xt) such that it approximates xt log pt(Xt|y) rather than xt log pt(Xt) (where y is the conditioning information) by adding on a term approximating xt log pt(y|Xt) with pt(y|Xt) = P x0 p(y|X0)p0|t(X0|Xt)dx0. This guides xt such that it is consistent with y. Since λ θ t (Xt) has access to xt, it will cause nt to automatically also be consistent with y without the user needing to input any information on how the conditioning information relates to the size of the datapoints. We give further details on diffusion guidance in Appendix C. We show our results in Table 3. In order to perform guidance on the FDDM baseline, we implement the model from [8] in continuous time and initialize the dimension from the empirically observed dimension distribution in the dataset. This accounts for the case of an end user attempting to guide a unconditional model with access to no further information. We find that TDDM produces samples whose dimensions much more accurately reflect the true conditional distribution of dimensions given the conditioning information. The λ t<0.9T = 0 ablation on the other hand only marginally improves the dimension error over FDDM because all dimensions are added in the generative process at a time when Xt is noisy and has little relation to the conditioning information. This highlights the necessity of allowing dimensions to be added throughout the generative process to gain the trans-dimensional diffusion guidance ability. The ablation with constant λ t has increased dimension error over TDDM as we find that when λ t > 0 for all t, λ θ t can become very large when t is close to 0 when the model has perceived a lack of dimensions. This occasionally results in too many dimensions being added hence an increased dimension error. Not using the Proposition 3 parameterization also increases dimension error due to the increased difficulty in learning λ θ t. 5.1.2 Trans-Dimensional Interpolation 50 60 70 80 90 Polarizability Number of Atoms Dataset FDDM TDDM Figure 3: Number of atoms versus polarizability for 3 interpolations with fixed random noise. The dataset mean and standard deviation for the number of atoms is also shown. FDDM interpolates entirely in a fixed dimensional space hence the number of atoms is fixed for all polarizabilities. Interpolations are a unique way of gaining insights into the effect of some conditioning parameter on a dataset of interest. To create an interpolation, a conditional generative model is first trained and then sampled with a sweep of the conditioning parameter but using fixed random noise [8]. The resulting series of synthetic datapoints share similar features due to the fixed random noise but vary in ways that are very informative as to the effect of the conditioning parameter. Attempting to interpolate with an FDDM is fundamentally limited because the entire interpolation occurs in the same dimension which is unrealistic when the conditioning parameter is heavily correlated with the dimension of the datapoint. We demonstrate this by following the setup of [8] who train a conditional FDDM conditioned on polarizability. Polarizability is the ability of a molecule s electron cloud to distort in response to an external electric field [38] with larger molecules tending to have higher polarizability. To enable us to perform a trans-dimensional interpolation, we also train a conditional version of our model conditioned on polarizability. An example interpolation with this model is shown in Figure 4. We find that indeed the size of the molecule increases with increasing polarizability, with some molecular substructures e.g. rings, being maintained across dimensions. We show how the dimension changes with polarizability during 3 interpolations in Figure 3. We find that these match the true dataset statistics much more accurately than interpolations using FDDM which first pick a dimension and carry out the entire interpolation in that fixed dimension. Figure 4: Sequence of generations for linearly increasing polarizability from 39 Bohr3 to 66 Bohr3 with fixed random noise. Note how molecular size generally increases with polarizability and how some molecular substructures are maintained between sequential generations of differing dimension. For example, between molecules 6 and 7, the single change is a nitrogen (blue) to a carbon (gray) and an extra hydrogen (white) is added to maintain the correct valency. We finally demonstrate our model on a video modeling task. Specifically we model the Robo Desk dataset [39], a video benchmark to measure the applicability of video models for planning and control problems. The videos are renderings of a robotic arm [40] performing a variety of different tasks including opening drawers and moving objects. We first train an unconditional model on videos of varying length and then perform planning by applying diffusion guidance to generate videos conditioned on an initial starting frame and a final goal frame [41]. The planning problem is then reduced to filling in the frames in between. Our trans-dimensional model automatically varies the number of in-filled frames during generation so that the final length of video matches the length of time the task should take, whereas the fixed dimension model relies on the unrealistic assumption that the length of time the task should take is known before generation. We model videos at 32 32 resolution and with varying length from 2 to 35 frames. For the network backbone, we use a UNet adapted for video [42]. In contrast to molecular point clouds, our data is no longer permutation invariant hence Aθ t(yadd, i|Xt) includes a prediction over the location to insert the new frame. Full experimental details are provided in Appendix D. We evaluate our approach on three planning tasks, holding stationary, sliding a door and pushing an object. An example generation conditioned on the first and last frame for the slide door task is shown in Figure 5, with the model in-filling a plausible trajectory. We quantify our model s ability to generate videos of a length appropriate to the task in Table 4 finding on all three tasks we generate a more accurate length of video than FDDM which is forced to sample video lengths from the unconditional empirically observed length distribution in the training dataset. Figure 5: A sample for the slide door task conditioned on the first and last frame (highlighted). Table 4: Dimension prediction mean absolute error for three planning tasks with standard deviations estimated over 45 samples. Method Stationary ( ) Slide Door ( ) Push Object ( ) Average ( ) FDDM 14.16 1.41 13.39 1.34 17.06 1.47 14.87 TDDM 9.70 0.99 11.47 0.74 15.43 0.90 12.2 6 Discussion In this work, we highlighted the pitfalls of performing generative modeling on varying dimensional datasets when treating state values and dimensions completely separately. We instead proposed a trans-dimensional generative model that generates both state values and dimensions jointly during the generative process. We detailed how this process can be formalized with the time-reversal of a jump diffusion and derived a novel evidence lower bound training objective for learning the generative process from data. In our experiments, we found our trans-dimensional model to provide significantly better dimension generation performance for diffusion guidance and interpolations when conditioning on properties that are heavily correlated with the dimension of a datapoint. We believe our approach can further enable generative models to be applied in a wider variety of domains where previous restrictive fixed dimension assumptions have been unsuitable. 7 Acknowledgements The authors are grateful to Martin Buttenschoen for helpful discussions. AC acknowledges support from the EPSRC CDT in Modern Statistics and Statistical Machine Learning (EP/S023151/1). AD acknowledges support of the UK Dstl and EPSRC grant EP/R013616/1. This is part of the collaboration between US DOD, UK MOD and UK EPSRC under the Multidisciplinary University Research Initiative. He also acknowledges support from the EPSRC grants Co Sines (EP/R034710/1) and Bayes4Health (EP/R018561/1). WH and CW acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program. This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) under the Defense Advanced Research Projects Agency (DARPA) Data Driven Discovery Models (D3M) program (Contract No. FA8750-19-2-0222) and Learning with Less Labels (Lw LL) program (Contract No.FA8750-19-C-0515). Additional support was provided by UBC s Composites Research Network (CRN), Data Science Institute (DSI) and Support for Teams to Advance Interdisciplinary Research (STAIR) Grants. This research was enabled in part by technical support and computational resources provided by West Grid (https://www.westgrid.ca/) and Compute Canada (www.computecanada.ca). The authors would like to acknowledge the use of the University of Oxford Advanced Research Computing (ARC) facility in carrying out this work. http://dx.doi.org/10.5281/zenodo.22558 [1] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning, 2015. [2] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020. [3] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021. [4] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, and Mohammad Fleet, David J aand Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022. [5] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. ar Xiv preprint ar Xiv:2204.06125, 2022. [6] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. International Conference on Learning Representations, 2021. [7] Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. bio Rxiv, 2022. [8] Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. International Conference on Machine Learning, 2022. [9] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 2022. [10] Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, and Tim Saliman. Imagen video: High definition video generation with diffusion models. ar Xiv preprint ar Xiv:2210.02303, 2022. [11] Ilia Igashov, Hannes Stärk, Clément Vignac, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, and Bruno Correia. Equivariant 3d-conditional diffusion models for molecular linker design. ar Xiv preprint ar Xiv:2210.05274, 2022. [12] Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 2021. [13] Katherine Crowson. Clip guided diffusion. Web Demo, https://huggingface.co/spaces/Eleuther AI/clip-guided-diffusion, 2021. [14] Hengtong Zhang and Tingyang Xu. Towards controllable diffusion models via reward-guided exploration. ar Xiv preprint ar Xiv:2304.07132, 2023. [15] Chin-Wei Huang, Jae Hyun Lim, and Aaron C Courville. A variational perspective on diffusionbased generative models and score matching. Advances in Neural Information Processing Systems, 2021. [16] Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in Neural Information Processing Systems, 2022. [17] Joe Benton, Yuyang Shi, Valentin De Bortoli, George Deligiannidis, and Arnaud Doucet. From denoising diffusions to denoising Markov models. ar Xiv preprint ar Xiv:2211.03595, 2022. [18] Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 1982. [19] Ulrich G Haussmann and Etienne Pardoux. Time reversal of diffusions. The Annals of Probability, 1986. [20] Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 2011. [21] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 2021. [22] Patrick Cheridito, Damir Filipovi c, and Marc Yor. Equivalent and absolutely continuous measure changes for jump-diffusion processes. Annals of applied probability, 2005. [23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017. [24] Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. International Conference on Machine Learning, 2021. [25] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention, 2015. [26] Tomer Weiss, Luca Cosmo, Eduardo Mayo Yanes, Sabyasachi Chakraborty, Alex M Bronstein, and Renana Gershoni-Poranne. Guided diffusion for inverse molecular design. Chemr Xiv, 2023. [27] Bowen Jing, Gabriele Corso, Renato Berlinghieri, and Tommi Jaakkola. Subspace diffusion generative models. European Conference on Computer Vision, 2022. [28] Han Zhang, Ruili Feng, Zhantao Yang, Lianghua Huang, Yu Liu, Yifei Zhang, Yujun Shen, Deli Zhao, Jingren Zhou, and Fan Cheng. Dimensionality-varying diffusion process. ar Xiv preprint ar Xiv:2211.16032, 2022. [29] Andrew Campbell, Joe Benton, Valentin De Bortoli, Tom Rainforth, George Deligiannidis, and Arnaud Doucet. A continuous time framework for discrete denoising models. Advances in Neural Information Processing Systems, 2022. [30] Ulf Grenander and Michael I Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological), 1994. [31] David B Phillips and Adrian F M Smith. Bayesian model comparison via jump diffusions. Markov Chain Monte Carlo in Practice, 1995. [32] Michael I Miller, Ulf Grenander, Joseph A O Sullivan, and Donald L Snyder. Automatic target recognition organized via jump-diffusion algorithms. IEEE Transactions on Image Processing, 1997. [33] Peter J Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 1995. [34] Peter J Green. Trans-dimensional Markov chain Monte Carlo. Highly Structured Stochastic Systems, 2003. [35] Lars Ruddigkeit, Ruud Van Deursen, Lorenz C Blum, and Jean-Louis Reymond. Enumeration of 166 billion organic small molecules in the chemical universe database gdb-17. Journal of chemical information and modeling, 2012. [36] Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data, 2014. [37] Rdkit: Open-source cheminformatics; http://www.rdkit.org. Accessed 2023. [38] Eric V Anslyn and Dennis A Dougherty. Modern physical organic chemistry. University Science Books, 2005. [39] Stephen Tian, Chelsea Finn, and Jiajun Wu. A control-centric benchmark for video prediction. ar Xiv preprint ar Xiv:2304.13723, 2023. [40] Harini Kannan, Danijar Hafner, Chelsea Finn, and Dumitru Erhan. Robodesk: A multi-task reinforcement learning benchmark. https: // github. com/ google-research/ robodesk , 2021. [41] Michael Janner, Yilun Du, Joshua Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis. International Conference on Machine Learning, 2022. [42] William Harvey, Saeid Naderiparizi, Vaden Masrani, Christian Weilbach, and Frank Wood. Flexible diffusion modeling of long videos. Advances in Neural Information Processing Systems, 2022. [43] John L Kelley. General Topology. Courier Dover Publications, 2017. [44] Stewart N Ethier and Thomas G Kurtz. Markov processes: Characterization and convergence. John Wiley & Sons, 2009. [45] Giovanni Conforti and Christian Léonard. Time reversal of Markov processes with jumps under a finite entropy condition. Stochastic Processes and their Applications, 2022. [46] Clement Vignac and Pascal Frossard. Top-n: Equivariant set and graph generation without exchangeability. International Conference on Learning Representations, 2022. [47] Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019. This appendix is organized as follows. In Section A, we present proofs for all of our propositions. Section A.1 presents a rigorous definition of our forward process using a more specific notation. This is then used in Section A.2.1 to prove the time reversal for our jump diffusions. We also present an intuitive proof of the time reversal using notation from the main text in Section A.2.2. In Section A.3 we prove Proposition 2 using the notation from the main text. We prove Proposition 3 in Section A.4 and we analyse the optimum of our objective directly without using stochastic process theory in Section A.5. In Section B we give more details on our objective and in Section C we detail how we apply diffusion guidance to our model. We give the full details for our experiments in Section D and finally, in Section E, we discuss the broader impacts of our work. A.1 Notation and Setup We here introduce a more rigorous notation for defining our trans-dimensional notation that will be used in a rigorous proof for the time-reversal of our jump diffusion. First, while it makes sense from a methodological and experimental point of view to present our setting as a transdimensional one, we slightly change the point of view in order to derive our theoretical results. We extend the space Rd to ˆRd = Rd { } using the one-point compactification of the space. We refer to [43] for details on this space. The point will be understood as a mask. For instance, let x1, x2, x3 Rd. Then X = (x1, x2, x3) (ˆRd)N with N = 3 corresponds to a vector for which all components are observed whereas X = (x1, , x3) (ˆRd)N corresponds to a vector for which only the components on the first and third dimension are observed. The second dimension is masked in that case. Doing so, we will consider diffusion models on the space X = (ˆRd)N with d, N N. In the case of a video diffusion model, N can be seen as the max number of frames. We will always consider that this space is equipped with its Borelian sigma-field X and all probability measures will be defined on X. We denote dim : X {0, 1}N which is given for any X = {xi}N i=1 X by dim(X) = {δRd(xi)}N i=1. In other words, dim(X) is a binary vector identifying the dimension of the vector X, i.e. which frames are observed. Going back to our example X = (x1, x2, x3) (ˆRd)N and X = (x1, , x3) (ˆRd)N, we have that dim(X) = {1, 1, 1} and dim(X ) = {1, 0, 1}. For any vector u {0, 1}N we denote |u| = PN i=1 ui, i.e. the active dimensions of u (or equivalently the nonmasked frames). For any X X and D {0, 1}N, we denote XD = {X i}N i=1 with X i = Xi if Di = 1 and X i = if Di = 0. We denote Ck b(Rd, R) the set of functions which are k differentiable and bounded. Similarly, we denote Ck b(Rd, R) the set of functions which are k differentiable and compactly supported. The set Ck 0(Rd, R) denotes the functions which are k differentiable and vanish when x + . We note that f C(ˆRd), if f C(Rd) and f f( ) C0(Rd) and that f Ck(ˆRd) for any k N if the restriction of f to Rd is in Ck(Rd) and f C(ˆRd). A.1.1 Transdimensional infinitesimal generator To introduce rigorously the transdimensional diffusion model defined in Section 3.1, we will introduce its infinitesimal generator. The infinitesimal generator of a stochastic process can be roughly defined as its probabilistic derivative . More precisely, assume that a stochastic process (Xt)t 0 admits a transition semigroup (Pt)t 0, i.e. for any t 0, A X and X X we have P(Xt A | X0 = x) = Pt(x, A), then the infinitesimal generator is defined as A(f) = limt 0(Pt(f) f)/t, for every f for which this quantity is well-defined. Here, we start by introducing the infinitesimal generator of interest and give some intuition about its form. Then, we prove a time-reversal formula for this infinitesimal generator. We consider b : Rd Rd, α : {0, 1}NM R+. For any f C2(X) and X X we define A(f)(X) = PN i=1{ b(Xi), xif(X) + 1 2 xif(X)}δRd(Xi) (5) DM D M 1 M 1 α(D0, . . . , DM) PM 1 i=0 (f(X) f(XDi+1))δDi(dim(X)), where M N, D0 = {1}N, { j}M 1 j=0 NM such that PM 1 j=0 j < N and for any j {0, . . . , M 1}, D j j is the subset of {0, 1}{1,...,N} such that Dj+1 D j j if and only if Dj Dj+1 = Dj+1, where is the pointwise multiplication operator, and |Dj| = |Dj+1| + j. The condition Dj Dj+1 = Dj+1 means that the non-masked dimensions in Dj+1 are also non-masked dimensions in Dj. The condition |Dj| = |Dj+1| + j means that in order to go from Dj to Dj+1, one needs to mask exactly j dimensions. Therefore, a sequence { j}M 1 j=0 NM such that PM 1 j=0 j < N can be interpreted as a sequence of drops in dimension. At the core level, we have that |DM| = N PM 1 j=0 j. For instance if |DM| = 1, we have that at the end of the process, only one dimension is considered. We choose α such that P DM D M 1 M 1 α(D0, . . . , DM) = 1. Therefore, α(D0, . . . , DM) corresponds to the probability to choose the dimension path D0 DM. The part X 7 b(Xi), xif(X) + 1 2 xif(X) is more classical and corresponds to the continuous part of the diffusion process. We refer to [44] for a thorough introduction on infinitesimal generators. For simplicity, we omit the schedule coefficients in (5). A.1.2 Justification of the form of the infinitesimal generator For any dimension path P = D0 DM (recall that D0 = {1}N), we define the jump kernel JP as follows. For any x X, we have JP(X, d Y ) = PM 1 i=0 δDi(dim(X))δXDi+1(d Y ). This operator corresponds to the deletion operator introduced in Section 3.1 . Hence, for any dimension path P = D0 DM, we can define the associated infinitesimal generator: for any f C2(X) and X X we define AP(f)(X) = PN i=1{ b(xi), xif(X) + 1 2 xif(X)}δRd(Xi) + R X(f(Y ) f(X))JP(X, d Y ). We can define the following jump kernel DM D M 1 M 1 α(D0, . . . , DM)JP. This corresponds to averaging the jump kernel over the different possible dimension paths. We have that for any f C2(X) and X X A(f)(X) = PN i=1{ b(xi), xif(X) + 1 2 xif(X)}δRd(Xi) + R X(f(Y ) f(X))J(X, d Y ). (6) In other words, A = P DM D M 1 M 1 α(D0, . . . , DM)AP. In what follows, we assume that there exists a Markov process (Xt)t 0 with infinitesimal generator A. In order to sample from (Xt)t 0, one choice is to first sample the dimension path P according to the probability α. Second sample from the Markov process associated with the infinitesimal generator AP. We can approximately sample from this process using the Lie-Trotter-Kato formula [44, Corollary 6.7, p.33]. Denote (Pt)t 0 the semigroup associated with AP, (Qt)t 0 the semigroup associated with the continuous part of AP and (Jt)t 0 the semigroup associated with the jump part of AP. More precisely, we have that, (Qt)t 0 is associated with Acont such that for any f C2(X) and X X Acont(f)(X) = PN i=1{ b(Xi), xif(X) + 1 In addition, we have that, (Qt)t 0 is associated with AP jump such that for any f C2(X) and X X AP jump(f)(X) = R X(f(Y ) f(X))JP(X, d Y ). First, note that Acont corresponds to the infinitesimal generator of a classical diffusion on the components which are not set to . Hence, we can approximately sample from (Qt)t 0 by sampling according to the Euler-Maruyama discretization of the associated diffusion, i.e. by setting Xt X0 + tb(X0) + where Z is a Gaussian random variable. Similarly, in order to sample from (Jt)t 0, one should sample from the jump process defined as follows. On the interval [0, τ), we have Xt = X0. At time τ, we define X1 J(X0, ) and repeat the procedure. In this case τ is defined as an exponential random variable with parameter 1. For t > 0 small enough the probability that t > τ is of order t. Therefore, we sample from J, i.e. the deletion kernel, with probability t. Combining this approximation and (7), we get approximate samplers for (Qt)t 0 and (Jt)t 0. Under mild assumptions, the Lie-Trotter-Kato formula ensures that for any t 0 Pt = lim n + (Qt/n Jt/n)n. This justifies sampling according to Algorithm 1 (in the case of the forward process). A.2 Proof of Proposition 1 For the proof of Proposition 1, we first provide a rigorous proof using the notation introduced in A.1. We then follow this with a second proof that aims to be more intuitive using the notation used in the main paper. A.2.1 Time-reversal for the transdimensional infinitesimal generator and Proof of Proposition 1 We are now going to derive the formula for the time-reversal of the transdimensional infinitesimal generator A, see (5). This corresponds to a rigorous proof of Proposition 1. We refer to Section A.2.2 for a more intuitive, albeit less-rigorous, proof. We start by introducing the kernel KP given for any dimension path D0 DM, for any i {0, . . . , M 1}, Y Di+1 and A X by KP(Y, A) = PM 1 i=0 δDi+1(dim(Y )) R pt((XDi\Di+1,YDi+1)|dim(Xt)=Di)P(dim(Xt)=Di) pt(YDi+1|dim(Xt)=Di+1)P(dim(Xt)=Di+1) d XDi\Di+1. Note that this kernel is the same as the one considered in Proposition 1. It is well-defined under the following assumption. Assumption 1. For any t > 0 and D {0, 1}N, we have that Xt conditioned to dim(Xt) = D admits a density w.r.t. the |D|d-dimensional Lebesgue measure, denoted pt( |dim(Xt) = D). The following result will be key to establish the time-reversal formula. Lemma 1. Assume A1. Let A, B X. Let P be a dimension path D0 DM with M N. Then, we have E[1A(Xt)JP(Xt, B)] = E[1B(Xt)KP(Xt, A)]. Proof. Let A, B X. We have E[1A(Xt)JP(Xt, B)] = PM 1 i=0 E[1A(Xt)δDi(dim(Xt))1B((Xt)Di+1)] = PM 1 i=0 R A Di pt(XDi|dim(Xt) = Di)P(dim(Xt) = Di)1B(XDi+1)d XDi = PM 1 i=0 R A Di pt(XDi|dim(Xt) = Di)P(dim(Xt) = Di)1B(XDi+1)d XDi+1d XDi\Di+1 = PM 1 i=0 R B Di+1 1B(XDi+1) A Di 1A(XDi)pt(XDi|dim(Xt) = Di)P(dim(Xt) = Di)d XDi\Di+1)d XDi+1 = PM 1 i=0 R B Di+1 1B(XDi+1) KP(XDi+1, A)pt(XDi+1|dim(Xt) = Di+1)P(dim(Xt) = Di+1)d XDi+1 = PM 1 i=0 E[δDi+1(dim(Xt))KP(Xt, A)1B(Xt)], which concludes the proof. Lemma 1 shows that KP verifies the flux equation associated with JP. The flux equation is the discrete state-space equivalent of the classical time-reversal formula for continuous state-space. We refer to [45] for a rigorous treatment of time-reversal with jumps under entropic conditions. We are also going to consider the following assumption which ensures that the integration by part formula is valid. Assumption 2. For any t > 0 and i {1, . . . , N}, Xt admits a smooth density w.r.t. the Nddimensional Lebesgue measure denoted pt and we have that for any f, h C2 b((Rd)N) for any u [0, t] and i {1, . . . , N} E[δRd((Xu)i) xif(Xu), xih(Xu) ] = E[δRd((Xu)i)h(Xu)( xif(Xu) + xi log pu(Xu), xif(Xu) )]. The second assumption ensures that we can apply the backward Kolmogorov evolution equation. Assumption 3. For any g C2(X) and t > 0, we have that for any u [0, t] and X X, ug(u, X) + A(g)(u, X) = 0, where for any u [0, t] and X X, g(u, X) = E[g(Xt) | Xu = X]. We refer to [19] for conditions under A2 and A3 are valid in the setting of diffusion processes. Proposition 4. Assume A1, A2 and A3. Assume that there exists a Markov process (Xt)t 0 solution of the martingale problem associated with (6). Let T > 0 and consider (Yt)t [0,T ] = (XT t)t [0,T ]. Then (Yt)t [0,T ] is solution to the martingale problem associated with R, where for any f C2(X), t (0, T) and x X we have R(f)(t, X) = PN i=1{ b(Xi) + xi log pt(X), xif(X) + 1 2 xif(X)}δRd(Xi) X(f(Y ) f(X))K(X, d Y ). Proof. Let f, g C2(X). In what follows, we show that for any s, t [0, T] with t s E[(f(Yt) f(Ys))g(Ys)] = E[g(Ys) R t s R(f)(u, Yu)du]. More precisely, we show that for any s, t [0, T] with t s E[(f(Xt) f(Xs))g(Xt)] = E[ g(Xt) R t s R(f)(u, Xu)du]. Let s, t [0, T], with t s. Next, we denote for any u [0, t] and X X, g(u, X) = E[g(Xt) | Xu = X]. Using A3, we have that for any u [0, t] and X X, ug(u, X) + A(g)(u, X) = 0, i.e. g satisfies the backward Kolmogorov equation. For any u [0, t] and X X, we have A(fg)(u, X) = ug(u, X)f(X) + PN i=1( b(Xi), xig(u, X) + 1 2 xig(u, Xi))f(X)δRd(Xi) + PN i=1( b(Xi), xif(X) + 1 2 xif(X))g(u, X)δRd(Xi) + PN i=1 δRd(Xi) xif(X), xig(u, X) + J(X, fg) = ug(u, X)f(X) + A(g)(u, X)f(X) + J(X, fg) J(X, g)f(X) + PN i=1( b(Xi), xif(X) + 1 2 xif(X))g(u, X)δRd(Xi) + PN i=1 δRd(Xi) xif(X), xig(u, X) = PN i=1( b(Xi), xif(X) + 1 2 xif(X))g(u, X)δRd(Xi) + PN i=1 δRd(Xi) xif(X), xig(u, X) + J(X, fg) J(X, g)f(X). (8) Using A2, we have that for any u [0, t] and i {1, . . . , N} E[δRd((Xu)i) xif(Xu), xig(u, Xu) ] = E[δRd((Xu)i)g(u, Xu)( xif(Xu) + xi log pu(Xu), xif(Xu) )]. (9) In addition, we have that for any X X and u [0, t], J(X, fg) J(X, g)f(X) = R X g(u, Y )(f(Y ) f(X))J(X, d Y ). Using Lemma 1, we get E[J(Xu, fg) J(Xu, f)g(u, Xu)] = E[g(u, Xu)K(Xu, f)]. (10) Therefore, using (8), (9) and (10), we have E[A(fg)(u, Xu)] = E[ R(f)(u, Xu)g(u, Xu)]. Finally, we have E[(f(Xt) f(Xs))g(Xt)] = E[g(t, Xt)f(Xt) f(Xs)g(s, Xs)] = E[ R t s A(fg)(u, Xu)du] = E[ R t s R(f)(u, Xu)g(u, Xu)du] = E[g(Xt) R t s R(f)(u, Xu)du], which concludes the proof. A.2.2 Intuitive Proof of Proposition 1 We recall Proposition 1. Proposition 1. The time reversal of a forward jump diffusion process given by drift b t, diffusion coefficient g t, rate λ t(n) and transition kernel Pn i=1 Kdel(i|n)δdel(X,i)(Y) is given by a jump diffusion process with drift b t (X), diffusion coefficient g t , rate λ t (X) and transition kernel R yadd Pn+1 i=1 A t (yadd, i|X)δins(X,yadd,i)(Y)dyadd as defined below b t (X) = b t(X) g 2 t x log pt(X), g t = g t, λ t (X) = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R yadd pt(ins(X, yadd, i))dyadd A t (yadd, i|X) pt(ins(X, yadd, i))Kdel(i|n + 1). Diffusion part. Using standard diffusion models arguments such as [18] [45], we get b t (X) = b t(X) g 2 t x log pt(X|n). Jump part. We use the flux equation from [45] which intuitively relates the probability flow going in the forward direction with the probability flow going the backward direction with equality being achieved at the time reversal. pt(X) λ t (X) K t (Y|X) = pt(Y) λ t(Y) K t(X|Y) pt(X) λ t (X) R yadd Pn+1 i=1 A t (yadd, i|X)δins(X,yadd,i)(Y)dyadd = pt(Y) λ t(Y) Pn+1 i=1 Kdel(i|n + 1)δdel(Y,i)(X). (11) To find λ t (X), we sum and integrate both sides over m and y, with Y = (m, y), PN m=1 R y Rmd pt(X) λ t (X) R yadd Pn+1 i=1 A t (yadd, i|X)δins(X,yadd,i)(Y)dyadddy y Rmd pt(Y) λ t(Y) Pn+1 i=1 Kdel(i|n + 1)δdel(Y,i)(X)dy. Now we use the fact that δdel(Y,i)(X) is 0 for any m = n + 1, pt(X) λ t (X) = λ t(n + 1) R y R(n+1)d pt(Y) Pn+1 i=1 Kdel(i|n + 1)δdel(Y,i)(X)dy = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R y R(n+1)d pt(Y)δdel(Y,i)(X)dy. Now letting Y = ins(X, yadd, i), pt(X) λ t (X) = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R yadd pt(ins(X, yadd, i))dyadd λ t (X)x = λ t(n + 1) Pn+1 i=1 Kdel(i|n+1) R yadd pt(ins(X,yadd,i))dyadd To find A t (yadd, i|X), we start from (11) and set Y = ins(X, zadd, j) to get pt(X) λ t (X)A t (zadd, j|X) = pt(Y) λ t(n + 1)Kdel(j|n + 1). By inspection, we see immediately that A t (zadd, j|X) pt(ins(X, zadd, j)Kdel(j|n + 1). With a re-labeling of zadd and j we achieve the desired form A t (yadd, i|X) pt(ins(X, yadd, i))Kdel(i|n + 1). A.3 Proof of Proposition 2 In this section we prove Proposition 2 using the notation from the main paper by following the framework of [17]. We operate on a state space X = SN n=1{n} Rnd. On this space the gradient operator : C(X, R) C(X, X) is defined as f(X) = (nd) x f(X) where (nd) x is the standard gradient operator defined as C(Rnd, R) C(Rnd, Rnd) with respect to x Rnd. We will write integration with respect to a probability measure defined on X as an explicit sum over the number of components and integral over Rnd with respect to a probability density defined on Rnd i.e. R X f(X)µ(d X) = PN n=1 R x Rnd f(X)p(n)p(x|n)dx where, for A Rnd, R (n,A) µ(d X) = R x A p(n)p(x|n)dx. We will write p(X) as shorthand for p(n)p(x|n). Following, [17], we start by augmenting our space with a time variable so that operators become time inhomegeneous on the extended space. We write this as X = (X, t) where X lives in the extended space S = X R 0. In the proof, we use the infinitesimal generators for the the forward and backward processes. An infinitesimal generator is defined as A(f)( X) = lim t 0 Ept|0( Y| X)[f( Y)] f( X) t and can be understood as a probabilistic version of a derivative. For our process on the augmented space S, our generators decompose as A = t+ ˆ At where ˆ At operates only on the spatial components of X i.e. X [17]. We now define the spatial infinitesimal generators for our forward and backward process. We will change our treatment of the time variable compared to the main text. Both our forward and backward processes will run from t = 0 to t = T, with the true time reversal of X following the forward process satisfying (Yt)t [0,T ] = (XT t)t [0,T ]. Further, we will write g t as gt and g t = g T t as we do not learn g and this is the optimal relation from the time reversal. We define ˆLt(f)(X) = b t(X) f(X)+ 1 2g2 t f(X)+ λ t(X) PN m=1 R y Rmd f(Y)( K t(Y|X) δX(Y))dy, as well as ˆKt(f)(X) = b θ t (X) f(X)+ 1 2g2 T t f(X)+ λ θ t (X) PN m=1 R y Rmd f(Y)( K θ t (Y|X) δX(Y))dy where = ( ) is the Laplace operator and δ is a dirac delta on X i.e. PN m=1 R y Rmd δX(Y)dy = 1 and PN m=1 R y Rmd f(Y)δX(Y)dy = f(X). Verifying Assumption 1. The first step in the proof is to verify Assumption 1 in [17]. Letting νt(X) = p T t(X), we assume we can write tpt(X) = ˆK t pt(X) in the form Mν + cν = 0 for some function c : S R, where M is the generator of another auxiliary process on S and ˆK t is the adjoint operator which satisfies ˆK t f, h = f, ˆKth i.e. PN n=1 R x Rnd h(X) ˆK t (f)(X)dx = PN n=1 R x Rnd f(X) ˆKt(h)(X)dx We now find ˆK t . We start by substituting in the form for ˆKt, PN n=1 R x Rnd f(X) ˆKt(h)(X)dx = PN n=1 R x Rnd f(X){( b θ t (x) h)(X) + 1 2g2 T t h(X)+ λ θ t (X) PN m=1 R y Rmd h(Y)( K θ t (Y|X) δX(Y))dy}dx We first focus on the RHS terms corresponding to the diffusion part of the process PN n=1 R x Rnd f(X){( b θ t h)(X) + 1 2g2 T t h(X)}dx x Rnd f(X)( b θ t h)(X) + 1 2g2 T tf(X) h(X)dx x Rnd f(X)( b θ t h)(X) + 1 2g2 T th(X) f(X)dx x Rnd h(X) (f b θ t )(X) + 1 2g2 T th(X) f(X)dx x Rnd h(X){ (f b θ t)(X) + 1 2g2 T t f(X)}dx x Rnd h(X){ f(X) b θ t(X) f(X) b θ t (X) + 1 2g2 T t f(X)}dx. where we apply integration by parts twice to arrive at the third line and once to arrive at the fourth line. We now focus on the RHS term corresponding to the jump part of the process x Rnd f(X){ λ θ t(X) PN m=1 R y Rmd h(Y)( K θ t(Y|X) δX(Y))dy}dx y Rmd h(Y){PN n=1 R x Rnd f(X) λ θ t(X)( K θ t (Y|X) δX(Y))dx}dy x Rnd h(X){PN m=1 R y Rmd f(Y) λ θ t(Y)( K θ t(X|Y) δY(X))dy}dx, where on the last line we have relabelled X to Y and Y to X. Putting both re-arranged forms for the RHS together, we obtain PN n=1 R x Rnd h(X) ˆK t (f)(X)dx = PN n=1 R x Rnd h(X){ f(X) b θ t (X) f(X) b θ t(X) + 1 2g2 T t f(X)+ PN m=1 R y Rmd f(Y) λ θ t(Y)( K θ t(X|Y) δY(X))dy}dx. We therefore have ˆK t (f)(X) = f(X) b θ t(X) f(X) b θ t (X) + 1 2g2 T t f(X)+ PN m=1 R y Rmd f(Y) λ θ t(Y)( K θ t(X|Y) δX(Y))dy. Now we re-write tpt(X) = ˆK t pt(x) in the form Mν + cν = 0. We start by re-arranging tpt(X) = ˆK t pt(X) = 0 = tνt(X) + ˆK T tνt(X). Substituting in our form for ˆK t we obtain 0 = tνt(X) νt(X) b θ T t(X) b θ T t(X) νt(X) + 1 2g2 t νt(X) y Rmd νt(Y) λ θ T t(Y)( K θ T t(X|Y) δX(Y))dy (12) We define our auxiliary process to have generator M = t + ˆ Mt with ˆ Mt(f)(X) = b M t (X) f(X)+ 1 2g2 t f(X)+λM t (X) PN m=1 R y Rmd f(Y)(KM t (Y|X) δX(Y))dy which is a jump diffusion process with drift b M t , diffusion coefficient gt, rate λM t and transition kernel KM t . Then if we have b M t = b θ T t, λM t (X) = PN m=1 R y Rmd λ θ T t(Y) K θ T t(X|Y)dy, (13) and KM t (Y|X) λ θ T t(Y) K θ T t(X|Y). (14) Then we have (12) can be rewritten as 0 = tνt(X) νt(X) b θ T t(X) + b M t (X) νt(X) + 1 2g2 t νt(X) νt(X) λ θ T t(X) + νt(X) PN m=1 R y Rmd λ θ T t(Y) K θ T t(X|Y)dy+ λM t (x) PN m=1 R y Rmd νt(Y)(KM t (Y|X) δX(Y))dy which is in the form M(ν)( X) + c( X)ν( X) = 0 if we let c( X) = b θ T t(X) λ θ T t(X) + PN m=1 R y Rmd λ θ T t(Y) K θ T t(X|Y)dy. Verifying Assumption 2. Now that we have verified Assumption 1, the second step in the proof is Assumption 2 from [17]. We assume there is a bounded measurable function α : S (0, ) such that αMf = L(fα) f Lα for all functions f : X R such that f D(M) and fα D(L). Substituting in M and L we get αt(X)[ tf(X) b θ T t(X) f(X) 2g2 t f(X) + λM t (X) PN m=1 R y Rmd f(Y)(KM t (Y|X) δX(Y))dy] = t(fαt)(X) + b t(X) (fαt)(X) + 1 2g2 t (fαt)(X) + λ t(X) PN m=1 R y Rmd f(Y)αt(Y)( K t(Y|X) δX(Y))dy f(X)[ tαt(X) + b t(X) αt(X) + 1 2g2 t αt(X) + λ t(X) PN m=1 R y Rmd αt(Y)( K t(Y|X) δX(Y))dy] (15) Since f does not depend on time, tf(X) = 0 and t(fαt)(X) = f(X) tαt(X) thus the t terms on the RHS also cancel out. Comparing terms on the LHS and RHS relating to the diffusion part of the process we obtain αt(X)( b θ T t(X) f(X)) + 1 2αt(X)g2 t f(X) = b t(X) (fαt)(X) + 1 2g2 t (fαt)(X) f(X) b t(X) αt(X) 1 2f(X)g2 t αt(X). Therefore, we get αt(X)( b θ T t(X) f(X)) + 1 2αt(X)g2 t f(X) = b t(X) (f(X) αt(X) + αt(X) f(X)) 2g2 t 2 f(X) αt(X) + f(X) αt(X) + αt(X) f(X) f(X) b t((X) αt(X) 1 2f(X)g2 t αt(X). Simplifying the above expression, we get αt(X)( b θ T t(X) f(X)) = αt(X) b t(X) f(X) + g2 t f(X) αt(X) ( αt(X) b θ T t(X)) f(X) = (αt(X) b t(X) + g2 t αt(X)) f(X). This is true for any f implying αt(X) b θ T t(X) = αt(X) b t(X) + g2 t αt(X). This implies that αt(X) satisfies log αt(X) = 1 g2 t ( b t(X) + b θ T t(X)) (16) Comparing terms from the LHS and RHS of (15) relating to the jump part of the process we obtain αt(X)λM t (X) PN m=1 R y Rmd f(Y)(KM t (Y|X) δX(Y))dy = λ t(X) PN m=1 R y Rmd f(Y)αt(Y)( K t(Y|X) δX(Y))dy f(X) λ t(X) PN m=1 R y Rmd αt(Y)( K t(Y|X) δX(Y))dy. Hence, we have αt(X) PN m=1 R y Rmd f(Y)λM t (X)KM t (Y|X)dy αt(X)λM t (X)f(X) = λ t(X) PN m=1 R y Rmd f(Y)αt(Y) K t(Y|X)dy f(X) λ t(X) PN m=1 R y Rmd αt(Y) K t(Y|X)dy. Recalling the definitions of λM t (X) and KM(Y|X), (13) and (14), we get αt(X) PN m=1 R y Rmd f(Y) λ θ T t(Y) K θ T t(X|Y)dy αt(X)f(X) PN m=1 R y Rmd λ θ T t(Y) K θ T t(X|Y)dy = λ t(X) PN m=1 R y Rmd f(Y)αt(Y) K t(Y|X)dy f(X) λ t(X) PN m=1 R y Rmd αt(Y) K t(Y|X)dy. This equality is satisfied if αt(X) follows the following relation αt(Y) = αt(X) λ θ T t(Y) K θ T t(X|Y) λ t(X) K t(Y|X) for n = m (17) We only require this relation to be satisfied for n = m because both K t(Y|X) and K θ T t(X|Y) are 0 for n = m. We note at this point, as in [17], that if we have αt(X) = 1/pt(X) and λ θ T t and K θ T t(X|Y) equal to the true time-reversals, then both (16), and (17) are satisfied. However, αt(X) = 1/pt(X) is not the only αt to satisfy these equations. (16) and (17) can be thought of as enforcing a certain parameterization of the generative process in terms of αt [17]. Concluding the proof. Now for the final part of the proof, we substitute our value for α into the IISM loss from [17] which is equal to the negative of the evidence lower bound on Epdata(X0)[log pθ 0(X0)] up to a constant independent of θ. Defining βt(Xt) = 1/αt(Xt), we have IISM(β) = R T 0 Ept(Xt)[ ˆ L t βt(Xt) βt(Xt) + ˆLt log βt(Xt)]dt. We split the spatial infintesimal generator of the forward process into the generator corresponding to the diffusion and the generator corresponding to the jump part, ˆL = ˆLdiff t + ˆLJ t with ˆLdiff t (f)(X) = b t(X) f(X) + 1 2g2 t f(X). and ˆLJ t(f)(X) = λ t(X) PN m=1 R y Rmd f(Y)( K t(Y|X) δX(Y))dy. By comparison with the approach to find the adjoint ˆK t , we also have ˆL t = ˆLdiff t + ˆLJ t with ˆLdiff t (f)(X) = f(X) b t(X) f(X) b t(X) + 1 2g2 t f(X). In addition, we get ˆLJ t (f)(X) = PN m=1 R y Rmd f(Y) λ t(Y)( K t(X|Y) δX(Y))dy. Finally, IISM becomes IISM(β) = R T 0 Ept(Xt)[ ˆ Ldiff t βt(Xt) βt(Xt) + ˆLdiff t log βt(Xt)]dt + R T 0 Ept(Xt)[ ˆ LJ t βt(Xt) βt(Xt) + ˆLJ log βt(Xt)]dt = Idiff ISM(β) + IJ ISM(β), where we have named the two terms corresponding to the diffusion and jump part of the process as Idiff ISM, IJ ISM respectively. For the diffusion part of the loss, we use the denoising form of the objective proven in Appendix E of [17] which is equivalent to Idiff ISM up to a constant independent of θ Idiff ISM(β) = R T 0 Ep0,t(X0,Xt)[ ˆ Ldiff t (pt|0( |X0)αt( ))(Xt) pt|0(Xt|X0)αt(Xt) ˆLdiff t log(pt|0( |X0)αt( ))(Xt)]dt + const. To simplify this expression, we first re-arrange ˆLdiff t (h) for some general function h : S R. ˆLdiff t (h) h ˆLdiff t (log h) = b t h h b t log h 1 2g2 t log h 2g2 t log h 2. Setting h = pt|0( |X0)αt( ), our diffusion part of the loss becomes Idiff ISM(β) = 1 2 R T 0 g2 t Ep0,t(X0,Xt)[ log pt|0(Xt|X0) + log αt(Xt) 2]dt + const We then directly parameterize log αt(Xt) as sθ t(Xt) Idiff ISM(β) = 1 2 R T 0 g2 t Ep0,t(X0,Xt)[ log pt|0(Xt|X0) sθ t(Xt) 2]dt + const. We now focus on the expectation within the integral to re-write it in an easy to calculate form Ep0,t(X0,Xt)[ log pt|0(Xt|X0) sθ t(Xt) 2] = Ep0,t(X0,Xt)[ sθ t(Xt) 2 2sθ t(Xt)T log p0,t(X0, Xt)] + const Now we note that we can re-write log p0,t(X0, Xt) using Mt where Mt is a mask variable Mt {0, 1}n0 that is 0 for components of X0 that have been deleted to get to Xt and 1 for components that remain in Xt. log p0,t(X0, Xt) = 1 p0,t(X0,Xt) p0,t(X0, Xt) = 1 p0,t(X0,Xt) P Mt p0,t(X0, Xt, Mt) Mt 1 p0,t(X0,Xt) p0,t(X0, Xt, Mt) Mt p(nt,Mt,X0) p0,t(X0,Xt) pt|0(xt|nt, X0, Mt) Mt p(Mt|X0,Xt) p(xt|nt,X0,Mt) pt|0(xt|nt, X0, Mt) = Ep(Mt|X0,Xt)[ log pt|0(xt|nt, X0, Mt)] Substituting this back in we get Ep0,t(X0,Xt)[ log pt|0(Xt|X0) sθ t(Xt) 2] = Ep0,t(X0,Xt)[ sθ t(Xt) 2 2sθ t(Xt)T Ep(Mt|X0,Xt)[ log pt|0(xt|nt, X0, Mt)]] + const = Ep0,t(X0,Xt,Mt)[ log pt|0(xt|nt, X0, Mt) sθ t(Xt) 2] + const. Therefore, the diffusion part of IISM can be written as Idiff ISM(β) = T 2 EU(t;0,T )p0,t(X0,Xt,Mt)[g2 t log pt|0(xt|nt, X0, Mt) sθ t(Xt) 2] + const. We now focus on the jump part of the loss IJ ISM. We first substitute in ˆLJ t and ˆLJ t IJ ISM = R T 0 Ept(Xt)[ P y Rmd λ t(Y) βt(Y) βt(Xt)( K t(Xt|Y) δY(Xt))dy+ λ t(Xt) PN m=1 R y Rmd K t(Y|Xt) log βt(Y)dy λ t(Xt) log βt(Xt)]dt. (18) Noting that βt(Xt) = 1/αt(Xt), we get λ θ T t(Y) K θ T t(Xt|Y) λ t(Xt) K t(Y|Xt) for nt = m (19) or swapping labels for Xt and Y, βt(Y) βt(Xt) = λ θ T t(Xt) K θ T t(Y|Xt) λ t(Y) K t(Xt|Y) for nt = m (20) Substituting (19) into the second line and (20) into the first line of (18) and using the fact that K t(Xt|Y) = 0 for nt = m, we obtain IJ ISM = R T 0 Ept(Xt)[ PN m=1\nt R y Rmd λ t(Y) λ θ T t(Xt) K θ T t(Y|Xt) λ t(Y) K t(Xt|Y) K t(Xt|Y)dy y Rmd λ t(Y) βt(Y) βt(Xt)δY(Xt)dy + λ t(Xt) PN m=1\nt R y Rmd K t(Y|Xt){log βt(Xt) log λ θ T t(Y) log K θ T t(Xt|Y) + log λ t(Xt) + log K t(Y|Xt)}dy λ t(Xt) log βt(Xt)]dt. Hence, we have IJ ISM = R T 0 Ept(Xt)[ λ θ T t(Xt) PN m=1\nt R y Rmd K θ T t(Y|Xt)dy λ t(Xt) βt(Xt) βt(Xt)+ λ t(Xt) PN m=1\nt R y Rmd K t(Y|Xt){ log λ θ T t(Y) log K θ T t(Xt|Y)}dy]dt + const. This can be rewritten as IJ ISM = R T 0 Ept(Xt)[ λ θ T t(Xt) + λ t(Xt)E K t(Y|Xt)[ log λ θ T t(Y) log K θ T t(Xt|Y)]]dt + const. Therefore, we have IJ ISM = TEU(t;0,T )pt(Xt) K t(Y|Xt)[ λ θ T t(Xt) λ t(Xt) log λ θ T t(Y) λ t(Xt) log K θ T t(Xt|Y)] + const. Finally, using the definition of the forward and backward kernels, i.e. K t(Y|Xt) = Pn i=1 Kdel(i|n)δdel(X,i)(Y) and K θ T t(Xt|Y) = R xadd Pn i=1 Aθ t (xadd, i|Y)δins(Y,xadd,i)(Xt)dxadd, we get IJ ISM = TEU(t;0,T )pt(Xt)[ PN m=1 R y Rmd Pnt i=1 Kdel(i|nt)δdel(Xt,i)(Y)( λ θ T t(Xt) λ t(Xt) log λ θ T t(Y) λ t(Xt) log K θ T t(Xt|Y))dy] + const We get IJ ISM = TEU(t;0,T )pt(Xt)Kdel(i|nt)δdel(Xt,i)(Y) [ λ θ T t(Xt) λ t(Xt) log λ θ T t(Y) λ t(Xt) log K θ T t(Xt|Y)] + const. Therefore, we have IJ ISM = TEU(t;0,T )pt(Xt)Kdel(i|nt)δdel(Xt,i)(Y) [ λ θ T t(Xt) λ t(Xt) log λ θ T t(Y) λ t(Xt) log Aθ T t(xadd, i|Y)] + const. Putting are expressions for Idiff ISM and IJ ISM together we obtain IISM = T 2 E[g2 t log pt|0(xt|nt, X0, Mt) sθ t (Xt) 2]+ TE[ λ θ T t(Xt) λ t(Xt) log λ θ T t(Y) λ t(Xt) log Aθ T t(xadd, i|Y)] + const. We get that IISM gives us our evidence lower bound on Epdata(X0)[log pθ 0(X0)] up to a constant that does not depend on θ. In the main text we have used a time notation such that the backward process runs backwards from t = T to t = 0. To align with the notation of time used in the main text we change T t to t on subscripts for λ θ T t and Aθ T t. We also will use the fact that λ t(Xt) depends only on the number of components in Xt, λ t(Xt) = λ t(nt). L(θ) = T 2 E[g2 t sθ t(Xt) xt log pt|0(xt|nt, X0, Mt) 2]+ TE[ λ θ t(Xt) + λ t(nt) log λ θ t(Y) + λ t(nt) log Aθ t (xadd, i|Y)] + const. Tightness of the lower bound Now that we have derived the ELBO as in Proposition 2, we show that the maximizers of the ELBO are tight, i.e. that they close the variational gap. We do this by proving the general ELBO presented in [17] has this property and therefore ours, which is a special case of this general ELBO, also has that the optimum parameters close the variational gap. To state our proposition, we recall the setting of [17]. The forward noising process is denoted (Yt)t 0 and associated with an infinitesimal generator ˆL its extension (t, Yt)t 0 is associated with the infinitesimal generator L, i.e. L = t + ˆL. We also define the score-matching operator Φ given for any f for which it is defined by Φ(f) = L(f)/f L(log(f)). We recall that according to [17, Equation (8)] and under [17, Assumption 1, Assumption2], we have log p T (Y0) E[log p0(YT ) R T 0 L(v/β)/(v/β) + L(log β)dt], with vt = p T t for any t [0, T]. We define the variational gap Gap as follows Gap = E[log p T (Y0) log p0(YT ) + R T 0 L(v/β)/(v/β) + L(log β)dt]. In addition, using Itô Formula, we have that log v T (YT ) log v0(Y0) = R T 0 L(v)dt. Assuming that E[| log v T (YT ) log v0(Y0)|] < + , we get Gap = E[ R T 0 L(log v) + L(v/β)/(v/β) + L(log β)dt] = E[ R T 0 Φ(v/β)dt]. In particular, using [17, Proposition 1], we get that Gap 0 and Gap = 0 if and only if β v. In addition, the ELBO is maximized if and only if β v, see [17, Equation 10] and the remark that follows. Therefore, we have that: if we maximize the ELBO then the ELBO is tight. Combining this with the fact that the ELBO is maximized at the time reversal [17], then we have that when our jump diffusion parameters match the time reversal, our variational gap is 0. Other approaches. Another way to derive the ELBO is to follow the steps of [15] directly, since [17] is a general framework extending this approach. The key formulae to derive the result and the ELBO is 1) a Feynman-Kac formula 2) a Girsanov formula. In the case of jump diffusions (with jump in Rd) a Girsanov formula has been established by [22]. Extending this result to one-point compactification space would allow us to prove directly Proposition 2 without having to rely on the general framework of [17]. A.4 Proof of Proposition 3 We start by recalling the form for the time reversal given in Proposition 1 λ t (X) = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R yadd pt(ins(X, yadd, i))dyadd/pt(X). We then introduce a marginalization over X0 λ t (X) = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R x0 p0,t(X0, ins(X, yadd, i))dx0dyadd/pt(X) = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R yadd P n0 R pt(X) pt|0(ins(X, yadd, i)|X0)dx0dyadd = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R yadd P n0 R x0 p0|t(X0|X) pt|0(X|X0)pt|0(ins(X, yadd, i)|X0)dx0dyadd = λ t(n + 1) Pn+1 i=1 Kdel(i|n + 1) R x0 p0|t(n0|X)p0|t(x0|X,n0) pt|0(n|X0)pt|0(x|X0,n) pt|0(n + 1|X0)pt|0(z(X, yadd, i)|X0, n + 1)dx0dyadd where (n + 1, z(X, yadd, i)) = ins(X, yadd, i). Now using the fact the forward component deletion process does not depend on x0, only n0, we have pt|0(n|X0) = pt|0(n|n0) and pt|0(n + 1|X0) = pt|0(n + 1|n0). Using this result, we get λ t (X) = λ t(n + 1) P n0{ pt|0(n+1|n0) pt|0(n|n0) p0|t(n0|X) Pn+1 i=1 Kdel(i|n+1) R yadd pt|0(z(X,yadd,i)|X0,n+1)dyadd pt|0(x|X0,n) p0|t(x0|X, n0)dx0}. (21) We now focus on the probability ratio within the integral over x0. We will show that this ratio is 1. We start with the numerator, introducing a marginalization over possible mask variables between X0 and (n + 1, z), denoted M (n+1) with M (n+1) having n + 1 ones and n0 (n + 1) zeros. Pn+1 i=1 Kdel(i|n + 1) R yadd pt|0(z(X, yadd, i)|X0, n + 1)dyadd = Pn+1 i=1 Kdel(i|n + 1) P yadd pt|0(M (n+1), z(X, yadd, i)|X0, n + 1)dyadd M (n+1) Pn+1 i=1 Kdel(i|n + 1)pt|0(M (n+1)|X0, n + 1) R yadd pt|0(z(X, yadd, i)|X0, n + 1, M (n+1))dyadd Now, for our forward process we have pt|0(z(X, yadd, i)|X0, n + 1, M (n+1)) = Qn+1 j=1 N(z(j); αt M (n+1)(X0)j, (1 αt)Id) where z is shorthand for z(X, yadd, i), z(j) is the vector in Rd for the jth component of z and M (n+1)(X0)j is the vector in Rd corresponding to the component in X0 corresponding to the jth one in the M (n+1) mask. Integrating out yadd we have R yadd pt|0(z(X, yadd, i)|X0, n+1, M (n+1))dyadd = Qn j=1 N(x(j); αt M (n+1)\i(X0)j, (1 αt)Id), where M (n+1)\i denotes a mask variable obtained by setting the ith one of M (n+1) to zero. Hence, we have Pn+1 i=1 Kdel(i|n + 1) R yadd pt|0(z(X, yadd, i)|X0, n + 1)dyadd M (n+1) Pn+1 i=1 Kdel(i|n + 1)pt|0(M (n+1)|X0, n + 1) Qn j=1 N(x(j); αt M (n+1)\i(X0)j, (1 αt)Id). (22) We now re-write the denominator from (21) introducing a marginalization over mask variables, M (n) pt|0(x|X0, n) = P M (n) pt|0(M (n)|X0, n)pt|0(x|M (n), X0, n). (23) We use the following recursion for the probabilities assigned to mask variables pt|0(M (n)|X0, n) = P M (n+1) Pn+1 i=1 I{M (n+1)\i = M (n)}Kdel(i|n + 1)pt|0(M (n+1)|X0, n + 1). Substituting this into (23) gives pt|0(x|X0, n) = P M (n+1) Pn+1 i=1 I{M (n+1)\i = M (n)}Kdel(i|n + 1) pt|0(M (n+1)|X0, n + 1)pt|0(x|M (n), X0, n) M (n+1) Pn+1 i=1 I{M (n+1)\i = M (n)}Kdel(i|n + 1) pt|0(M (n+1)|X0, n + 1) Qn j=1 N(x(j); αt M (n)(X0)j, (1 αt)Id) = P M (n+1) Pn+1 i=1 Kdel(i|n + 1)pt|0(M (n+1)|X0, n + 1) Qn j=1 N(x(j); αt M (n+1)\i(X0)j, (1 αt)Id). By comparing with (22), we can see that pt|0(x|X0, n) = Pn+1 i=1 Kdel(i|n + 1) R yadd pt|0(z(X, yadd, i)|X0, n + 1)dyadd. This shows that the probability ratio in (21) is 1. Therefore, we have λ t (X) = λ t(n + 1) P n0{ pt|0(n+1|n0) pt|0(n|n0) p0|t(n0|X) R x0 p0|t(x0|X, n0)dx0} = λ t(n + 1) P n0 pt|0(n+1|n0) pt|0(n|n0) p0|t(n0|X), which concludes the proof. pt|0(n|n0) can be analytically calculated when λ t(n) is of a simple enough form. When λ t(n) does not depend on n then the dimension deletion process simply becomes a time inhomogeneous Poisson process. Therefore, we would have pt|0(n|n0) = ( R t 0 λ sds)n0 n (n0 n)! exp( R t 0 λ sds). In our experiments we set λ t(n = 1) = 0 to stop the dimension deletion process when we reach a single component. If we have λ t(n) = λ t(m) for all n, m > 1 then we can still use the time inhomogeneous Poisson process formula for n > 1 and find the probability for n = 1, pt|0(n = 1|n0) by requiring pt|0(n|n0) to be a valid normalized distribution. Therefore, for the case that λ t(n) = λ t(m) for all n, m > 1 and λ t(n = 1) = 0, we have pt|0(n|n0) = ( R t 0 λ sds)n0 n (n0 n)! exp( R t 0 λ sds) 1 < n n0 1 Pn0 m=2 ( R t 0 λ sds)n0 m (n0 m)! exp( R t 0 λ sds) n = 1 In cases where λ t(n) depends on n not just for n = 1, pt|0(n|n0) can become more difficult to calculate analytically. However, since the probability distributions are all 1-dimensional over n, it is very cheap to simply simulate the forward dimension deletion process many times and empirically estimate pt|0(n|n0) although we do not need to do this for our experiments. A.5 The Objective is Maximized at the Time Reversal In this section, we analyze the objective L(θ) as a standalone object and determine the optimum values for sθ t , λ θ t and Aθ t directly. This is in order to gain intuition directly into the learning signal of L(θ) without needing to refer to stochastic process theory. The definition of L(θ) as in the main text is 2 E[g2 t sθ t(Xt) xt log pt|0(xt|X0, nt, Mt) 2]+ TE[ λ θ t (Xt) + λ t(nt) log λ θ t (Y) + λ t(nt) log Aθ t(xadd t , i|Y)] + C. with the expectations taken over U(t; 0, T)p0,t(X0, Xt, Mt)Kdel(i|nt)δdel(Xt,i)(Y). Continuous optimum. We start by analysing the objective for sθ t . This part of L(θ) can be written as 1 2 R T 0 g2 t Ep0,t(X0,Xt,Mt)[ sθ t(Xt) xt log pt|0(xt|X0, nt, Mt) 2]dt We now use the fact that the function that minimizes an L2 regression problem min f Ep(x,y)[ f(x) y 2] is the conditional expectation of the target f (x) = Ep(y|x)[y]. Therefore the optimum value for sθ t(Xt) is s t (Xt) = Ep(Mt,X0|Xt)[ xt log pt|0(xt|X0, nt, Mt)] Mt PN n0=1 R x0 Rn0d p(Mt, n0, x0|Xt) xt log pt|0(xt|x0, n0, nt, Mt)dx0 Mt PN n0=1 R x0 Rn0d p(Mt,n0,x0|Xt) pt|0(xt|x0,n0,nt,Mt) xtpt|0(xt|x0, n0, nt, Mt)dx0 = P Mt PN n0=1 R x0 Rn0d p(x0,n0,nt,Mt) p(nt,xt) xtpt|0(xt|x0, n0, nt, Mt)dx0 = 1 p(nt,xt) P Mt PN n0=1 R x0 Rn0d xtp(xt, x0, n0, nt, Mt)dx0 = 1 p(nt,xt) xt P Mt PN n0=1 R x0 Rn0d p(xt, x0, n0, nt, Mt)dx0 = 1 p(nt,xt) xtp(xt, nt) = xt log p(Xt). Therefore, the optimum value for sθ t(Xt) is xt log p(Xt) which is the value that gives b t to be the time reversal of b t as stated in Proposition 1. Jump rate optimum. The learning signal for λ θ t comes from these two terms in L(θ) TE[ λ θ t(Xt) + λ t(nt) log λ θ t(Y)] (24) This expectation is maximized when for each test input Z and test time t, we have the following expression maximized pt(Z) λ θ t (Z) + Pnz+1 i= R yadd pt(ins(Z, yadd, i))Kdel(i|nz + 1)dyadd λ t(nz + 1) log λ θ t (Z), because pt(Z) is the probability Z gets drawn as a full sample from the forward process and Pnz+1 i= R yadd pt(ins(Z, yadd, i))Kdel(i|nz + 1)dyadd is the probability that a sample one component bigger than Z gets drawn from the forward process and then a component is deleted to get to Z. Therefore the first probability is the probability that test input Z and test time t appear as the first term in (24) whereas the second probability is the probability that test input Z and test time t appear as the second term in (24). We now use the fact that, for constants b and c, argmaxa ba + c log a = c We therefore have the optimum λ θ t (Z) as λ t (Z) = λ t(nz + 1) yadd pt(ins(Z,yadd,i))Kdel(i|nz+1)dyadd which is the form for the time-reversal given in Proposition (1). Jump kernel optimum. Finally, we analyse the part of L(θ) for learning Aθ t(xadd t , i|Y), TE[ λ t(nt) log Aθ t(xadd t , i|Y)] = R T 0 Ept(Xt)Kdel(i|nt)δdel(Xt,i)(Y)[ λ t(nt) log Aθ t(xadd t , i|Y)]dt = R T 0 Ept(nt)[ λ t(nt)Ept(xt|nt)Kdel(i|nt)δdel(Xt,i)(Y)[log Aθ t (xadd t , i|Y)]]dt. We now re-write the joint probability distribution that the inner expectation is taken with respect to, pt(xt|nt)Kdel(i|nt)δdel(Xt,i)(Y) = p(Y|nt)p(xadd t , i|Y)δy(xbase t ). with p(Y|nt) = Pnt i=1 R xt pt(xt|nt)Kdel(i|nt)δdel(xt,i)(Y)dxt, and p(xadd t , i|Y) pt(xt|nt)Kdel(i|nt), and xbase t R(nt 1)d referring to the nt 1 components of xt, that are not xadd t i.e. Xt = ins((xbase t , nt 1), xadd t , i). We then have TE[ λ t(nt) log Aθ t(xadd t , i|Y)] = R T 0 Ept(nt)[ λ t(nt)E p(Y|nt)p(xadd t ,i|Y)δy(xbase t )[log Aθ t(xadd t , i|Y)]]dt = R T 0 Ept(nt)[ λ t(nt)E p(Y|nt)p(xadd t ,i|Y)δy(xbase t )[log Aθ t (xadd t , i|Y)]]dt R T 0 Ept(nt)[ λ t(nt)E p(Y|nt)p(xadd t ,i|Y)δy(xbase t )[log p(xadd t , i|Y)]]dt + const = R T 0 Ept(nt)[ λ t(nt)E p(Y|nt)δy(xbase t )[ KL(p(xadd t , i|Y) || Aθ t(xadd t , i|Y)]]dt + const. Therefore, the optimum Aθ t (xadd t , i|Y) which maximizes this part of L(θ) is A t (xadd t , i|Y) = p(xadd t , i|Y) pt(Xt)Kdel(i|nt). which is the same form as given in Proposition 1. B Training Objective We estimate our objective L(θ) by taking minibatches from the expectation U(t; 0, T)p0,t(X0, Xt, Mt)Kdel(i|nt)δdel(Xt,i)(Y). We first sample t U(t; 0, T) and then take samples from our dataset X0 pdata(X0). In order to sample pt|0(Xt, Mt|X0) we need to both add noise, delete dimensions and sample a mask variable. Since the Gaussian noising process is isotropic, we can add a suitable amount of noise to all dimensions of X0 and then delete dimensions of that noised full dimensional value. More specifically, we first sample Xt = (n0, xt) with xt N( xt; αtx0, (1 αt)In0d) for αt = exp R t 0 β(s)ds using the analytic forward equations for the VP-SDE derived in [3]. Then we sample the number of dimensions to delete. This is simple to do when our rate function is independent of n except for the case when n = 1 at which it is zero. We simply sample a Poisson random variable with mean parameter R t 0 λ sds and then clamp its value such that the maximum number of possible components that are deleted is n0 1. This gives the appropriate distribution over n, pt|0(n|n0) as given in Section A.4. To sample which dimensions are deleted, we can sample Kdel(i1|n0)Kdel(i2|n0 1) . . . Kdel(in0 nt|nt + 1) from which we can create the mask Mt and apply it to Xt to obtain Xt, Xt = Mt( Xt). When Kdel(i|n) = 1/n this is especially simple to do by simply randomly permuting the components of Xt, and then removing the final n0 nt components. As is typically done in standard diffusion models, we parameterize sθ t in terms of a noise prediction network that predicts ϵ where xt = αt Mt(x0) + 1 αtϵ, ϵ N(0, Intd). We then re-weight the score loss in time such that we have a uniform weighting in time rather than the likelihood weighting with g2 t [3, 21]. Our objective to learn sθ t then becomes EU(t;0,T )pdata(X0)p(Mt,nt|X0)N(ϵ;0,Intd) ϵθ t(Xt) ϵ 2 with xt = αt Mt(x0) + 1 αtϵ, sθ t (Xt) = 1 1 αt ϵθ t(Xt). Further, by using the parameterization given in Proposition 3, we can directly supervise the value of pθ 0|t(n0|Xt) by adding an extra term to our objective. We can treat the learning of pθ 0|t(n0|Xt) as a standard prediction task where we aim to predict n0 given access to Xt. A standard objective for learning pθ 0|t(n0|Xt) is then the cross entropy max θ Ep0,t(X0,Xt) h log pθ 0|t(n0|Xt) i Our augmented objective then becomes L(θ) = TE[ 1 2 ϵθ t (Xt) ϵ 2 λ θ t(Xt)+ λ t(nt) log λ θ t (Y)+ λ t(nt) log Aθ t (xadd t , i|Y)+γ log pθ 0|t(n0|Xt)] (25) where the expectation is taken with respect to U(t; 0, T)pdata(X0)p(Mt, nt|X0)N(ϵ; 0, Intd)Kdel(i|nt)δdel(Xt,i)(Y) where xt = αt Mt(x0) + 1 αtϵ and γ is a loss weighting term for the cross entropy loss. C Trans-Dimensional Diffusion Guidance To guide an unconditionally trained model such that it generates datapoints consistent with conditioning information, we use the reconstruction guided sampling approach introduced in [9]. Our conditioning information will be the values for some of the components of X0, and thus the guidance should guide the generative process such that the rest of the components of the generated datapoint are consistent with those observed components. Following the notation of [9], we denote the observed components as xa Rnad and the components to be generated as xb Rnbd. Our trained score function sθ t(Xt) approximates xt log pt(Xt) whereas we would like the score to approximate xt log pt(Xt|xa 0). In order to do this, we will need to augment our unconditional score sθ t(Xt) such that it incorporates the conditioning information. We first focus on the dimensions of the score vector corresponding to xa. These can be calculated analytically from the forward process xa t log p(Xt|xa 0) = xa t log pt|0(xa t |xa 0, nt) with pt|0(xa t |xa 0, nt) = N(xa t ; αtxa 0, (1 αt)Inad). Note that we assume a correspondence between xa t and xa 0. For example, in video if we condition on the first and last frame, we assume that the first and last frame of the current noisy xt correspond to xa 0 and guide them towards their observed values. For molecules, the point cloud is permutation invariant and so we can simply assume the first na components of xt correspond to xa 0 and guide them to their observed values. Now we analyse the dimensions of the score vector corresponding to xb. We split the score as xb t log p(Xt|xa 0) = xb t log p(xa 0|Xt) + xb t log pt(Xt) p(xa 0|Xt) is intractable to calculate directly and so, following [9], we approximate it with N(xa 0; ˆxθa 0 (Xt), 1 αt αt Inad) where ˆxθa 0 (Xt) is a point estimate of xa 0 given from sθ t(Xt) calculated as ˆxθa 0 (Xt) = xa t + (1 αt)sθ t(Xt)a αt where again we have assumed a correspondence between xa t and xa 0. Our approximation for xb t log p(xa 0|Xt) is then xb t log p(xa 0|Xt) xb t αt 2(1 αt) xa 0 ˆxθa 0 (Xt) 2 which can be calculated by differentiating through the score network sθ t . We approximate λ t (Xt|xa 0) and A t (yadd, i|Xt, xa 0), with their unconditional forms λ θ t (Xt) and Aθ t(yadd, i|Xt). We find this approximation still leads to valid generations because the guidance of the score network sθ t, results in Xt containing the conditioning information which in turn leads to λ θ t(Xt) guiding the number of components in Xt to be consistent with the conditioning information too as verified in our experiments. Further, any errors in the approximation for Aθ t(yadd, i|Xt) are fixed by further applications of the guided score function, highlighting the benefits of our combined autoregressive and diffusion based approach. D Experiment Details Our code is available at https://github.com/andrew-cr/jump-diffusion D.1 Molecules D.1.1 Network Architecture Backbone For our backbone network architecture, we used the EGNN used in [8]. This is a specially designed graph neural network applied to the point cloud treating it as a fully connected graph. A special equivariant update is used, operating only on distances between atoms. We refer to [8] for the specific details on the architecture. We used the same size network as used in [8] s QM9 experiments, specifically there are 9 layers, with a hidden node feature size of 256. The output of the EGNN is fed into a final output projection layer to give the score network output sθ t(Xt). Component number prediction To obtain pθ 0|t(n0|Xt), we take the embedding produced by the EGNN before the final output embedding layer and pass it through 8 transformer layers each consisting of a self-attention block and an MLP block applied channel wise. Our transformer model dimension is 128 and so we project the EGNN embedding output down to 128 before entering into the transformer layers. We then take the output of the transformer and take the average embedding over all nodes. This embedding is then passed through a final projection layer to give softmax logits over the pθ 0|t(n0|Xt) distribution. Autoregressive Distribution Our Aθ t(yadd, i|Xt) network has to predict the position and features for a new atom when it is added to the molecule. Since the point cloud is permutation invariant, we do not need to predict i and so we just need to parameterize Aθ t(yadd|Xt). We found the network to perform the best if the network first predicts the nearest atom to the new atom and then a vector from that atom to the location of the new atom. To achieve this, we first predict softmax logits for a distribution over the nearest atom by applying a projection to the embedding output from the previously described transformer block. During training, the output of this distribution can be directly supervised by a cross entropy loss. Given the nearest atom, we then need to predict the position and features of the new atom to add. We do this by passing in the embedding generated by the EGNN and original point cloud features into a new transformer block of the same size as that used for pθ 0|t(n0|Xt). We also input the distances from the nearest atom to all other atoms in the molecule currently as an additional feature. To obtain the position of the new atom, we will take a weighted sum of all the vectors between the nearest atom and other atoms in the molecule. This is to make it easy for the network to create new atoms in plane with existing atoms which is useful for e.g. completing rings that have to remain in the same plane. To calculate the weights for the vectors, we apply an output projection to the output of the transformer block. The new atom features (atom type and charge) are generated by a separate output projection from the transformer block. For the position and features, Aθ t(yadd|Xt) outputs both a mean and a standard deviation for a Gaussian distribution. For the position distribution, we set the standard deviation to be isotropic to remain equivariant to rotations. In total our model has around 7.3 million parameters. D.1.2 Training We train our model for 1.3 million iterations at a batch size of 64. We use the Adam optimizer with learning rate 0.00003. We also keep a running exponential moving average of the network weights that is used during sampling as is standard for training diffusion models [2, 3, 16] with a decay parameter of 0.9999. We train on the 100K molecules contained in the QM9 training split. We model hydrogens explicitly. Training a model requires approximately 7 days on a single GPU which was done on an Academic cluster. In [8] the atom type is encoded as a one-hot vector and diffused as a continuous variable along with the positions and charge values for all atoms. They found that multiplying the one-hot vectors by 0.25 to boost performance by allowing the atom-type to be decided later on in the diffusion process. We instead multiply the one-hot vectors by 4 so that atom-type is decided early on in the diffusion process which improves our guided performance when conditioning on certain atom-types being present. We found our model is robust to this change and achieves similar sample quality to [8] as shown in Table 2. When deleting dimensions, we first shuffle the ordering of the nodes and then delete the final n0 nt nodes. The cross entropy loss weighting in (25) is set to 1. Following [8] we train our model to operate within the center of mass (Co M) zero subspace of possible molecule positions. The means, throughout the forward and backward process, the average position of an atom is 0. In our transdimensional framework, this is achieved by first deleting any atoms required under the forward component deletion process. We then move the molecule such that its Co M is 0. We then add Co M free noise such that the noisy molecule also has Co M= 0. Our score model sθ t is parameterized through a noise prediction model ϵθ t which is trained to predict the Co M free noise that was added. Therefore, our score network learns suitable directions to maintain the process on the Co M= 0 subspace. For the position prediction from Aθ t (yadd|Xt) we train it to predict the new atom position from the current molecules reference frame. When the new atom is added, we then update all atom positions such that Co M= 0 is maintained. D.1.3 Sampling During sampling we found that adding corrector steps [3] improved sample quality. Intuitively, corrector steps form a process that has pt(X) as its stationary distribution rather than the process progressing toward p0(X). We use the same method to determine the corrector step size ζ as in [3]. For the conditional generation tasks, we also found it useful to include corrector steps for the component generation process. As shown in [29], corrector steps in discrete spaces can be achieved by simulating with a rate that is the addition of the forward and backward rates. We achieve this in the context of trans-dimensional modeling by first simulating a possible insertion using λ θ t and then simulating a possible deletion using λ t. We describe our overall sampling algorithm in Algorithm 2. Algorithm 2: Sampling the Generative Process with Corrector Steps Input: Number of corrector steps C t T X pref(X) = I{n = 1}N(x; 0, Id) while t > 0 do if u < λ θ t (X)δt with u U(0, 1) then Sample xadd, i Aθ t(xadd, i|X) X ins(X, xadd, i) end x x b θ t (X)δt + gt δtϵ with ϵ N(0, Ind) for c = [1, . . . , C] do x x + ζsθ t δt(X) + 2ζϵ with ϵ N(0, Ind) if u < λ θ t δt(X)δt with u U(0, 1) then Sample xadd, i Aθ t δt(xadd, i|X) X ins(X, xadd, i) end if u < λ t δt(n)δt with u U(0, 1) then X del(X, i) with i Kdel(i|n) end end X (n, x), t t δt end D.1.4 Evaluation Unconditional For our unconditional sampling evaluation, we start adding corrector steps when t < 0.1T in the backward process and use 5 corrector steps without the corrector steps on the number of components. We set δ = 0.05 for t > 0.5T and δ = 0.001 for t < 0.5T such that the total number 0 5 10 15 20 25 30 Number of atoms Dataset Samples Figure 6: Distribution of the size of molecules in the QM9 dataset as measured through the number of atoms versus the distribution of the size of molecules generated by our unconditional model. of network evaluations is 1000. We show the distribution of sizes of molecules generated by our model in Figure 6 and show more unconditional samples in Figure 7. We find our model consistently generates realistic molecules and achieves a size distribution similar to the training dataset even though this is not explicitly trained and arises from sampling our backward rate λ θ t. Since we are numerically integrating a continuous time process and approximating the true time reversal rate λ t , some approximation error is expected. For this experiment, sampling all of our models and ablations takes approximately 2 GPU days on Nvidia 1080Ti GPUs. Conditional For evaluating applying conditional diffusion guidance to our model, we choose 10 conditioning tasks that each result in a different distribution of target dimensions. The task is to produce molecules that include at least a certain number of target atom types. We then guide the first set of atoms generated by the model to have these desired atom types. The tasks chosen are given in Table 5. Molecules in the training dataset that meet the conditions in each task have a different distribution of sizes. The tasks were chosen so that we have an approximately linearly increasing mean number of atoms for molecules that meet the condition. We also require that there are at least 100 examples of molecules that meet the condition within the training dataset. For sampling when using conditional diffusion guidance, we use 3 corrector steps throughout the backward process with δt = 0.001. For these conditional tasks, we include the corrector steps on the number of components. We show the distribution of dimensions for each task from the training dataset and from our generated samples in Figure 8. Our metrics are calculated by first drawing 1000 samples for each conditioning task and then finding the Hellinger distance between the size distribution generated by our method (orange diagonal hashing in Figure 8) and the size distribution for molecules in the training dataset that match the conditions of the task (green no hashing in Figure 8). We find that indeed our model when guided by diffusion guidance can automatically produce a size distribution close to the ground truth size distribution found in the dataset for that conditioning value. We show samples generated by our conditionally guided model in Figure 9. We can see that our model can generate realistic molecules that include the required atom types and are of a suitable size. For this experiment, sampling all of our models and ablations takes approximately 13 GPU days on Nvidia 1080Ti GPUs. Interpolations For our interpolations experiments, we follow the set up of [8] who train a new model conditioned on the polarizability of molecules in the dataset. We train a conditional version of our model which can be achieved by simply adding in the polarizability as an additional feature input to our backbone network and re-using all the same hyperparameters. We show more examples of interpolations in Figure 10. Figure 7: Unconditional samples from our model. D.1.5 Ablations For our main model, we set λ t<0.1T = 0 to ensure that all dimensions are added with enough generation time remaining for the diffusion process to finalize all state values. To verify this setting, we compare its performance with λ t<0.03T = 0 and λ t<0.3T = 0. We show our results in Table 6. We find that the λ t<0.03T = 0 setting to generate reasonable sample quality but incur some extra dimension error due to the generative process sometimes observing a lack of dimensions near t = 0 and adding too many dimensions. We observed the same effect in the paper for when setting λ t to be constant for all t in Table 3. Further, the setting λ t<0.3T = 0 also results in increased dimension error due to there being less opportunity for the guidance model to supervise the number of dimensions. We find that λ t<0.1T = 0 to be a reasonable trade-off between these effects. D.1.6 Uniqueness and Novelty Metrics We here investigate sample diversity and novelty of our unconditional generative models. We measure uniqueness by computing the chemical graph corresponding to each generated sample and measure what proportion of the 10000 produced samples have a unique chemical graph amongst this set of 10000 as is done in [8]. We show our results in Table 7 and find our TDDM method to have slightly lower levels of uniqueness when compared to the fixed dimension diffusion model baseline. Table 5: The 10 conditioning tasks used for evaluation. The number of each atom type required for the task is given in columns 2 5 whilst the average number of atoms in molecules that meet this condition in the training dataset is given in the 6th column. Task Carbon Nitrogen Oxygen Fluorine Mean Number of Atoms 1 4 1 2 1 11.9 2 4 3 1 1 13.0 3 5 2 1 1 13.9 4 6 0 1 1 14.6 5 5 3 1 0 16.0 6 6 3 0 0 17.2 7 6 1 2 0 17.7 8 7 1 1 0 19.1 9 8 1 0 0 19.9 10 8 0 1 0 21.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 0 10 20 30 0.0 Number of atoms Figure 8: Distribution of molecule sizes for each conditioning task. Tasks 1 5 are shown left to right in the top row and tasks 6 10 are shown left to right in the bottom row. We show the unconditional size distribution from the dataset in blue vertical/horizontal hashing, the size distribution of our conditionally generated samples in orange diagonal hashing and finally the size distribution for molecules in the training dataset that match the conditions of each task (the ground truth size distribution) in green no hashing. Measuring novelty on generative models trained on the QM9 dataset is challenging because the QM9 dataset contains an exhaustive enumeration of all molecules that satisfy certain predefined constraints [46], [8]. Therefore, if a novel molecule is produced it means the generative model has failed to capture some of the physical properties of the dataset and indeed it is found in [8] that during training, as the model improved, novelty decreased. Novelty is therefore not typically included in evaluating molecular diffusion models. For completeness, we include the novelty scores in Table 7 as a comparison to the results presented in [8] Appendix C. We find that our samples are closer to the statistics of the training dataset whilst still producing novel samples at a consistent rate. D.2.1 Dataset We used the VP2 benchmark, which consists of 35 000 videos, each 35 frames long. The videos are evenly divided among seven tasks, namely: push {red, green, blue} button, open {slide, drawer}, push {upright block, flat block} off table. The 5000 videos for each task were collected using a scripted task-specific policy operating in the Robo Desk environment [40]. They sample an action vector at every step during data generation by adding i.i.d. Gaussian Figure 9: Samples generated by our model when conditional diffusion guidance is applied. Each row represents one task with task 1 at the top, down to task 10 at the bottom. For each task, 10 samples are shown in each row. noise to each dimension of the action vector output by the scripted policy. For each task, they sample 2500 videos with noise standard deviation 0.1 and 2500 videos with standard deviation 0.2. We filter out the lower-quality trajectories sampled with noise standard deviation 0.2, and so use only the 17 500 videos (2500) per task with noise standard deviation 0.1. We convert these videos to 32 32 resolution and then, so that the data we train on has varying lengths, we create each training example by sampling a length l from a uniform distribution over {2, . . . , 35} and then taking a random l-frame subset of the video. D.2.2 Forward Process The video domain differs from molecules in two important ways. The first is that videos cannot be reasonably treated as a permutation-invariant set. This is because the order of the frames matters. Secondly, generating a full new component for the molecules with a single pass autoregressive network is feasible, however, a component for the videos is a full frame which is challenging for a single pass autoregressive network to generate. We design our forward process to overcome these challenges. Figure 10: Interpolations showing a sequence of generations for linearly increasing polarizability from 39 Bohr3 to 66 Bohr3 with fixed random noise. Each row shows an individual interpolation with Bohr3 increasing from left to right. Table 6: Ablation of when to set the forward rate to 0 on the conditional molecule generation task. We report dimension error as the average Hellinger distance between the generated and ground truth conditional dimension distributions as well as average sample quality metrics. Metrics are reported after 620k training iterations. Method Dimension Error % Atom stable % Molecule Stable % Valid λ t<0.03T = 0 0.227 0.16 91.5 3.7 56.5 9.8 72.0 11 λ t<0.1T = 0 0.162 0.071 92.4 2.8 53.9 12 72.7 9.6 λ t<0.3T = 0 0.266 0.11 92.0 3.2 53.5 13 66.6 12 Table 7: Uniqueness and novelty metrics on unconditional molecule generation. We produce 10000 samples for each method and measure validity using RDKit. Uniquenss is judged as whether the chemical graph is unique amongst the 10000 produced samples. Amongst the valid and unique molecules, we then find the percentage that have a chemical graph not present in the training dataset. Method % Valid % Valid and Unique Percentage of Valid and Unique Molecules that are Novel FDDM [8] 91.9 90.7 65.7 TDDM (ours) 92.3 89.9 53.6 TDDM, const λ t 86.7 84.4 56.9 TDDM, λ t<0.9T = 0 89.4 86.1 51.3 TDDM w/o Prop. 3 87.1 85.9 63.3 We define our forward process to delete frames in a random order. This means that during generation, frames can be generated in any order in the reverse process, enabling more conditioning tasks since we can always ensure that whichever frames we want to condition on are added first. Further, we use a non-isotropic noise schedule by adding noise just to the frame that is about to be deleted. Once it is deleted, we then start noising the next randomly chosen frame. This is so that, in the backward direction, when a new frame is added, it is simply Gaussian noise. Then the score network will fully denoise that new frame before the next new frame is added. We now specify exactly how our forward process is constructed. We enable random-order deletion by applying an initial shuffling operation occurring at time t = 0. Before this operation, we represent the video x as an ordered sequence of frames, x0 = [x1, x2, . . . , xn0]. During shuffling, we sample a random permutation π of the integers 1, . . . , n0. Then the frames are kept in the same order, but annotated with an index variable so that we have x0+ = [(x(1) 0+ , π(1)), (x(2) 0+ , π(2)), . . . , (x(n0) 0+ , π(n0))]. We will run the forward process from t = 0 to t = 100N. We will set the forward rate such we delete down from nt to nt 1 at time (N nt + 1)100. This is achieved heuristically by setting λ t(nt) = 0 for t < (N nt + 1)100, for t (N nt + 1)100. We can see that at time t = (N nt + 1)100 we will quickly delete down from nt to nt 1 at which point λ t(nt) will become 0 thus stopping deletion until the process arrives at the next multiple of 100 in time. When we hit a deletion event, we delete the frame from Xt that has the current highest index variable π(n). In other words Kdel(i|Xt) = ( 1 for nt = x(i) t [2], 0 otherwise where we use x(i) t [2] to refer to the shuffle index variable for the ith current frame in xt. We now provide an example progression of the forward deletion process. Assume we have n0 = 4, N = 5 and sample a permutation such that π(1) = 3, π(2) = 2, π(3) = 4, and π(4) = 1. Initially the state is augmented to include the shuffle index. Then the forward process progresses from t = 0 to t = 500 with components being deleted in descending order of the shuffle index x0+ = [(x(1) t , 3), (x(2) t , 2), (x(3) t , 4), (x(4) t , 1)] x100+ = [(x(1) t , 3), (x(2) t , 2), (x(3) t , 4), (x(4) t , 1)] x200+ = [(x(1) t , 3), (x(2) t , 2), (x(4) t , 1)] x300+ = [(x(2) t , 2), (x(4) t , 1)] x400+ = [(x(4) t , 1)] In this example, due to the random permutation sampled, the final video frame remained after all others had been deleted. Note that the order of frames is preserved as we delete frames in the forward process although the spacing between them can change as we delete frames in the middle. Between jumps, we use a noising process to add noise to frames. The noising process is non-isotropic in that it adds noise to different frames at different rates such that the a frame is noised only in the time window immediately preceding its deletion. For component i [1, . . . , nt], we set the forward noising process such that pt|0(x(i) t |x(i) 0 , Mt) = N(x(i) t ; x(i) 0 , σt(x(i) t )2) where x(i) 0 is the clean frame corresponding to x(i) t as given by the mask Mt and σt(x(i) t ) follows σt(x(i) t ) = 0 for t < (N x(i) t [2])100, 100 for t > (N x(i) t [2])100, t (N x(i) t [2])100 for (N x(i) t [2])100 t (N x(i) t [2] + 1)100 where we again use x(i) t [2] for the shuffle index of component i. This is the VE-SDE from [3] applied to each frame in turn. We note that we only add noise to the state values on not the shuffle index itself. The SDE parameters that result in the VE-SDE are b t = 0 and g t = q 2t 2(N x(i) t [2])100. D.2.3 Sampling the Backward Process When t is not at a multiple of 100, the forward process is purely adding Gaussian noise, and so the reverse process is also purely operating on the continuous dimensions. We use the Heun sampler proposed by [16] to update the continuous dimensions in this case, and also a variation of their discretisation of t - specifically to update from e.g. t = 600 to t = 500, we use their discretization of t as if the maximum value was 100 and then offset all values by 500. To invert the dimension deletion process, we can use Proposition 3 to derive our reverse dimension generation process. We re-write our parameterized λ θ t using Proposition 3 as λ θ t(Xt) = λ t(nt + 1)Epθ 0|t(n0|Xt) pt|0(nt + 1|n0) pt|0(nt|n0) At each time multiple of 100 in the backward process, we will have an opportunity to add a component. At this time point, we estimate the expectation with a single sample n0 pθ 0|t(n0|Xt). If n0 > nt then λ θ t (Xt) = . The new component will then be added at which point λ θ t (Xt) becomes 0 for the remainder of this block of time due to nt becoming nt + 1. If n0 = nt then λ θ t (Xt) = 0 and no new component is added. λ θ t(Xt) will continue to be 0 for the remainder of the backward process once an opportunity to add a component is not used. When a new frame is added, we use Aθ t(yadd, i|Xt) to decide where the frame is added and its initial value. Since when we delete a frame it is fully noised, Aθ t(yadd, i|Xt) can simply predict Gaussian noise for the new frame yadd. However, Aθ t(yadd, i|Xt) will still learn to predict a suitable location i to place the new frame such that backward process is the reversal of the forward. We give an example simulation from the backward generative process in Figure 11. Figure 11: An example simulation of the backward generative process conditioned on the first and last frame. Note how the process first adds a new frame and then fully denoises it before adding the next frame. Since the first and last frame are very similar, the process produces a short video. D.2.4 Network Architecture Our video diffusion network architecture is based on the U-net used by [42], which takes as input the index of each frame within the video, and uses the differences between these indices to control the interactions between frames via an attention mechanism. Since, during generation, we do not know the final position of each frame within the x0, we instead pass in its position within the ordered sequence xt. One further difference is that, since we are perform non-isotropic diffusion, the standard deviation of the added noise will differ between frames. We adapt to this by performing preconditioning, and inputting the timestep embedding, separately for each frame x(i) t based on σt(x(i) t ) instead of basing them on the global diffusion timestep t. Our timestep embedding and preand post-conditioning of network inputs/outputs are as suggested by [16], other than being done on a per-frame basis. The architecture from [42] with these changes applied then gives us our score network sθ t. While it would be possible to train a single network that estimates the score and all quantities needed for modelling jumps, we chose to train two separate networks in order to factorize our exploration of the design space. These were the score network sθ t, and the rate and index prediction network modeling pθ 0|t(n0|Xt) and Aθ t(i|Xt). The rate and index prediction network is similar to the first half of the score network, in that it uses all U-net blocks up to and including the middle one. We then flatten the 512 4 4 hidden state for each frame after this block such that, for an nt frame input, we obtain a nt 8192 hidden state. These are fed through a 1D convolution with kernel size 2 and zero-padding of size 1 on each end, reducing the hidden state to (nt + 1) 128, which is in turn fed through a Re LU activation function. This hidden state is then fed into three separate heads. One head maps it to the parameters of Aθ t(i|Xt) via a 1D convolution of kernel size 3. The output of size (nt + 1) is fed through a softmax to provide the categorical distribution Aθ t(i|Xt). The second head averages the hidden state over the frame dimension, producing a 128-dimensional vector. This is fed through a single linear layer and a softmax to parameterize pθ 0|t(n0|Xt). Finally, the third head consists of a 1D convolution of kernel size 3 with 35 output channels. The (nt + 1) 35 output is fed through a softmax to parameterize distributions over the number of frames that were deleted from X0 which came before the first in xt, the number of frames from X0 which were deleted between each pair of frames in xt, and the number deleted after the last frame in xt. We do not use this head at inference-time but found that including it improved the performance of the other heads by helping the network learn better representations. For a final performance improvement, we note that under our forward process there is only ever one noised frame in xt, while there are sometimes many clean frames. Since the cost of running our architecture scales with the number of frames, running it on many clean frames may significantly increase the cost while providing little improvement to performance. We therefore only feed into the architecture the noised frame, the two closest clean frames before it, and the two closest clean frames after it. See our released source code for the full implementation of this architecture. D.2.5 Training To sample t during training, we adapt the log-normal distribution suggested by [16] in the context of isotropic diffusion over a single image. To apply it to our non-isotropic video diffusion, we first sample which frames have been deleted, which exist with no noise, and which have had noise added, by sampling the timestep from a uniform distribution and simulating our proposed forward process. We then simply change the noise standard deviation for the noisy frame, replacing it with a sample from the log-normal distribution. The normal distribution underlying our log-normal has mean 0.6 and standard deviation 1.8. This can be interpreted as sampling the timestep from a mixture of log-normal distributions, 1 N PN 1 i=0 LN(t 100i; 0.6, 1.82). Here, the mixture index i can be interpreted as controlling the number of deleted frames. We use the same loss weighting as [16] but, similarly to our use of preconditioning, compute the weighting separately for each frame x(i) t as a function of σt(x(i) t ) to account for the non-isotropic noise. D.2.6 Perceptual Quality Metrics We now verify that our reverse process does not have any degradation in quality during the generation as more dimensions are added. We generate 10000 videos and throw away the 278 that were sampled to have only two frames. We then compute the FID score for individual frames in each of the remaining 9722 videos. We group together the scores for all the first frames to be generated in the reverse process and then for the second frame to be generated and so on. We show our results in Table 8. We find that when a frame is inserted has no apparent effect on perceptual quality and conclude that there is no overall degradation in quality as our sampling process progresses. We note that the absolute value of these FID scores may not be meaningful due to the Robo Desk dataset being far out of distribution for the Inception network used to calculate FID scores. We can visually confirm good sample quality from Figure 5. Table 8: FID for video frames grouped by when they were inserted during sampling. 1st 2nd 3rd 3rd last 2nd last last 34.2 34.9 34.7 34.2 34.1 34.4 E Broader Impacts In this work, we presented a general method for performing generative modeling on datasets of varying dimensionality. We have not focused on applications and instead present a generic method. Along with other generic methods for generative modeling, we must consider the potential negative social impacts that these models can cause when inappropriately used. As generative modeling capabilities increase, it becomes simpler to generate fake content which can be used to spread misinformation. In addition to this, generative models are becoming embedded into larger systems that then have real effects on society. There will be biases present within the generations created by the model which in turn can reinforce these biases when the model s outputs are used within wider systems. In order to mitigate these harms, applications of generative models to real world problems must be accompanied with studies into their biases and potential ways they can be misused. Further, public releases of models must be accompanied with model cards [47] explaining the biases, limitations and intended uses of the model.