# symdiff_equivariant_diffusion_via_stochastic_symmetrisation__15e7746b.pdf

Published as a conference paper at ICLR 2025

SYMDIFF: EQUIVARIANT DIFFUSION VIA STOCHASTIC SYMMETRISATION

Leo Zhang Kianoosh Ashouritaklimi Yee Whye Teh Rob Cornish Department of Statistics, University of Oxford

We propose SYMDIFF, a method for constructing equivariant diffusion models using the framework of stochastic symmetrisation. SYMDIFF resembles a learned data augmentation that is deployed at sampling time, and is lightweight, computationally efficient, and easy to implement on top of arbitrary off-the-shelf models. In contrast to previous work, SYMDIFF typically does not require any neural network components that are intrinsically equivariant, avoiding the need for complex parameterisations or the use of higher-order geometric features. Instead, our method can leverage highly scalable modern architectures as drop-in replacements for these more constrained alternatives. We show that this additional flexibility yields significant empirical benefit for E(3)-equivariant molecular generation. To the best of our knowledge, this is the first application of symmetrisation to generative modelling, suggesting its potential in this domain more generally.

1 INTRODUCTION

For geometrically structured data such as N-body systems of molecules or proteins, it is often of interest to obtain a diffusion model that is equivariant with respect to some group actions (Xu et al., 2022; Hoogeboom et al., 2022; Watson et al., 2023; Yim et al., 2023). By accounting for the large number of permutation and rotational symmetries within these systems, equivariant diffusion models aim to improve data efficiency (Batzner et al., 2022) and generalisation (Elesedy & Zaidi, 2021). In previous work, equivariant diffusion models have been implemented using intrinsically equivariant neural networks, each of whose linear and nonlinear layers are constrained individually to be equivariant, so that the overall network is also equivariant as a result (Bronstein et al., 2021).

There is a growing literature on the practical limitations of intrinsically equivariant neural networks (Wang et al., 2024; Canez et al., 2024; Abramson et al., 2024; Pertigkiozoglou et al., 2024). It has been observed that these models can suffer from degraded training dynamics due to their imposition of architectural constraints, as well as increased computational cost and implementation complexity (Duval et al., 2023a). To address these issues, there has been recent interest in symmetrisation techniques for obtaining equivariance instead (Murphy et al., 2018; Puny et al., 2021; Duval et al., 2023b; Kim et al., 2023; Kaba et al., 2023; Mondal et al., 2023; Dym et al., 2024; Gelberg et al., 2024). These approaches offer a mechanism for constructing equivariant neural networks using subcomponents that are not equivariant, which previous work has shown leads to better performing models (Yarotsky, 2022; Kim et al., 2023). However, the advantages of symmetrisation-based equivariance have not yet been explored for generative modelling. This is possibly in part because previous work has solely focused on deterministic equivariance, rather than the more complex condition of stochastic equivariance that is required in the context of generative models.

In this paper, we introduce SYMDIFF, a novel methodology for obtaining equivariant diffusion models through symmetrisation, rather than intrinsic equivariance. We build on the recent framework of stochastic symmetrisation developed by Cornish (2024) using the theory of Markov categories (Fritz, 2020). Unlike previous work on symmetrisation, which operates on deterministic functions, stochastic symmetrisation can be applied to Markov kernels directly in distribution space. We show that this leads naturally to a more flexible approach for constructing equivariant diffusion models than is possible using intrinsic architectures. We apply this concretely to obtain an E(3)-equivariant diffusion architecture for modelling N-body systems, where E(3) denotes the Euclidean group. For this task, we formulate a custom reverse process that is allowed to be non-Gaussian, for which we derive a

Published as a conference paper at ICLR 2025

tractable optimisation objective. Our model is stochastically E(3)-equivariant overall without needing any intrinsically E(3)-equivariant neural networks as subcomponents. We also sketch how to extend SYMDIFF to score and flow-based generative models (Song et al., 2020; Lipman et al., 2022).

To validate our framework, we implemented SYMDIFF for de novo molecular generation, and evaluated it as a drop-in replacement for the E(3)-equivariant diffusion of Hoogeboom et al. (2022), which relies on intrinsically equivariant neural networks. In contrast, our model is able to leverage highly scalable off-the-shelf architectures such as Diffusion Transformers (Peebles & Xie, 2023) for all of its subcomponents. We demonstrate this leads to significantly improved empirical performance for both the QM9 and GEOM-Drugs datasets.

2 BACKGROUND

We provide here an overview of the underlying theory behind equivariant diffusion modelling. This theory is most conveniently developed in terms of Markov kernels, whose definition we recall first.

2.1 EQUIVARIANT MARKOV KERNELS

Markov kernels At a high level, a Markov kernel k : X Y may be thought of as a conditional distribution or stochastic map that, when given an input x X, produces a random output in Y with distribution k(dy|x). For example, given a function f : X E Y and a random element η of E, there is a Markov kernel k : X Y for which k(dy|x) is the distribution of f(x, η)1. As a special case, every deterministic function f : X Y may be thought of as a Markov kernel X Y also. When k(dy|x) has a density (or likelihood), we will denote this by k(y|x), although we note that we can still reason about Markov kernels even when they do not admit a likelihood in this sense.

Stochastic equivariance Let G be a group acting on spaces X and Y. We will denote this using dot notation, so that the action on X is a function (g, x) 7 g x. Recall that a function f : X Y is then equivariant if f(g x) = g f(x) for all x X and g G. (1) Here f is purely deterministic, and so this concept must be generalised in order to encompass models whose outputs are stochastic (Bloem-Reddy & Teh, 2020). To this end, Cornish (2024) uses a notion defined for Markov kernels: a Markov kernel k : X Y is stochastically equivariant (or simply equivariant) if k(dy|g x) = g k(dy|x) for all x X and g G, (2) where the right-hand side denotes the distribution of g y when y k(dy|x), or in other words the pushforward of k(dy|x) under g. When k is obtained from f and η as above, equation 2 holds iff

f(g x, η) d= g f(x, η) for all x X and g G,

where d= denotes equality in distribution2. If η is constant, this says f is deterministically equivariant in the usual sense. Likewise, when k has a conditional density, say k(y|x), equation 2 holds if k(g y|g x) = k(y|x) for all x X, y Y, and g G, provided the action of G on Y has unit Jacobian (Cornish, 2024, Proposition 3.18), as will be the case for all the actions we consider. This latter condition recovers the usual formulation of stochastic equivariance considered in the diffusion literature by e.g. Xu et al. (2022); Hoogeboom et al. (2022).

Stochastic invariance When the action of G on Y is trivial, the condition in equation 1 is referred to as invariance. This same idea carries over to Markov kernels also: we say that k above is stochastically invariant (or simply invariant) if k(dy|g x) = k(dy|x) for all g G and x X. Importantly, this differs from another natural notion of invariance that also arises in the stochastic context: we will say that a distribution p(dy) on Y is distributionally invariant if it holds that g p(dy) = p(dy) for all g G. Given a density p(y), this holds equivalently if p(g y) = p(y) for all g G, again assuming the action on Y has unit Jacobian.

1For nice choices of Y, the converse also holds by noise outsourcing (Kallenberg, 2002, Lemma 3.22). 2Note this is more general than the almost sure condition from equation (17) of Bloem-Reddy & Teh (2020).

Published as a conference paper at ICLR 2025

2.2 EQUIVARIANT DIFFUSION MODELS

Denoising diffusion models Diffusion models construct a generative model pθ(z0) of an unknown data distribution pdata(z0) on a space Z by learning to reverse an iterative forward noising process zt. Following the notation of Kingma et al. (2021), the distribution of zt is defined by q(zt|z0) = N(zt; αtz0, σ2 t I) for some noise schedule αt, σt > 0 such that the signal-to-noise ration SNR(t) := α2 t/σ2 t is strictly monotonically decreasing. The joint distribution of the forward and reverse processes respectively then have the forms:

q(z0:T ) = q(z0)

t=1 q(zt|zt 1) pθ(z0:T ) = p(z T )

t=1 pθ(zt 1|zt) (3)

with q(z0) := pdata(z0), and q(zt|zt 1) = N(zt; αt|t 1zt 1, σ2 t|t 1I), with constants defined as αt|t 1 := αt/αt 1 and σ2 t|t 1 := σ2 t α2 t|t 1σ2 t 1. We take p(z T ) to also be Gaussian. The reverse process is then trained to maximise the ELBO of log pθ(z0), which can be obtained as follows (Sohl-Dickstein et al., 2015):

log pθ(z0) Eq(z1|z0)[log pθ(z0|z1)] DKL(q(z T |z0)||p(z T ))

t=2 Eq(zt|z0)[DKL(q(zt 1|zt, z0)||pθ(zt 1|zt))]. (4)

This objective can be efficiently optimised when the reverse process is Gaussian. This is due to the fact that the posterior distributions q(zt 1|zt, z0) = N(zt 1; µq(zt, z0), σ2 q(t)I) are Gaussian, and that the KL divergence between Gaussians can be expressed in closed form. To match these posteriors, the reverse process is then typically defined in terms of a neural network µθ : Z Z as

pθ(zt 1|zt) := N(zt 1; µθ(zt), σ2 q(t)I). (5)

Invariant and equivariant diffusion It is often desirable for pθ(z0) to be distributionally invariant with respect to the action of a group G. Intuitively, this says that the density pθ(z0) is constant on the orbits of this action. For the model in equation 3, distributional invariance follows if pθ(z T ) is itself distributionally invariant, and if each pθ(zt 1|zt) is stochastically G-equivariant (Xu et al., 2022). All previous work we are aware of has approached this by obtaining a deterministically equivariant µθ, which then implies that the reverse process is stochastically equivariant (Le et al., 2023).

2.3 N-BODY SYSTEMS AND E(3)-EQUIVARIANT DIFFUSION

Equivariant diffusion models are often applied to model N-body systems such as molecules and proteins (Xu et al., 2022; Hoogeboom et al., 2022; Yim et al., 2023). This is motivated by the large number of symmetries present in these system. For example, intuitively speaking, neither the coordinate system nor the ordering of bodies in the system should matter for sampling. We describe the standard components of such models below.

N-body data In 3-dimensions, the state of an N-body system can be encoded as a pair z = [x, h] RN (3+d), where x (x(1), . . . , x(N)) RN 3 describes a set of N points in 3D space, and h (h(1), . . . , h(N)) RN d describes a set of N feature vectors of dimension d. Each feature vector h(i) is associated with the point x(i). For example, Hoogeboom et al. (2022) encodes molecules in this way, where each x(i) denotes the location of an atom, and h(i) some corresponding properties such as atom type (represented as continuous quantities).

Center of mass free space Intuitively speaking, for many applications, the location of an N-body system in space should not matter. For this reason, instead of defining a diffusion on the full space of N-body systems directly, previous work (Garcia Satorras et al., 2021a; Xu et al., 2022; Hoogeboom et al., 2022) has set Z := U, where U is the center of mass (Co M) free linear subspace of RN (3+d) consisting of [x, h] such that x := 1 N PN i=1 x(i) = 0. In this way, samples from their model are always guaranteed to be centered at the origin.

Published as a conference paper at ICLR 2025

Co M-free diffusions To construct their forward and reverse processes to now live entirely on U instead of RN (3+d), Xu et al. (2022) defines the projected Gaussian distribution NU(µ, σ2I), for µ U and σ2 > 0, as the distribution of z obtained via the following process:

ϵ N(0, I) z := µ + σ proj U(ϵ),

where proj U centers its input in RN (3+d) at the origin, i.e. proj U([x, h]) := [x ( x, . . . , x), h]. By construction, the projected Gaussian distribution is then supported on the linear subspace U. Xu et al. (2022) shows that this distribution has a density with Gaussian form NU(z; µ, σ2I) N(z; µ, σ2I) defined for all z living in the subspace U. This allows defining forward and reverse processes q(zt 1|zt) and pθ(zt 1|zt) with exactly the same form as in Section 2.2 before, but with NU used everywhere in place of N. Since these processes are still Gaussian (albeit on a linear subspace), the KL terms in equation 4 remain tractable, which allows optimising the ELBO in the usual way. This approach does require µθ : U U now to be constrained to produce outputs in the subspace U. Prior work has achieved this simply by taking µθ to be a neural network with proj U as its final layer.

Invariance and equivariance Intuitively, the ordering of the N points and the orientation of the overall system in 3D space should not matter. To formalise this, let SN denote the symmetric group of permutations of the integers {1, . . . , N}, and O(3) denote the group of orthogonal 3 3 matrices. Their product SN O(3) acts on N-body systems by reordering and orthogonally transforming points as follows:

(σ, R) [x, h] := [Rx(σ(1)), . . . , Rx(σ(N)), h(σ(1)), . . . , h(σ(N))], (6)

where σ SN and R O(3). Previous work (Xu et al., 2022; Hoogeboom et al., 2022) has then chosen the model pθ(z0) to be distributionally invariant to this action. They enforce this via the approach described in Section 2.2, by ensuring that µθ is deterministically (SN O(3))-equivariant, which implies that the reverse process is stochastically equivariant also.

Following standard terminology in the literature (Garcia Satorras et al., 2021b; Xu et al., 2022; Hoogeboom et al., 2022), we refer to a (SN O(3))-equivariant diffusion defined on the Co M-free space U as an E(3)-equivariant diffusion.

3 EQUIVARIANT DIFFUSION VIA STOCHASTIC SYMMETRISATION

In this section, we introduce SYMDIFF and apply it to the problem of obtaining E(3)-equivariant diffusion models for N-body systems. We also discuss extensions to score and flow-based generative models (Song et al., 2020; Lipman et al., 2022) in Appendix E.

3.1 STOCHASTIC SYMMETRISATION

Recently, Cornish (2024) gave a general theory of neural network symmetrisation in the framework of Markov categories (Fritz, 2020), encompassing earlier approaches to symmetrisation based on averaging or canonicalisation (Murphy et al., 2018; Puny et al., 2021; Kaba et al., 2023; Kim et al., 2023). This theory applies flexibly and compositionally to general groups and actions, including in the non-compact case, and extends to provide a methodology for symmetrising Markov kernels, which had not previously been considered.

In this work, we will make use of a special case of Example 6.3 of Cornish (2024), which we state now before providing intuition. We denote by H G the direct product of groups G and H. Recall that an action of H G on a space X induces actions of both H and G on X also. For example, H acts via h x := (h, e G) x, where e G is the identity element of G. We will also say that a Markov kernel γ : X G is an equivariant base case if it is both H-invariant and G-equivariant, where G acts on the output space of γ by left multiplication, i.e. g g := g g. We then have the following result (see Appendix A.1 for a proof). Theorem 1. Suppose H G acts on X and Y, and γ : X G is an equivariant base case. Then every Markov kernel k : X Y that is equivariant with respect to the induced action of H gives rise to a Markov kernel symγ(k) : X Y that is equivariant with respect to H G, where symγ(k)(dy|x) may be sampled from as follows:

g γ(dg|x) y k(dy|g 1 x) return g y.

Published as a conference paper at ICLR 2025

Intuitively, this result allows us to start with a Markov kernel that is equivariant only with respect to H, and then upgrade it to become equivariant with respect to both H and G. As a special case, if H is the trivial group, then every Markov kernel k : X Y is H-equivariant. Moreover, in this case H G = G, and so Theorem 1 gives a procedure for obtaining G-equivariant Markov kernels from ones that are completely unconstrained. However, as our N-body example will illustrate, it can often be useful to symmetrise a Markov kernel that is already partially equivariant , which motivates keeping H general here.

Beyond the existence of an equivariant base case, Theorem 1 is completely generic and requires no assumptions on the groups and actions involved. As explained in Section 5.1 of Cornish (2024), this is also the only natural procedure that can be defined in this way without further assumptions.

Recursive symmetrisation The symmetrisation procedure defined by Theorem 1 requires γ already to satisfy two equivariance constraints. In effect, this pushes back the problem of obtaining (H G)-equivariant Markov kernels to the choice of γ, which mirrors the situation in the deterministic setting also (Puny et al., 2021; Kim et al., 2023). Whenever G is compact, Example 6.3 of Cornish (2024) gives a suitable choice as γ(dg|x) := λ(dg) , where λ denotes the Haar measure on G (Kallenberg, 1997). Other choices could also be made here on a case-by-case basis, such as using intrinsically equivariant neural networks if desired. To obtain greater modelling flexibility, Cornish (2024) also proposes a recursive approach to obtaining γ. Specifically, the idea is to set γ := symγ0(γ1) (7) where γ0, γ1 : X G are Markov kernels, and γ0 is an equivariant base case (e.g. the Haar measure), but where now γ1 is only required to be H-invariant, and may behave arbitrarily with respect to G. We note this recursive approach exploits the stochastic nature of the procedure in Theorem 1 and would not be possible using deterministic symmetrisation methods here instead.

3.2 SYMDIFF: SYMMETRISED DIFFUSION

We propose to use stochastic symmetrisation to obtain a diffusion process as in Section 2.2 whose reverse kernels are stochastically equivariant. Specifically, suppose some product group H G acts on our state space Z. For each timestep t {1, . . . , T}, we will choose some H-equivariant Markov kernel kθ : Z Z that admits a conditional density kθ(zt 1|zt). Similarly, we will choose some Markov kernel γθ : Z G that satisfies the conditions of Theorem 1 when X = Z. With these components, we will define our equivariant reverse process to be pθ(zt 1|zt) := symγθ(kθ)(zt 1|zt), (8) which is guaranteed to be (H G)-equivariant by Theorem 1. This defines a conditional density, not just a Markov kernel, as a consequence of the next result. For the proof, see Appendix A.2. Proposition 1. Assume the same setup as Theorem 1, and for each fixed g G, let k(dy|g, x) be the distribution of the following generative process: y k(dy|g 1 x) return g y. If k(dy|x) has a density k(y|x), then k(dy|g, x) has a density k(y|g, x), and symγ(k)(dy|x) has symγ(k)(y|x) = Eγ(dg|x)[k(y|g, x)]

as a density. If the action on Y has unit Jacobian, then we may write k(y|g, x) = k(g 1 y|g 1 x).

Training objective We would like to learn the parameters θ using the ELBO from equation 4. However, in general, we do not have access to the densities pθ(zt 1|zt) from equation 8 in closed form, since this requires computing the expectation as in Proposition 1. As such, we cannot compute the ELBO directly. However, since log is concave, Jensen s inequality allows us to bound log pθ(zt 1|zt) Eγθ(dg|zt)[log kθ(zt 1|g, zt)]. Since the ELBO in equation 4 depends linearly on log pθ(zt 1|zt), this allows us to also bound

L1 z }| { Eq(z1|z0),γθ(dg|z1)[log kθ(z0|g, z1)] DKL(q(z T |z0)||p(z T ))

Lt z }| { Eq(zt|z0),γθ(dg|zt)[DKL(q(zt 1|zt, z0)||kθ(zt 1|g, zt)], (9)

Published as a conference paper at ICLR 2025

where the right-hand side is a tractable lower bound to the original ELBO in equation 4. We take this new bound as our objective used to train our SYMDIFF model. By a similar argument as in Remark 7.1 of Cornish (2024), if the model is sufficiently expressive, then optimising this new bound is equivalent to optimising the original ELBO (see Appendix D).

Comparison with deterministic symmetrisation An alternative approach to equation 8 would be to obtain µθ in equation 5 via deterministic symmetrisation. However, many deterministic techniques (Puny et al., 2021; Kim et al., 2023) involve a Monte Carlo averaging step that requires multiple passes through the model instead, and then are only approximate, which would introduce sampling bias here. In contrast, to sample from our method requires only a single pass through γθ and kθ, and involves no bias. The canonicalisation method of Kaba et al. (2023) also avoids this averaging step, but instead suffers from pathologies associated with its analogue of the equivariant base case γθ, which Dym et al. (2024) show must become discontinuous at certain inputs. Additionally, canonicalisation requires an intrinsically G-equivariant architecture for its analogue of γθ. In contrast, whenever G is compact, our stochastic approach allows γθ to be obtained using the Haar measure in a way that does not suffer these pathologies, and may be implemented without any intrinsically G-equivariant neural network components at all, as we show concretely next.

3.3 SYMDIFF FOR N-BODY SYSTEMS

We now apply SYMDIFF in the setting of N-body systems considered in Section 2.3. Specifically, we take Z := U, H := SN, and G := O(3), and consider the action on Z defined in equation 6. This means that we start with kθ(zt 1|zt) in equation 8 that is already equivariant with respect to reorderings of the N bodies, and then symmetrise this to obtain an (SN O(3))-equivariant reverse kernel overall. We choose to symmetrise in this way because highly scalable SN-equivariant kernels based on Transformer architectures can be readily constructed for this purpose (Vaswani et al., 2017; Lee et al., 2019; Peebles & Xie, 2023), whereas intrinsically O(3)-equivariant neural networks have not shown the same degree of scalability to-date (Abramson et al., 2024).

Choice of unsymmetrised kernels It remains now to choose kθ(zt 1|zt). We now do so in a way that equation 9 will resemble the standard diffusion objective in Ho et al. (2020), allowing for the scalable training of SYMDIFF. Specifically, we take

kθ(zt 1|zt) := NU(zt 1; µθ(zt), σ2 q(t)I), (10)

where µθ : Z Z3 is an arbitrary SN-equivariant neural network, which in turn means kθ is stochastically SN-equivariant. We highlight that µθ is otherwise unconstrained and can process [x, h] jointly. In contrast, previous work using intrinsically (Sn O(3))-equivariant components (Satorras et al., 2021; Th olke & De Fabritiis, 2022; Hua et al., 2024) has required complex parameterisations for µθ that handle the x and h inputs separately.

Form of KL terms We now show how our model yields a closed-form expression for the Lt terms in equation 9. Standard arguments show that each q(zt 1|zt, z0) = NU(zt; µq(zt, z0), σ2 q(t)I) is a (projected) Gaussian. We claim that our model also gives a (projected) Gaussian

kθ(zt 1|R, zt) = NU(zt 1; R µθ(RT zt), σ2 q(t)I). (11)

Indeed, by the definition of this kernel in Proposition 1 and the definition of NU in Section 2.3, and since R 1 = RT for R O(3), for zt 1 kθ(zt 1|R, zt) we have

zt 1 d= R µθ(RT zt) + σq(t) ϵ = R µθ(RT zt) + σq(t)R ϵ,

where ϵ NU(0, I). Since R ϵ d= ϵ for R O(3), equation 11 now follows. The same argument as Hoogeboom et al. (2022) now yields the closed-form expression

Lt = Eq(zt|z0),γθ(d R|zt)

µq(zt, z0) R µθ(RT zt) 2 . (12)

3We leave the dependence on t implicit in our notation throughout.

Published as a conference paper at ICLR 2025

We can obtain unbiased gradients of this quantity whenever γθ(d R|zt) is reparametrisable (Kingma, 2013). In other words, we should define γθ(d R|zt) to be the distribution of φθ(zt, ξ), where φθ is a deterministic neural network, and ξ is some noise variable whose distribution does not depend on θ. For a discussion of how we can handle the L1 term in equation 9 in our framework, we refer to Appendix C.1.

ϵ-parameterisation When µθ is taken to have the ϵ-form (Ho et al., 2020; Kingma et al., 2021)

µθ(zt) := 1 αt|t 1 zt σ2 t|t 1 αt|t 1σt ϵθ(zt), (13)

for some neural network ϵθ : Z Z, the same argument given by Ho et al. (2020) now allows us to rewrite equation 12 in a way that resembles the standard diffusion objective:

Lt = Eq(z0),ϵ NU(0,I),γθ(d R|zt)

2w(t) ϵ R ϵθ(RT zt) 2 (14)

where zt = αtz0 + σtϵ, and w(t) = (1 SNR(t 1)/SNR(t)). Recall we require µθ to be SN-equivariant, which straightforwardly follows here whenever ϵθ is. In practice, we set w(t) = 1 during training as is commonly done in the diffusion literature (Kingma & Gao, 2024).

Recursive choice of γθ To apply Theorem 1, we require an equivariant base case γθ that is SNinvariant and O(3)-equivariant. To obtain this, we apply the recursive procedure from equation 7, where γ0 : Z O(3) is obtained using the Haar measure on O(3) as described there, and γ1(d R|zt) is defined as the distribution of fθ(zt, η), where η is sampled from some noise distribution ν(dη) and fθ( , η) : Z O(3) is a SN-invariant neural network for each fixed value of η. Both the Haar measure and ν do not depend on θ, and so the overall γθ obtained in this way is always reparametrisable by construction. We emphasise that fθ is not required to be O(3)-equivariant in any sense, thus allowing for highly flexible choices such as Set Transformers (Lee et al., 2019). At sampling time, we use the procedure from Section 5 of Mezzadri (2006) to sample from the Haar measure on O(3), which is a negligible overhead compared with the cost of evaluating fθ.

Algorithm 1 SYMDIFF training step

1: Sample z0 pdata(z0), t Unif({1, . . . , T}) and ϵ NU(0, I) 2: zt αtz0 + σtϵ 3: Sample R0 from the Haar measure on O(3) and η ν(dη) 4: R R0 fθ(RT 0 zt, η)

5: Take gradient descent step with θ 1

2w(t) ϵ R ϵθ(RT zt) 2

Algorithm 2 SYMDIFF sampling process

1: Sample z T NU(0, I) 2: for s = T, . . . , 2 do 3: Sample R0 from the Haar measure on O(3), η ν(dη), and ϵ NU(0, I) 4: R R0 fθ(RT 0 zt, η)

5: zt 1 1 αt|t 1 zt σ2 t|t 1 αt|t 1σt R ϵθ(RT zt) + σq(t)ϵ

6: return z0 pθ(z0|z1) See Appendix C.1 for an example of this output kernel

3.4 DATA AUGMENTATION IS A SPECIAL CASE OF SYMDIFF

Data augmentation is a popular method for incorporating soft inductive biases within neural networks. Consider again our setup from the previous section, but now using the unsymmetrised kernels kθ(zt 1|zt) from equation 10 in place of pθ(zt 1|zt) in our backwards process. Suppose this model is trained using the standard diffusion objective, applying a uniform random orthogonal transformation to the input of ϵθ before each forward pass. This is equivalent to optimising the

Published as a conference paper at ICLR 2025

following objective:

Laug t = Eq(z0),ϵ NU(0,I),λ(d R)

2w(t) ϵ ϵθ(αt R z0 + σtϵ) 2 , (15)

where λ is the Haar measure on O(3). (More general choices of λ could also be considered.) We then have the following result, proven in Appendix A.3.

Proposition 2. When γθ(d R|zt) = λ(d R) for all zt Z, our SYMDIFF objective in equation 14 recovers the data augmentation objective exactly, so that Lt = Laug t .

As a result, SYMDIFF may be understood as a strict generalisation of data augmentation in which a more flexible augmentation process γθ is learned during training. Moreover, our γθ is then also deployed at sampling time in a way that guarantees stochastic equivariance. In contrast, data augmentation is usually only applied during training, and the model deployed at sampling time then becomes only approximately equivariant.

4 EXPERIMENTS

We evaluated our E(3)-equivariant SYMDIFF model as a drop-in replacement for the E(3)- equivariant diffusion (EDM) of Hoogeboom et al. (2022) on both the QM9 and GEOM-Drugs datasets for molecular generation. We implemented this within the official codebase of Hoogeboom et al. (2022)4 substituting our symmetrised reverse process for their one. We made minimal other changes to their code and experimental setup otherwise, and performed minimal tuning of our architecture. We found SYMDIFF led to significantly improved performance on both tasks. Our results were also on par or better than Geo LDM (Xu et al., 2023), END (Cornet et al., 2024), and MUDiff (Hua et al., 2024), which bake in more sophisticated inductive biases related to molecular generation than EDM does. Our models were also more computationally efficient than these baselines, which all rely on intrinsically equivariant subcomponents. We give an overview of our setup now and provide full details in Appendix C. Our code is available at: https://github.com/leozhang ML/Sym Diff.

Dataset QM9 (Ramakrishnan et al., 2014) is a common benchmark dataset used for evaluating molecular generation. It consists of molecular properties and atom coordinates for 130k small molecules with up to 9 heavy atoms and a total of 29 atoms including hydrogen. For our experiments, we trained our SYMDIFF method to generate molecules with 3D coordinates, atom types, and atom charges where we explicitly modeled hydrogen atoms. We used the same train-val-test split of 100K-8K-13K as in Anderson et al. (2019).

Our model We took our core model (Sym Diff) to be symγθ(kθ)(zt 1|zt) from equation 8 above, with kθ and γθ given as specified in Section 3.3. We chose the backbone neural network ϵθ to be a Diffusion Transformer (Di T) (Peebles & Xie, 2023), which is SN-equivariant by construction. Likewise, we chose the backbone neural network fθ of the component γθ to be a Di T. We made this component Sn-invariant using a Set Transformer (Lee et al., 2019) approach, thereby achieving the requirements described in Section 3.3. Our ϵθ had 29M parameters, matching the smallest model considered by Peebles & Xie (2023), while our fθ had 2.2M parameters. In this way, our γθ was much smaller than our kθ, following a similar approach taken by earlier work on deterministic symmetrisation (Kim et al., 2023; Kaba et al., 2023). Overall, our model had 31.2M parameters in total. To test its scalability, we also trained a larger version of our method with a backbone kθ having 115.6M parameters (Sym Diff ), which matched the Di T-B model from Peebles & Xie (2023). We trained all our models for 4350 epochs to match the same number of gradient steps as Hoogeboom et al. (2022). For further details about our architecture, see Appendix B.

Metrics To measure the quality of generated molecules, we follow standard practice (Hoogeboom et al., 2022; Garcia Satorras et al., 2021a) and report atom stability, molecular stability, validity

4https://github.com/ehoogeboom/e3_diffusion_for_molecules

Published as a conference paper at ICLR 2025

and uniqueness. We exclude results for the novelty metric for the same reasons as discussed in Vignac & Frossard (2021) and refer the reader to these works for a more extensive discussion of these metrics. For all of our metrics, we used 10,000 samples and report the mean and standard deviation over three evaluation runs. To demonstrate the efficiency of our approach, we also report the number of seconds per epoch, time taken to generate one sample and v RAM.

Baselines As a baseline, we trained the 5.3M parameter EDM model using the original experimental setup of Hoogeboom et al. (2022). We also trained an unsymmetrised reverse process with a 29M parameter Di T backbone (Di T), as well as the same model using data augmentation as described in Section 3.4 (Di T-Aug). Additionally, we trained a SYMDIFF model with the same backbone kθ as our Sym Diff model above, but whose γθ was obtained using the Haar measure on O(3), rather than learning this component (Sym Diff-H).

Results From Table 1, we see that our SYMDIFF models comfortably outperformed EDM on all metrics, bar uniqueness. Additionally, our model was also competitive with the more recent, sophisticated baselines from the literature, outperforming all of them on validity. We attribute the improved performance of our method to the extra architectural flexibility provided by our approach to symmetrisation. Our largest model, Sym Diff , outperformed all our baselines on atom stability and validity, and is within variance for molecular stability. We conjecture that similar performance improvements could be achieved by using our SYMDIFF approach as a drop-in replacement for the reverse kernels in more sophisticated methods. Table 1 also shows that Di T-Aug performed notably better than the Di T model on all metrics, highlighting its strength as a baseline. Despite this, our Sym Diff model outperformed both Sym Diff-H and Di T-Aug on all metrics apart from uniqueness. This shows that our approach has benefits that extend beyond merely performing data augmentation.

Computational efficiency Importantly, as Table 2 shows, our method was also more computationally efficient than the alternative methods we considered, both in terms of seconds/epoch, sampling time, and v RAM. This is not surprising since these alternative models rely on intrinsically equivariant graph neural networks that use message passing during training and inference, which is computationally very costly. In contrast, our symmetrisation approach allows us to use computationally efficient Di T components that parallelise and scale much more effectively (Fei et al., 2024).

Table 1: Test NLL, atom stability, molecular stability, validity and uniqueness on QM9 for 10,000 samples and 3 evaluation runs. We omit the results for NLL where not available.

Method NLL Atm. stability (%) Mol. stability (%) Val. (%) Uniq. (%)

Geo LDM 98.90 0.10 89.40 0.50 93.80 0.40 92.70 0.50 MUDiff -135.50 2.10 98.80 0.20 89.90 1.10 95.30 1.50 99.10 0.50 END 98.90 0.00 89.10 0.10 94.80 0.10 92.60 0.20 EDM -110.70 1.50 98.70 0.10 82.00 0.40 91.90 0.50 90.70 0.60

Sym Diff -133.79 1.33 98.92 0.03 89.65 0.10 96.36 0.27 97.66 0.22 Sym Diff -129.35 1.07 98.74 0.03 87.49 0.23 95.75 0.10 97.89 0.26

Sym Diff-H -126.53 0.90 98.57 0.07 85.51 0.18 95.22 0.18 97.98 0.09 Di T-Aug -126.81 1.69 98.64 0.03 85.85 0.24 95.10 0.17 97.98 0.08 Di T -127.78 2.49 98.23 0.04 81.03 0.25 94.71 0.31 97.98 0.12

Data 99.00 95.20 97.8 100

Table 2: Seconds per epoch, sampling time and v RAM for Sym Diff and our baselines on QM9. Results for END are omitted as their code was not publicly available.

Method # Parameters Sec./epoch (s) Sampling time (s) v RAM (GB)

Geo LDM 11.4M 210.93 0.26 27 Mu Diff 9.7M 230.87 0.89 36 END 9.4M EDM 5.4M 88.80 0.27 14

Sym Diff 117.8M 53.40 0.21 16 Sym Diff 31.2M 27.20 0.09 7

Ablations As an ablation study, we also tested the effect of making SYMDIFF smaller, and EDM larger. For Sym Diff, we trained two models of 23.5M (Sym Diff ) and 13.5M (Sym Diff )

Published as a conference paper at ICLR 2025

parameters respectively. For EDM, we trained two additional models with 9.5M (EDM+) and 12.4M (EDM++) parameters respectively. For full details see Appendix C.2.1. From Table 3, we see that even our smaller Sym Diff models remained competitive. In particular, Sym Diff gave comparable molecular stability as the second largest EDM model, EDM+, while being approximately 5 times faster in terms of seconds/epoch.

Table 3: NLL, molecular stability, seconds per epoch, sampling time and v RAM for different sizes of Sym Diff and EDM on QM9. For additional performance metrics see Appendix C.2.

Method NLL Mol. stability (%) Sec./epoch (s) Sampling time (s) v RAM (GB)

EDM++ -119.12 1.41 85.68 0.83 160.60 0.56 23 EDM+ -110.97 1.42 84.63 0.16 192.60 0.46 23 EDM -110.70 1.50 82.00 0.40 88.80 0.27 14

Sym Diff -129.35 1.07 87.49 0.23 27.20 0.09 7 Sym Diff -125.40 0.63 83.51 0.24 24.87 0.08 6 Sym Diff -110.68 2.55 71.25 0.50 20.60 0.07 5

4.2 GEOM-DRUGS

Dataset, model and training GEOM-Drugs (Axelrod & Gomez-Bombarelli, 2022) is a larger and more complicated dataset than QM9, containing 430,000 molecules with up to 181 atoms. We processed the dataset in the same way as Hoogeboom et al. (2022), where we again model hydrogen explicitly. We used the Sym Diff model from earlier, which we trained for 55 epochs to match the same number of gradient steps as Hoogeboom et al. (2022).

Metrics and baselines We report the same metrics as for QM9 but exclude molecular stability and uniqueness for the same reasons discussed in Hoogeboom et al. (2022). We compared our method to the EDM model used by Hoogeboom et al. (2022) for GEOM-Drugs, as well as the other baseline architectures reported for QM9.

Results From Table 4 we again see that our approach comfortably outperformed its EDM counterpart. It is also again competitive with the more sophisticated baselines, whose reported results we restate here. Like with QM9, our Sym Diff models were significantly less costly in terms of compute time and memory usage compared with EDM (see Appendix C.3). In fact, when we tried to run the EDM model it resulted in out-of-memory errors on our NVIDIA H100 80GB GPU (Hoogeboom et al. (2022) avoid this by training EDM on 3 NVIDIA RTX A6000 48GB GPUs.)

Table 4: Test NLL, atom stability and validity on GEOM-Drugs for 10,000 samples and 3 evaluation runs. Geo LDM and EDM ran their results for just one evaluation run. We omit the results for NLL and validity where not available.

Method NLL Atm. stability (%) Val. (%)

Geo LDM 84.4 99.3 END 87.8 0.99 92.9 0.3 EDM -137.1 81.3 Sym Diff -301.21 0.53 86.16 0.05 99.27 0.1

Data 86.50 99.9

5 CONCLUSION

We have introduced SYMDIFF: a lightweight, and scalable framework for constructing equivariant diffusion models based on stochastic symmetrisation. We applied this approach to E(3)- equivariance for N-body data, obtaining an overall model that is stochastically equivariant but that does not rely on any intrinsically equivariant neural network subcomponents. Our approach leads to significantly greater modelling flexibility, which allows leveraging powerful off-the-shelf architectures such as Transformers (Vaswani et al., 2017). We showed empirically that this leads overall to improved performance on several relevant benchmarks.

Published as a conference paper at ICLR 2025

ACKNOWLEDGMENTS

The authors are grateful to Tom Rainforth, Emile Mathieu, Saifuddin Syed and Ahmed Elhag for helpful discussions. LZ and KA are supported by the EPSRC CDT in Modern Statistics and Statistical Machine Learning (EP/S023151/1).

Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J Ballard, Joshua Bambrick, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature, pp. 1 3, 2024.

Brandon Anderson, Truong Son Hy, and Risi Kondor. Cormorant: Covariant molecular neural networks. Advances in neural information processing systems, 32, 2019.

Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313 326, 1982.

Simon Axelrod and Rafael Gomez-Bombarelli. Geom, energy-annotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):185, 2022.

Simon Batzner, Albert Musaelian, Lixin Sun, Mario Geiger, Jonathan P Mailoa, Mordechai Kornbluth, Nicola Molinari, Tess E Smidt, and Boris Kozinsky. E (3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nature communications, 13(1): 2453, 2022.

Benjamin Bloem-Reddy and Yee Whye Teh. Probabilistic symmetries and invariant neural networks. Journal of Machine Learning Research, 21(90):1 61, 2020.

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veliˇckovi c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. ar Xiv preprint ar Xiv:2104.13478, 2021.

Diego Canez, Nesta Midavaine, and Thijs Stessen. Effect of equivariance on training dynamics. https://gram-blogposts.github.io/blog/2024/ relaxed-equivariance/, 2024. Accessed: 2024-09-30.

Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, and David K Duvenaud. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.

Franc ois RJ Cornet, Grigory Bartosh, Mikkel N Schmidt, and Christian A Naesseth. Equivariant neural diffusion for molecule generation. In ICML 2024 AI for Science Workshop, 2024.

Rob Cornish. Stochastic Neural Network Symmetrisation in Markov Categories. ar Xiv preprint ar Xiv:2406.11814, 2024.

Alexandre Duval, Simon V Mathis, Chaitanya K Joshi, Victor Schmidt, Santiago Miret, Fragkiskos D Malliaros, Taco Cohen, Pietro Lio, Yoshua Bengio, and Michael Bronstein. A hitchhiker s guide to geometric gnns for 3d atomic systems. ar Xiv preprint ar Xiv:2312.07511, 2023a.

Alexandre Agm Duval, Victor Schmidt, Alex Hern andez-Garcıa, Santiago Miret, Fragkiskos D Malliaros, Yoshua Bengio, and David Rolnick. Faenet: Frame averaging equivariant gnn for materials modeling. In International Conference on Machine Learning, pp. 9013 9033. PMLR, 2023b.

Nadav Dym, Hannah Lawrence, and Jonathan W Siegel. Equivariant frames and the impossibility of continuous canonicalization. ar Xiv preprint ar Xiv:2402.16077, 2024.

Bryn Elesedy and Sheheryar Zaidi. Provably strict generalisation benefit for equivariant models. In International conference on machine learning, pp. 2959 2969. PMLR, 2021.

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters. ar Xiv preprint ar Xiv:2407.11633, 2024.

Published as a conference paper at ICLR 2025

Tobias Fritz. A synthetic approach to markov kernels, conditional independence and theorems on sufficient statistics. Advances in Mathematics, 370:107239, 2020.

Victor Garcia Satorras, Emiel Hoogeboom, Fabian Fuchs, Ingmar Posner, and Max Welling. E(n) equivariant normalizing flows. Advances in Neural Information Processing Systems, 34:4181 4192, 2021a.

Victor Garcia Satorras, Emiel Hoogeboom, Fabian Fuchs, Ingmar Posner, and Max Welling. E(n) equivariant normalizing flows. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 4181 4192. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/ file/21b5680d80f75a616096f2e791affac6-Paper.pdf.

Yoav Gelberg, Tycho FA van der Ouderaa, Mark van der Wilk, and Yarin Gal. Variational inference failures under model symmetries: Permutation invariant posteriors for bayesian neural networks. In ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative Modeling, 2024.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020.

Emiel Hoogeboom, Vıctor Garcia Satorras, Cl ement Vignac, and Max Welling. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pp. 8867 8887. PMLR, 2022.

Chenqing Hua, Sitao Luan, Minkai Xu, Zhitao Ying, Jie Fu, Stefano Ermon, and Doina Precup. Mudiff: Unified diffusion for complete molecule generation. In Learning on Graphs Conference, pp. 33 1. PMLR, 2024.

S ekou-Oumar Kaba, Arnab Kumar Mondal, Yan Zhang, Yoshua Bengio, and Siamak Ravanbakhsh. Equivariance with learned canonicalization functions. In International Conference on Machine Learning, pp. 15546 15566. PMLR, 2023.

Olav Kallenberg. Foundations of modern probability, volume 2. Springer, 1997.

Olav Kallenberg. Foundations of Modern Probability. Springer, 2 edition, 2002.

Jinwoo Kim, Dat Nguyen, Ayhan Suleymanzade, Hyeokjun An, and Seunghoon Hong. Learning probabilistic symmetrization for architecture agnostic equivariance. Advances in Neural Information Processing Systems, 36:18582 18612, 2023.

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation. Advances in Neural Information Processing Systems, 36, 2024.

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696 21707, 2021.

Diederik P Kingma. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013.

Leon Klein, Andreas Kr amer, and Frank No e. Equivariant flow matching. Advances in Neural Information Processing Systems, 36, 2024.

Tuan Le, Julian Cremer, Frank No e, Djork-Arn e Clevert, and Kristof Sch utt. Navigating the design space of equivariant diffusion-based generative models for de novo 3d molecule generation. ar Xiv preprint ar Xiv:2309.17296, 2023.

Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Seungjin Choi, and Yee Whye Teh. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744 3753. PMLR, 2019.

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022.

Published as a conference paper at ICLR 2025

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations.

Shengjie Luo, Tianlang Chen, Yixian Xu, Shuxin Zheng, Tie-Yan Liu, Liwei Wang, and Di He. One transformer can understand both 2d & 3d molecular data. In The Eleventh International Conference on Learning Representations, 2022.

Francesco Mezzadri. How to generate random matrices from the classical compact groups. ar Xiv preprint math-ph/0609050, 2006.

Arnab Kumar Mondal, Siba Smarak Panigrahi, Oumar Kaba, Sai Rajeswar Mudumba, and Siamak Ravanbakhsh. Equivariant adaptation of large pretrained models. Advances in Neural Information Processing Systems, 36:50293 50309, 2023.

Ryan L Murphy, Balasubramaniam Srinivasan, Vinayak Rao, and Bruno Ribeiro. Janossy pooling: Learning deep permutation-invariant functions for variable-size inputs. ar Xiv preprint ar Xiv:1811.01900, 2018.

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195 4205, 2023.

Stefanos Pertigkiozoglou, Evangelos Chatzipantazis, Shubhendu Trivedi, and Kostas Daniilidis. Improving equivariant model training via constraint relaxation. ar Xiv preprint ar Xiv:2408.13242, 2024.

Gabriel Peyr e, Marco Cuturi, et al. Computational optimal transport: With applications to data science. Foundations and Trends in Machine Learning, 11(5-6):355 607, 2019.

Omri Puny, Matan Atzmon, Heli Ben-Hamu, Ishan Misra, Aditya Grover, Edward J Smith, and Yaron Lipman. Frame averaging for invariant and equivariant network design. ar Xiv preprint ar Xiv:2110.03336, 2021.

Raghunathan Ramakrishnan, Pavlo O Dral, Matthias Rupp, and O Anatole Von Lilienfeld. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014.

Vıctor Garcia Satorras, Emiel Hoogeboom, and Max Welling. E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323 9332. PMLR, 2021.

P. Selinger. A Survey of Graphical Languages for Monoidal Categories, pp. 289 355. Springer Berlin Heidelberg, 2010. ISBN 9783642128219. doi: 10.1007/978-3-642-12821-9 4. URL http://dx.doi.org/10.1007/978-3-642-12821-9_4.

Noam Shazeer. Glu variants improve transformer. ar Xiv preprint ar Xiv:2002.05202, 2020.

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pp. 2256 2265. PMLR, 2015.

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020.

Philipp Th olke and Gianni De Fabritiis. Torchmd-net: equivariant transformers for neural network based molecular potentials. ar Xiv preprint ar Xiv:2202.02541, 2022.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems, 2017.

Clement Vignac and Pascal Frossard. Top-n: Equivariant set and graph generation without exchangeability. ar Xiv preprint ar Xiv:2110.02096, 2021.

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661 1674, 2011.

Published as a conference paper at ICLR 2025

Yuyang Wang, Ahmed AA Elhag, Navdeep Jaitly, Joshua M Susskind, and Miguel Angel Bautista. Swallowing the bitter pill: Simplified scalable conformer generation. In Forty-first International Conference on Machine Learning, 2024.

Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, 620(7976):1089 1100, 2023.

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, and Jian Tang. Geodiff: A geometric diffusion model for molecular conformation generation. ar Xiv preprint ar Xiv:2203.02923, 2022.

Minkai Xu, Alexander S Powers, Ron O Dror, Stefano Ermon, and Jure Leskovec. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, pp. 38592 38610. PMLR, 2023.

Dmitry Yarotsky. Universal approximations of invariant maps by neural networks. Constructive Approximation, 55(1):407 474, 2022.

Jason Yim, Brian L Trippe, Valentin De Bortoli, Emile Mathieu, Arnaud Doucet, Regina Barzilay, and Tommi Jaakkola. Se(3) diffusion model with application to protein backbone generation. ar Xiv preprint ar Xiv:2302.02277, 2023.

Published as a conference paper at ICLR 2025

A.1 PROOF OF THEOREM 1

Proof. Theorem 1 is a special case of Example 6.3 of Cornish (2024), whose notation and setup we will import freely here. Note that Example 6.3 makes use of string diagrams (Selinger, 2010), an introduction to which can be found in Section 2 of Cornish (2024). Intuitively, a string diagram represents a (possibly stochastic) computational processes that should be read up the page, with the inputs applied at the bottom, and outputs produced at the top.

To proceed, we first let the action ρ be trivial, so that the semidirect product N ρH becomes simply the direct product N H by Remark 3.29 of Cornish (2024). Now consider the following string diagram:

The upshot of Example 6.3 of Cornish (2024) is that equation 16 is always equivariant with respect to the (N H)-actions αX and αY whenever the following conditions all hold:

γ : X H is equivariant to the H-actions αX,H and H;

γ : X H is invariant to the N-action αX,N;

k : X Y is equivariant with respect to the N-actions αX,N and αY,N.

Here, as in Example 6.3 of Cornish (2024):

We have decomposed the (N H)-action αX as an H-action αX,H followed by an Naction αX,N using Remark 3.31 of Cornish (2024). In other words:

We have decomposed αY similarly;

We denote by H and ( ) 1 H the multiplication and inversion operations of the group H.

Published as a conference paper at ICLR 2025

To obtain our Theorem 1, we now simply instantiate things appropriately in the Markov category C := Stoch, so that all the boxes appearing in equation 16 become Markov kernels. Concretely, we take X := X, Y := Y, N := H, and H := G. A procedure for sampling from the symmetrised Markov kernel in equation 16 may then be read off as follows:

g γ(dg|x) y k(dy|g 1 x) return g y,

exactly as in the statement of this result.

A.2 PROOF OF PROPOSITION 1

Proof. For simplicity, we assume k(dy|x) has a density k(y|x) with respect to the Lebesgue measure µ on Y = Rn. To derive the density of k(dy|g, x) where g G is some fixed group element, we note that for y k(dy|g, x), we have y = g y0 where y0 k(y|g 1 x). Hence, by the change-of-variables formula, we can conclude that the density of k(y|g, x) exists and has the form:

k(y|g, x) = k(g 1 y|g 1 x) (g 1 y)

Hence, we see that when the action of G has unit Jacobian, the density of k(y|g, x) = k(g 1 y|g 1 x).

Further, suppose we have g γ(dg|x) and y k(y|g, x) - i.e. y symγ(k)(dy|x). It is the case that for an arbitrary (Borel) measurable set A in Y, we have

symγ(k)(y A|x) = Z

G k(y A|g, x) γ(dg|x).

Since we have shown above that k(dy|g, x) has a density, we can express this as

symγ(k)(y A|x) = Z

A k(y|g, x) µ(dy) γ(dg|x)

A Eγ(dg|x) [k(y|g, x)] µ(dy).

where we use Fubini s theorem for the second line as all quantities are non-negative. Hence, we can conclude that the density of symγ(k) exists and has the form symγ(k) = Eγ(dg|x)[k(y|g, x)].

A.3 PROOF OF PROPOSITION 2

Proof. The standard diffusion objective with data augmentation distributed according to the Haar measure λ is given by

Laug t = Eq(z0),ϵ NU(0,I),λ(d R)

2w(t) ϵ ϵθ(αt R z0 + σtϵ) 2 (17)

= Eq(z0),ϵ NU(0,I),λ(d R)

2w(t) ϵ ϵθ(R (αtz0 + σt RT ϵ)) 2 (18)

= Eq(z0),ϵ NU(0,I),λ(d R)

2w(t) R ϵ ϵθ(R (αtz0 + σtϵ )) 2 (19)

= Eq(z0),ϵ NU(0,I),λ(d R)

2w(t) ϵ RT ϵθ(R (αtz0 + σtϵ)) 2 , (20)

where we use the fact that ϵ = RT ϵ is distributed according to NU(0, I) as R, ϵ are independent in the expectation, and that the action of R O(3) preserves the L2 norm. To conclude, we note that it is a standard result that if R λ, the inverse RT is also distributed according to the Haar measure λ. Hence, we see that Laug t coincides with

Lt = Eq(z0),ϵ NU(0,I),λ(d R)

2w(t) ϵ R ϵθ(RT zt) 2 , (21)

where zt = αtz0 + σtϵ.

Published as a conference paper at ICLR 2025

B MODEL ARCHITECTURE

Below, we outline the architectures used for ϵθ and γθ. Both components rely on Diffusion Transformers (Di Ts) (Peebles & Xie, 2023) using the official Py Torch implementation at https: //github.com/facebookresearch/Di T. We also state the hyperparameters that we kept fixed for both our QM9 and GEOM-Drugs experiments. Any hyperparameters that differed between the datasets are discussed in their respective sections later in the Appendix.

We emphasise that our architecture choices were not extensively tuned as the main purpose of our experiments was to show that we can use generic architectures for equivariant diffusion models. We arrived at the below architecture through small adjustments from experimenting with Di T models in the context of molecular generation, which we stuck with for our final experiments.

B.1 ARCHITECTURE OF ϵθ

As we need ϵθ to be an SN-equivariant architecture, we parametrise this in terms of a Di T model which consists of nlayers (intermediate) layers, nhead attention heads, hidden size nsize and a final output layer. We then project the outputs via proj U to ensure the outputs lie in U. In addition, for the MLP layers, we use Swi GLU activations (Shazeer, 2020) instead of the standard GELU, where the ratio of the hidden size of the Swi GLU to nsize is 2, and we do not use the default Fourier embeddings for the inputs - we pass our inputs directly into the model. We also use the default time embeddings. We refer to this model setup as Di T.

We also use the Gaussian positional embeddings from Luo et al. (2022) as additional features that we concatenate to the inputs of Di T. To compute this from x, we let

ψk (i,j) = 1

x(j) x(j) µk

where k = 1, . . . , K is the number of basis kernels we use and µk, σk R are learnable parameters. We define ψ(i,j) = (ψ1 (i,j), . . . , ψK (i,j))T RK 1. We then compute our positional embeddings by

Ψi = 1 N PN j=1 ψ(i,j)WD where WD RK nemb is a learnable matrix, and we concatenate these to form our embeddings Ψ = [Ψ1, . . . , ΨN] RN nemb. We note that Ψ is O(3)-invariant.

Finally, we provide pseudo-code for a single pass through ϵθ in Algorithm 3 where we note that WI is a learnable linear layer.

Algorithm 3 Computation of ϵθ Inputs: z = [x, h] where x RN 3, h RN d; t R

1: Compute Ψ RN nemb from x 2: z [x, h]WI where WI R(3+d) nz

3: z Di T(t, [z, Ψ]) 4: return z

B.2 ARCHITECTURE OF γθ

We construct γθ following the recursive setup in Section 3.3 where we take the noise distribution ν to be NU(0, I) on RN mnoise. We provide pseudo-code for a single pass through our fθ in Algorithm 4 where we note that WG, W1, W2 are learnable linear layers, we use the same embedding parameters for Ψ as before and Di TWithout Final Layer is the same as a Di T with mlayers (intermediate) layers, mhead attention heads, hidden size msize but where we do not apply the final layer.

B.3 HYPERPARAMETERS FOR γθ AND ϵθ

For both QM9 and GEOM-Drugs, we fixed the following hyperparameters. For ϵθ, we set K 1 2nsize and nemb = nsize nz. For γθ, we set mnoise = 3 and m size = 1

Published as a conference paper at ICLR 2025

Algorithm 4 Computation of fθ Inputs: z = [x, h] where x RN 3, h RN d; η RN mnoise; t R

1: Compute Ψ RN nemb from x 2: z [x, η, Ψ]WG where WG R(3+mnoise+nemb) msize

3: z Di TWithout Final Layer(t, z)

N where 1 RN is a vector of ones Ensures SN-invariance

5: z GELU(z W1)W2 where W1 Rmsize m size, W2 Rm size (3 3)

6: R QRDecomposition(z) 7: return R

C EXPERIMENTAL DETAILS

C.1 FIRST LIKELIHOOD TERM L1

We have a presented a framework for parametrising and optimising pθ(zt 1|zt) for t > 1 in Section 3.3 where pθ is obtained via stochastic symmetrisation. This corresponds to the Lt terms where t > 1 from our objective in equation 9. However, we note that standard diffusion models usually choose a different parametrisation for pθ(z0|z1) as this corresponds to the final generation step. Depending on the modelling task, this requires a different approach compared to the other reverse kernels.

For example, in Hoogeboom et al. (2022), pθ(z0|z1) is defined as the product of densities pcont θ (x0|z1)pdisc θ (h0|z1) where pdisc θ implements a quantisation step converting the continuous latent z1 to discrete values h0, while pcont θ is still a Gaussian distribution generating continuous geometric features x0 from z1. In particular, we have that

pcont θ (x0|z1) = NU(x0; x1/α1 σ1/α1ϵ(x) θ (z1), σ2 1/α2 1I)

where ϵθ : Z Z is some SN-invariant neural network and ϵ(x) θ denotes the x component of the output of ϵθ.

We note that our proposed methodology can still account for this case by defining the symmetrised kernel by

pθ(z0|z1) = symγθ(kθ)(z0|z1), kθ(z0|z1) = pcont θ (x0|z1)pdisc θ (h0|z1)

We can follow the same discussion in Section 3.3 to conclude that

pθ(z0|z1) = Eγθ(d R|z1) [kθ(z0|R, z1)] , kθ(z0|R, z1) = kcont θ (x0|R, z1)kdisc θ (h0|R, z1),

where kcont θ (z0|R, z1) = N(x0; x1/α1 σ1/α1R ϵθ(RT z1), σ2 1/α2 1I) and kdisc θ (h0|R, z1) = pdisc θ (h0|RT z1). This allows us to decompose L1 into the form in equation 9 and to tractably optimise this objective since we have access to the density kθ(z0|R, z1).

C.2 QM9 DETAILS

C.2.1 MODEL HYPERPARAMETERS

For all of our experiments, we retain the diffusion hyperparameters as EDM (Hoogeboom et al., 2022) - i.e. we use the same noise schedule, discretisation steps etc.

Sym Diff Table 5 shows the hyperparameters for the ϵθ backbone of the kθ component used in the SYMDIFF models for QM9. The remaining hyperparameters were kept the same as in Appendix B.3.

For the fθ backbone of the γθ component, we set msize = 128, mlayers = 8, mheads = 4 for all models bar Sym Diff . For Sym Diff , we set msize = 216, mlayers = 10, mheads = 8.

EDM Table 6 shows the hyperparameters for the EDM models that we used for our QM9 experiments. The remaining model hyperparameters were kept the same as those in Hoogeboom et al. (2022).

Published as a conference paper at ICLR 2025

Table 5: Choice of nsize, nlayers, nheads for the ϵθ of the SYMDIFF models used for QM9.

Model # Parameters nsize nlayers nheads Sym Diff 115.6M 768 12 12 Sym Diff 29M 384 12 6 Sym Diff 21.3M 360 10 6 Sym Diff 11.3M 294 8 6

Table 6: Choices of the hyperparameters nf (# features per layer), nl (number of layers) for the EDM models used for QM9.

Model # Parameters nf nl EDM++ 12.4M 332 12 EDM+ 9.5M 256 16 EDM 5.3M 256 9

C.2.2 OPTIMISATION

For the optimisation of SYMDIFF models, we followed Peebles & Xie (2023) and used Adam W (Loshchilov & Hutter) with a batch size of 256. We chose a learning rate of 2 10 4 and weight decay of 10 12 for our 31.2M parameter model by searching over a small grid of 3 values for each. To match the same number of steps as in Hoogeboom et al. (2022), we trained our model for 4350 epochs.

We applied the same optimization hyperparameters from our 31.2M model to all other SYMDIFF models bar Sym Diff , where we used a learning rate of 10 4. For the EDM models, we followed the default hyperparameters from Hoogeboom et al. (2022). In our augmentation experiments, we first tuned the learning rate and weight decay for the Di T model, keeping all other optimization hyperparameters unchanged. These tuned values were then applied to Di T-Aug.

C.3 GEOM-DRUGS DETAILS

For all of our experiments, we retain the diffusion hyperparameters as EDM (Hoogeboom et al., 2022) - i.e. we use the same noise schedule, discretisation steps etc.

Like with QM9, we report the seconds per epoch, sampling time (s) and v RAM for the models used in Table 4. We exclude END as their code is not publicly available. We omit the results for EDM and Geo LDM as were unable to run their code on our NVIDIA H100 80GB GPU.

Table 7: Seconds per epochs, sampling time, and v RAM for different models on GEOM-Drugs.

Method # Parameters Sec./epoch Sampling time (s) v RAM (GB)

Geo LM 5.5M EDM 2.4M Sym Diff 31.2M 4336.82 0.39 63

For the Sym Diff model, we used the same hyperparameters as for QM9 except for the learning where we used 10 4 as we found this to result in a lower validation loss.

C.4 PRETRAIN-FINETUNING

To further explore the flexibility of our approach, we experimented with using it in the pretrainfinetune framework, similar to Mondal et al. (2023). Using QM9, we took the trained Di T model from Table 1 and substituted it as the ϵθ for our Sym Diff model, while keeping the same architecture and hyperparameters for fθ. We tested two setups: finetuning both ϵθ and fθ (Di T-FT) and freezing ϵθ while tuning only fθ (Di T-FT-Freeze). The same training procedure and optimization hyperparameters were used, except we now trained our models for only 800 epochs and used a larger grid for learning rate and weight decay tuning. Specifically, we searched first for the optimal learning

Published as a conference paper at ICLR 2025

rate in [10 3, 8 10 4, 2 10 4, 10 4] and for the optimal weight decay in [0, 10 12, 2 10 12]. We found the optimal learning rate and weight decay to be 10 3 and 2 10 12.

Table 8: Test NLL, atom stability, molecular stability, validity and uniqueness on QM9 for 10,000 samples and 3 evaluation runs.

Method NLL Atm. stability (%) Mol. stability (%) Val. (%) Uniq. (%)

Sym Diff -129.35 1.07 98.74 0.03 87.49 0.23 95.75 0.10 97.89 0.26 Di T-FT -111.66 1.22 98.43 0.03 83.27 0.39 94.19 0.16 98.17 0.26 Di T-FT-Freeze -43.29 3.73 95.68 0.02 55.02 0.38 90.48 0.24 99.06 0.13 Di T -127.78 2.49 98.23 0.04 81.03 0.25 94.71 0.31 97.98 0.12

From Table 8, we observe that finetuning both ϵθ and fθ improves performance over the Di T model, even with our minimal tuning. However, finetuning only fθ leads to worse results, indicating that end-to-end training or finetuning the whole model is necessary. This underscores the flexibility of our approach and its potential for easy and efficient symmetrisation of pretrained Di T models with an unconstrained fθ.

D DISCUSSION ABOUT THE SYMDIFF OBJECTIVE

We explain here why the SYMDIFF objective in equation 9 is reasonable to use as a surrogate for the true ELBO in equation 4. The underlying idea is analogous to Remark 7.1 of Cornish (2024). First, it is straightforward to check that our SYMDIFF objective recovers the ELBO exactly if either of the following two conditions are met:

γθ is deterministic, i.e. γθ(dg|z1) is a Dirac distribution for every z1 Z; or

kθ is G-equivariant.

(For our model in Section 3.3, the latter holds if the function ϵθ : Z Z is deterministically O(3)-equivariant.) It follows that the result of optimising our SYMDIFF objective will achieve at least as high an ELBO as the best performing θ for which either of these two conditions are met. Accordingly, if our model is powerful enough to express (or approximate) a rich family of deterministic γθ and G-equivariant kθ, then it is reasonable to expect good performance from our surrogate objective. More generally, our model also has the ability to interpolate between these two conditions, allowing for potentially better overall optima than could be achieved in either case individually.

E EXTENSION TO SCORE MATCHING AND FLOW MATCHING

In this section, we discuss how to extend stochastic symmetrisation to score and flow-based generative models to give an analogue of SYMDIFF to these paradigms. For clarity of presentation, we consider all models to be defined for N-body systems living in the full space Z = RN 3 where we wish to obtain a SN O(3)-equivariant model - i.e. we do not consider non-geometric features or translation invariance. Although, we note that the below discussion can be extended to such settings in the natural way as presented for diffusion models above.

E.1 SCORE MATCHING

Score-based generative models (SGMs) (Song et al., 2020) are the continuous-time analogue of diffusion models. SGMs consider the forward noising process xt pt for t [0, T] defined by the following stochastic differential equation (SDE) with the initial condition x0 pdata:

dxt = f(xt, t)dt + g(t)dw, (22)

for some choice of functions f : Z [0, T] Z and g : [0, T] R, and where w is a standard Weiner process.

Published as a conference paper at ICLR 2025

The corresponding backward process is shown in Anderson (1982) to take the form:

dxt = f(xt, t) g(t)2 x log pt(xt) dt + g(t)d w, (23)

where w is a standard Weiner process and time runs backwards from T to 0. Hence, given samples x T p T and access to the score of the marginal distributions x log pt(xt), we can obtain samples from pdata by simulating the backward process in equation 23.

By considering the Euler Maruyama discretisation of equation 23, we can represent the sampling scheme of a SGM in terms of the Markov chain p T (x T ) Qn i=1 p(xti 1|xti), where the time-points ti are uniformly spaced in [0, T] - i.e. ti = i t where t = T/n - and the reverse transition kernels are given by:

p(xti 1|xti) = N xti 1; xti + t f(xti, ti) g(ti)2 x log pti(xti) , g(ti)2 t I .

In what follows, we additionally assume that f( , t) is SN O(3)-equivariant for all t [0, T]. This is true for common choices of f which take f to be linear in xt.

Stochastic symmetrisation In order to learn an approximation to the transition kernels via stochastic symmetrisation, we can parametrise the reverse transition kernels, in a similar fashion as for diffusion models, by

pθ(xti 1|xti) = symγθ(kθ)(xti 1|xti),

where we take kθ(xti 1|xti) = N(xti 1; µθ(xti), g(ti)2 t I). We define µθ by the following parametrisation5:

µθ(xti) = xti + t f(xti, ti) g(ti)2sθ(xti) ,

where we take sθ : Z Z to be a SN-equivariant neural network which aims to learn an approximation to the true score x log pt(xt). This ensures that kθ is SN-invariant. Additionally, we assume that γθ : Z O(3) is some choice of a SN-invariant and O(3)-equivariant Markov kernel. Hence, we can conclude that pθ : Z Z is a SN O(3)-equivariant Markov kernel by Theorem 1. We can also guarantee that pθ admits a density by Proposition 1.

Training To learn θ for pθ(xti 1|xti), a natural objective is to minimise the KL divergence between the true reverse kernels and our parametrised reverse kernels

i=1 λ0(ti)Li(θ), Li(θ) = Epti(xti) DKL(p(xti 1|xti)||pθ(xti 1|xti)) ,

where λ0 is some time weighting function. We note that we run into the same issue as with SYMDIFF in that we do not have access to pθ(xti 1|xti) in closed-form, since this is expressed in terms of an expectation. However, as Li(θ) is a linear function of log pθ(xti 1|xti) and log is a convex function, we can apply Jensen s inequality again to provide the following upper bound to our original objective

i=1 λ0(ti)L i(θ), L i(θ) = Epti(xti),γθ(d R|xti) DKL(p(xti 1|xti)||kθ(xti 1|R, xti)) ,

which we can now use to train θ. To further simplify L i(θ), we note that kθ(xti 1|R, xti) = N(xti 1; R µθ(RT xti), g(ti)2 t I) with a similar derivation as before. This allows us to evaluate the KL divergences in closed form since p, kθ are defined in terms of Gaussians. We can show that this gives

L i(θ) = Epti(xti),γθ(d R|xti)

2g(ti)2 t R sθ(RT xti) x log pti(xti) 2 , (24)

where we use the fact that f( , t) is SN O(3)-equivariant. To express equation 24 in a tractable form (as we do not have access to the true score), we can apply the standard technique of employing the score matching identity (Vincent, 2011) to give

L i(θ) = Ep(x0),p(xti|x0),γθ(d R|xti)

2g(ti)2 t R sθ(RT xti) x log pti(xti|x0) 2 + Ci,

5Similar to our discussion on diffusion models, we leave the time dependency implicit in here.

Published as a conference paper at ICLR 2025

where p(xti|x0) denotes the conditional distribution of xti given x0 under the forward noising process p, and Ci is some constant. In practice, the choice of forward noising SDE in equation 22 is made to ensure that we have access to p(xti|x0) in closed-form and that the distribution is easy to sample from.

By making the choice that λ0(t) = 2λ(t)/(g(ti)2T) for some suitable time weighting function λ, we can show as t 0 that our objective L (θ) will converge to (modulo some constant)

Ep(x0),t U(0,T ),p(xt|x0),γθ(d R|xt) h λ(t) R sθ(RT xt) x log p(xt|x0) 2i , (25)

where U(0, T) denotes the uniform distribution on [0, T]. We see that our final objective in equation 25 now resembles the standard score matching objective.

E.2 FLOW MATCHING

Continuous normalising flows (CNFs) (Chen et al., 2018) construct a generative model of data x1 q = pdata by the pushforward of an ordinary differential equation (ODE) taking the form

d dtϕt(x) = ut(ϕt(x)), ϕ0(x) = x, (26)

where ut : Z [0, T] Z is the vector field function defining the ODE, and ϕt : Z [0, T] Z denotes the flow implicitly defined by solutions to the above ODE. By letting p0 be some simple prior distribution, the above ODE defines a generative model xt pt by the pushforward of p0 through the flow ϕt

pt = [ϕt]#p0 (27)

If ut is chosen in such a way that p1 q = pdata, we can then generate samples from pdata by sampling some x0 p0, then solving the ODE in equation 26 with this initial condition6. Furthermore, as in previous work (Klein et al., 2024), we assume that ut is SN O(3)-equivariant.

By considering the Euler discretisation of equation 26, we can represent the generation process by the Markov chain p0(x0) QT i=1 p(dxti|xti 1) where the time-points ti are uniformly spaced in [0, T] - i.e. ti = i t where t = T/n - and the transition kernels are given by the Markov kernels

p(dxti|xti 1) = δ(xti 1 + uti 1(xti 1) t), (28)

where δ( ) denotes the Dirac measure at some point.

Stochastic symmetrisation To learn the transition kernels induced by the vector field ut via stochastic symmetrisation, we parametrise our transition kernels by

pθ(dxti|xti 1) = symγθ(kθ)(xti|xti 1), (29)

where we take kθ(dxti|xti 1) = δ(xti 1 + vθ ti 1(xti 1) t) in which vθ t : Z Z is some SNequivariant neural network which aims to learn an approximation to the true vector field ut. We further assume γθ : Z O(3) is some SN-invariant and O(3)-equivariant Markov kernel. We can again conclude that pθ : Z Z is a (SN O(3))-equivariant Markov kernel by Theorem 1.

Training A natural objective to learn pθ is to minimise the 2-Wasserstein distance W2 (Peyr e et al., 2019) between p(dxti|xti 1) and pθ(dxti|xti 1) since these are defined in terms of Dirac measures. We can write our objective as

i=1 λ0(ti 1)Li(θ), Li(θ) = Epti 1(xti 1) W2 2(p(dxti|xti 1), pθ(dxti|xti 1))

where λ0 is some time weighting function, and the 2-Wasserstein distance W2 is defined as W2 2(π1, π2) = infπ R x y 2 dπ(x, y) where π is taken over the space of possible couplings

6The use of time here reverse the convention used in the diffusion literature.

Published as a conference paper at ICLR 2025

between the measures π1, π2. We note that as p(dxti|xti 1) is Dirac, there only exists a single coupling between the two kernels given by the product of the Markov kernels. This allows us to evaluate Li+1 as

Li+1(θ) = Epti(xti),γθ(d R|xti) h t2 R vθ ti(RT xti) uti(xti) 2i . (30)

To express equation 30 in a tractable form (as we do not have access to ut), we can take ut to be constructed by the same setup used in Flow Matching (Lipman et al., 2022). This framework allows us to express equation 30 in the now tractable form

Li+1(θ) = Eq(x1),pti(xti|x1),γθ(d R|xti) t2 R vθ ti(R xti) uti(xti|x1) + Ci+1, (31)

by the use of the Conditional Flow Matching objective, where Ci+1 is some constant. Here pt(xt|x1) is a family of conditional distributions where p0(x0|x1) = p0(x0) equals our prior distribution and p1(x1|x1) δ(x1), and for which ut(xt|x1) is a vector field generating pt(xt|x1) by an ODE of the form in equation 27. These are constructed to be easy to sample from and evaluate. The true vector field ut, which provides a generative model of q = pdata, is then defined by some expectation of the conditional vector fields ut(xt|x0) over pt(xt|x0) and q(x1).

Hence, by taking λ0(t) = λ(t)

T t for some suitable time weighting function λ, we can show as t 0, our objective L(θ) will converge to (modulo some constant)

Eq(x1),t U(0,T ),pt(xt|x1),γθ(d R|xt) h λ(t) R vθ t (RT xt) ut(xt|x1) 2i . (32)

We see that our final objective in equation 32 now resembles the standard flow matching objective.