# geometric_representation_condition_improves_equivariant_molecule_generation__2c4b8f44.pdf

Geometric Representation Condition Improves Equivariant Molecule Generation

Zian Li * 1 2 Cai Zhou * 3 4 Xiyuan Wang 1 2 Xingang Peng 1 2 Muhan Zhang 1

Recent advances in molecular generative models have demonstrated great promise for accelerating scientific discovery, particularly in drug design. However, these models often struggle to generate high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce Geo RCG, a general framework to improve molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared with single-stage generation, the easy-to-generate representation in the first stage guides the second stage generation toward a highquality molecule in a goal-oriented way. Leveraging EDM and Semla Flow as base generators, we observe significant quality improvements in unconditional molecule generation on the widely used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 50% performance improvement over stateof-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations. Furthermore, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly reducing the generation iterations needed. Code is available at https://github.com/Graph PKU/Geo RCG.

*Equal contribution 1Institute for Artificial Intelligence, Peking University, Beijing, China 2School of Intelligence Science and Technology, Peking University, Beijing, China 3Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA 4Department of Automation, Tsinghua University, Beijing, China. Correspondence to: Muhan Zhang <muhan@pku.edu.cn>.

Proceedings of the 42 nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267, 2025. Copyright 2025 by the author(s).

1. Introduction

Recent years have seen rapid development in generative modeling techniques for molecule generation (Garcia Satorras et al., 2021; Hoogeboom et al., 2022; Luo & Ji, 2022; Wu et al., 2022; Xu et al., 2023; Le et al., 2023; Morehead & Cheng, 2024), which have demonstrated great promise in accelerating scientific discoveries such as drug design (Graves et al., 2020). By representing molecules as point clouds of chemical elements embedded in Euclidean space (potentially with edges (Vignac et al., 2023; Irwin et al., 2024)) and employing equivariant models such as EGNN (Satorras et al., 2021) as backbone architectures, these approaches ensure the O(3)- (or SO(3)-) invariance of the modeled molecule probability and have shown significant progress in both unconditional and conditional molecule generation tasks.

Despite the advances, precisely modeling the molecular distribution q(M) still remains a challenge, with current models often falling short of satisfactory results. This is especially true in more practical scenarios where the goal is to capture the conditional distribution q(M|c) for conditional generation, with c representing a desired property such as the HOMO-LUMO gap. In such cases, recent models still produce molecules with property errors significantly larger than the data lower bound (Hoogeboom et al., 2022; Xu et al., 2023). This challenge arises in part because molecules are naturally supported in a lower-dimensional manifold (Mislow, 2012; De Bortoli, 2022; You et al., 2023), yet they are embedded in a 3D space with much higher ambient dimensions (N (3 + d), where N is the number of atoms and d the atom feature dimension). Consequently, directly learning these distributions without additional guidance or conditioning solely on a single property can result in substantial errors (Song et al., 2021), often leading to unstable or undesirable molecular samples.

In this work, we propose Geo RCG (Geometric Representation-Conditioned Molecule Generation), a general framework for improving the generation quality of molecular generative models by leveraging geometric representation conditions for both unconditional and conditional generation; see Figure 1 for an overview of the framework. At a high level, rather than directly learning the extrinsic molecular distribution, we aim to

Geometric Representation Condition Improves Equivariant Molecule Generation

ℳ= 𝑥, ℎ ℝ! # ℝ! $

Pre-trained

Representation

Generator 𝑝% 𝑟 rep. condition

Molecule Generator 𝑝& ℳ 𝑟

b) Sampling (sequential)

ℳ 𝑝& ℳ 𝑟= 𝑟

Representation

Generator 𝑝% 𝑟

Molecule Generator 𝑝& ℳ 𝑟

rep. condition

a) Training (parallel)

Figure 1: Training and sampling procedure of Geo RCG for unconditional molecule generation. a) During training, each molecule M is mapped into a representation r by a pre-trained, frozen geometric encoder E. The representation distribution is then learned by a lightweight representation generator. The molecule generator is trained in a self-conditioned manner, generating a molecule M conditioned on its own representation E(M). b) During sampling, an informative representation is first generated, which subsequently guides the molecule generator to produce high-quality molecules.

first transform it into a more compact and semantically meaningful representation distribution, with the help of a well-pretrained geometric encoder E such as Unimol (Zhou et al., 2023) and Frad (Feng et al., 2023). This distribution is much simpler because it does not exhibit any group symmetries, such as O(3)/SO(3) and S(N) groups which are present in extrinsic molecular distributions. As a result, a lightweight representation generator (Li et al., 2023) can effectively capture this simple distribution. In the second stage, we employ a conditional molecular generator to achieve the ultimate objective: molecular generation. Unlike conventional approaches, our molecular generator is directly informed by the first-stage geometric representation, which encapsulates crucial molecular structure and property information. This guidance enables the generation of high-quality molecular structures with improved fidelity.

Our approach is directly inspired by RCG (Li et al., 2023), which, however, focuses on image data with fixed sizes and positions and does not necessitate handling Euclidean and permutation symmetries factors that are markedly different in molecular data. Compared to recent work Graph RCG (Wang et al., 2024) which applies the RCG framework to 2D graph data, we explicitly handle 3D geometry that is more complex due to the additional Euclidean symmetry. Moreover, we avoid the complicated stepwise bootstrapped training and sampling process proposed in Wang et al. (2024) that requires noise alignment, sequential training, and simultaneous encoder training. Instead, we adopt a simple and intuitive framework that enables parallel training and leverages advanced pre-trained geometric encoders containing valuable external knowledge (Zaidi et al., 2022; Feng et al., 2023), thus achieving competitive results without complex training procedures. Notably, while Li et al. (2023) primarily focuses on empirical evaluation, we also provide generic theoretical characterizations of the representation-conditioned diffusion model class for both

unconditional and conditional generation, offering a rigorous understanding of the improved performance.

To illustrate the effectiveness of our approach, we select one of the simplest and most classical equivariant generative models, EDM (Hoogeboom et al., 2022), as the base molecular generator of Geo RCG. For better performance on the more challenging dataset GEOM-DRUG, we also apply Geo RCG onto the recent state-of-the-art (SOTA) model Semla Flow (Irwin et al., 2024). Experimentally, our method achieves the following significant improvements:

Substantially enhancing the quality (e.g., molecule stability) of the generated molecules on the widely used QM9 and GEOM-DRUG datasets. On QM9, Geo RCG not only improves the performance of EDM by a large margin, but also significantly surpasses several recent baselines with advanced performance (Wu et al., 2022; Xu et al., 2023; Morehead & Cheng, 2024; Song et al., 2024a). On GEOM-DRUG, Geo RCG also significantly improves EDM s performance, and consistently enhances Semla Flow s already SOTA results. More remarkably, in conditional molecule generation tasks, Geo RCG yields an average 50% improvement in performance (i.e., difference of generated molecule s property with conditions), while many contemporary models struggle to achieve even marginal gains. By incorporating classifier-free guidance into the molecule generator (Li et al., 2023) and employing lowtemperature sampling for representation generation (Ingraham et al., 2023), Geo RCG demonstrates a flexible trade-off between molecular quality and diversity on QM9 dataset without additional training, which is especially advantageous in specific molecular generation tasks that prioritize quality over diversity. With the assistance of the representation guidance, Geo RCG significantly reduces the number of diffusion steps required by approximately 10x, while preserving the quality of molecular generation.

Geometric Representation Condition Improves Equivariant Molecule Generation

2. Related Works

Molecular Generative Models. Early work has primarily focused on modeling molecules as 2D graphs (composed of atom types, connections, and edge types), utilizing 2D graph generative models to learn the graph distribution (Vignac et al., 2022; Jang et al., 2023; Le et al., 2023; Jo et al., 2023; Luo et al., 2023; Zhou et al., 2024). However, since molecules inherently exist in 3D space where physical laws govern their behavior and spatial geometry provides critical information related to key properties, recent research has increasingly focused on leveraging 3D generative models to directly learn the geometric distribution by modeling molecules as point clouds of chemical elements. Notable early autoregressive models include GSch Net (Gebauer et al., 2019) and G-Sphere Net (Luo & Ji, 2022). More recently, diffusion models have demonstrated effectiveness in this domain, as evidenced by models like EDM (Hoogeboom et al., 2022) and subsequent advancements that enhance EDM with latent space (Xu et al., 2023), prior information (Wu et al., 2022) and more powerful backbones (Morehead & Cheng, 2024). Furthermore, recent advances in flow methods (Lipman et al., 2022; Liu et al., 2022b) have inspired the development of geometric, equivariant flow methods including Equi FM (Song et al., 2024b) and GOAT (Hong et al., 2024), which enable much faster molecule generation speed. Beyond these, there are also methods that jointly model 2D and 3D information (Vignac et al., 2023; You et al., 2023; Huang et al., 2024; Irwin et al., 2024) (also called 3D graph (You et al., 2023)), where representative methods include Mi Di (Vignac et al., 2023) and Semla Flow (Irwin et al., 2024) that jointly learn atom types, bond types, formal charges and coordinates.

Pre-training for Molecular Encoders Learning meaningful molecular representations is crucial for downstream tasks like molecular property prediction (Fang et al., 2022). The strategy of pre-training on large-scale datasets followed by fine-tuning on smaller, task-specific datasets has been proven to significantly improve model performance in vision and language domains (Kenton & Toutanova, 2019; Brown, 2020; Dosovitskiy, 2020). Building on this success, recent studies have explored pre-training methods for molecular data, aiming to achieve similar performance improvements (Zhou et al., 2023; Feng et al., 2023; Liu et al., 2022a; Fang et al., 2022; Jiao et al., 2024; Ni et al., 2024). Common pretext tasks involve masking and recovering atom types, bond lengths, or bond angles (Fang et al., 2022; Zhou et al., 2023). However, since molecules exist in continuous 3D space, a more effective approach is introduced by adding carefully crafted noise into the molecular coordinates and training the model to denoise it. Examples of such noise types include isotropic Gaussian noise (Zaidi et al., 2022; Zhou et al., 2023), Riemann-Gaussian noise (Jiao et al.,

2023), and complex hybrid noise (Ni et al., 2024; Feng et al., 2023; Jiao et al., 2024). Notably, Zaidi et al. (2022) showed that denoising equilibrium structures effectively corresponds to learning the underlying force field, thereby producing molecular representations that are physically and chemically informative.

Latent Generative Models. At a high level, our framework can also be viewed as a latent generative model, where data distributions are learned in a latent space (our stage 1) and decoded back through some decoder (our stage 2). Most prior work in this domain either focuses on regular data forms (e.g., images) with fixed positions and sizes (Van Den Oord et al., 2017; Razavi et al., 2019; Dai & Wipf, 2019; Aneja et al., 2021; Rombach et al., 2022; Li et al., 2023), or on graph data without Euclidean symmetry and requires explicit modeling (Wang et al., 2024). Molecular data, however, presents unique challenges in both aspects. One of the key issues in this context is how to define the latent space defining it as latent coordinates and features as in Geo LDM (Xu et al., 2023) still results in a geometrically structured and thus complex space, while defining it on representations as we do introduces the challenge of effectively decoding a global, non-symmetric embedding back into geometric objects. LGD (Zhou et al., 2024) trains a diffusion model on a unified Euclidean latent space obtained by jointly training a powerful encoder and a simple decoder, and performs both generation and prediction tasks focusing on 2D graphs. LDM-3DG (You et al., 2023) adopts representation latent space but employs a cascaded (2D+3D) autoencoder (AE) framework, where the decoder is designed (or trained) to be deterministic, rendering poor performance on the 3D part as evidenced in our experiments. In contrast, we model the decoder as a powerful generative model, focusing solely on geometric learning while demonstrating superior effectiveness.

3.1. Preliminaries

In this work, we represent molecules as point clouds of chemical elements in 3D space, denoted by M = (x, h), where x = (x1, . . . , x N) RN 3 represents the atomic coordinates of N atoms, and h = (h1, . . . , h N) RN d captures the node features of dimension d, such as atomic numbers and charges. This formulation follows the approach of Hoogeboom et al. (2022); Xu et al. (2023); Morehead & Cheng (2024) and is widely adopted in molecular representation learning (Thomas et al., 2018; Li et al., 2024a; Zaidi et al., 2022), facilitating the integration of pre-trained molecular encoders (Zaidi et al., 2022; Feng et al., 2023). After generating point clouds of chemical elements, these methods infer bond types using lookup tables based on atom types and pairwise distances, or relying on advanced packages

Geometric Representation Condition Improves Equivariant Molecule Generation

like Open Babel (O Boyle et al., 2011). Notably, approaches like Mi Di (Vignac et al., 2023) and Semla Flow (Irwin et al., 2024) additionally represent molecules with explicit bond types, enabling joint learning and generating of 2D and 3D information, which typically results in improved performance. We use q to denote the underlying data distribution, such as molecule distributions q(M), and p to denote the approximated distributions captured by parametric models.

We denote the pre-trained geometric encoder as E : S+ N=1(RN 3 RN d) Rdr, which embeds a molecule M with an arbitrary number of nodes N into a representation vector r of fixed dimension dr. The geometric encoder exhibits E(3)- (or SE(3)-) invariance, suggesting that E(M) = E(x, h) = E(x RT + t, h) for any t R3 and R O(3) (or SO(3)), where O(3) is the set of orthogonal matrices (and SO(3) being the set of special orthogonal matrices).

3.2. Geo RCG: Geometric-Representation-Conditioned Molecular Generation

Geometric Representation Generator. To improve the quality of the generated molecules, we propose to first transform the geometrically structured molecular distribution q(M) into a non-geometric representation distribution q(r) using a well-pretrained geometric encoder E that maps each molecule M to its representation r. Learning the representation distribution q(r) is considerably easier, since representations do not exhibit any symmetry as in explicit molecular generative models (Hoogeboom et al., 2022). We thus leverage a simple yet effective MLP-based diffusion architecture as proposed in (Li et al., 2023) for the representation generator pφ(r), which follows DDIM schemes (Song et al., 2020a) for training and adopts predictor-corrector frameworks for sampling (Song et al., 2020b).

Figure 2: t-SNE visualizations of the representations produced by Frad (Feng et al., 2023) for the QM9 dataset (left) and by Unimol (Zhou et al., 2023) for the GEOM-DRUG dataset (right). The representations exhibit clear clustering based on node count.

One additional design compared to previous practices (Li et al., 2023; Wang et al., 2024) is that we condition the representation generator on the molecule s node number N by default1. This is crucial to ensuring consistency between the

1We omit the condition N in our probability decompositions

size of the representation s underlying molecule and the size of the molecule it guides to generate. Moreover, molecules with different sizes often have distinct modes in structures and properties (Hoogeboom et al., 2022), which is reflected in their geometric representations learned by modern pretrained geometric encoders (Zhou et al., 2023; Feng et al., 2023), as shown in Figure 2. From the figures, it is evident that by conditioning on N, the learning process for the representation generator becomes simpler and more effective, leading to the following loss function of our representation generator:

Lrep = E(r,N) Drep train,ϵ N(0,I),t U(0,T ) ||r fφ(rt; t, N)||2 , (1) where Drep train = {(E(M), N(M))|M Dmol train}, with N(M) representing atom number of M and Dmol train denoting the molecule dataset. Here, fφ is the MLP backbone (Li et al., 2023), and rt = αtr + 1 αtϵ is the noisy representation computed with the predefined schedule αt (0, 1].

Molecule Generator. Since the ultimate goal of our framework is to generate molecules from q(M), we decompose the molecular distribution as q(M) = R q(M|r)q(r) dr to explicitly enable geometric-representation conditions. Consequently, a geometric-representation-conditioned molecular generator pθ(M|r) is required. In principle, we can use many modern molecule generators (Hoogeboom et al., 2022; Xu et al., 2023; Morehead & Cheng, 2024; Irwin et al., 2024), as these models can all take additional conditions.

To illustrate the effectiveness of our approach, we choose a relatively simple model EDM (Hoogeboom et al., 2022) as the base generator and primarily demonstrate our method with it. Furthermore, we showcase the generality of our approach by adapting it to a recent flow-matching based SOTA model, Semla Flow (Irwin et al., 2024), emphasizing its ability to consistently improve SOTA models performance.

EDM is designed to ensure the O(3)-invariance, i.e., for any R O(3), pθ(M) = pθ(x, h) = pθ(x RT , h). To accommodate EDM to representation conditions, we use the following training objective:

Lmol = E(M,r) Dmol-rep train ,t U(0,T ),ϵ ˆ N(0,I) ||ϵ fθ(Mt; t, r)||2 , (2) where Dmol-rep train = {(M, E(M))|M Dmol train}, and sampling from ˆ N(0, I) entails drawing ϵ0 = [ϵ(x) 0 , ϵ(h) 0 ] from N(0, I), adjusting ϵ(x) 0 by subtracting its geometric center to obtain ϵ(x), and setting ϵ = [ϵ(x), ϵ(h) 0 ]. This ensures the zero center-of-mass property, as the distribution is defined on this subspace to ensure translation invariance (Hoogeboom et al., 2022). The noisy molecule is

and mathematical derivations for statement simplicity, as its inclusion does not affect the overall framework and conclusions.

Geometric Representation Condition Improves Equivariant Molecule Generation

given by Mt = α(M) t [x, h] + σ(M) t ϵ, with time-dependent schedules α(M) t and σ(M) t , while the diffusion backbone fθ, which is instantiated with EGNN (Satorras et al., 2021), is conditioned on r.

Combining the Two Generators Together. The representation generator pφ(r) and the molecule generator pθ(M|r) together model the molecular distribution pφ,θ(M) := R pθ(M|r)pφ(r) dr, which approximates the data distribution q(M) = R q(M|r)q(r) dr that we aim to capture. One notable advantage of the framework is that the decomposition enables parallel training of the two generators. The entire training and sampling procedure is summarized in Algorithm 1.

Theoretical Analysis of Geo RCG. There are several key properties of Geo RCG that facilitate high-quality molecule generation. First, Geo RCG preserves symmetry properties of the base molecule generator pθ(M): Proposition 3.1. (Symmetry Preservation) Assume the original molecular generator pθ(M) is O(3)- or SO(3)- invariant. Then, the two-stage generator pφ,θ(M) is also O(3)- or SO(3)-invariant.

Proof. This result follows directly from the definition. Specifically, pφ,θ(M) = R pθ(M|r) pφ(r) dr = R pθ(x RT , h|r) pφ(r) dr = pφ,θ(x RT , h) for any R O(3) (or SO(3)). The second equality holds due to the symmetric property of pθ(M), which remains valid when additional non-symmetric conditions r are applied.

Moreover, representation-conditioned diffusion models can achieve no higher overall total variation distance than traditional diffusion models, and can arguably yield better results, as the representation encodes key data information that may further reduce estimation error. We present the rigorous bound in Theorem 3.2, and provide corresponding proof and detailed discussions in Appendix D.1. Remarkably, this is a generic theoretical characterization that applies to prior experimental work (Li et al., 2023). For models that account for equivariant symmetries such as EDM, we build upon results from (Feng et al., 2024; You et al., 2023) to establish finer-grained bounds, as detailed in Theorem D.14. Theorem 3.2. Consider the random variable x RN(d+3) q(x), and assume that the second moment mx of x is bounded as m2 x := Eq(x)[ x x 2] < , where x := Eq(x)[x]. Further, assume that the score ln q(xt) is Lx-Lipschitz for all t, and that the score estimation error in the second-stage diffusion is bounded by ϵφ,θ,cond such that Er pφ(r), xt qt(xt|r)[ sθ(xt, t, r) ln qt(xt|r) 2] ϵ2 φ,θ,cond. Denote the step size as h := T/Nd, where T is the total diffusion time and Nd is the number of discretization steps, and assume that h 1/Lx. Suppose that we sample x pθ(x|r) from Gaussian noise, where r pφ(r), and denote the final distribution of x as pθ,φ(x).

Define p q T |φ 0 , which is the ending point of the reverse process starting from q T |φ instead of Gaussian noise. Here, q T |φ is the T-th step in the forward process starting from q0|φ := 1 A R

r q(x0|r)pφ(r) dr, where A is the normalization factor. Denote the k-dim isotropic Gaussian distribution as γk. Then the following holds,

TV(pθ,φ(x), q(x)) q

KL(q0|φ||γN(d+3)) exp( T) | {z } convergence of forward process (3)

N(d + 3)h + Lxmxh)

T | {z } discretization error (4)

+ ϵφ,θ,cond

T | {z } conditional score estimation error

+ TV(q0|φ, q0) | {z } representation generation error

Balancing Quality and Diversity of Molecule Generation. In many scientific applications, researchers prioritize generating higher-quality molecules over more diverse ones. To facilitate this, we introduce a feature that allows fine-grained control over the trade-off between diversity and quality in the sampling stage (thus without retraining). This is achieved by integrating two key techniques: lowtemperature sampling (Ingraham et al., 2023) (controlled via the temperature T ) for the representation generator, and classifier-free guidance (Ho & Salimans, 2022; Zheng et al., 2023) (controlled via the coefficient w) for the molecule generator. We provide more details about the two techniques in Appendix A. The combination of the two techniques enables flexible and explicit control, which we refer to as Balancing Controllability and demonstrate its effectiveness in Section 4.2.

Handling Conditional Molecule Generation. The framework discussed thus far focuses on unconditional molecule generation, where no specific property c (e.g., HOMO energy) is prespecified. However, for molecule generation, a more practical and desired scenario is conditional (also called controllable) generation, where additional conditions c, such as the HOMO-LUMO gap energy, are introduced, and our objective shifts to generating molecules from the distribution q(M|c). In Geo RCG, this conditional generation is naturally decomposed as pθ,φ(M|c) := R pθ(M|r)pφ(r|c) dr , suggesting that we first generate a property-meaningful molecular representation r, which is then independently used to condition the second-stage molecule generation; see Figure 3 for an illustration. A key advantage of this modeling approach is that, when different properties (e.g., HOMO, LUMO, GAP energy) need to be captured, only the representation generator

Geometric Representation Condition Improves Equivariant Molecule Generation

requires retraining under the new conditions. This retraining is highly efficient due to the lightweight nature of the representation generator. Notably, Geo RCG demonstrates outstanding conditional generation performance, as shown in Section 4.3. Moreover, we theoretically demonstrate that, under mild assumptions, the representation generator can provably estimate the conditional distribution and generate representations that lead to provable reward improvements toward the target, which subsequently benefits the secondstage generation. Further theoretical details are provided in Appendix D.2.

Molecule Generator

Rep. Generator 𝑝!! 𝑟 | 𝑐(#)

Rep. Generator 𝑝!" 𝑟

Rep. Generator 𝑝!# 𝑟 | 𝑐(%)

Figure 3: A single molecule generator can be employed for both unconditional and conditional molecule generation with respect to various properties. For conditional generation, only the representation generator is re-trained on (molecule, property) pairs, allowing it to conditionally sample property-meaningful representations during the sampling stage.

4. Experiments

4.1. Experiment Setup

Datasets and Tasks. As a method for 3D molecule generation, we evaluate Geo RCG on the widely used datasets QM9 (Ramakrishnan et al., 2014) and GEOMDRUG (Gebauer et al., 2019; 2022; Axelrod & Gomez Bombarelli, 2022). We focus on two tasks: unconditional molecule generation, where the goal is to sample from q(M), and conditional (or controllable) molecule generation, where a property c is given, and we aim to sample from q(M|c).

We use Geo RCG (EDM) to denote the variant of Geo RCG that employs EDM as the base molecule generator, Geo RCG (Semla) to refer to its application built upon Semla Flow (Irwin et al., 2024), and Geo RCG when the context is clear or for general purpose. To ensure fair comparisons, we follow the dataset split and configurations exactly as in Anderson et al. (2019); Hoogeboom et al. (2022); Xu et al. (2023). Without further clarification, we bold the highest scores and underline the second-highest one. Additionally, to highlight the direct improvement over the base model, we display green numbers next to the score to indicate the improvement, and red numbers to denote a decrease. Without further clarification, results are calculated based on

10k randomly sampled molecules, averaged over three runs, with standard errors reported in parentheses.

Instantiation of the Pre-trained Encoder. We employ Frad (Feng et al., 2023), which was pre-trained on the PCQM4Mv2 dataset (Nakata & Shimazaki, 2017) using a hybrid noise denoising objective, as the geometric encoder for QM9 dataset. For GEOM-DRUG, we adopt Unimol (Zhou et al., 2023) architecture but perform our own pretraining using the dataset from (Zhou et al., 2023), with GEOM-DRUG included as an additional pretraining dataset. This is because GEOM-DRUG contains unique chemical elements not found in PCQM4Mv2 or other commonly used pretraining datasets such as ZINC or Chem BL (Li et al., 2021). We note that, when using Frad (Feng et al., 2023) as the encoder, Geo RCG also leads to significant improvements on the GEOM-DRUG dataset, although with slightly lower performance compared to Unimol; see Appendix C.

Baselines. A direct comparison is made with our base molecule generators, EDM or Semla Flow. For Geo RCG (EDM), we compare it against generative models that, like EDM, do not explicitly generate bonds but instead infer them based on bond lengths. Although this approach may be less effective in generating valid molecules, it is widely adopted and presents a greater challenge for generative models in learning 3D geometric distributions precisely where our geometric representation guidance offers the most significant improvement. These models include: (1) the nonequivariant counterparts of EDM and Geo LDM (Xu et al., 2023), specifically GDM(-AUG)(Hoogeboom et al., 2022) and Graph LDM(-AUG)(Xu et al., 2023); (2) the autoregressive method G-Sch Net (Gebauer et al., 2019); (3) advanced equivariant diffusion models such as Geo LDM (Xu et al., 2023), EDM-Bridge (Wu et al., 2022), and GCDM (Morehead & Cheng, 2024); (4) fast equivariant flow-based methods like E-NF (Garcia Satorras et al., 2021), Equi FM (Song et al., 2024b), and GOAT (Hong et al., 2024); and (5) the recently introduced Bayesian-based method Geo BFN (Song et al., 2024a).

For Geo RCG (Semla), we compare it with recent advanced 2D&3D methods that directly generate bonds to produce higher-quality samples similar to Semla Flow, including Mi Di (Vignac et al., 2023) and EQGAT-diff (Le et al., 2023).

Note that we intentionally separate the comparison between EDM-like 3D-only models and Semla Flow-like 2D&3D models, focusing on the improvements brought by Geo RCG to the base model. This is because combining the comparisons would be unfair, as 2D&3D models additionally learn bond information, which reduces the complexity of generating valid molecules (Morehead & Cheng, 2024).

We provide further experiments, including ablation studies on the pre-trained encoder, in Appendix C.

Geometric Representation Condition Improves Equivariant Molecule Generation

Table 1: Unconditional molecule generation on QM9 and GEOM-DRUG. The gray cells denotes the base molecule generator employed in Geo RCG.

Methods Metrics Atom Sta (%) Mol Sta (%) Valid (%) Valid & Unique (%) Atom Sta (%) Valid (%)

Data 99 95.2 97.7 97.7 86.5 99.9 G-Schnet 95.7 68.1 85.5 80.3 - - GDM 97 63.2 - - 75 90.8 GDM-AUG 97.6 71.6 90.4 89.5 77.7 91.8 Graph LDM 97.2 70.5 83.6 82.7 76.2 97.2 Graph LDM-AUG 97.9 78.7 90.5 89.5 79.6 98 EDM 98.7 82 91.9 90.7 81.3 92.6 EDM-Bridge 98.8 84.6 92 90.7 82.4 92.8 Geo LDM 98.9(0.1) 89.4(0.5) 93.8(0.4) 92.7(0.5) 84.4 99.3 GCDM 98.7(0.0) 85.7(0.4) 94.8(0.2) 93.3(0.0) 89 95.5 ENF 85 4.9 40.2 39.4 - - Equi FM 98.9(0.1) 88.3(0.3) 94.7(0.4) 93.5(0.3) 84.1 98.9 GOAT 98.4 84.1 90.9 89.99 81.8 96.0 Geo BFN 99.08(0.03) 90.87(0.1) 95.31(0.1) 92.96(0.1) 85.6 92.08 Geo RCG (EDM) 99.12(0.03) 0.43% 92.32(0.06) 12.59% 96.52(0.2) 5.03% 92.45(0.2) 1.93% 84.3(0.12) 3.69% 98.5(0.12) 6.37%

4.2. Unconditional Molecule Generation

We first evaluate the quality of unconditionally generated molecules from Geo RCG, with the commonly adopted validity and stability metrics for assessing molecules quality (Hoogeboom et al., 2022). See Appendix B for detailed descriptions of these metrics.

We present the main results of Geo RCG on the QM9 and GEOM-DRUG datasets in Table 1 and Table 2. Below, we highlight the key findings: (i) Improvement over the base model: By leveraging geometric representations, Geo RCG significantly outperforms the base model, on both QM9 and GEOM-DRUG datasets. Notably, on QM9, Geo RCG (EDM) increases stable molecules from 82% to 93.9% and validity from 91.9% to 97.4%, while also improving molecule uniqueness. (ii) Superior performance compared to advanced methods: Geo RCG (EDM) also surpasses included advanced models on the QM9 dataset. On the GEOM-DRUG dataset, Geo RCG (EDM) outperforms models such as EDM-Bridge and GOAT, and gets a high score in validity. Although Geo RCG (EDM) falls short of achieving the best performance, we attribute this to the relatively limited capabilities of EDM. To address this, we replace EDM with the recent SOTA flow-matching based model, Semla Flow (Irwin et al., 2024), as the base model on the GEOM-DRUG dataset, as shown in Table 2. As demonstrated, Geo RCG (Semla) consistently enhances Semla Flow s SOTA performance across all metrics on the GEOM-DRUG dataset.

We proceed to investigate the Balancing Controllablility feature of Geo RCG introduced in Section 3.2. To this end, we conducted a grid search by varying both w and T on QM9 dataset for Geo RCG (EDM), as depicted in Figure 4

Table 2: Unconditional molecule generation on GEOMDRUG for 2D&3D methods. Molecule stability and validity are reported as percentages, while energy and strain energy are expressed in kcal mol 1. Results marked with were reproduced in our own experiments.

Methods Atom Stab Mol Stab Valid Energy Strain Mi Di 99.8 91.6 77.8 - - EQGAT-diff 99.8(0.01) 93.4(0.21) 94.6(0.24) 148.8(0.9) 140.2(0.7) Semla Flow 99.8(0.00) 97.4(0.07) 94.4(0.17) 95.72(1.24) 56.42(1.07) Geo RCG (Semla) 99.8(0.00) 97.6(0.00) 95.3(0.13) 88.6(1.03) 47.64(1.10)

(see Appendix C for the extended figure that includes validity). The results indicate a clear trend: increasing w and decreasing T improve validity and stability at the expense of uniqueness, allowing for fine-grained, flexible control over molecule generation. At its best, this approach achieves a molecule stability of 93.9% and a validity of 97.42%, approaching the dataset s upper bound, with a trade-off in lower validity&uniqueness of 86.82%.

Figure 4: Balance controllable generation on QM9 of Geo RCG (EDM). Increasing w and decreasing T enhances stability, with the cost of a reduction in uniqueness.

1 2 3 4 Inverse temperature 1 Temp

0 1 2 3 Classifier guidance coefficient w

90.02 90.65 91.06 91.08

92.32 93.16 93.30 93.38

92.40 93.43 93.72 93.90

92.30 93.66 93.56 93.54

Molecule Stability

1 2 3 4 Inverse temperature 1 Temp

0 1 2 3 Classifier guidance coefficient w

92.61 92.23 91.73 91.40

92.45 91.36 89.53 88.62

92.05 90.37 88.37 86.82

91.38 89.70 86.70 85.64

Validity&Uniqueness

In Appendix C, we present additional experiments on QM9,

Geometric Representation Condition Improves Equivariant Molecule Generation

Table 3: Conditional molecule generation on QM9. The metric used is the MSE between the target property value and the classifier-predicted value. The gray cells denote the baseline molecule generator employed in our proposed approach. Models marked with indicate results obtained from our own experiments; these are provided only as a coarse reference due to potentially differing evaluation criteria, see Appendix B for details.

Methods Properties α ε εHOMO εLUMO µ Cv

QM9 (lower bound) 0.1 64 39 36 0.043 0.04 Random 9.01 1470 646 1457 1.616 6.857 N atoms 3.86 866 426 813 1.053 1.971 EDM 2.76 655 356 584 1.111 1.101 Geo LDM 2.37 587 340 522 1.108 1.025 GCDM 1.97 602 344 479 0.844 0.689 Equi FM 2.41 591 337 530 1.106 1.033 GOAT 2.74 605 350 534 1.01 0.883 LDM-3DG 12.29 1160 583 1093 1.42 5.74 Geo BFN 2.34 577 328 516 0.998 0.949 Geo RCG (EDM) 0.86(0.01) 68.84% 325.2(3.4) 50.35% 202.2(1.2) 43.20% 257.9(5.5) 55.84% 0.805(0.006) 27.54% 0.475(0.005) 56.86%

demonstrating that Geo RCG enhances distribution-level geometric metrics, such as Bond Angle W1, which underscore Geo RCG s improved geometric learning capabilities.

4.3. Conditional Molecule Generation

We now turn to a more challenging task: generating molecules with a specific property value c from q(M|c). We strictly follow the evaluation protocol outlined in (Hoogeboom et al., 2022). Speicifically, QM9 is split into two halves, and an EGNN classifier (Satorras et al., 2021) is trained on the first half for evaluating the generated molecules property, while the generator is trained on the second half. We focus on six properties: polarizability (α), orbital energies (εHOMO, εLUMO), their gap ( ε), dipole moment (µ), and heat capacity (Cv).

The results are presented in Table 3. The first three baselines, as introduced by EDM (Hoogeboom et al., 2022), represent the classifier s inherent bias as the lower bound for performance, the random evaluation result as the upper bound, and the dependency of properties on N. For more details, please refer to Appendix B.

As shown, Geo RCG (EDM) nearly doubles the performance of the best existing models for most properties, with an average 50% improvement over the best ones. This is a task where many recent models struggle to make even modest improvements, as evidenced in the table. Notably, for different properties, we only re-train the representation generator, as demonstrated in Section 3.2, significantly saving training time. In Figure 5, we visualize the generated samples, which exhibit minimal property errors and display a clear trend as the target values increase. Additional randomly generated molecules are provided in Appendix E.2.

A potential concern is that for a given property value c, pφ(r|c) may produce a representation corresponding to a

molecule from the training dataset, allowing the molecule generator to simply recover its full conformation based on that representation. This could lead to small property errors but a lack of novelty. To address this, we conducted a thorough evaluation of the generated molecules across each property, finding that the novelty (the proportion of new molecules not present in the training dataset) remains comparable to other methods. Additionally, the conditionally generated molecules demonstrate much higher molecule stability than EDM (Hoogeboom et al., 2022). Further details can be found in Appendix C.

Table 4: Unconditional molecule generation on QM9 with fewer diffusion steps. The blue cells indicate the highest value among methods with the same number of diffusion steps, while bold font emphasizes values that outperform all other methods across all diffusion steps.

Methods Metrics # Steps Atom Sta (%) Mol Sta (%) Valid (%)

Data - 99 95.2 97.7 Equi FM 200 98.9(0.1) 88.3(0.3) 94.7(0.4) GOAT 90 98.4 84.1 90.9 EDM 50 97.0(0.1) 66.4(0.2) - EDM-Bridge 50 97.3(0.1) 69.2(0.2) - Geo BFN 50 98.28(0.1) 85.11(0.5) 92.27(0.4) Geo RCG (EDM) 50 98.75(0.05) 1.80% 89.08(0.52) 34.16% 95.05(0.33) EDM 100 97.3(0.1) 69.8(0.2) - EDM-Bridge 100 97.9(0.1) 72.3(0.2) - Geo BFN 100 98.64(0.1) 87.21(0.3) 93.03(0.3) Geo RCG (EDM) 100 99.08(0.03) 1.83% 91.85(0.34) 31.59% 96.49(0.27) EDM 500 98.5(0.1) 81.2(0.1) - EDM-Bridge 500 98.7(0.1) 83.7(0.1) - Geo BFN 500 98.78(0.8) 88.42(0.2) 93.35(0.2) Geo RCG (EDM) 500 99.09(0.01) 0.60% 91.89(0.24) 13.17% 96.57(0.12) EDM 1000 98.7 82 91.9 EDM-Bridge 1000 98.8 84.6 92 Geo BFN 1000 99.08(0.06) 90.87(0.2) 95.31(0.1) Geo RCG (EDM) 1000 99.12(0.03) 0.43% 92.32(0.06) 12.59% 96.52(0.2) 5.03%

4.4. Fewer-Step Generation

With geometric representation condition, it is reasonable to expect that fewer discretization steps of the reverse diffu-

Geometric Representation Condition Improves Equivariant Molecule Generation

64.20 64.45

69.99 70.10

75.77 75.60

81.56 81.00

87.34 87.41

93.13 93.24

98.91 97.44

104.70 104.73

Figure 5: Conditionally generated molecules on property α using Geo RCG (EDM). The black number indicates the condition value, the green number represents the oracle property value for the generated molecule conformer.

Table 5: Unconditional molecule generation on GEOMDRUG with fewer diffusion steps.

# Steps Atom Sta (%) Valid (%) Geo BFN Geo RCG (EDM) Geo BFN Geo RCG (EDM) 50 75.11 81.44(0.10) 91.66 95.70(0.70) 100 78.89 83.02(0.06) 93.05 96.30(0.70) 500 81.39 84.03(0.37) 93.47 97.57(0.90) 1000 85.6 84.3(0.12) 92.08 98.5(0.12)

sion SDE (Song et al., 2021) would still yield competitive results. Therefore, we reduce the number of diffusion steps and evaluate the model s performance. The results are presented in Table 4 and Table 5. We provide the fewer-step performance of Geo RCG (Semla) on GEOM-DRUG dataset in Appendix C.

As demonstrated, with the geometric representation condition, Geo RCG consistently outperforms other approaches across almost all step numbers. Notably, in Table 4, with approximately 100 steps, the performance of our method nearly converges to the optimal performance observed with 1000 steps, which already surpasses all other methods across all step numbers. This demonstrates the strong potential of Geo RCG to reduce the number of iterations required by sequential generative methods.

5. Conclusions and Limitations

Conclusions. In this work, we present Geo RCG, a simple yet effective framework to improve the generation quality of arbitrary molecule generators by incorporating geometric representation conditions. We use EDM (Hoogeboom et al., 2022) and Semla Flow (Irwin et al., 2024) as base generators and demonstrate the effectiveness of our framework through extensive molecular generation experiments. In conditional generation tasks, Geo RCG achieves a remarkable 50% performance boost compared to recent SOTA models. Additionally, the representation guidance enables sampling with 10x fewer diffusion steps while maintaining near-optimal performance. Beyond these empirical improvements, we provide theoretical characterizations of representation-conditioned generative models, which address a key gap in the existing empirical literature (Li et al., 2023).

Table 6: Sampling time (in seconds) for 5k samples using Semla Flow and Geo RCG (Semla) across different numbers of sampling steps, measured on a single NVIDIA RTX 4090.

# Steps Method Rep. Time Mol. Time Mol. Time w/o CFG

100 Semla Flow - 610 610 Geo RCG (Semla) 97 1481 770

50 Semla Flow - 310 310 Geo RCG (Semla) 97 690 380

20 Semla Flow - 152 152 Geo RCG (Semla) 97 315 189

Limitations. We discuss two limitations of Geo RCG. First, as a representation-guided generative method, its generation quality may depend heavily on the quality of representations. For instance, in the GEOM-DRUG dataset, an insufficiently pre-trained encoder may produce less meaningful representations. As a result, the benefits of low-temperature sampling and classifier-free guidance in enhancing generation quality and controllability may be less pronounced. Future work could investigate more effective pre-training strategies beyond standard denoising or enhanced representation regularization techniques to mitigate this issue. Second, although the additional conditioning module introduces small overhead, the use of classifier-free guidance requires doubling the batch size, resulting in roughly twice the resource consumption (memory and computation), see Table 6. Nonetheless, for many cases such as Geo RCG (EDM) in QM9, performance gains are substantial even without employing classifier-free guidance. With ongoing advancements in hardwares and infrastructures, we expect the overhead introduced by parallelism to be further minimized.

Acknowledgement

This work is supported by the National Key R&D Program of China (2022ZD0160300) and National Natural Science Foundation of China (62276003).

Impact Statement

This paper presents work whose goal is to advance the field of molecule generation. There are many potential societal consequences of molecule generation improvement, such as accelerating drug discovery, and developing new material.

Geometric Representation Condition Improves Equivariant Molecule Generation

None of them we feel need to be specifically highlighted here for potential risk.

Abbasi-yadkori, Y., P al, D., and Szepesv ari, C. Improved algorithms for linear stochastic bandits. In Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 24. Curran Associates, Inc., 2011. URL https://proceedings.neurips. cc/paper_files/paper/2011/file/ e1d5be1c7f2f456670de3d53c7b54f4a-Paper. pdf.

Anderson, B., Hy, T. S., and Kondor, R. Cormorant: Covariant molecular neural networks. Advances in neural information processing systems, 32, 2019.

Aneja, J., Schwing, A., Kautz, J., and Vahdat, A. A contrastive learning approach for training variational autoencoder priors. Advances in neural information processing systems, 34:480 493, 2021.

Axelrod, S. and Gomez-Bombarelli, R. Geom, energyannotated molecular conformations for property prediction and molecular generation. Scientific Data, 9(1):185, 2022.

Beaudry, N. J. and Renner, R. An intuitive proof of the data processing inequality. ar Xiv preprint ar Xiv:1107.0740, 2011.

Brown, T. B. Language models are few-shot learners. ar Xiv preprint ar Xiv:2005.14165, 2020.

Chen, S., Chewi, S., Li, J., Li, Y., Salim, A., and Zhang, A. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/ forum?id=zy LVMgs Z0U_.

Chen, X., Xie, S., and He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9640 9649, 2021.

Dai, B. and Wipf, D. Diagnosing and enhancing vae models. ar Xiv preprint ar Xiv:1903.05789, 2019.

De Bortoli, V. Convergence of denoising diffusion models under the manifold hypothesis. ar Xiv preprint ar Xiv:2208.05314, 2022.

Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020.

Fang, X., Liu, L., Lei, J., He, D., Zhang, S., Zhou, J., Wang, F., Wu, H., and Wang, H. Geometry-enhanced molecular representation learning for property prediction. Nature Machine Intelligence, 4(2):127 134, 2022.

Feng, S., Ni, Y., Lan, Y., Ma, Z.-M., and Ma, W.-Y. Fractional denoising for 3d molecular pre-training. In International Conference on Machine Learning, pp. 9938 9961. PMLR, 2023.

Feng, S., Ni, Y., Lu, Y., Ma, Z.-M., Ma, W.-Y., and Lan, Y. Unigem: A unified approach to generation and property prediction for molecules. ar Xiv preprint ar Xiv:2410.10516, 2024.

Garcia Satorras, V., Hoogeboom, E., Fuchs, F., Posner, I., and Welling, M. E (n) equivariant normalizing flows. Advances in Neural Information Processing Systems, 34: 4181 4192, 2021.

Gebauer, N., Gastegger, M., and Sch utt, K. Symmetryadapted generation of 3d point sets for the targeted discovery of molecules. Advances in neural information processing systems, 32, 2019.

Gebauer, N. W., Gastegger, M., Hessmann, S. S., M uller, K.-R., and Sch utt, K. T. Inverse design of 3d molecular structures with conditional generative neural networks. Nature communications, 13(1):973, 2022.

Graves, J., Byerly, J., Priego, E., Makkapati, N., Parish, S. V., Medellin, B., and Berrondo, M. A review of deep learning methods for antibodies. Antibodies, 9(2):12, 2020.

Ho, J. and Salimans, T. Classifier-free diffusion guidance. ar Xiv preprint ar Xiv:2207.12598, 2022.

Hong, H., Lin, W., and Tan, K. C. Fast 3d molecule generation via unified geometric optimal transport. ar Xiv preprint ar Xiv:2405.15252, 2024.

Hoogeboom, E., Satorras, V. G., Vignac, C., and Welling, M. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pp. 8867 8887. PMLR, 2022.

Huang, H., Sun, L., Du, B., and Lv, W. Learning joint 2-d and 3-d graph diffusion models for complete molecule generation. IEEE Transactions on Neural Networks and Learning Systems, 2024.

Ingraham, J. B., Baranov, M., Costello, Z., Barber, K. W., Wang, W., Ismail, A., Frappier, V., Lord, D. M., Ng Thow-Hing, C., Van Vlack, E. R., et al. Illuminating protein space with a programmable generative model. Nature, 623(7989):1070 1078, 2023.

Geometric Representation Condition Improves Equivariant Molecule Generation

Irwin, R., Tibo, A., Janet, J.-P., and Olsson, S. Efficient 3d molecular generation with flow matching and scale optimal transport. ar Xiv preprint ar Xiv:2406.07266, 2024.

Jang, Y., Kim, D., and Ahn, S. Hierarchical graph generation with k2-trees. In ICML 2023 Workshop on Structured Probabilistic Inference {\&} Generative Modeling, 2023.

Jiao, R., Han, J., Huang, W., Rong, Y., and Liu, Y. Energymotivated equivariant pretraining for 3d molecular graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 8096 8104, 2023.

Jiao, R., Kong, X., Yu, Z., Huang, W., and Liu, Y. Equivariant pretrained transformer for unified geometric learning on multi-domain 3d molecules. ar Xiv preprint ar Xiv:2402.12714, 2024.

Jo, J., Kim, D., and Hwang, S. J. Graph generation with diffusion mixture. ar Xiv preprint ar Xiv:2302.03596, 2023.

Kenton, J. D. M.-W. C. and Toutanova, L. K. Bert: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of naac L-HLT, volume 1, pp. 2, 2019.

Le, T., Cremer, J., No e, F., Clevert, D.-A., and Sch utt, K. Navigating the design space of equivariant diffusionbased generative models for de novo 3d molecule generation. ar Xiv preprint ar Xiv:2309.17296, 2023.

Li, P., Wang, J., Qiao, Y., Chen, H., Yu, Y., Yao, X., Gao, P., Xie, G., and Song, S. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Briefings in Bioinformatics, 22 (6):bbab109, 2021.

Li, T., Katabi, D., and He, K. Self-conditioned image generation via generating representations. ar Xiv preprint ar Xiv:2312.03701, 2023.

Li, Z., Wang, X., Huang, Y., and Zhang, M. Is distance matrix enough for geometric deep learning? Advances in Neural Information Processing Systems, 36, 2024a.

Li, Z., Wang, X., Kang, S., and Zhang, M. On the completeness of invariant geometric deep learning models. ar Xiv preprint ar Xiv:2402.04836, 2024b.

Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling. ar Xiv preprint ar Xiv:2210.02747, 2022.

Liu, S., Guo, H., and Tang, J. Molecular geometry pretraining with se (3)-invariant denoising distance matching. ar Xiv preprint ar Xiv:2206.13602, 2022a.

Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. ar Xiv preprint ar Xiv:2209.03003, 2022b.

Luo, T., Mo, Z., and Pan, S. J. Fast graph generation via spectral diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.

Luo, Y. and Ji, S. An autoregressive flow model for 3d molecular geometry generation from scratch. In International conference on learning representations (ICLR), 2022.

Mislow, K. Introduction to stereochemistry. Courier Corporation, 2012.

Morehead, A. and Cheng, J. Geometry-complete diffusion for 3d molecule generation and optimization. Communications Chemistry, 7(1):150, 2024.

Nakata, M. and Shimazaki, T. Pubchemqc project: a largescale first-principles electronic structure database for datadriven chemistry. Journal of chemical information and modeling, 57(6):1300 1308, 2017.

Ni, Y., Feng, S., Hong, X., Sun, Y., Ma, W.-Y., Ma, Z.-M., Ye, Q., and Lan, Y. Pre-training with fractional denoising to enhance molecular property prediction. ar Xiv preprint ar Xiv:2407.11086, 2024.

O Boyle, N. M., Banck, M., James, C. A., Morley, C., Vandermeersch, T., and Hutchison, G. R. Open babel: An open chemical toolbox. Journal of cheminformatics, 3: 1 14, 2011.

Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4195 4205, 2023.

Ramakrishnan, R., Dral, P. O., Rupp, M., and Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1(1):1 7, 2014.

Razavi, A., Van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684 10695, 2022.

Satorras, V. G., Hoogeboom, E., and Welling, M. E (n) equivariant graph neural networks. In International conference on machine learning, pp. 9323 9332. PMLR, 2021.

Geometric Representation Condition Improves Equivariant Molecule Generation

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. ar Xiv preprint ar Xiv:2010.02502, 2020a.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. ar Xiv preprint ar Xiv:2011.13456, 2020b.

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum? id=Px TIG12RRHS.

Song, Y., Gong, J., Qu, Y., Zhou, H., Zheng, M., Liu, J., and Ma, W.-Y. Unified generative modeling of 3d molecules via bayesian flow networks. ar Xiv preprint ar Xiv:2403.15441, 2024a.

Song, Y., Gong, J., Xu, M., Cao, Z., Lan, Y., Ermon, S., Zhou, H., and Ma, W.-Y. Equivariant flow matching with hybrid probability transport for 3d molecule generation. Advances in Neural Information Processing Systems, 36, 2024b.

Th olke, P. and De Fabritiis, G. Torchmd-net: equivariant transformers for neural network based molecular potentials. ar Xiv preprint ar Xiv:2202.02541, 2022.

Thomas, N., Smidt, T., Kearnes, S., Yang, L., Li, L., Kohlhoff, K., and Riley, P. Tensor field networks: Rotation-and translation-equivariant neural networks for 3d point clouds. ar Xiv preprint ar Xiv:1802.08219, 2018.

Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.

Vignac, C. and Frossard, P. Top-n: Equivariant set and graph generation without exchangeability. ar Xiv preprint ar Xiv:2110.02096, 2021.

Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher, V., and Frossard, P. Digress: Discrete denoising diffusion for graph generation. ar Xiv preprint ar Xiv:2209.14734, 2022.

Vignac, C., Osman, N., Toni, L., and Frossard, P. Midi: Mixed graph and 3d denoising diffusion for molecule generation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 560 576. Springer, 2023.

Wang, S., Tan, Z., Zhao, X., Chen, T., Liu, H., and Li, J. Graphrcg: Self-conditioned graph generation via bootstrapped representations. ar Xiv preprint ar Xiv:2403.01071, 2024.

Wu, L., Gong, C., Liu, X., Ye, M., and Liu, Q. Diffusionbased molecule generation with informative prior bridges. Advances in Neural Information Processing Systems, 35: 36533 36545, 2022.

Xu, M., Powers, A. S., Dror, R. O., Ermon, S., and Leskovec, J. Geometric latent diffusion models for 3d molecule generation. In International Conference on Machine Learning, pp. 38592 38610. PMLR, 2023.

You, Y., Zhou, R., Park, J., Xu, H., Tian, C., Wang, Z., and Shen, Y. Latent 3d graph diffusion. In The Twelfth International Conference on Learning Representations, 2023.

Yuan, H., Huang, K., Ni, C., Chen, M., and Wang, M. Reward-directed conditional diffusion: Provable distribution estimation and reward improvement. In Thirtyseventh Conference on Neural Information Processing Systems, 2023.

Zaidi, S., Schaarschmidt, M., Martens, J., Kim, H., Teh, Y. W., Sanchez-Gonzalez, A., Battaglia, P., Pascanu, R., and Godwin, J. Pre-training via denoising for molecular property prediction. ar Xiv preprint ar Xiv:2206.00133, 2022.

Zheng, Q., Le, M., Shaul, N., Lipman, Y., Grover, A., and Chen, R. T. Guided flows for generative modeling and decision making. ar Xiv preprint ar Xiv:2311.13443, 2023.

Zhou, C., Wang, X., and Zhang, M. Unifying generation and prediction on graphs with latent graph diffusion. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.

Zhou, G., Gao, Z., Ding, Q., Zheng, H., Xu, H., Wei, Z., Zhang, L., and Ke, G. Uni-mol: A universal 3d molecular representation learning framework. 2023.

Geometric Representation Condition Improves Equivariant Molecule Generation

A. Algorithms

High-level Algorithm for Parallel Training and Sequential Sampling We provide the high-level training and sampling algorithm for Geo RCG in Algorithm 1.

Algorithm 1 Parallel Training and Sequential Sampling for Geo RCG

Input: Molecule dataset Dmol train S+ N=1 RN 3 RN d , pre-trained geometric encoder E, initial representation generator pφ0(r), molecule generator pθ0(M|r). Output: Trained representation generator pφ(r), molecule generator pθ(M|r), and molecule samples from pφ,θ(M). Parallel Training: Pre-process to obtain:

- The representation dataset Drep train = {(E(M), N(M))|M Dmol train} - The mol-rep dataset Dmol-rep train = {(E(M), M)|M Dmol train} Train the representation generator pφ0(r) with Drep train using loss Lrep in Equation (1). Train the molecule generator pθ0(M|r) with Dmol-rep train using loss Lmol in Equation (2), while applying the training techniques outlined below, including representation perturbation and representation loss. Sequential Sampling: Sample a representation r pφ(r) with low-temperature sampling technique outlined below. Sample a molecule M pθ(M|r ) conditionally, with classifier-free guidance technique outlined below. Return: Trained representation generator pφ(r), molecule generator pθ(M|r), and generated molecule sample.

Training: Representation Perturbation Unlike typical conditional training scenarios, Geo RCG faces a unique challenge: the representations that condition the molecule generator during training may not always coincide with those generated by the representation generator during the sampling stage. This issue is particularly pronounced in molecular generation than image case (Li et al., 2023), where pre-trained encoders are typically not trained on that large datasets with advanced regularization techniques like Mo Co v3 (Chen et al., 2021). Consequently, the molecule generator is susceptible to overfitting to the training representations, as evidenced by our preliminary experiments on QM9 molecule generation shown in Table 7.

Table 7: Quality of molecules generated by Geo RCG trained on the QM9 dataset without using the representation perturbation technique, comparing different representation sources. Training Dataset refers to representations sampled from Drep train, while Rep. Sampler refers to representations generated by the trained representation generator pφ(r).

Rep source Metrics Mol Sta (%) ( ) Valid (%) ( )

Training Dataset 93.20 (0.50) 97.07 (0.32) Rep. Sampler 86.93 (0.50) 89.12 (0.21)

We find that a simple technique perturbing the geometric representation during training the molecule generator with some Gaussian noise σrepϵ, where ϵ N(0, I) and σrep is a relatively small variance is particularly effective for solving this problem. Formally, after sampling a data point (E(M), M) from Dmol-rep train , we use (M, E(M) + σrepϵ) for training. Ablation study in Appendix C show this simple method can effectively prevent overfitting and ensure that performance on novel representations matches those from the training dataset.

In practice, we set σrep to 0.3 for QM9 dataset and σrep to [0.3, 0.5] for GEOM-DRUG dataset.

Training: Representation Loss Training molecular generative models typically involves predicting a clean molecule from a noisy molecule input (in noise parameterization or vector field parameterization, an equivalent formulation exists for constructing clean molecule predictions). The loss is computed by minimizing the distance (e.g., MSE for coordinates) between the predicted clean molecule and the true clean molecule. To further strengthen supervision of the representation, we introduce an additional representation loss during training. This loss is defined as the MSE between the predicted clean molecule s representation and the actual clean molecule s representation.

In practice, for Geo RCG (EDM), we do not apply this technique, whereas for Geo RCG (Semla), we incorporate it with a relatively small coefficient.

Geometric Representation Condition Improves Equivariant Molecule Generation

Sampling: Low-Temperature Sampling We adopt the low-temperature sampling algorithm introduced by Chroma (Ingraham et al., 2023) to the representation generator. However, we apply it to an MLP-based diffusion model rather than the equivariant diffusion model that processes geometric objects as Chroma.

The objective of low-temperature sampling is to perturb the learned representation distribution pφ(r) by rescaling it with an inverse temperature factor, 1

T , where T is a tunable temperature parameter during sampling, and finally enables sampling

1 Tφ , where ZT is a normalization constant. The method proposed in Chroma (Ingraham et al., 2023) scales the score ϵt estimated at each diffusion time step using a time-dependent factor λt. The approach is derived from and has theoretical guarantees for simplified toy distributions, and its performance on complex distributions, though lacking strict guarantees, has shown consistent results when combined with annealed Langevin sampling (Song et al., 2021). Here we briefly introduce it for self-containess, and recommend the readers to Ingraham et al. (2023) for detailed derivation and illustration.

Consider the vanilla reverse SDE used in DDPM sampling (VP formulation) (Song et al., 2021):

2βtr βt r log qt(r)dt + p

where w is a reverse-time Wiener process, qt(r) denotes the ground-truth representation distribution at time t, and βt represents the time-dependent diffusion schedule. To incorporate low-temperature sampling, we utilize the following Hybrid Langevin Reverse-time SDE:

2βtr λt + λ0ψ

βt r log qt(r)dt + p

βt(1 + ψ)d w, (8)

where λt is a time-dependent temperature parameter defined as λ0 α2 t +(1 α2 t )λ0 , with λ0 = 1

T . αt satisfies 1

2βt = d log αt

dt . The parameter ψ controls the rate of Langevin equilibration per unit time, and as shown in Ingraham et al. (2023), it helps align more effectively with the reweighting objective in complex distributions. In our implementation, we employ the explicit annealed Langevin process (the corrector step from (Song et al., 2021)) to achieve similar results.

In practice, for the unconditional QM9 and GEOM-DRUG generation results of Geo RCG (EDM) shown in Table 1, we set T = 1.0 for QM9 and T = 0.5 for GEOM-DRUG. For conditional generation in Table 3, we set T = 1.0. The effect of varying T on QM9 is detailed in Table 10. For Geo RCG (Semla Flow) results in Table 2, we use T = 1.0.

Sampling: Classifier-Free Guidance We employ the classifier-free guidance algorithm, as introduced in (Ho & Salimans, 2022), for our molecule generator. Specifically, we introduce a trainable fake representation, denoted as l, which serves as the unconditional signal. During the training phase, l is initialized as learnable parameters, and with a probability of pfake, the true representation r is replaced by l. This ensures that the model is capable of generating molecules unconditionally, i.e., pθ(M|l) approximates q(M). During sampling, the final score estimate produced by the molecule generator is adjusted using the formula (1 + w)fθ(Mt, t, r) wfθ(Mt, t, l), allowing flexible control over the strength of the representation guidance.

In practice, for unconditional generation in Table 1, we set w = 1.0 for QM9 and w = 0.0 for GEOM-DRUG. The impact of varying w on QM9 is shown in Table 10. For conditional generation in Table 3, we set w = 2.0. For Geo RCG (Semla Flow), we use w = 0.9 (note that 1.0 < w < 0.0 indicates subtle representation guidance that exerts a small but still meaningful influence on the model s behavior). Further tuning of these hyperparameters may yield improved performance.

B. Experiment Details

Metrics and Baseline Descriptions We adopt the evaluation metrics, guidelines and baselines commonly used in prior 3D molecular generative models to ensure a fair comparison (Hoogeboom et al., 2022).

In the unconditional setting, we assess the generated molecules using several key metrics:

Atom Stability: The proportion of atoms with correct valency. Molecule Stability: The proportion of molecules where all atoms within the molecule are stable. Validity: The proportion of molecules that can be converted into valid SMILES using RDKit.

Geometric Representation Condition Improves Equivariant Molecule Generation

Validity & Uniqueness: The proportion of unique molecules among the valid molecules. Energy: The energy U(x) of a conformation x. Lower energy values typically correspond to more stable and physically plausible conformations that are closer to what would be observed in nature. Following (Irwin et al., 2024), the energy is calculated using the MMFF94 force field within RDKit, a commonly used molecular modeling framework. Strain: A measure of how distorted a generated conformation x is compared to its relaxed (optimized) state x. The relaxed conformation x is obtained by applying energy minimization using the MMFF94 force field, where the molecular structure is iteratively adjusted to reduce its energy until reaching a local minimum. Mathematically, the strain energy is defined as U(x) U( x) , where U(x) is the energy of the generated conformation and U( x) is the energy of the relaxed conformation. Lower strain energy values imply that the generated conformations are closer to being physically accurate and require minimal correction during optimization.

Following the approach of Hoogeboom et al. (2022); Vignac & Frossard (2021), we do not report Novelty scores in the main text, since QM9 represents an exhaustive enumeration of molecules satisfying a predefined set of constraints, therefore, novel molecule would often violate at least one of these constraints, which indicates that a model fails to fully capture the properties of the dataset. For reference, we observe that the novelty of Geo RCG (EDM) on QM9 is approximately 42% when w = 0.0, T = 1.0, compared to about 65% for EDM.

When comparing with 2D & 3D models in Table 8, we evaluate two 3D metrics introduced by Mi Di (Vignac et al., 2023), which directly assess the geometry learning ability:

Bond Length W1: The weighted 1-Wasserstein distance between the bond-length distributions of the generated molecules and the training dataset, with weights corresponding to different bond types. Formally, it is defined as:

Bond Lengths W1 = X

y bond types q Y (y)W1( ˆDdist(y), Ddist(y)), (9)

where q Y (y) is the proportion of bonds of type y in the training set, ˆDdist(y) is the generated distribution of bond lengths for bond type y, and Ddist(y) is the corresponding distribution from the test set. Bond Angle W1: The weighted 1-Wasserstein distance between the atom-centered angle distributions of the generated molecules and the training dataset, with weights based on atom types. Formally, it is defined as

Bond Angles W1 = X

x atom types q X(x)W1( ˆDangles(x), Dangles(x)), (10)

where q X(x) denotes the proportion of atoms of type x in the training set, restricted to atoms with two or more neighbors, and Dangles(x) represents the distribution of geometric angles of the form (rk ri, rj ri), where i is an atom of type x, and k and j are neighbors of i.

In the conditional generation setting, as described in (Hoogeboom et al., 2022), we evaluate our approach on the QM9 dataset across six properties: polarizability α, orbital energies εHOMO, εLUMO, and their gap ε, dipole moment µ, and heat capacity Cv. The generative model is trained conditionally on the second half of the QM9 dataset, and an EGNN (Satorras et al., 2021) classifier, trained on the first half, is employed to evaluate the MAE property error of the generated samples.

Three baselines are adopted in Table 3:

QM9 (lower bound): The mean error of a classifier trained on the first half of the QM9 dataset and evaluated on the second half. This baseline represents the inherent bias/error of the classifier, setting a lower bound for model performance and reflecting the best possible performance a model can achieve. Random: The classifier s performance when evaluated on the second half of QM9 with randomly shuffled molecule property labels. This baseline provides an upper bound, representing the worst achievable performance. N atoms: The performance of a classifier trained exclusively on the number of atoms N and evaluated using only N as input. This baseline captures the intrinsic relationship between molecular properties and the number of atoms, which a generative model must surpass to demonstrate effectiveness.

Geometric Representation Condition Improves Equivariant Molecule Generation

Model Architectures, Hyperparameters and Training Details

Representation Generator. We use the same architecture for the representation generator as the MLP-based diffusion model proposed in Li et al. (2023). We use 18 blocks of residual MLP layers with 1536 hidden dimensions, 1000 diffusion steps, and a linear noise schedule for βt. The representation generator is trained for 2000 epochs with a batch size of 128 for both the QM9 and GEOM-DRUG datasets. Training on QM9 takes approximately 2.5 days on a single Nvidia 4090, while training on GEOM-DRUG takes around 4 days on a single Nvidia A800. Training time can indeed be further reduced, as the model shows minimal progress after approximately half of the reported time.

Molecule Generator. We adopt EDM (Hoogeboom et al., 2022) as the base molecule generator, using the same EGNN (Satorras et al., 2021) architecture, with the exception of the conditioning module. Specifically, we introduce a simple gated feedforward layer to incorporate the representation condition, inserting it between each EGNN block to enhance regularization and improve model expressiveness.

For the EGNN hyperparameters, we use 9 layers with 256 hidden dimensions for QM9 and 4 layers with 256 hidden dimensions for GEOM-DRUG. The number of diffusion steps is set to 1000 (except for cases in Table 4 that generate molecules with fewer steps), and we employ the polynomial scheduler for α(M) t . Notably, all model hyperparameters are identical to those in EDM for fair comparison.

During training, we use a batch size of 128 and 3000 epochs on QM9, and a batch size of 64 and 20 epochs on GEOM-DRUG. Training takes approximately 6 days on QM9 using a single Nvidia 4090, and around 10 days on GEOM-DRUG using two Nvidia A800 GPUs.

For Geo RCG (Semla Flow), we select the better-performing architecture between a gated feedforward layer and an Ada LN-Zero-like module (Peebles & Xie, 2023) for conditioning on representations and place it between every block of Semla. During training, we use the batch cost 2048 and trains for 300 epochs.

Evaluation of LDM-3DG (You et al., 2023) We evaluate the performance of LDM-3DG (You et al., 2023) in Table 8 and Table 3, an Auto-Encoder-based method that also leverages the compactness of the representation space to achieve good performance.

For the unconditional results in Table 8, we utilize the 3D conformations unconditionally generated by LDM-3DG (You et al., 2023) and compute the bond information using the look-up table method from EDM (Hoogeboom et al., 2022). Notably, although LDM-3DG predicts both the 2D molecular graph and the 3D conformation, we do not use the bond information it predicts for the following reasons:

1. For the calculation of 3D geometry statistics, we observe significant inconsistencies between the generated 2D graphs and 3D geometries (e.g., valid molecules with bond lengths exceeding 100 m), leading to unreliable statistics (e.g., Bond Length W1 exploding to 3900). 2. For stability and validity metrics, which are fundamentally 2D and computed based on molecular graphs (atoms and bond types), using the generated 2D graph would ignore the contribution of the 3D module, preventing an evaluation of its 3D learning performance. 3. Most critically, their 2D module is explicitly designed to filter out invalid (sub-)molecules during generation using the RDKit method. This means that if invalid molecules or sub-molecules are generated, they are regenerated. This explicit filtering deviates from our standard evaluation criteria and is unsuitable for a fair comparison.

For the conditional results in Table 3, we first note a potential issue with LDM-3DG (You et al., 2023): The model cannot explicitly specify the node number N during molecule generation, as it uses an auto-regressive 2D generator that automatically stops adding atoms/motifs when deemed sufficient. However, the evaluation in Table 3 requires specifying both N and property c, following the ground-truth distribution q(N, c) from the training dataset. To ensure fair evaluation, conditions feeding to LDM-3DG must also satisfy this distribution. As the authors claim the model can implicitly learn q(N) and thus q(N|c), we first sample 10,000 values from q(c) and feed them to LDM-3DG, expecting it to infer N from c implicitly as argued, and thus matching the q(N, c) conditions.

C. Additional Experiment Results

Comparison of Geo RCG (EDM) with 2D&3D Methods We compare Geo RCG (EDM) with recent 2D&3D methods such as Mi Di (Vignac et al., 2023) and LDM-3DG (You et al., 2023). As discussed in Section 4.1, such comparison is not

Geometric Representation Condition Improves Equivariant Molecule Generation

fair, since these models learn and generate both 2D bond structures and 3D geometries, which is beneficial for metrics like validity and stability. Considering that they rely on external chemistry toolkits like RDKit or Open Babel (O Boyle et al., 2011) for bond determination at the input stage and continue to leverage this bond information throughout training and generation, we report Geo RCG (EDM) using the same external tools for accurate bond computation in the generated 3D conformations, rather than relying on simple look-up tables, to narrow the comparison gap (though not completely addressed since Geo RCG (EDM) still do not explicitly learn to generate bonds).

Furthermore, as Geo RCG (EDM) essentially captures 3D geometric distributions, we place more emphasis on 3D metrics that directly evaluate 3D learning capabilities, including Bond Length W1 and Bond Angle W1 proposed by Vignac et al. (2023) and detailed in Appendix B.

The results in Table 8 demonstrate that Geo RCG not only significantly outperforms Mi Di and LDM-3DG on 3D metrics, highlighting the advantages of using a pure 3D model for learning 3D structures, but also further enhances EDM s performance, which has already shown considerable promise in 3D learning.

Table 8: 3D geometry statistics and generated molecule quality on QM9 across different methods. Models marked with

indicate results obtained from our own experiments; see Appendix B for the evaluation guidelines. The stability metrics for EDM are higher than in Table 1 due to using the Mi Di codebase for evaluation, which permits more valency for atoms.

Methods Metrics Angles ( ) Bond Length (e-2 A) Mol Sta (%) Atom Sta (%) Validity (%) Uniqueness (%)

Data 0.1 0 98.7 99.8 98.9 99.9 Mi Di (uniform) 0.67(0.02) 1.6(0.7) 96.1(0.2) 99.7(0.0) 96.6(0.2) 97.6(0.1) Mi Di (adaptive) 0.62(0.02) 0.3(0.1) 97.5(0.1) 99.8(0.0) 97.9(0.1) 97.6(0.1) LDM-3DG 3.56 0.2 94.03 99.38 94.89 97.03 EDM 0.44 0.1 90.7 99.2 91.7 98.5 EDM + OBabel 0.44 0.1 97.9 99.8 99.0 98.5 Geo RCG (EDM) 0.21(0.04) 52.27% 0.04(0.0) 60% 95.82(0.16) 5.6% 99.59(0.02) 0.39% 96.54(0.27) 5.28% 95.74(0.18) 2.8% Geo RCG (EDM) + OBabel 0.20(0.04) 54.55% 0.07(0.06) 30% 98.21(0.09) 0.32% 99.88(0.00) 0.08% 99.0(0.04) 0.0% 95.74(0.16) 2.8%

Balancing Controllability We present a more comprehensive figure that includes molecule stability, atom stability, validity, and validity&uniqueness in Figure 6.

Figure 6: Balance controllable (unconditional) generation on QM9 dataset of Geo RCG (EDM). Increasing w and decreasing T enhances stability, with the cost of a reduction in uniqueness.

1 2 3 4 Inverse temperature 1 Temp

0 1 2 3 Classifier guidance coefficient w

90.02 90.65 91.06 91.08

92.32 93.16 93.30 93.38

92.40 93.43 93.72 93.90

92.30 93.66 93.56 93.54

Molecule Stability

1 2 3 4 Inverse temperature 1 Temp

0 1 2 3 Classifier guidance coefficient w

98.97 99.04 99.08 99.11

99.10 99.22 99.24 99.26

99.10 99.23 99.26 99.29

99.08 99.22 99.22 99.21

Atom Stability

1 2 3 4 Inverse temperature 1 Temp

0 1 2 3 Classifier guidance coefficient w

95.67 96.06 96.28 96.38

96.52 97.11 97.04 97.30

96.60 97.14 97.38 97.42

96.36 97.09 97.10 97.13

1 2 3 4 Inverse temperature 1 Temp

0 1 2 3 Classifier guidance coefficient w

92.61 92.23 91.73 91.40

92.45 91.36 89.53 88.62

92.05 90.37 88.37 86.82

91.38 89.70 86.70 85.64

Validity&Uniqueness

Fewer-step Sampling of Geo RCG (Semla) We present the performance of Geo RCG (Semla) across varying numbers of sampling steps in Table 9. The results demonstrate consistent improvements over Semla Flow, highlighting the effectiveness of Geo RCG on this advanced method with reduced sampling steps.

Ablation Study: Representation Encoders Geometric representations play a pivotal role in Geo RCG. To evaluate the importance of representation quality, we conduct an ablation study comparing the quality of molecule samples generated by Geo RCG trained under different geometric encoder configurations.

We first assess the benefits provided by the pre-training stage. Specifically, we utilize the pre-trained encoder Frad (Feng et al., 2023), trained on the PCQM4Mv2 dataset (Nakata & Shimazaki, 2017) with a hybrid coordinates denoising task (Feng et al., 2023). This approach has been proven to equivalently learn force fields (Feng et al., 2023; Zaidi et al., 2022), and is

Geometric Representation Condition Improves Equivariant Molecule Generation

Table 9: Comparison between Semla Flow and Geo RCG (Semla) across varying numbers of sampling steps. Results of Semla Flow is obtained by our experiments.

# Steps Method Energy Strain Atom-Stab. Mol.-Stab. Validity kcal mol 1 kcal mol 1 % %

100 Semla Flow 95.72(1.24) 56.42(1.07) 99.8(0.00) 97.4(0.07) 94.4(0.17) Geo RCG (Semla) 88.6(1.03) 47.64(1.10) 99.8(0.00) 97.6(0.00) 95.3(0.13)

50 Semla Flow 100.60(0.55) 60.31(0.13) 99.8(0.00) 96.9(0.11) 94.6(0.17) Geo RCG (Semla) 91.60(1.03) 50.50(0.65) 99.8(0.00) 97.2(0.20) 95.3(0.22)

20 Semla Flow 117.03(1.37) 76.99(1.18) 99.7(0.01) 95.4(0.17) 93.2(0.30) Geo RCG (Semla) 99.83(0.91) 62.88(0.56) 99.7(0.00) 95.5(0.13) 94.2(0.01)

therefore expected to produce informative representations that capture high-level molecular information. We train Geo RCG (EDM) using representations from a well-pretrained Frad and a Frad with randomly initialized weights. The molecule generation quality on QM9, as shown in Table 10, clearly underscores the critical role of pre-training on large datasets with advanced techniques in improving representation quality, ultimately enhancing Geo RCG s performance.

Table 10: Quality of molecules generated by Geo RCG (EDM) with different encoders trained on the QM9 dataset. Random indicates that the weights were initialized randomly without any pre-training.

Encoder Metrics Atom Sta (%) Mol Sta (%) Valid (%) Valid & Unique (%)

Random Enc 98.55(0.01) 78.66(0.07) 94.68(0.09) 55.99(0.83) Pretrained Enc 99.10(0.02) 92.15(0.23) 96.48(0.08) 92.45(0.21)

Next, we investigate the impact of different pre-trained encoders, which could vary in model structure and proxy tasks used for pre-training. Specifically, we compare Unimol (Zhou et al., 2023), which employs a message-passing neural network framework incorporating distance features (i.e., Dis GNN in (Li et al., 2023; 2024b)) and primarily uses naive coordinates denoising, with Frad (Feng et al., 2023), which adopts Torch MD (Th olke & De Fabritiis, 2022) as the backbone and utilizes carefully designed hybrid-denoising tasks. Both Unimol (Zhou et al., 2023) and Frad (Feng et al., 2023) are pre-trained on the GEOM-DRUG dataset until convergence. We visualize the t-SNE of the representations generated for GEOM-DRUG. As shown in Figure 7, the t-SNE of the Unimol representations exhibits a clearer clustering pattern based on node numbers compared to the Frad representations, which may suggest better representation learning. To further investigate, we utilize both encoders to train Geo RCG and subsequently evaluate the quality of molecule generation. The Frad-based Geo RCG achieves a Validity of 96.9(0.44) and Atom Stability of 84.4(0.27), while the Unimol-based Geo RCG achieves a Validity of 98.5(0.12) and Atom Stability of 84.3(0.12): Although the Frad-based Geo RCG produces slightly higher atom stability, its high variance and significantly lower validity suggest inferior performance. These findings, along with our main results, offer insights into the types of representations more effective for guiding molecule generation, suggesting that sensitivity to molecule size may be a critical factor.

Table 11: Quality of molecules generated by Geo RCG (EDM) trained on the QM9 dataset, with and without representation perturbation and representation condition dropout.

Hyper-parameters Metrics Atom Sta (%) Mol Sta (%) Valid (%) Valid & Unique (%)

rep noise cond. dropout 98.53(0.08) 86.93(0.5) 93.69(0.09) 89.12(0.21) rep noise cond. dropout 98.62(0.08) 87.9(0.35) 94.64(0.18) 90.15(0.02) rep noise cond. dropout 99.05(0.01) 91.69(0.08) 96.48(0.11) 92.38(0.12) rep noise cond. dropout 99.10(0.02) 92.15(0.23) 96.48(0.08) 92.45(0.21)

Ablation Study: Representation Perturbation As discussed in Appendix A, we investigate the effectiveness of the straightforward representation perturbation technique by introducing random noise to perturb the representations during training. Additionally, we apply extra dropout in the conditioning module of our molecule generator to mitigate overfitting on the representation conditions. Ablation experiments presented in Table 11 demonstrate the efficacy of these simple yet impactful methods.

Geometric Representation Condition Improves Equivariant Molecule Generation

Figure 7: t-SNE visualization of representations produced by the pre-trained encoders for the GEOM-DRUG dataset, colored by node number. The left plot corresponds to Unimol (Zhou et al., 2023), and the right plot corresponds to Frad (Feng et al., 2023).

Table 12: Supplementary evaluation of conditionally generated molecules from Geo RCG (EDM). The right side reports metrics for unconditionally generated molecules from other methods for reference. Note that conditional models (left) were trained on half of the QM9 dataset, while unconditional models (right) were trained on the full dataset, which may account for slight decreases in stability and validity metrics.

α ε εHOMO εLUMO µ Cv EDM Geo LDM Equi FM Atom Sta (%) 98.93(0.04) 98.95(0.03) 98.93(0.02) 98.99(0.01) 98.90(0.02) 98.98(0.02) 98.7 98.9(0.1) 98.9(0.1) Mol Sta (%) 90.64(0.36) 90.46(0.24) 90.38(0.19) 90.94(0.23) 90.02(0.22) 90.87(0.12) 82 89.4(0.5) 88.3(0.3) Valid (%) 95.40(0.04) 95.44(0.11) 95.46(0.19) 95.74(0.16) 95.26(0.04) 95.62(0.08) 91.9 93.8(0.4) 94.7(0.4) Valid & Unique (%) 90.47(0.44) 90.09(0.13) 90.06(0.11) 90.32(0.24) 89.93(0.24) 90.20(0.16) 90.7 92.7(0.5) 93.5(0.3) Valid & Unique & Novelty (%) 50.03(0.26) 50.81(0.15) 50.90(0.33) 50.59(0.08) 51.70(0.79) 51.10(0.29) 65.7 58.1 57.4

Quality of Conditionally Generated Molecules Detailed molecular metrics for conditionally generated molecules are provided in Table 12. For comparison, we also include the stability metrics of molecules conditionally generated by EDM, which highlight a notable improvement in stability with Geo RCG. Specifically, EDM s stability scores are: α (80.4%), ε (81.73%), εHOMO (82.81%), εLUMO (83.6%), µ (83.3%), and Cv (81.03%).

D. Theoretical Analysis

In this section, we provide rigorous theoretical analysis on representation-conditioned diffusion models. Our theory is not limited to molecule generation, and is the first theoretical breakthrough for the RCG framework (Li et al., 2023).

Our analysis is organized as follows. In Appendix D.1, we analyze the generation bound of representation-conditioned diffusion models in unconditional generation tasks by showing: (i) the representation can be well generated by the first-stage diffusion model with mild assumptions (Appendix D.1.1); (ii) the second-stage representation-conditioned diffusion model exhibits no higher generalization error than traditional one-stage diffusion model, and can arguably achieve lower error leveraging the informative representations (Appendix D.1.2). Then in Appendix D.2, we analyze conditional generation tasks as follows: (iii) under mild assumptions of representations and targets, we provide novel bound for score estimation error (Appendix D.2.1); (iv) generated representations have provable reward improvement towards the target, with the suboptimality composed of offline regression error and diffusion distribution shift (Appendix D.2.2), thus would improve the second stage of conditional generation (Appendix D.2.3).

Notations. In this section, we use SDE and score matching formulations of diffusion models to present our theoretical results, given their equivalence with the DDPM family (Song et al., 2021). We consider the random variable x RN (d+3), and use q( ) to denote the ground truth distributions, p( ) to denote the posterior distribution predicted by diffusion models. For instance, q(x) is the ground truth distribution of the underlying data x, while pφ(r) is the predicted distribution of latent representations. We use T to denote the total time of diffusion models, and Nd to represent the discretization step number.

Geometric Representation Condition Improves Equivariant Molecule Generation

We consider a SDE with continuous time [0, T], as well as its discretized DDPM which has Nd diffusion steps with step size h := T/Nd. The forward process is denoted as (xt)t [0,T ] qt, and the reverse process is denoted as ( xt)t [0,T ] pt. If the reverse process is predicted by the score matching network, we use its parameters as the subscript. Please note that there are two different initialization of the reverse process: the end of forward process q T and standard Gaussian noise γd. We use superscript q T to differentiate the former from the latter.

D.1. Unconditional Generation

D.1.1. PROVABLE GENERATION OF REPRESENTATIONS

Recall the two-stage generation process of representation-conditioned generation: p(x, r) = pθ(x|r)pφ(r). To quantitatively evaluate the generation process, we consider two stages separately. In this subsection, we first provide theoretical analysis on the provable generation of representations pφ(r).

Assumption D.1. (Second moment bound of representations.)

m2 r := Eq(r)[||r r||2] < (11)

where q(r) is the ground truth distribution of the representations, and r := Eq(r)[r].

Assumption D.2. (Lipschitz score of representations). For all t 0, the score ln q(rt) is Lr-Lipschitz.

where q(rt) is the distribution of noisy latent rt at diffusion step t in the forward process.

Finally, the quality of diffusion models obviously depends on the expressivity of score network φ with prediction s(t) φ .

Assumption D.3. (Score estimation error of representations). For all t [0, T],

Eq(rt)[||s(t) φ ln q(rt)||2] ϵ2 φ,score (12)

These are similar assumptions to the ones in (Chen et al., 2023).

Proposition D.4. Suppose Assumption D.1, Assumption D.2, Assumption D.3 hold, and the step size h := T/Nd satisfies h 1/Lr. Then the following holds,

TV(pφ(r0), q(r)) q

KL(q(r)||γdr) exp( T) | {z } convergence of forward process

drh + Lrmrh)

T | {z } discretization error

T | {z } score estimation error

Here a( ) b( ) means there exists a constant C such that a( ) Cb( ) always holds. This is a direct conclusion from (Chen et al., 2023). In typical DDPM implementation, we choose h = 1 and thus T = Nd. Remarkably, Proposition D.4 indicates the benefit of generating the representation first: since dr d, the generation quality (measured by the TV distance in Proposition D.4) of the low-dimensional representation can easily outperform directly generating the high-dimensional data points x. The theorem also accounts for applying a lightweight MLP as the denoising network while in the stage of generating the representation.

D.1.2. PROVABLE SECOND-STAGE GENERATION

Tractable Training Loss. Now we analyze the generation quality of the second-stage diffusion model. Since we sample from pθ(x, r), we have representations as conditions even for unconditional generation tasks. To learn the score function conditioning on the representations, consider the following loss for score matching,

0 λ(t)Ext,r[||sθ(xt, r, t) xt log qt|r(xt|r)||2]dt (14)

However, since qt|r(xt|r) is intractable, we use the following equivalent losses:

0 λ(t)Ex0,r h Ext|x0 ||sθ(xt, r, t) xt log qt|0(xt|x0)||2|x0 i dt + C (15)

Geometric Representation Condition Improves Equivariant Molecule Generation

Proposition D.5. (Tractable representation-conditioned score matching loss.)

L(θ) : = Z T

0 λ(t)Ext,r[||sθ(xt, r, t) xt log qt|r(xt|r)||2]dt (16)

0 λ(t)Ex0,r h Ext|x0 ||sθ(xt, r, t) xt log qt|0(xt|x0)||2|x0 i dt + C (17)

Proof. The key is the following important property holds since the gradient is taken w.r.t. xt only:

xt log qt|r(xt|r) = xt log qt,r(xt, r) (18)

The remaining of the derivation parallels to traditional DDPM. We can replace xt log qt,r(xt, r) with xt log qt,r|0(xt, r|x0):

xt log qt,r(xt, r) = Ex0,r|xt h xt log qt,r|0(xt, r|x0) xt i (19)

Er Ext q(xt|r)[||sθ(xt, r, t) xt log qt|r(xt|r)||2] (20)

=Er Ex0 q(x0|r)Ext q(xt|r,x0)[||sθ(xt, r, t) xt log qt|r(xt|x0, r)||2] (21)

=Er Ex0 q(x0|r)Ext q(xt|x0)[||sθ(xt, r, t) xt log qt|r(xt|x0)||2] (22)

which is equivalent to our tractable score matching loss.

Rigorous Error Bound for Second-Stage Generation. Utilizing Proposition D.5, analysis of the second-stage diffusion parallels to the first stage, except that the score network takes additional inputs r.

Assumption D.6. (Second moment bound of molecule features.)

m2 x := Eq(x)[||x x||2] < (23)

where q(x) is the ground truth distribution of the molecule features, and x := Eq(x)x.

Assumption D.7. (Lipschitz score of second stage). For all t 0, the score ln q(xt) is Lx-Lipschitz.

where q(xt) is the distribution of noisy latent xt at diffusion step t in the forward process.

Finally, we make some assumptions of the score network estimation error.

Assumption D.8. (Score estimation error of second-stage diffusion). For all t [0, T],

Er pφ(r),xt qt(xt)[||sθ(xt, t, r) ln qt(xt)||2] ϵ2 θ,score (24)

This assumption contains the error brought by generating representations, i.e., the TV distance shown in Proposition D.4. Later in Theorem D.12 we explicitly deal with the error brought by representation generation, which results in a more fine-grained error bound.

We now present a key lemma which facilitates analysis and the proof of the central Theorem D.12.

Lemma D.9. Suppose Assumption D.6, Assumption D.7, Assumption D.8 hold, and the step size h := T/Nd satisfies h 1/Lx. Suppose we sample x pθ(x|r) from Gaussian noise where r pφ(r), and denote the final distribution of x as pθ,φ(x). Then the following holds,

TV(pθ,φ(x), q(x)) q

KL(q(x)||γN(d+3)) exp( T) | {z } convergence of forward process

N(d + 3)h + Lxmxh)

T | {z } discretization error

T | {z } score estimation error

Geometric Representation Condition Improves Equivariant Molecule Generation

Proof. Recall the notation that pθ,φ(x) := R

r p0|r(x0|r)pr(r)dr = p0 predicted by denoising networks θ, φ starting from Gaussian noise γN(d+3). Consider the reverse process pq T 0 (x0) starting from q T instead of γN(d+3),

TV(p0, q(x)) TV(p0, pq T 0 ) + TV(pq T 0 , q0) (26)

Using the convergence of the OU process in KL divergence (see (Chen et al., 2023)), the following holds for the first term,

TV(p0, pq T 0 ) TV(γN(d+3), q T ) q

KL(q(x)||γN(d+3)) exp( T) (27)

The second term is caused by score estimation error and discretization error, which can be bounded by

TV(pq T 0 , q0)2 KL(q0||pq T 0 ) (ϵ2 θ,score + L2 x N(d + 3)h + L2 xm2 xh2)T (28)

We start proving Equation (28) by proving

k=1 Eq0,r pφ

(k 1)h ||s(kh) θ (xkh, kh, r) ln qt(xt)||2dt (ϵ2 θ,score + L2 x N(d + 3)h + L2 xm2 xh2)T (29)

For t [(k 1)h, kh], we decompose

Eq0,r pφ[||s(kh) θ (xkh, kh, r) ln qt(xt)||2] (30)

Eq0,r pφ[||s(kh) θ (xkh, kh, r) qkh(xkh)||2] + Eq0[|| qkh(xkh) qt(xkh)||2] (31)

+ Eq0[|| qt(xkh) qt(xt)||2] (32)

ϵ2 θ,score + Eq0 h ln qkh

qt (xkh) 2i + L2 x Eq0[||xkh xt||2] (33)

Note that we omit the term r in expectation of last two terms because they are independent of r.

Utilizing Lemma 16 from (Chen et al., 2023), we bound

qt (xkh) 2 L2 x N(d + 3)h + L2 xh2||xkh||2 + (L2 x + 1)h2|| ln qt(xkh)||2 (34)

For the last term,

|| ln qt(xkh)||2 || ln qt(xt)||2 + || ln qt(xkh) ln qt(xt)||2 (35)

|| ln qt(xt)||2 + L2 x||xkh xt||2 (36)

where the second term is absorbed into the third term of the decomposition Equation (30). Thus,

Eq0,r pφ[||s(kh) θ (xkh, kh, r) ln qt(xt)||2] (37)

ϵ2 θ,score + L2 x N(d + 3)h + L2 xh2Eq0[||xkh||2] + L2 xh2Eq0[|| ln qt(xt)||2] + L2 x Eq0[||xkh xt||2] (38)

ϵ2 θ,score + L2 x N(d + 3)h + L2 xh2(N(d + 3) + m2 x) + L3 x N(d + 3)h2 + L2 x(m2 xh2 + N(d + 3)h) (39)

ϵ2 θ,score + L2 x N(d + 3)h + L2 xh2m2 x (40)

Analogous to (Chen et al., 2023), using properties of Brownian motions and local martingales, we can apply Girsanov s theorem and complete the stochastic integration. Since q0 is the end of the reverse SDE, by the lower semicontinuity of the KL divergence and the data-processing inequality (Beaudry & Renner, 2011), we take the limit and obtain

KL(q0||pq T 0 ) (ϵ2 θ,score + L2 x N(d + 3)h + L2 xh2m2 x)T (41)

We finally conclude with Pinsker s inequality (TV2 KL).

Geometric Representation Condition Improves Equivariant Molecule Generation

This result holds for general representation-conditioned diffusion models, and to our best knowledge we are the first to provide theories for representation-conditioned generation, which is a general generation framework suitable for various domains such as images (Li et al., 2023) and graphs.

Lemma D.9 quantitatively characterizes the bound on generalization error in representation-conditioned diffusion. It directly suggests that the error of representation-conditioned diffusion will be no higher than that of its one-stage counterpart. This is because the first two components of the generalization error (i.e., the convergence of the forward process and the discretization error) of the representation-conditioned diffusion model align with those of traditional DDPM, provided that both are parameterized using the same diffusion processes. Furthermore, the third component (score estimation error) can be made identical if we simply set all representation-relevant parameters in sθ to zero and disregard representation s impact. We therefore have the following conclusion,

Corollary D.10. Self-representation-conditioned diffusion model can have the same or a lower generation distribution error than one-stage diffusion model.

We now give a more fine-grained error bound analysis of representation-conditioned diffusion, given the relationship between r and x that enables our further qualitative analysis for the argubly better performance.

Assumption D.11. (representation-conditioned score estimation error of second-stage diffusion). For all t [0, T],

Er pφ(r),xt qt(xt|r)[||sθ(xt, t, r) ln qt(xt|r)||2] ϵ2 φ,θ,cond (42)

The following main theorem is novel and precise since it (i) deals with the generation error of first-stage representations explicitly; (ii) takes advantages of the conditional distribution q(x|r) in the denoising network.

Theorem D.12. (Theorem 3.2 in the main text) Suppose Assumption D.6, Assumption D.7, Assumption D.11 hold, and the step size h := T/Nd satisfies h 1/Lx. Suppose we sample x pθ(x|r) from Gaussian noise where r pφ(r), and denote the final distribution of x as pθ,φ(x). Define p q T |φ 0 , which is the end point of the reverse process starting from q T |φ instead of Gaussian. Here q T |φ is the T-th step in the forward process starting from q0|φ := 1

r q(x0|r)pφ(r)dr where A is the normalization factor. Then the following holds,

TV(pθ,φ(x), q(x)) q

KL(q0|φ||γN(d+3)) exp( T) | {z } convergence of forward process

N(d + 3)h + Lxmxh)

T | {z } discretization error

+ ϵφ,θ,cond

T | {z } conditional score estimation error

+ TV(q0|φ, q0) | {z } representation generation error

Proof. The proof sketch parallels that of Lemma D.9, except that in the first step we decompose the TV distance as follows,

TV(pθ,ϕ, q(x)) TV(p0, p q T |φ 0 ) + TV(p q T |φ 0 , q0|φ) + TV(q0|φ, q0) (45)

We complete the proof analogously to the proof of Lemma D.9.

Remarkably, when q0|φ, i.e., pφ fully recovers the ground truth marginal distribution of representations q(r), Theorem D.12 has the same format as Lemma D.9 but with ϵφ,θ,cond < ϵθ,score. This is because the former is the score estimation error based on explicit relationship between x and r while the latter learns implicitly. Thus, Theorem D.12 is a much tighter bound for representation-conditioned generation. To the best of our knowledge, this is the first rigorous theoretical analysis on RCG (Li et al., 2023). We now provide some qualitative discussions on why representations can arguably lead to better generalization error.

Typically, representations are powerful (and sometimes even complete) as they encode key information about x with potential additional knowledge via pretraining tasks (for example, coordinates denoising for molecules (Zaidi et al., 2022; Feng et al., 2023)). Therefore, it is reasonable to expect that score estimation conditioned on representations can be more accurate (i.e., ϵθ,score could be significantly smaller than when estimating the score without representation conditioning). If the representations are complete where a special case would be r = x this would greatly assist in predicting the noise. The same applies when r can be properly transformed back to x by a neural network. More generally, there are intermediate cases where r reflects partial information about x (e.g., a multiset of atoms and bonds), which would still aid in improving prediction.

Geometric Representation Condition Improves Equivariant Molecule Generation

Extension to Equivariant Diffusion Models. The previous conclusions are generic and can be applied to general representation-conditioned generation. However, so far we only consider traditional diffusion models without taking into account the permutation Π and SE(3) transformation Ωinvariance/equivariance of the diffusion model. We thus extend our theory specifically to equivariant diffusion models that operate on symmetry structures, which is the case for our experiments. Moreover, the previous results assume that both the diffusion process and the denoising model treat atom coordinates x and atom type features h identically, while the fact is that (1) the noises of atom coordinates always have zero center of mass (Co M), thus actually lies in a subspace with degree of freedom 3(N 1) as opposed to 3N; (2) in the forward process, xt and ht are conditional independent give x0 and h0, and since the denoising network processes coordinates xt and ht differently, the score estimation error term can be further decomposed as in Assumption D.13.

Assumption D.13. (fine-grained representation-conditioned score estimation error of second-stage diffusion). For all t [0, T],

Er pφ(r),xt qt(xt|r)[||sθ(xt, t, r) ln qt(xt|r)||2] (46)

=Er pφ(r),xt qt(xt|r)[||s(x) θ (xt, t, r) xt ln qt(xt|r)||2 + ||s(h) θ (xt, t, r) ht ln qt(ht|r)||2] (47)

(ϵx φ,θ,cond)2 + (ϵh φ,θ,cond)2 (48)

where s(x) θ ( ) and s(h) θ ( ) refer to the predicted score of coordinates and atom type features by the score network, respectively.

Combining all the pieces, we conclude the following result Theorem D.14 for equivariant diffusion models that generate 3D coordinates as described above.

Theorem D.14. Suppose Assumption D.6, Assumption D.7, Assumption D.13 hold, and the step size h := T/Nd satisfies h 1/Lx. Suppose we sample x pθ(x|r) from Gaussian noise with zero Co M in the coordinate subspace where r pφ(r), and denote the final distribution of x as pθ,φ(x). Define p q T |φ 0 , which is the end point of the reverse process starting from q T |φ instead of Gaussian. Here q T |φ is the T-th step in the forward process starting from q0|φ := 1

r q(x0|r)pφ(r)dr where A is the normalization factor. Denote γN(d+3) 0 Co M the N(d + 3)-dimensional Gaussian but with zero center of mass in the N 3-dimensional subspace for coordinates. Denote p( ) as the distribution after acting permutation group Π and SE(3) transformation Ωon the data from p( ). Then the following holds,

TV( pθ,φ(x), q(x)) :=α(pθ,φ, Π, Ω)TV(pθ,φ(x), q(x)) (49)

α(pθ,φ, Π, Ω)

KL(q0|φ||γN(d+3) 0 Co M) exp( T) | {z } convergence of forward process

N(d + 3)h + Lxmxh)

T | {z } discretization error

+ (ϵx φ,θ,cond + ϵh φ,θ,cond)

T | {z } conditional score estimation error

+ TV(q0|φ, q0) | {z } representation generation error

where α( ) [0, 1].

Proof. The proof parallels the proof of Theorem D.12 with three distinct parts.

1. Due to the existence of the permutation group Π and the SE(3) group Ω, we need to consider the distribution pθ,φ(x). Note the definition α(pθ,φ, Π, Ω) := TV( pθ,φ(x), q(x))

TV(pθ,φ(x),q(x)), and by data processing inequality (Beaudry & Renner, 2011), α(pθ,φ, Π, Ω) [0, 1]. Specifically, when the denoising model pθ,φ is constructed invariant/equivariant to permutations Π and SE(3) transformations Ω(which means the model treats all elements in one equivalent class the same), α(pθ,φ, Π, Ω) reaches the minimum; see (You et al., 2023) for further explanations.

2. Since the Gaussian noises for coordinates are sampled from a subspace with zero center of mean, the prior distribution γN(d+3) in Theorem D.12 should be replaced with γN(d+3) 0 Co M, the N(d + 3)-dimensional Gaussian but with zero center

of mass in the N 3-dimensional subspace for coordinates. It is notable that the degree of freedom of γN(d+3) 0 Co M is actually N(d + 3) 3, and the remaining of the proof still holds (Feng et al., 2024).

3. As explained before, in a properly designed forward process, xt and ht can be conditional independent give x0 and h0. Meanwhile, the denoising network processes coordinates xt and ht differently, the score estimation error term can be

Geometric Representation Condition Improves Equivariant Molecule Generation

further decomposed as in Assumption D.13; see (Feng et al., 2024) for more details. Therefore, the term ϵφ,θ,cond

in Theorem D.12 can be replaced by q

(ϵx φ,θ,cond)2 + (ϵh φ,θ,cond)2, which is further bounded by (ϵx φ,θ,cond + ϵh φ,θ,cond).

In conclusion, we provide a detailed characterization of the generalization error of representation-conditioned diffusion models in this subsection. It is important to note that some parameters in our assumptions, such as Lipschitz scores and estimation errors, are not constants; they are functions of the SDE total time T and the number of diffusion steps Nd. Our conclusions also explain why representation-conditioned generation methods remain competitive even when the number of second-stage diffusion steps Nd is decreased for faster generation. This is because the score estimation error can remain small even when the number of diffusion steps is reduced, given the guidance from representations. As a result, reducing Nd causes a slower increase in ϵφ,θ,cond(Nd) compared to the score estimation error without representation conditioning, leading to representation-conditioned generative models strong performance with fewer steps.

D.2. Conditional Generation

In this subsection, we aim to prove that conditional generation using our representation-conditioned generation have probable reward improvement. While we used c to denote conditions in the main text, we use the notation y instead for the targets or reward to keep coordinate with existing literature. Denote qa := q( |y = a) as the ground truth conditional distribution, and ˆpa := p( |y = a) the estimated distribution. Suppose the ground truth distribution satisfies y := f (x, r) which can be decomposed as f (x, r) = g (x , r ) + h (x , r ) (52)

where we denote x = x when x q(x), r = r when r q(r), and f (x, r) = g (x, r) when x q(x), r q(r).

To start with, we assume a linear relationship between r and y, which is reasonable thanks to the powerful pretrained model (which makes the representations helpful in predicting molecule properties and even complete) if some noises are allowed. In detail, the reward is f (x, r) = ˆw r + ξ and ξ N(0, ν2). In some cases, we may further make Gaussian assumptions on r (WLOG, r N(µ, Σ)) but is not necessary.

D.2.1. PARAMETRIC CONDITIONAL SCORE MATCHING ERROR

First, we give a detailed form of the representation score estimation error presented in Assumption D.1 under the assumptions above.

Lemma D.15. For δ 0, with probability 1 δ, the score estimation error ϵr ϵφ,score is bounded by

t0 E(rt,y) qt[|| log qt(rt|y) ˆsφ(rt, y, t)||2 2]dt ϵ2 r = O 1

n)d2r log 1

where t0 is the early stopping time of the SDE, n is the number of samples, S is the parametric function class of denoising network, and N(S, 1

n) is the log covering number of S. When S is linearly parameterized, N(S, 1

n) = O(d2 r log( drn

Proof. This is a direct extension of Lemma C.1 from (Yuan et al., 2023). Note that we consider the special case where the low-dimensional subspace is the original space (i.e., A = Idr and d = D = dr in their paper), and our noised linear assumption between r and y is identical to their pseudo labeling setting (i.e., ˆy = ˆw r + ξ where ξ N(0, ν2)). We only provide the proof sketch here.

When r follows the Gaussian design, some algebra gives

r log qt(r, y) = α(t)

h(t) Bt α(t)r + h(t)

ν2 yw 1 h(t)r (54)

where α(t) = exp( t/2), h(t) = 1 exp( t), B(t) = α2(t)Idr + h(t)

ν2 ww + h(t)Σ 1 1 . We then bound the estimation error with PAC-learning concentration argument by using Dudley s entropy integral to bound the Rademacher

Geometric Representation Condition Improves Equivariant Molecule Generation

complexity, and obtain

n)d2r log 1

Further, the log covering number of S under Gaussian design satisfies

n) d2 r log 1 + drn t0λminν2

where 0 < λmin < 1 is the smallest eigenvalue of Σ, and typically the early stopping time t0 = O(1). In (Yuan et al., 2023) the authors assume ν2 = 1/dr which states that the variance ν2 of regression residuals ξ reduces when we increase the representation dimensions, which is reasonable.

Lemma D.15 provides a detailed score estimation error given the linear assumption between r and y, which serves as a special case of ϵ2 φ,score. Substituting it into Proposition D.4, we can obtain a quantitative result of representation generation error.

D.2.2. REWARD IMPROVEMENT VIA CONDITIONAL GENERATION

Next, we want to obtain the reward guarantees of the generated samples give the condition y. Define the suboptimality of distribution P as Sub Opt(P; y ) = y E(x,r) P [f (x, r)] (57)

where y is the target reward value (condition) and f is the ground truth reward function. We use the notation ˆpa := pφ(r|y = a), then we have the following result for Sub Opt(ˆpa; y = a), which can also be viewed as a form of off-policy regret.

Lemma D.16. (Theorem 4.6 in (Yuan et al., 2023).)

Sub Opt(ˆpa; y = a) Er qa h ( ˆw w) r i

+ Er qa[g (r )] Er ˆpa[g (r )] | {z } E2

+ Er ˆpa[h (r )] | {z } E3

Proof. Recall the notation qa := q( |y = a), we have

Er ˆpa[f (r)] (59)

Er qa[f (r)] Er ˆpa[f (r)] Er qa[f (r)] (60)

Er qa[ ˆf(r)] Er qa ˆf(r) f (r) Er ˆpa[f (r)] Er qa[f (r)] (61)

Er qa[ ˆf(r)] Er qa ˆf(r) g (r)

Er qa[g (r )] Er ˆpa[g (r )] | {z } E2

Er ˆpa[h (r )] | {z } E3

where ˆw is the estimated w by Ridge regression, Er qa[ ˆf(r)] = Ea q[a] and r = r , f (r) = g (r) when r qa.

Here we give a brief interpretation of the decomposition. E1 is the prediction and generalization error coming from regression, which is independent from the diffusion error. Both E2 and E3 come from the diffusion process, where E2 reflects the disparity between ˆpa and qa on the subspace support while E3 measures the off-subspace error in ˆpa. The following results give concrete bounds for the terms in Lemma D.16.

Bounding Regression Error with Offline Bandits. By estimating w with Ridge regression, we have

ˆw = (R R + λI) 1R (Rw + η) (63)

where R = (r1, . . . , rn) and η = (ξ1, . . . , ξn) where ξi N(0, ν2). Define Vλ := R R + λI, ˆΣλ := 1 n Vλ and Σqa := Er qarr and take λ = 1, we have

Geometric Representation Condition Improves Equivariant Molecule Generation

Proposition D.17. With high probability,

Tr(ˆΣ 1 λ Σqa) O(||w || + ν2 dr log n) n (64)

Further when r has Gaussian design r N(µ, Σ),

Tr(ˆΣ 1 λ Σqa) O a2

||w ||Σ + dr (65)

when n = Ω(max{ 1 λmin , dr ||w ||2 Σ }).

Proof. First we have E1 = Eˆpa|r (w ˆw)| Eˆpa||r||V 1 λ ||w ˆw||Vλ (66)

Eˆpa||r||V 1 λ q

Eˆpar V 1 λ r = q

Tr(V 1 λ Eˆparr ) q

Tr(V 1 λ Σqa) (67)

Hence we only need to prove

||w ˆw||Vλ O(||w || + ν2p

dr log n) (68)

Using the closed form expression, we have

ˆw w = V 1 λ R η λV 1 λ w (69)

Thus, ||w ˆw||Vλ ||R η||V 1 λ + λ||w ||V 1 λ (70)

where λ||w ||V 1 λ

λ||w ||, and according to (Abbasi-yadkori et al., 2011),

||R η||V 1 λ = ||R η||(R R+λI) 1 O(ν2p

dr log n) (71)

with high probability. We hence conclude the first part of proof.

Further, when r has Gaussian design r N(µ, Σ), according to Lemma C.6 of (Yuan et al., 2023), we can prove the results. The key here is the conditional distribution follows the Gaussian below,

Pr r| ˆf(r) = a = N µ + Σ ˆw( ˆw Σ ˆw + ν2)(a ˆw µ), Σ Σ ˆw( ˆw Σ ˆw + ν2) 1 ˆw Σ (72)

trace(ˆΣ 1 λ Σqa) = trace Σ1/2 ˆw ˆw ΣˆΣ 1 λ Σ1/2a2

(|| ˆw||2 Σ + ν2)2

(1 + 1 λminn) O( a2

|| ˆw||2 Σ + dr) (73)

Notice that || ˆw||Σ ||w ||Σ || ˆw w ||Σ. We have

|| ˆw w ||Σ = O ||w || + ν2p

dr log(n) n

where we can prove || ˆw||Σ 1

2||w ||Σ when n = Ω( dr ||w ||2 Σ ).

Remarkably, this is a more precise bound improving the results (Lemma C.5 and C.6) in (Yuan et al., 2023), where we make less assumptions on the relationship between y and r, explicitly taking ||w|| and ν2 into account.

Geometric Representation Condition Improves Equivariant Molecule Generation

Bounding Distribution Shift in Diffusion. We define the distribution shift between two arbitrary distributions p1 and p2 restricted under function class L as

T (p1, p2; L) := sup l L

Ex p1[l(x)] Ex p2[l(x)] (75)

We have the follow lemma. Lemma D.18. Under the assumption that r follows Gaussian design, then

TV(ˆpa, qa) = O s

T (q(r, y = a), qry; S)

where ϵr is defined in Lemma D.15. We can bound E2 with:

E2 = O (TV(ˆpa, qa) + t0) p

where M(a) = O( a2 ||w ||Σ + d). By plugging in ϵ2 r = O( d2 r t0 n), when t0 = (d4 r/n)1/6, E2 admits the best trade off with bound

T (q(r, y = a), qry; S)

λmin (d4 r/n)1/6a (78)

Proof. The proof directly follows from Lemma C.4 and Lemma C.7 in (Yuan et al., 2023). However, the conclusion is slightly different as we do not assume a low dimensional subspace of r.

One can also verify that when r follows Gaussian design, T(q(r, y = a), qry; S) = O(a2 dr).

D.2.3. SECOND STAGE OF CONDITIONAL GENERATION

Now that we have proved that: (i) the first-stage diffusion model can estimate the score function with provable error bound (Appendix D.2.1); and (ii) the generated representations have provable reward improvement (Appendix D.2.2). We continue to show that the ultimate generated samples also have distribution shift towards the desired target in the following contexts. Particularly, we want to answer the question: why utilizing the conditionally generated representations is enough for the second stage generation?

First, when we use the generated representations as the only condition of the second stage diffusion model, the generation process is identical to the second stage of unconditional generation. Therefore, the results in Appendix D.1.2 can be directly applied to analyze the second stage generation of conditional generation, which states that the generation conditioning on representations has small TV distance error compared with ground truth conditional distribution. Thus, when we have high-quality first-stage generation, the corresponding second stage generation would introduce almost no additional error, which implies provable reward improvements towards the desired target. In addition, the well-pretrained encode ensures that the correspondence between representations and data points is good, which makes it possible to rigorously construct the data points given the representations (a special case would be r is the complete representation of x).

We then partially answer this question from the information theoretic perspectives. We use H( ) to denote the information entropy, and I( ; ) to denote the mutual information between two variables.

I(x; r) I(x; y). On the one hand, r contains enough information to recover the targets y thanks to the results in Appendix D.2.2, thus we do not explicitly need y for the second stage. One the other hand, benefit from the pretraining task, the representations obviously contains more information in addition to y. This assumption is valid if w in previous analysis is sparse (there are components in r independent of y), i.e., H(r) > H(y). Therefore, generating x conditioning on r is much easier than generating conditioning on y (traditional one stage generation), as the score estimation error of the former one would obviously be much smaller than the latter.

I(x, r; y) I(x; y). Recall Equation (52) which states the target property y depends on both x and r. Hence, r may contain additional information of y obtained from pretrained tasks that is hard to (or cannot) be directly extracted from x - the complex pretrained model assists in extracting relevant information in our two-stage generation, while one-stage generation solely relies on the single denoising model to do so. Therefore, by leveraging representations with provable error bounds, we can better shift the distribution towards the target.

Geometric Representation Condition Improves Equivariant Molecule Generation

In summary, r is an ideal middle state connecting x and y - it is easy to recover r from y (Appendix D.2.1) and to recover x from y (Appendix D.1.2), and vice versa. In comparison, it is somewhat more difficult to directly recover x from y or predict y from x. Consequently, r may be a better indicator of y compared with x due to the aforementioned reasons.

Indeed, one-stage diffusion models generate x directly from conditions y need to optimize a highly complex score x log p(x|y) where x and y are highly non-linearly mapped. As Theorem E.4 in (Yuan et al., 2023) points out, the nonparametric Sub Opt of x generated by deep neural networks is larger than our results in Appendix D.2.2, which further validates the advantage of first generating r that can be well mapped from y.

E. Visualization

E.1. Representation Visualization

To illustrate how well pφ(r) fits q(r), we sample from both q(r) (i.e., the representations produced by the pre-trained encoder for the QM9 and GEOM-DRUG datasets) and the trained representation generator pφ(r). We then visualize them in Appendix E.1, with colors indicating whether the samples are from q(r) or pφ(r). We compute the Silhouette Score of the clustering results, scaled by 102 for clarity. A score close to zero suggests that the two clusters are difficult to distinguish, indicating a good fit between pφ(r) and q(r). Similarly, we provide the visualization of conditionally generated representations in Figure 9

Figure 8: t-SNE visualizations of representations unconditionally generated by the representation generator (T = 1.0) vs. those produced by the pre-trained encoder on the QM9 and GEOM-DRUG datasets. The Silhouette Score is scaled by 102

for clarity.

E.2. Visualization of Molecule Samples

In this section, we provide additional random molecule samples to offer deeper insights into the performance of Geo RCG. Figure 10 and Figure 11 show unconditional random samples generated by Geo RCG trained on the QM9 and GEOM-DRUG datasets, respectively. Figure 12 presents random samples conditioned on the α property, along with their corresponding errors.

Geometric Representation Condition Improves Equivariant Molecule Generation

Figure 9: t-SNE visualization of representations conditionally generated by the representation generator vs. those produced by the pre-trained encoder on the QM9 dataset: (a) α, (b) ϵ, (c) ϵHOMO, (d) ϵLUMO, (e) µ, and (f) Cv. The Silhouette Score is scaled by 102 for clarity.

Geometric Representation Condition Improves Equivariant Molecule Generation

Figure 10: Unconditional random samples from Geo RCG trained on QM9. The number of nodes is randomly sampled from the node distribution q(N).

Geometric Representation Condition Improves Equivariant Molecule Generation

Figure 11: Unconditional random samples from Geo RCG trained on GEOM-DRUG. The number of nodes is randomly sampled from the node distribution q(N).

Geometric Representation Condition Improves Equivariant Molecule Generation

Figure 12: Conditional random samples from Geo RCG trained on QM9 dataset and α property. Black numbers indicate the specified property value condition, while green numbers represent the evaluated property value of the generated samples. The number of nodes and property value conditions are randomly sampled from the joint distribution q(N, c).