# a3d_does_diffusion_dream_about_3d_alignment__df0cad7b.pdf Published as a conference paper at ICLR 2025 A3D: DOES DIFFUSION DREAM ABOUT 3D ALIGNMENT? Savva Ignatyev 1 Nina Konovalova 2 Daniil Selikhanovych1 Oleg Voynov1,2 Nikolay Patakin2 Ilya Olkov1 Dmitry Senushkin2 Alexey Artemov3 Anton Konushin2 Alexander Filippov4 Peter Wonka5 Evgeny Burnaev1,2 1Skoltech, Russia 2AIRI, Russia 3Medida AI, Israel 4AI Foundation and Algorithm Lab, Russia 5KAUST, Saudi Arabia *Savva Ignatyev and Nina Konovalova contributed equally Corresponding author: Savva Ignatyev (e-mail: savva.ignatyev@skoltech.ru) Figure 1: Our method A3D enables conditioning text-to-3D generation process on a set of text prompts to jointly generate a set of 3D objects with a shared structure (top). This enables a user to make hybrids combined of different parts from multiple aligned objects (middle), or perform text-driven structure-preserving transformation of an input 3D model (bottom). We tackle the problem of text-driven 3D generation from a geometry alignment perspective. Given a set of text prompts, we aim to generate a collection of objects with semantically corresponding parts aligned across them. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality representations of the 3D objects. These methods handle multiple text queries separately, and therefore the resulting objects have a high variability in object pose and structure. However, in some applications, such as 3D asset design, it may be desirable to obtain a set of objects aligned with each other. In order to achieve the alignment of the corresponding parts of the generated objects, we propose to embed these objects into a common latent space and optimize the continuous transitions between these objects. We enforce two kinds of properties of these transitions: smoothness of the transition and plausibility of the intermediate objects along the transition. We demonstrate that both of these properties are essential for good alignment. We provide several practical scenarios that benefit from alignment between the objects, including 3D editing and object hybridization, and experimentally demonstrate the effectiveness of our method. voyleg.github.io/a3d Published as a conference paper at ICLR 2025 Figure 2: Collections of objects generated with existing text-to-3D methods lack structural consistency (left, (Shi et al., 2024)). Shapes obtained with existing text-driven 3D editing methods lack text-toasset alignment and visual quality (middle, (Chen et al., 2024a)). In contrast, our method enables the generation of structurally coherent, text-aligned assets with high visual quality (right). 1 INTRODUCTION Creating high-quality 3D assets is a timeand labor-intensive process, so even experienced 3D artists commonly break it down into manageable steps. Prior to shaping an asset, an artist might conceptualize its design structure to capture geometric proportions and spatial relationships of its semantically meaningful parts. Then, a series of detailed 3D design instances (e.g., high-resolution geometry and textures) consistent with an established structure can be produced. Recent 3D generation research (Poole et al., 2023; Qiu et al., 2024; Shi et al., 2024) promises to significantly reduce the effort related to manual production of such high-resolution textured shapes, replacing it by an automated AI-based step controlled with natural language. A text-to-3D generation pipeline could potentially be utilized to produce a collection of structurally aligned 3D objects, consisting of a common set of semantic parts and sharing their structure, e.g., the pose or the arrangement of semantic parts. However, existing 3D generation approaches synthesize objects independently and fail to maintain structural alignment across them (Figure 2, left). One may attempt to enforce the alignment in a series of 3D objects by generating an initial one with a text-to-3D pipeline and obtaining the others with text-driven 3D editing methods (Haque et al., 2023; Chen et al., 2024a). Unfortunately, the latter struggle with the visual quality and sometimes fail to perform the necessary edits appropriately, resulting in a low degree of the alignment with the text prompt (Figure 2, middle). To address the limitations of existing approaches, we propose A3D, a method for jointly generating collections of structurally aligned objects from a collection of respective text prompts. The idea of our method is to embed a set of 3D objects and transitions between them into a shared latent space and enforce smoothness and plausibility of these transitions. We take inspiration from the transition trajectory regularization proposed for 2D GANs (Karras et al., 2020, Sec. 3.2). We represent each set of 3D objects and the transitions between them with a single Neural Radiance Field (Ne RF) (Mildenhall et al., 2020) and train it with a text-to-image denoising diffusion model via Score Distillation Sampling (SDS) (Poole et al., 2023) to simultaneously correspond to the set of text prompts and enforce the plausibility of the transitions between the objects. Our method is naturally suited for several different scenarios that require control over the structure of the generated objects. (1) Generation of multiple structurally aligned 3D objects (Figure 1, top) enables artists to choose an appropriate 3D asset among a variety of generations, replace 3D objects within existing scenes, or transfer animations across distinct objects. (2) Combining parts of different objects into a hybrid (Figure 1, middle) allows adjusting constituent elements without affecting the overall structure of the asset. (3) Structure-preserving transformations of a 3D object (Figure 1, bottom), let a user design a simplified 3D shape with a particular pose and let the automatic generation process fill in complex geometric details and texture while preserving the structure. Sets of objects generated using our method exhibit a high degree of structural alignment and high visual quality scores, outperforming those obtained with state-of-the-art alternatives. Our method is easily adopted for the structure-preserving transformation, performing on par with specialized text-driven 3D editing methods. Further, it is effective in combination with different text-to-3D generation frameworks. Overall, our work advances the state of the art in text-driven 3D generation and opens up new possibilities for applications requiring the generation of structurally aligned objects. Published as a conference paper at ICLR 2025 2 RELATED WORK 2.1 TEXT-DRIVEN 3D ASSET GENERATION Collecting large, high-quality, diverse 3D datasets poses significant challenges, so 3D generation approaches predominantly leverage 2D priors for training. Dream Fusion (Poole et al., 2023) introduced Score Distillation Sampling (SDS) that enables training Neural Radiance Fields (Ne RFs) (Mildenhall et al., 2020) with the guidance of pre-trained 2D diffusion models. Subsequent research has refined this methodology to improve the quality and speed of 3D generation. Magic3D (Lin et al., 2023) uses a coarse-to-fine optimization strategy to increase the speed and resolution. Fantasia3D (Chen et al., 2023a) disentangles the geometry and texture training. Several works enhance realism, detail, and optimization speed by utilizing adversarial training (Chen et al., 2024d), 3D-view conditioned diffusion models (Liu et al., 2023; Shi et al., 2023a; 2024; Liu et al., 2024; Ye et al., 2024; Seo et al., 2024), and Gaussian splatting-based models (Tang et al., 2024b; Yi et al., 2024). All these works focus on independent optimization for distinct prompts, resulting in generation of collections of objects that lack structural alignment, as we show in Figure 2 and in our ablation study. This misalignment issue persists even in amortized frameworks, where a single generative model is trained to handle multiple prompts (Tang et al., 2024a; Jun & Nichol, 2023; Hong et al., 2024; Siddiqui et al., 2024; Ma et al., 2024). Due to the mode-seeking nature of SDS (Poole et al., 2023), these frameworks often produce misaligned objects with the structure sensitive to subtle variations in the prompt, increasing inconsistency across generated outputs. Unlike the described methods that optimize 3D objects independently or amortized models trained on large-scale datasets, our method optimizes a small set of objects jointly, allowing us to achieve structural consistency between objects. 2.2 TEXT-DRIVEN 3D ASSET EDITING One straightforward way to produce a collection of aligned 3D objects is to generate an initial object using a text-to-3D pipeline and subsequently modify this object via text-driven 3D editing. Several methods have been proposed to manipulate Ne RF-based scene representations using text as guidance (Haque et al., 2023; Park et al., 2024; Bao et al., 2023; Zhuang et al., 2023). Dream Booth3D (Raj et al., 2023) and Magic3D (Lin et al., 2023) provide the capability to edit personalized objects while leveraging the underlying 3D structure. Focal Dreamer (Li et al., 2024), Progressive3D (Cheng et al., 2024), and Vox-E (Sella et al., 2023) confine the effect of modifications to specific parts of the object, thus enhancing control of the editing process. Fantasia3D (Chen et al., 2023a) and Dream Mesh (Yang et al., 2024) focus on global transformations of one object into another, iteratively optimizing a 3D model to align with a text prompt via SDS. Iterative optimization with SDS does not guarantee preservation of the structure of the transformed object, so several techniques were proposed to improve it. Coin3D (Dong et al., 2024) refines geometric primitives into high-quality assets by imposing deformation constraints through input masks. Gaussian Dreamer (Yi et al., 2024) and Lucid Dreamer (Liang et al., 2024) show text-driven editing capabilities for Gaussian splats, which they initialize using a separate pipeline and fine-tune with the help of a diffusion model. Haque et al. (2023) and Palandra et al. (2024) use the SDS loss in combination with a pre-trained 2D image editing network Instruct Pix2Pix (Brooks et al., 2023). MVEdit (Chen et al., 2024a) goes one step further by avoiding SDS and proposes a special mechanism that coordinates 2D edits from different viewpoints. Although some of these methods allow obtaining sets of the aligned objects sequentially, the editing process is constrained by the configuration of the initially generated object. This limits the visual quality of the generated sets of objects, as we show in Figure 2 and in our experiments. In contrast, our method optimizes the whole transition trajectory between the objects, and produces both structurally consistent and high-quality results. Additionally, our method is easily adapted for the task of structure-preserving 3D editing, performing on par with specialized methods. 2.3 LATENT SPACE REGULARIZATION To achieve the structural alignment between the generated objects, we embed these objects into a common latent space together with the transition trajectories between them. We draw inspiration from works on generative modeling of 2D images that show that alignment, disentanglement, and quality of the generated samples can be improved with regularization of trajectories between them. For example, Berthelot* et al. (2019) and Sainburg et al. (2018) directly optimize the quality of the interpolated samples with adversarial training, and Style GAN (Karras et al., 2020) explicitly Published as a conference paper at ICLR 2025 regularizes the smoothness of the trajectories by calculating the perceptual path distance in the VGG feature space. Similarly, we employ a diffusion model as a critic that encourages plausibility of the samples on the trajectories via SDS, leading to smooth transitions and aligned objects. 3 PRELIMINARIES 3.1 NEURAL RADIANCE FIELDS Neural radiance field (Ne RF) (Mildenhall et al., 2020) is a differentiable volume rendering approach that represents the scene as a radiance function parameterized with a neural network. This network maps a 3D point µ R3 and a view direction d S2 into a volumetric density τ R+ and a view-dependent emitted radiance c R3 at that spatial location. To render an image, Ne RF queries 5D coordinates (µ, d) along camera rays and gathers the output colors and densities using volumetric rendering. The ray color C is calculated numerically through quadrature approximation: C = P iαi Tici, Ti = Q j