# debara_denoisingbased_3d_room_arrangement_generation__c91580c8.pdf De Ba RA: Denoising-Based 3D Room Arrangement Generation Léopold Maillard1,2 Nicolas Sereyjol-Garros Tom Durand2 Maks Ovsjanikov1 1LIX, École Polytechnique, IP Paris 2Dassault Systèmes {maillard,maks}@lix.polytechnique.fr {firsname.lastname}@3ds.com Generating realistic and diverse layouts of furnished indoor 3D scenes unlocks multiple interactive applications impacting a wide range of industries. The inherent complexity of object interactions, the limited amount of available data and the requirement to fulfill spatial constraints all make generative modeling for 3D scene synthesis and arrangement challenging. Current methods address these challenges autoregressively or by using off-the-shelf diffusion objectives by simultaneously predicting all attributes without 3D reasoning considerations. In this paper, we introduce De Ba RA, a score-based model specifically tailored for precise, controllable and flexible arrangement generation in a bounded environment. We argue that the most critical component of a scene synthesis system is to accurately establish the size and position of various objects within a restricted area. Based on this insight, we propose a lightweight conditional score-based model designed with 3D spatial awareness at its core. We demonstrate that by focusing on spatial attributes of objects, a single trained De Ba RA model can be leveraged at test time to perform several downstream applications such as scene synthesis, completion and re-arrangement. Further, we introduce a novel Self Score Evaluation procedure so it can be optimally employed alongside external LLM models. We evaluate our approach through extensive experiments and demonstrate significant improvement upon state-of-the-art approaches in a range of scenarios. 1 Introduction Systems capable of generating realistic environments comprising multiple interacting objects would impact several industries including video games, robotics, augmented and virtual reality (AR/VR) and computer-aided interior design. As a result and in tandem with the growing availability of synthetic datasets of indoor layouts [10, 44, 42, 63, 7], which can be populated with high-quality 3D assets [11, 63, 1], data-driven approaches for automatically generating and arranging 3D scenes have been actively investigated by the computer vision community. Notably, the ongoing success of deep generative models for controllable content creation in the text and image domains has recently been extended to scene synthesis, allowing users to craft realistic indoor environments from a set of multimodal constraints [38, 37, 55, 54, 27, 35]. Challenges associated with 3D indoor scene generation are numerous as the intricate nature of multiobject interactions is difficult to capture and model precisely. Items should be placed, potentially resized and oriented relative to one another, in a way that is both plausible and aligned with subjective and context-dependent priors such as style, as well as ergonomic and functional preferences. Additionally, objects should fit within a bounded, restricted area, and a subtle mismatch can break the perceived validity of the synthesized environment (e.g., overlapping, floating or out-of-bounds Work done during internship at Dassault Systèmes. 38th Conference on Neural Information Processing Systems (Neur IPS 2024). Denoising Step (𝑻 𝟏) LLMs Templates [ceiling lamp] De Ba RA Output [sofa, , dining table] [stool, , TV stand] Objects Semantic Generation Denoising Step (𝑻𝑹𝑨 𝟏) Denoising Step (𝑻 𝟏) De Ba RA (+SSE) Set Proposal(s) (Conditioning) Initialization Initialization Initialization Objects Retrieval Re-Arrangement Dimensions Mismatch armchair coffe table dining table 3D Layout Generation Scene Re-Arrangement Optimal Retrieval 3D Scene Synthesis Scene Completion Figure 1: Application scenarios overview. Besides generating diverse and realistic 3D indoor layouts, a single trained De Ba RA model can be employed to execute several related tasks by tweaking the initial sampling noise level σmax and/or performing object or attribute-level layout inpainting. Our novel SSE procedure enables 3D Scene Synthesis capabilities by efficiently selecting conditioning semantics from external sources using density estimates provided by the pretrained model. objects, inaccessible areas). Finally, the limited availability of high-quality data [10, 42] requires learning-based approaches to make careful design choices and trade-offs. Early data-driven approaches often rely on intermediate hand-crafted representations [43, 54, 36, 62] that are closely related to the considered dataset, which introduces significant biases. Concurrently, popular methods have been adopting autoregressive architectures that treat scene synthesis as a set generation task [55, 38, 27, 21, 37] by sequentially adding individual objects. More recently, score-based generative models (also known as denoising diffusion models) have shown promising capabilities in various 3D scene understanding applications [18, 58] including controllable scene synthesis [51, 62, 60] and re-arrangement [57]. In contrast to previous methods, denoising-based approaches enable a stable and scalable training phase and can output all scene attributes simultaneously. The iterative sampling framework brings an improved consideration for the conditioning information and an attractive balance between generation quality and variety. However, current methods leveraging score-based generative models try to model all attributes (both categorical and spatial) within a single framework, which, as we demonstrate below, is less data-efficient and leads to suboptimal solutions. In this context, our work aims to establish principled and robust capabilities for generating accurate and diverse 3D layouts. Specifically, our key contributions are threefold: 1. We propose a score-based conditional objective and architecture designed to effectively learn spatial attributes of interacting 3D objects in a constrained indoor environment. In contrast to previous approaches [51, 38], we disentangle the design space and reduce the model s prediction to a minimal representation consisting solely of oriented 3D bounding boxes, taking as conditioning input the room s floor plan and list of object semantic categories. 2. We propose a set of approaches which allows a model trained following our method to be flexibly employed at test time to perform several user-driven tasks enabling object or attribute-level control. In particular, we demonstrate strong capabilities on controllable scenarios such as scene re-arrangement or room completion, from a single trained network. 3. Finally, we introduce a novel Self Score Evaluation (SSE) procedure, which enables 3D scene synthesis by selecting the set of inputs provided by external sources, such as a LLM, that lead to the most realistic layouts. We exhibit our model s capabilities across a wide range of experimental scenarios and report state-ofthe art 3D layout generation and scene synthesis performance. 2 Related Work Score-based Generative Models By smoothly perturbing training examples with noise, Diffusion Models map a complex data distribution to a known Gaussian prior from which they sample back via iterative denoising using a neural network trained over multiple noise levels. This family of generative models has been motivated by several theoretical foundations over the past years: DDPMs [16, 33] parameterize the diffusion process as a discrete-time Markov chain, as opposed to continuous-time approaches [49, 48]. The seminal EDM [19, 20] training and sampling settings later unified previous methods into an improved ideal framework defined by a set of interpretable parameters. Originally motivated by image generation, diffusion models have demonstrated impressive capabilities on various conditional tasks such as text-to-image synthesis [34, 45], image-to-image generation from various 2D input modalities [45, 64, 56], text-to-3D asset creation [40, 23] or environment-aware human motion synthesis [18, 24]. Relevant to our work, diffusion models have been applied to the generation of point clouds [31, 52] and other geometric representations involving 3D coordinates [59]. Lifting Pretrained Diffusion Models Knowledge of trained diffusion models can be leveraged in various settings including content inpainting[30, 18], score distillation [40], exact likelihood computation [49, 19] or teacher-student distillation [47, 32]. More relevant to our work, imagedomain diffusion priors have demonstrated compelling performance in discriminative tasks including zero-shot image classification [22, 6, 5] and segmentation [4]. More precisely, Diffusion Classifiers assign a label, from a finite set of possible classes {ci}N i=1 to an observed sample x0 by computing class-conditional density estimates from a pretrained diffusion model under the assumption of a uniform prior p (ci) = 1/N. In practice, this is done by, for each class, iteratively adding noise to the observed sample x0 and computing a Monte Carlo estimate of the expected reconstruction loss using the class-conditioned model. Controllable 3D Scene Synthesis Synthesizing indoor 3D layouts from a partial set of information or constraints has come in various settings depending on provided vs. predicted entities and enabled control granularity. A prolific line of research has been adopting intermediate 3D scene representations such as graphs [25, 43, 54, 36, 62, 13, 26], furniture matrices [65] or multi-view images [35]. Autoregressive furnishing approaches [55, 38] have been supplemented by object attribute-level conditioning [37, 27] and additional ergonomic constraints [21]. However, their one object at a time strategy does not comprehensively capture complex relationships between all the interacting elements and is known to easily fall into local minima in which new items fail to be accurately inserted to the current configuration. Lately, methods have unfolded LLMs double-edged capabilities in this area [9, 61] as they excel at generating sensible furniture descriptions while struggling in accurately arranging them in the 3D space, which [12] addresses by introducing a costly refinement stage. In the light of that, LLMs appear to be ideal candidates to supplement a specialized 3D layout generation model. Denoising Indoor Scenes Previous methods have explored diffusion-based approaches in the context of 3D scene synthesis. Pioneering their usage, LEGO-Net [57] performs scene re-arrangement (i.e., recovering a clean object layout from a noisy one) in the 2D space using a transformer backbone that is not noise-conditioned, which we argue is the root cause of its main limitations. Phy Scene[60] augment diffusion-based 3D scene synthesis with additional physic-based guidance to enable practical embodied agent applications. Most relevant to our work, Diffu Scene [51] achieves 3D scene synthesis by fitting a DDPM [16] on stacked 3D object features, resulting in a high-dimensional composite distribution that is hard to learn and interpret. It does not enforce spatial configurations over other predicted features. More importantly, its generative process is not conditioned on the room s floor plan (i.e., bounds) that constrains objects to be placed within a restricted area. 3.1 3D Scene Representation Our method is based on encoding the state of a 3D indoor scene S that is defined by a floor plan (i.e., bounds) F and an unordered set of N objects O = {o1, . . . , o N}, each being modeled by its typed 3D bounding box oi = {xi, ci} where ci {0, 1}k is the one-hot encoding of the semantic category among k classes and xi = (pi, ri, di) R8 comprises 3D spatial attributes. More specifically, pi R3 denotes the object s center coordinate position, ri = (cos θi, sin θi) R2 is a continuous encoding of the rotation of angle θi around the scene s vertical axis [66] and di R3 is the dimension. Ground Truth Scene Perturbed Scene Floor Plan ℱ Shared Object Encoder Floor Plan Encoder Add Noise 𝜎𝜖to Objects 𝒪 Noise-Aware Scene Encoder Shared Spatial Noise Level 𝜎 Noise Encoder 𝒯! 𝒯-! Positional Encoding Concatenation Transformer Encoder Scene Input Space Trainable Module Conditioning Dropout Figure 2: De Ba RA architecture and training overview. At each iteration, 3D bounding boxes parameters (p, r, d) of indoor scene s objects O are perturbed with Gaussian noise σϵ. The floor plan F, noise level σ and resulting objects Oσ are processed by respective encoders to form an unordered set of representations T fed as input to a transformer encoder. Novel object embeddings ˆTo are finally decoded back to their predicted clean spatial configuration (ˆp, ˆr, ˆd). Trainable modules are optimized by minimizing a semantic-aware Chamfer loss. Input object categories c are randomly dropped to model both the class-conditional and unconditional 3D layout distributions. 3.2 Diffusion Framework and Architecture We describe in this section our score-based layout generation framework, relevant design choices and network architecture, that are summarized in Figure 2. Remarkably, unlike previous approaches [51, 38, 37] that output a range of attributes lying in different spaces, we focus on accurately modeling 3D spatial layouts of bounded indoor scenes from a set of input object categories. Learning 3D spatial configurations from object semantics We adopt a diffusion-based approach to yield a conditional generation model that outputs 3D object spatial features {xi}N i=1 from an input floor plan and set of semantic categories y = (F, c) with c = {ci}N i=1. During training, 3D spatial attributes are perturbed with Gaussian noise ϵ N(0, I) at various noise levels (i.e., magnitudes) σ. A trainable noise-conditioned denoiser model Dθ(xσ; y, σ) maps noisy spatial attributes xσ = x + σϵ to their clean counterparts ˆx x RN 8. We notice that each object spatial attribute has an individual real-world interpretation (e.g, p and d can be expressed in meters, r in degrees). To preserve their measurable nature at intermediate perturbed configurations xσ, we want our diffusion parameterization to support a continuous range of noise levels, correlated to the scale of the input signal. This will be particularly convenient at test time (see Section 3.5). To guarantee both properties, we adapt the score-based EDM [19] framework. In our context, this formulation is more natural than the DDPM framework employed by previous work [51]. The latter is based on discretizing noise levels and does not offer a straightforward interpretability of the scene s state at arbitrary timesteps. Our parameterization is further detailed in Appendix A. Estimating the unconditional layout density Inspired by classifier-free guidance [17] in the image domain, we model both the class-conditional density pθ x|F, c and the unconditional density pθ x|F, ) by a single network of parameters θ. At each training iteration, we perform conditioning dropout on the set of semantic categories, by setting c = with probability pdrop else {ci}N i=1. We found that this mechanism helps to reduce overfitting of the training layouts pdata(x) and enables novel capabilities that we introduce in Section 3.4. Denoiser Network Architecture Our lightweight architecture is inspired by previous work [57] to which we make key changes. Similar to [38] and [51], we use a shared object encoder in order to obtain per-object token Toi as a concatenation of the object oi attributes embedded by sinusoidal positional encoding and linear layers. Following [57], we uniformly sample P points on the edges of the floor plan and feed them into a Point Net [41] model, resulting in a floor token TF. This choice of feature extractor backbone is natural as it allows to maintain all the input scene s spatial features in a common 3D space. Importantly, a noise token Tσ is computed from the current noise level σ, making our architecture noise-aware, i.e., able to denoise layouts xσ at any perturbation magnitude. All the previously encoded tokens form a sequence T = {TF, Tσ, Toi, . . . , To N } from which a global scene encoder Tθ computes rich representations ˆT . We design the method without any token ordering and use padding mask for scene with fewer objects than the transformer capabilities. A final shared decoder MLP takes as input object tokens { ˆToi}N i=1 and returns denoised spatial attribute values ˆx = {(ˆpi, ˆri, ˆdi)}N i=1. We provide implementation details on the denoiser in Appendix B.1. 3.3 3D Spatial Objective Our noise-conditioned model Dθ is optimized towards a novel semantic-aware Chamfer Distance objective that does not penalize permutation of 3D bounding boxes sharing the same semantic category between the predicted scene objects layout ˆO and the ground truth one O: LCD( ˆO, O) = 1 2N ˆo ˆ O min o O l(ˆo, o) + X o O min ˆo ˆ O l(ˆo, o) where l(ˆo, o) = ˆx x 2 2 + κ 1 δc(ˆo, o) . (2) Here, κ is a large value so that a significant penalty is applied to objects that do not share the same semantic category c, preventing them to be returned by the min operator. We can finally rewrite the usual score-based training objective [49, 19] as: Epdata(x),ϵ,σ λ(σ)LCD(Dθ(x + σϵ; y, σ), x) (3) where λ(σ) is a noise-dependent loss weighting function. 3.4 Self Score Evaluation While specifying complete conditioning information such as the set of object semantics c could be tedious, it can be provided by either a LLM or a separately trained sequence generation model. However, using independent models is inherently suboptimal since it does not guarantee that the generated conditioning input will be aligned with the score model knowledge. As a result, we propose a novel method to select conditioning inputs that are attuned with the model s capabilities. More specifically, we evaluate a finite set of C object semantic categories candidates, where each candidate is associated to a 3D spatial layout sampled from the learned conditional density, i.e., candidates = cj, xj pθ x|F, cj) C Then, the optimal conditioning candidate c is derived from a density estimate of its corresponding 3D spatial layout x provided by the unconditional network: x = arg min xi Eϵ,σ LCD{Dθ(xi + σϵ ; F, , σ), xi} (5) Algorithm 1 Self Score Evaluation Require: a diffusion prior Dθ trained with conditioning dropout and by optimizing LCD Input: conditioning candidates {cj}C j=1, number of score evaluation trials Tsse 1: sample xj pθ(x|F, cj) for each candidate cj using iterative sampling 2: initialize scores[cj] = list() for each cj 3: for trial t = 1, . . . , Tsse do 4: sample σ N(0, σs); ϵ N(0, I) 5: for candidate ck, sample xk do 6: scores[ck].append(LCD[Dθ(xk + σϵ, ; F, , σ), xk]) 7: end for 8: end for 9: return arg mincj mean(scores[cj]) (a) ATISS [38] (b) Diffu Scene [51] (c) De Ba RA Figure 3: We compare our method with established baselines for generating a 3D layout from a floor plan and set of object categories. De Ba RA produces less failure cases while consistently generating regular arrangements within the room s bounds. In practice, we compute an unbiased Monte Carlo estimate of each candidate expectation using Tsse fixed (σ, ϵ) pairs. Although similar in some aspects, SSE fundamentally differs from diffusion classifiers [22] as in our case, the uniform assumption over conditioning probabilities does not hold. Indeed, in our setting some input signals cannot lead to a plausible arrangement at all. As a result, density estimates of observed samples generated by the class-conditioned model are computed using the unconditional one, while diffusion classifiers compute density estimates of a single observed sample using the class-conditioned model. The SSE procedure is detailed in Algorithm 1. It is further illustrated and discussed in Appendix C. 3.5 Application Scenarios As shown in Figure 1, a single trained De Ba RA model can be used at test time to perform multiple downstream interactive applications. Usual generation procedures, such as EDM 2nd order stochastic sampler [19] can be applied using our trained denoiser to generate novel 3D layouts via T-step iterative denoising at discretized noise levels σ0 = σmax > . . . > σT = 0. In particular, several applications can be performed by inpainting [30], i.e., predicting missing spatial features from those specified (i.e., fixed) in the input layout x RN 8. To do so, we introduce a binary mask m {0, 1}N 8 specifying values to retain from the input. The predicted layout at any sampling iteration t can be expressed as: xσt = ˆxσt (1 m) + xσt m (6) 3D Layout Generation Novel and diverse 3D layouts can be generated from an input set of semantic categories c and a floor plan F by sampling from a high initial noise level σmax >> σdata, arbitrarily initialized 3D spatial features xσ0 and with m = 0N 8. 3D Scene Synthesis De Ba RA can perform 3D scene synthesis via 3D layout generation from semantic categories provided by external sources such as a LLM [9]. Input conditioning candidates can further be optimally selected using the Self Score Evaluation procedure. Scene Completion Additional objects oa can be inserted to an existing scene partially furnished with k objects oe. To do so, their 3D spatial attributes xa are inpainted from the existing ones xe with Dθ conditioned on the updated set of semantic categories c = ce ca using m(i, j) = 1{i k}. Re-arrangement In the context of scene synthesis, re-arrangement [57] consists in recovering the closest clean spatial configuration of existing objects from a messy one, which has practical applications in robotics [2]. De Ba RA can perform re-arrangement by sampling from an initial noise σmax that depends on the scene perturbation magnitude. During denoising, object positions and rotations (p, r) RN 5 are inpainted from the known object dimensions using m(i, j) = 1{j>5}. (a) Ground Truth (c) Re-arranged (d) Partial (e) Completed Figure 4: Qualitative results on scene re-arrangement (left) and completion (right). De Ba RA is able to recover a plausible layout from a messy one, and to finely take into account initial configurations. Optimal Object Retrieval 3D scene synthesis systems depend on external 3D asset databases for furnishing rooms. For each object of semantic class c, a textured furniture is retrieved by minimizing the mismatch with the generated dimension dσT . This is inherently suboptimal as the resulting scene quality is limited by the size of the external database. To overcome this issue, we introduce a post-retrieval refinement stage by performing additional re-arrangement steps starting from a noise level σmax derived from the mismatch between generated and retrieved object dimensions. Generation from Coarse Specifications We propose a time-dependent masking approach to synthesize layouts from rough input spatial features (i.e., instead of exact ones), that are adjusted in the late denoising iterations. To indicate approximate e.g., object dimensions, we set, at any sampling step t, m(i, j) = 1{j>5 and t