# diffgs_functional_gaussian_splatting_diffusion__d9463ca8.pdf

Diff GS: Functional Gaussian Splatting Diffusion

Junsheng Zhou Weiqi Zhang Yu-Shen Liu

School of Software, Tsinghua University, Beijing, China

{zhou-js24,zwq23}@mails.tsinghua.edu.cn liuyushen@tsinghua.edu.cn

3D Gaussian Splatting (3DGS) has shown convincing performance in rendering speed and fidelity, yet the generation of Gaussian Splatting remains a challenge due to its discreteness and unstructured nature. In this work, we propose Diff GS, a general Gaussian generator based on latent diffusion models. Diff GS is a powerful and efficient 3D generative model which is capable of generating Gaussian primitives at arbitrary numbers for high-fidelity rendering with rasterization. The key insight is to represent Gaussian Splatting in a disentangled manner via three novel functions to model Gaussian probabilities, colors and transforms. Through the novel disentanglement of 3DGS, we represent the discrete and unstructured 3DGS with continuous Gaussian Splatting functions, where we then train a latent diffusion model with the target of generating these Gaussian Splatting functions both unconditionally and conditionally. Meanwhile, we introduce a discretization algorithm to extract Gaussians at arbitrary numbers from the generated functions via octree-guided sampling and optimization. We explore Diff GS for various tasks, including unconditional generation, conditional generation from text, image, and partial 3DGS, as well as Point-to-Gaussian generation. We believe that Diff GS provides a new direction for flexibly modeling and generating Gaussian Splatting. Project page: https://junshengzhou.github.io/Diff GS.

1 Introduction

3D content creation is a vital task in computer graphics and 3D computer vision, which shows great potential in real-world applications such as virtual reality, game design, film production, and robotics. Previous 3D generative models usually take Neural Radiance Field (Ne RF) [41, 2, 62] as the representation. However, the volumetric rendering for Ne RF requires considerable computational cost, leading to sluggish rendering speeds and significant memory burden. Recent advances of 3D Gaussian Splatting (3DGS) [28, 68, 23] have demonstrated its potential to serve as the nextgeneration 3D representation by enabling both real-time rendering and high-fidelity appearance modeling. Designing 3D generative models for 3DGS provides a scheme for real-time interaction with 3D creations.

The core challenge in generative 3DGS modeling lies in its discreteness and unstructured nature, which prevents the well-studied frameworks in structural image/voxel/video generation from transferring to directly generate 3DGS. Concurrent works [72, 19] alternatively transport Gaussians into structural voxel grids with volume generation models [11] for generating Gaussians. However, these methods lead to 1) abundant computational cost for high-resolution voxels, and 2) limited number of generated Gaussians constrained by the voxel resolutions. Certain voxelization schemes [19] also introduce information loss, making it challenging to maintain high-quality Gaussian reconstructions.

Equal contribution. Yu-Shen Liu is the corresponding author.

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

Multi-View Images 3DGS

Neural Functioning

Gaussian Extraction

Gaussian Probability Function (Gau PF)

Gaussian Color Function (Gau CF)

Gaussian Transform Function (Gau TF)

Gaussian Splatting Functions

Figure 1: The illustration of Diff GS. We fit 3DGS from multi-view images and then disentangle it into three Gaussian Splatting Functions. We train a Gaussian VAE with a latent diffusion model for generating these functions, followed by a Gaussian extraction algorithm to obtain the final generated Gaussians.

To address these challenges, we present Diff GS, a novel diffusion-based generative model for general 3D Gaussian Splatting, which is capable of efficiently generating high-quality Gaussian primitives at arbitrary numbers. The key insight of Diff GS is to represent Gaussian Splatting in a disentangled manner via three novel functions: Gaussian Probability Function (Gau PF), Gaussian Color Function (Gau CF) and Gaussian Transform Function (Gau TF). Especially, Gau PF indicates the geometry of 3DGS by modeling the probabilities of each sampled 3D location to be a Gaussian location. Gau CF and Gau TF predict the Gaussian attributes of appearances and transformations given a 3D location as input, respectively. Through the novel disentanglement of 3DGS, we represent the discrete and unstructured 3DGS with three continuous Gaussian Splatting functions.

With the disentangled and powerful representation, the next step is to design a generative model with the target of generating these Gaussian Splatting functions. We propose a Gaussian VAE model for creating compressed representation for the Gaussian Splatting functions. The Gaussian VAE learns a regularized latent space which maps the Gaussians Splatting functions of each shape into one latent vector. A latent diffusion model (LDM) is simultaneously trained at the latent space for generating novel 3DGS shapes. With the powerful LDM, we explore Diff GS to generate diverse 3DGS both conditionally and unconditionally. Finally, we introduce a discretization algorithm to extract Gaussians at arbitrary numbers from the generated functions via octree-guided sampling and optimization. The key idea is to first extract 3D Gaussian geometry from Gau PF by sampling 3D locations at the 3D spaces with the highest Gaussian probabilities, and then predict the Gaussian attributes with Gau CF and Gau TF. We illustrate the overview of Diff GS in Fig. 1.

We systematically summarize the superiority of Diff GS in terms of: 1) Efficiency, we design Diff GS based on Gaussian Splatting and Latent Diffusion Models, which shows significant efficiency in model training, inference and shape rendering. 2),3) Generality and quality, we generate native 3DGS without processes like voxelization, leading to unimpaired quality and generality in applying to downstream 3DGS applications. 4) Scalability, we scalably generate Gaussian primitives at arbitrary numbers. We conduct comprehensive experiments on both synthetic Shape Net dataset and real-world Deep Fashion3D dataset, which demonstrate our non-trivial improvements over the state-of-the-art methods. In summary, our contributions are given as follows.

We propose Diff GS, a novel diffusion-based generative model for general 3D Gaussian Splatting, which is capable of efficiently generating high-quality Gaussian primitives at arbitrary numbers.

We introduce a novel schema to represent Gaussian Splatting in a disentangled manner via three functions to model Gaussian probabilities, Gaussian colors and Gaussian transforms, respectively. We simultaneously propose a discretization algorithm to extract Gaussians from these functions via octree-guided sampling and optimization.

Diff GS achieves remarkable performances under various tasks including unconditional generation, conditional generation from text, image, and partial 3DGS, as well as Point-to Gaussian generation.

2 Related Work

2.1 Rendering-Guided 3D Representation

Neural implicit representations which learn signed [46, 79, 37] (unsigned [77, 78, 75]) distance functions or occupancy functions [39] have largely advanced the field of 3D generation [84, 76, 66, 36], reconstruction [80, 26, 24, 25, 45] and perception [82, 83, 81, 32, 31]. Remarkable progress have been achieved in the field of novel view synthesis (NVS) [41, 47, 62, 2, 43], with the proposal of Neural Radiance Field (Ne RF) [41]. Ne RF implicitly represents scene appearance and geometries using MLP-based neural networks, optimized through volume rendering to achieve outstanding NVS quality. Some subsequent variants [1, 15, 49] have shown promising performance by advancing Ne RF in terms of rendering quality, scalability and view-consistency. Additionally, more recent methods [43, 7, 14, 64] explore the training and rendering efficiency of Ne RF by introducing featuregrids based 3D representations. Instant-NGP [43] highly accelerates Ne RF learning by introducing multi-resolution feature grids based on hash table with fully-fused CUDA kernel implementations. However, the Ne RF representations which require expensive neural network inferences during volume rendering, still struggles in the applications where real-time rendering is required.

Recently, the emergence of 3D Gaussian Splatting (3DGS) [28, 59, 30, 68, 71, 18, 74] has showcased impressive real-time results in novel view synthesis (NVS). 3DGS [28] has led to revolutions in the NVS field by demonstrating superior performances in multiple domains. However, the generation of Gaussian Splatting remains a challenge due to its discreteness and unstructured nature. In this paper, we introduce a novel schema to represent the discrete and unstructured 3DGS with three continuous Gaussian Splatting Functions, thus ingeniously tackle the challenge by designing generative models for the functions.

2.2 3D Generative Models

The field of creating 3D contents with generative models has emerged as a particularly captivating research direction. A series of studies [48, 33, 65, 52, 40, 36, 9, 56, 69, 60] focus on optimizationbased frameworks based on Score Distillation Sampling (SDS), which achieve convincing generation performances by distilling 3D geometry and appearance of the radiance fields with pretrained 2D diffusion models [44, 21] as the prior. However, these studies entail significant computational costs due to time-consuming per-scene optimization. Going beyond optimization-based 3D generation, recent methods [42, 61, 63, 27] explore 3D generative methods based on diffusion models to directly learn priors from 3D datasets for generative radiance fields modeling, which typically represent radiance fields as structural triplanes [63, 55, 17] or voxels [61, 42, 11]. Diff RF [42] leverage a voxel based Ne RF representation with 3D U-Nets as the backbone to train a diffusion model.

With the recent advances in 3DGS [28], designing a powerful 3D generative model for generating 3DGS is expected to be a popular research topic. This also brings significant challenges due to the discreteness and unstructured nature of 3DGS, which prevents the well-studied frameworks in structural image/voxel/video generation from transferring to directly generate 3DGS. A series of studies [70, 86, 22, 73] follow the schema of image-based reconstruction without generative modeling, which lack the ability to generate diverse shapes. Concurrent studies Gaussian Cube [72] and GVGEN [19] follow the voxel-based representations to transport Gaussians into structural voxel grids with volume generation models for generating Gaussians. However, these methods come with several drawbacks, including abundant computational costs for high-resolution voxels and a restricted number of generated Gaussians constrained by voxel resolutions. Some voxelization strategies [19] may introduce information loss, leading to difficulties in preserving high-quality Gaussian reconstructions. In contrast, our proposed Diff GS explores a new perspective to directly represent the discrete and unstructured 3DGS with three continuous Gaussian Splatting Functions. Though the insight, we design a latent diffusion model for efficiently generating high-quality Gaussian primitives by learning to generate the Gaussian Splatting Functions. Diff GS generates general Gaussians at arbitrary numbers with a specially designed octree-based extraction algorithm.

We introduce Diff GS, a novel diffusion-based generative model for general 3D Gaussian Splatting, which is capable of efficiently generating high-quality Gaussian primitives at arbitrary numbers.

z GS Encoder

Triplane Decoder

(a) Training

Probability Color Transform

Gaussian LDM

Step K Step 0

Text, Image, Partial GS

Generator 𝑧𝑧𝑘𝑘 𝑧𝑧𝑘𝑘 1 𝑧𝑧1 𝑧𝑧0

Triplane Decoder

Geometry Extraction

GS Locations

Gaussian VAE

(b) Gaussian Generation

Figure 2: The overview of Diff GS. (a) We disentangle the fitted 3DGS into three Gaussian Splatting Functions to model the Gaussian probability, colors and transforms, respectively. We then train a Gaussian VAE with a conditional latent diffusion model for generating these functions. (b) During generation, we first extract Gaussian geometry from the generated Gau PF, followed by the Gau CF and Gau TF to obtain the Gaussian attributes.

The overview of Diff GS is shown in Fig. 2. We first preview Gaussian Splatting in Sec. 3.1 and present the novel functional schema for representing Gaussian Splatting with three disentangled Gaussian Splatting Functions in Sec. 3.2. We then introduce the Gaussian Variational Auto-encoder and the Latent Diffusion Model for compressing and generative modeling on Gaussian Splatting Functions, as shown in Sec. 3.3. A novel discretization algorithm is further developed in Sec. 3.4 to extract Gaussians at arbitrary numbers from the generated functions via octree-guided sampling and optimization.

3.1 Preview Gaussian Splatting

3D Gaussian Splatting (3DGS) [28] represents a 3D shape or scene as a set of Gaussians with attributes to model the geometries and view-dependent appearances. For a 3DGS G = {gi}N i=1 containing N Gaussians, the geometry of i-th Gaussian is explicitly parameterized via 3D covariance matrix Σi and its center σi R3, fomulated as:

gi(x) = exp 1

2(x σi)T Σ 1(x σi) , (1)

where the covariance matrix Σi = risis T i r T i is factorized into a rotation matrix ri R4 and a scale matrix si R. The appearance of the Gaussian gi is controlled by an opacity value oj R and a color value ci R3. Note that the color is represented as a series of sphere harmonics coefficients in practice of 3DGS, yet we still keep its definition as three-dimension color ci in our paper for a clear understanding on our method. To this end, the 3DGS G is defined as {gi = {σi, ri, si, oi, ci} RK}N j=1, where K is dimension of the combined attributes in each Gaussian.

3.2 Functional Gaussian Splatting Representation

The key challenge in generative 3DGS modeling lies in its discreteness and unstructured nature, which prevents the well-studied generative frameworks from transferring to directly generate 3DGS. We address this challenge by introducing to represent Gaussian Splatting in a disentangled manner via three novel functions: Gaussian Probability Function (Gau PF), Gaussian Color Function (Gau CF) and Gaussian Transform Function (Gau TF), respectively. Through the novel disentangling of 3DGS, we represent the discrete and unstructured 3DGS with three continuous Gaussian Splatting Functions.

Gaussian Probability Function. Gaussian Probability Function (Gau PF) indicates the geometry of 3DGS by modeling the probabilities of each sampled 3D location to be a Gaussian location. Given a set of 3D query location Q = {qj R3}M i=1 sampled in 3D space around a fitted 3DGS G = {gi R3}N j=1, the Gau PF of G predicts the probabilities p of queries {qj}M i=1 to be a Gaussian location in G, fomulated as: pj = Gau PF(qj) [0, 1]. (2)

The idea of Gaussian probability modeling comes from the observation that the further a 3D location qj is from all Gaussians, the lower the probability that any Gaussian occupies the space at qj. Therefore the ground truth Gaussian probability of qj is defined as:

Gau PF(qj) = τ(λ( min i [1,N] ||qj σi||2)), (3)

where min i [1,N] ||qj σi||2 indicates the distance from qj to the nearest Gaussian center in {σi}N i=1, λ

is a truncation function which filters the extremely large values and τ is a continuous function which maps the query-to-Gaussian distances to probabilities in the range of [0,1].

A learned Gau PF implicitly models the locations of 3D Gaussian centers, which is the key factor for generating high-quality 3DGS. The extraction of 3DGS centers from Gau PF is then achieved with our designed Gaussian extraction algorithm which will be introduced in Sec. 3.4.

Gaussian Color and Transform Modeling. Gaussian Color Function (Gau CF) and Gaussian Transform Function (Gau TF) predict the Gaussian attributes of appearances and transformations from Gaussian geometries. Specifically, given the center σi of a Gaussian gi in G as input, Gau CF predicts the color attribute ci and Gau TF predicts the rotation ri, scale si and opacity oi, formulated as:

{ci} = Gau CF(σi); {ri, si, oi} = Gau TF(σi). (4)

Note that Gau CF and Gau TF mainly focus on predicting the Gaussian colors and transforms from 3D Gaussian centers. This is different from the Gau PF which models the probabilities of query samples in the 3D space. The reason is that Gau PF focuses on exploring the geometry of 3DGS from the 3D space, while Gau CF and Gau TF learn to predict the Gaussian attributes from the known geometries.

Through the novel disentanglement of 3DGS, we represent the discrete and unstructured 3DGS with three continuous Gaussian Splatting Functions. The functional representation is a general and flexible term for 3DGS which has no restrictions on the Gaussian numbers, densities, geometries, etc.

3.3 Gaussian Variational Auto-encoder and Latent Diffusion

With the disentangled and powerful representation, the next step is to design a generative model with the target of generating these Gaussian Splatting Functions. We follow the common schema to design a Gaussian Variational Auto-encoder (VAE) [29] with a Latent Diffusion Model (LDM) [44] as the generative model. The detailed framework and the training pipeline of Diff GS are illustrated in Fig. 1(a).

Gaussian VAE. The Gaussian VAE compresses the Gaussian Splatting Functions into a regularized latent space by mapping the Gaussian Splatting Functions of each 3DGS shape into a latent vector, from which we can also recover the Gaussian Splatting Functions. Specifically, the Gaussian VAE consists of 1) a GS encoder ϕen to learn representations from 3DGS and encodes each 3DGS into a latent vector z, 2) a triplane decoder ϕde which decodes the latent z into a feature triplane, and 3) three neural predictors ψpf, ψcf and ψtf which serve as the implementation of Gau PF, Gau CF and Gau TF to predict Gaussian probabilities, colors and transforms, respectively.

Given a fitted 3DGS G = {gi}N j=1 as input, the GS encoder ϕen extracts a global latent feature z from G, which is then decoded into a feature triplane t RH W C 3 with the decoder ϕde, formulated as: z = ϕen(G); t = ϕde(z). (5)

The triplane t consists of three orthogonal feature planes {t XY , t XZ, t Y Z} which are aligned to the axex. For a 3D location qj, we obtain its corresponding feature fj = interp(t, qj) from the triplane t by projecting qj onto the orthogonal feature planes and concatenating the tri-linear interpolated

(a) Octree depth 1 (b) Octree depth 2 (c) Octree depth L (d) Proxy Gaussian centers (e) Final Gaussian centers

Figure 3: Gaussian geometry extractions from generated Gau PF. The yellow and green regions indicate the high probability area and the low probability area judged by Gau PF. (a),(b) and (c) show the progressively octree build process at depth 1,2 and L. (d) We sample proxy Gaussian centers from the octree at final depth L. (e) We optimize proxy centers to the exact geometry indicated in Gau PF.

features at the three planes. We then predict the Gaussian probability, color and transform of qj with the neural Gaussian Splatting Function predictors as:

{ ˆpi} = ψpf(fj); {ˆci} = ψcf(fj); {ˆri, ˆsi, ˆoi} = ψtf(fj). (6)

The Gaussian VAE is trained with the target of accurately predicting Gaussian attributes and robustly regularizing the latent space. In practice, the training objective is formulated as:

LVAE = {ˆp, ˆc, ˆr, ˆs, ˆo} {p, c, r, s, o} 1 + β (DKL (Qϕ(z|G) P(z))) . (7)

The first loss term indicates the L1 loss between the predicted Gaussian attributes in Eq. (6) and the target ones defined by the ground truth Gaussian Splatting Functions in Eq. (2) and Eq. (4). The second term in Eq. (7) is the KL-divergence loss with a factor of β, which constrains on the regularization of the learned latent space of z. Specifically, we define the inferred posterior of z as the distribution Qϕ(z|G), which is regularized to align with the Gaussian distribution prior P(z) = N(0, I), where I is the standard deviation.

Gaussian LDM. With the trained Gaussian VAE in place, we are now able to encode any 3DGS into a compact 1D latent vector z. We then train a latent diffusion model (LDM) [44] efficiently on the latent space. A diffusion model is trained to generate samples from a target distribution by reversing a process that incrementally introduces noise. We define {z0, z1, ..., z K} as the forward process γ(z0:K) which gradually transforms a real data z0 into Gaussian noise (z T ) by adding noises. The backward process µ(z0:K) leverages a neural generator µ to denoise z K into a real data sample.

To achieve controllable generation of 3DGS, we introduce a conditioning mechanism [54] into the diffusion process with cross-attention. Given an input condition y (e.g. text, image, partial 3DGS), we leverage a custom encoder δ to project y into the condition embedding δ(y). The embedding is then fused into the generator µ with cross attention modules. Following DDPM, we simply adopt the optimizing objective to train the generator for predicting noises ϵσ, formulated as:

LLDM = Ez0,t,ϵ N(0,I) h ϵ ϵσ (zt, δ(y), t) 2i , (8)

where t is a time step and ϵ is a noise latent sampled from the Gaussian distribution N(0, I), respectively. We adopt the well-studied architecture DALLE-2 [53] as the LDM implementation.

3.4 Gaussian Extraction Algorithm

The final step for the generation process of Diff GS is to extract 3DGS from the generated Gaussian Splatting Functions, similar to the effect of Marching Cubes algorithm [34] which extracts meshes from Signed Distance Functions. The key factor is to extract the geometries of 3DGS, i.e., Gaussian locations and the appearances of 3DGS, i.e., colors and transforms. The full generation pipeline is shown in Fig. 2(b).

Octree-Guided Geometry Sampling. The locations of 3D Gaussian centers indicate the geometry of the represented 3DGS. We aim to design a discretization algorithm to obtain the discrete 3D locations from the learned continuous Gaussian Probability Function parameterized with the neural network ψpf, which models the probability of each query sampled in the 3D space to be a 3D Gaussian

Table 1: Comparisons of unconditional generation under Shape Net [6] dataset.

Method Airplane Chair

FID-50K KID-50K ( ) FID-50K KID-50K ( )

GET3D [16] 59.51 2.414 Diff TF [4] 110.8 9.173 93.02 6.708 Ours 47.03 3.436 35.28 2.148

GET3D Diff TF Ours

Figure 4: Visual comparisons with state-of-the-arts on unconditional generation of Shape Net Chairs.

location. To achieve this, we design an octree-based sampling and optimization algorithm which generates accurate center locations of 3D Gaussians at arbitrary numbers.

We show the 2D illustration of the algorithm in Fig. 3. Assume that the 3D space is divided into the high probability area (the yellow region) and the low probability area (the green region) by the generated Gau PF. We aim to extract the geometry as the locations with high probabilities. A naive implementation is to densely sample queries in the 3D space and keep the ones with large probabilities as outputs. However, it will lead to high computational cost for inferencing and the discrete sampling also struggles to accurately reach the locations with largest probabilities in the continuous Gau PF. We get inspiration from octree [38, 67] to design a progressive strategy which only explores the 3D regions with large probabilities in current octree depth for further subdivision in the next octree depth. After L layers of octree subdivision, we reach the local regions with largest probabilities, from where we uniformly sample N 3D points as the proxy points {ρi}N i=1 representing coarse locations of Gaussian centers.

Optimizing Geometry with Gau PF. To further refine the proxy points to the exact locations of Gaussian centers with largest probabilities in Gau PF, we propose to further optimize the proxy points with the supervision from learned Gau PF ψpf. Specifically, we set the position of proxy points {ρi = {ρxi, ρyi, ρzi}}N i=1 to be learnable and optimize them to reach the positions { ˆσi}N i=1 with largest probabilities of ψpf. The optimization target is formulated as:

i=1 ψpf(ρi). (9)

Note that we can set N to arbitrary numbers, enabling Diff GS to generate 3DGS with no limits on the density and resolution.

Extracting Gaussian Attributes. We now obtain the estimated geometry indicating the predicted Gaussian centers { ˆσi}N i=1. We then extract the appearances and transforms from the generated triplane t, Gaussian Color Function ψcf and Gaussian Transform Function ψtf as

a chair with brown handles and a grey back.

(a) (b) (c)

a grey chair has two L type legs.

Figure 5: Visualization of conditional 3DGS generation results on Shape Net. (a) Text conditional generation. (b) Image conditional generation. (c) Gaussian Splatting completion.

{ˆci} = ψcf(interp(t, ˆσi)) and {ˆri, ˆsi, ˆoi} = ψtf(interp(t, ˆσi)). Finally, the general 3DGS is now generated as ˆG = { ˆσi, ˆci, ˆri, ˆsi, ˆoi}N i=1.

4 Experiment

4.1 Unconditional Generation

Dataset and Metrics. For unconditional generation of 3D Gaussian Splatting, we conduct experiments under the airplane and chair classes of Shape Net [6] dataset. Following previous works [42, 4], we report two widely-used image generation metrics Fréchet Inception Distance (FID) [20] and Kernel Inception Distance (KID) [3] for evaluating the rendering quality of our proposed Diff GS and previous state-of-the-art works. The metrics are evaluated between 50K renderings of the generated shapes and 50K renderings of the ground turth ones, both at the resolution of 1024 1024.

Comparisons. We compare Diff GS with the state-of-the-art methods in terms of the rendering quality of generated shapes, including the GAN-based methods GET3D [16] and the diffusion-based method Diff TF [4]. The quantitative comparison is shown in Tab. 1, where Diff GS achieves the best performance over all the baselines. We further show the visual comparison on the renderings of some generated shapes in Fig. 4, where the GAN-based GET3D struggles in generating complex shapes and the generations of Diff TF is blurry with poor textures. In contrast, Diff GS produces significantly more visual-appealing and high-fidelity generations in terms of rendering and geometry qualities.

4.2 Conditional Generation.

We explore the conditional generation ability of Diff GS given texts, images and partial 3DGS as the input conditions. All the experiments are conducted under the chair class of Shape Net [6] dataset with commonly used data splits in previous methods [10, 35].

Text/Image-conditional Gaussian Splatting Generation. For introducing texts/images as the conditions for controllable Gaussian Splatting generation, we leverage the frozen text and image encoder from the pretrained CLIP [51] model as the implementation of custom text encoder γtext and γimage for achieving text/image embeddings. We then train Diff GS with the conditional optimization objective in Eq.(8). We show the visualization of some text/image conditional generations produced by Diff GS in Fig. 5(a) and Fig. 5(b). The results show that Diff GS accurately recovers the semantics and geometries described in the text prompts and the images, demonstrating the powerful capability of Diff GS in generating high-fidelity 3DGS from text descriptions or vision signals.

Gaussian Splatting Completion. Additionally, we explore an interesting task of Gaussian Splatting completion. To the best of our knowledge, we are the first to focus and introduce solutions for this task. Specifically, the Gaussian Splatting completion task is to recover the complete 3DGS from a

Deep Fashion3D Shape Net

Figure 6: Visualization of Point-to-Gaussian fittings on Deepfashion3D and generations on Shape Net.

partial 3DGS which contains large occlusions. In real-world applications, having only sparse views with limited viewpoint movement available for optimizing 3DGS often results in a partial 3DGS. Solving Gaussian Splatting completion task enables us to infer the complete and dense 3DGS from the partial ones for improving the rendering quality at invisible viewpoints.

We introduce Diff GS with partial 3DGS as the conditions for solving this task. Specifically, we simply leverage a modified Point Net [50] as the custom encoder γpartial for partial 3DGS. Fig. 5(c) presents the visualization of Gaussian Splatting completion results produced by Diff GS. The results show that Diff GS is capable of recovering complex geometries and detailed appearances from highly occluded 3DGS. Please refer to the appendix for implementation details on Gaussian Splatting completion.

4.3 Point-to-Gaussian Generation

We further introduce Diff GS for another challenging and vital task of Point-to-Gaussian generation. This task aims to generate the Gaussian attributes given a 3D point cloud as input. The task serves as the bridge between the easily accessible point clouds and the powerful 3DGS representation which efficiently models high-quality 3D appearances.

Dataset and Implementation. We conduct experiments under the chair and airplane classes of Shape Net and also the widely-used garment dataset Deep Fashion3D [85]. The Deep Fashion3D dataset is a real-captured 3D dataset containing complex textures. For implementing Point-to Gaussian, we simply train the Gaussain VAE with the three-dimension point clouds as inputs, instead of the 3DGS with attributes. Please refer to the Appendix for more details on data preparation and implementation.

Performances. We provide the visualization of some Point-to-Gaussian fitting and generation results in Fig. 6. We shown the fitting results for Deepfashion3D [85] dataset and the generation results for the test set of airplane and chair classes in Shape Net [6]. Diff GS produces visual-appealing 3DGS generations given only 3D point cloud geometries as inputs. The results demonstrate that Diff GS can accurately predict Gaussian attributes for 3D point clouds. We believe Diff GS provides a new direction for 3DGS content generation by connecting 3DGS with point clouds.

4.4 Ablation Study

To evaluate some major designs and important hyper-parameters in Diff GS, we conduct ablation studies under the chair class of Shape Net dataset. We report the performance in terms of PSNR, SSIM and LPIPS of the reconstructed 3DGS with Gaussian VAE.

Table 2: Ablations on framework designs.

Method PSNR SSIM LPIPS

w/o Trunction 29.39 0.9792 0.0173 Exponent 29.74 0.9765 0.0188 w/o Optimization 30.34 0.9875 0.0152

Ours 34.01 0.9879 0.0149

Framework Designs. We first evaluate some major designs of our framework in Tab. 2. We justify the effectiveness of introducing the truncation function λ when modeling Gau PF and report the results without λ as w/o truncation . We then explore implementing the projection function τ either as τ(x) = e x (as shown in Exponent ) or as a linear projection (as

shown in Ours ). We also show the results without the optimization process during Gaussian extraction as w/o Optimization , which demonstrates the effectiveness of optimizing Gaussians to the exact locations.

Table 3: Ablations on Gaussian number.

Num PSNR SSIM LPIPS

50K 28.61 0.9787 0.0251 100K 30.35 0.9838 0.0151 350K 34.01 0.9879 0.0149

Gaussian Numbers. One significant advantage of Diff GS lies in the ability of generating high-quality Gaussians at arbitrary numbers. To explore how the number of Gaussians affects the rendering quality, we conduct ablations on the Gaussian numbers as shown in Fig. 3. The results demonstrate that denser Gaussians lead to better quality.

5 Conclusion

In this paper, we introduce Diff GS for generative modeling of 3DGS. Diff GS disentangled represent 3DGS via three novel functions to model Gaussian probabilities, colors and transforms. We then train a latent diffusion model with the target of generating these functions both conditionally and unconditionally. Diff GS generates 3DGS with arbitrary numbers by an octree-guided extraction algorithm. The experimental results on various tasks demonstrate the superiority of Diff GS.

6 Acknowledgement

This work was supported by National Key R&D Program of China (2022YFC3800600), the National Natural Science Foundation of China (62272263, 62072268), and in part by Tsinghua-Kuaishou Institute of Future Media Data.

[1] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-Ne RF: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5855 5864, 2021. [2] Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip Ne RF 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5470 5479, 2022. [3] Mikołaj Bi nkowski, Danica J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying mmd gans. ar Xiv preprint ar Xiv:1801.01401, 2018. [4] Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, and Ziwei Liu. Large-vocabulary 3D diffusion model with transformer. ar Xiv preprint ar Xiv:2309.07920, 2023. [5] Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. Efficient geometryaware 3d generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16123 16133, 2022. [6] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shape Net: An information-rich 3D model repository. ar Xiv preprint ar Xiv:1512.03012, 2015. [7] Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tenso RF: Tensorial radiance fields. In European Conference on Computer Vision, pages 333 350. Springer, 2022. [8] Hansheng Chen, Jiatao Gu, Anpei Chen, Wei Tian, Zhuowen Tu, Lingjie Liu, and Hao Su. Single-stage diffusion nerf: A unified approach to 3d generation and reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2416 2425, 2023. [9] Zilong Chen, Feng Wang, and Huaping Liu. Text-to-3d using gaussian splatting. ar Xiv preprint ar Xiv:2309.16585, 2023. [10] Zhiqin Chen and Hao Zhang. Learning implicit fields for generative shape modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5939 5948, 2019. [11] Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, Alexander G Schwing, and Liang-Yan Gui. SDFusion: Multimodal 3D shape completion, reconstruction, and generation. In Proceedings

of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456 4465, 2023. [12] Gene Chou, Yuval Bahat, and Felix Heide. Diffusion-SDF: Conditional generative modeling of signed distance functions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2262 2272, 2023. [13] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli Vander Bilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142 13153, 2023. [14] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5501 5510, 2022. [15] Qiancheng Fu, Qingshan Xu, Yew-Soon Ong, and Wenbing Tao. Geo-Neus: Geometryconsistent neural implicit surfaces learning for multi-view reconstruction. Advances in Neural Information Processing Systems (Neur IPS), 2022. [16] Jun Gao, Tianchang Shen, Zian Wang, Wenzheng Chen, Kangxue Yin, Daiqing Li, Or Litany, Zan Gojcic, and Sanja Fidler. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841 31854, 2022. [17] Anchit Gupta, Wenhan Xiong, Yixin Nie, Ian Jones, and Barlas O guz. 3DGen: Triplane latent diffusion for textured mesh generation. ar Xiv preprint ar Xiv:2303.05371, 2023. [18] Liang Han, Junsheng Zhou, Yu-Shen Liu, and Zhizhong Han. Binocular-guided 3d gaussian splatting with view consistency for sparse view synthesis. In Advances in Neural Information Processing Systems (Neur IPS), 2024. [19] Xianglong He, Junyi Chen, Sida Peng, Di Huang, Yangguang Li, Xiaoshui Huang, Chun Yuan, Wanli Ouyang, and Tong He. GVGEN: Text-to-3D generation with volumetric representation. ar Xiv preprint ar Xiv:2403.12957, 2024. [20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017. [21] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840 6851, 2020. [22] Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3D. ar Xiv preprint ar Xiv:2311.04400, 2023. [23] Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. ar Xiv preprint ar Xiv:2403.17888, 2024. [24] Han Huang, Yulun Wu, Junsheng Zhou, Ge Gao, Ming Gu, and Yu-Shen Liu. Neu Surf: Onsurface priors for neural surface reconstruction from sparse input views. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024. [25] Chuan Jin, Tieru Wu, Yu-Shen Liu, and Junsheng Zhou. Music-udf: Learning multi-scale dynamic grid representation for high-fidelity surface reconstruction from point clouds. Computers & Graphics, page 104081, 2024. [26] Chuan Jin, Tieru Wu, and Junsheng Zhou. Multi-grid representation with field regularization for self-supervised surface reconstruction from point clouds. Computers & Graphics, 2023. [27] Heewoo Jun and Alex Nichol. Shap-e: Generating conditional 3d implicit functions. ar Xiv preprint ar Xiv:2305.02463, 2023. [28] Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4):1 14, 2023. [29] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. ar Xiv preprint ar Xiv:1312.6114, 2013. [30] Jiahe Li, Jiawei Zhang, Xiao Bai, Jin Zheng, Xin Ning, Jun Zhou, and Lin Gu. Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization. ar Xiv preprint ar Xiv:2403.06912, 2024. [31] Shujuan Li, Junsheng Zhou, Baorui Ma, Yu-Shen Liu, and Zhizhong Han. Ne AF: Learning neural angle fields for point normal estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023. [32] Shujuan Li, Junsheng Zhou, Baorui Ma, Yu-Shen Liu, and Zhizhong Han. Learning continuous implicit field with local distance indicator for arbitrary-scale point cloud upsampling. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.

[33] Chen-Hsuan Lin, Jun Gao, Luming Tang, Towaki Takikawa, Xiaohui Zeng, Xun Huang, Karsten Kreis, Sanja Fidler, Ming-Yu Liu, and Tsung-Yi Lin. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300 309, 2023. [34] William E Lorensen and Harvey E Cline. Marching cubes: A high resolution 3D surface construction algorithm. ACM Siggraph Computer Graphics, 21(4):163 169, 1987. [35] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2837 2845, 2021. [36] Baorui Ma, Haoge Deng, Junsheng Zhou, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Geo Dream: Disentangling 2D and geometric priors for high-fidelity and consistent 3D generation. ar Xiv preprint ar Xiv:2311.17971, 2023. [37] Baorui Ma, Junsheng Zhou, Yu-Shen Liu, and Zhizhong Han. Towards better gradient consistency for neural signed distance functions via level set alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17724 17734, 2023. [38] Donald Meagher. Geometric modeling using octree encoding. Computer graphics and image processing, 19(2):129 147, 1982. [39] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3D reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4460 4470, 2019. [40] Gal Metzer, Elad Richardson, Or Patashnik, Raja Giryes, and Daniel Cohen-Or. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663 12673, 2023. [41] B Mildenhall, PP Srinivasan, M Tancik, JT Barron, R Ramamoorthi, and R Ng. Ne RF: Representing scenes as neural radiance fields for view synthesis. In European Conference on Computer Vision, 2020. [42] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bulo, Peter Kontschieder, and Matthias Nießner. Diffrf: Rendering-guided 3d radiance field diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4328 4338, 2023. [43] Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. Instant neural graphics primitives with a multiresolution hash encoding. ACM Transactions on Graphics (To G), 41(4):1 15, 2022. [44] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162 8171. PMLR, 2021. [45] Takeshi Noda, Chao Chen, Xinhai Liu Weiqi Zhang and, Yu-Shen Liu, and Zhizhong Han. Multipull: Detailing signed distance functions by pulling multi-level queries at multi-step. In Advances in Neural Information Processing Systems, 2024. [46] Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. Deep SDF: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 165 174, 2019. [47] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865 5874, 2021. [48] Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. ar Xiv preprint ar Xiv:2209.14988, 2022. [49] Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10318 10327, 2021. [50] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652 660, 2017. [51] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748 8763. PMLR, 2021. [52] Amit Raj, Srinivas Kaza, Ben Poole, Michael Niemeyer, Nataniel Ruiz, Ben Mildenhall, Shiran Zada, Kfir Aberman, Michael Rubinstein, Jonathan Barron, et al. Dreambooth3d: Subjectdriven text-to-3d generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2349 2359, 2023.

[53] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents, 2022. [54] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684 10695, 2022. [55] J Ryan Shue, Eric Ryan Chan, Ryan Po, Zachary Ankner, Jiajun Wu, and Gordon Wetzstein. 3d neural field generation using triplane diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20875 20886, 2023. [56] Jingxiang Sun, Bo Zhang, Ruizhi Shao, Lizhen Wang, Wen Liu, Zhenda Xie, and Yebin Liu. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. ar Xiv preprint ar Xiv:2310.16818, 2023. [57] Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10208 10217, 2024. [58] Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, and Ziwei Liu. Lgm: Large multi-view gaussian model for high-resolution 3d content creation. In European Conference on Computer Vision, pages 1 18. Springer, 2025. [59] Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, and Gang Zeng. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. ar Xiv preprint ar Xiv:2309.16653, 2023. [60] Junshu Tang, Tengfei Wang, Bo Zhang, Ting Zhang, Ran Yi, Lizhuang Ma, and Dong Chen. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22819 22829, 2023. [61] Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, and Baining Guo. Volumediffusion: Flexible text-to-3d generation with efficient volumetric encoder. ar Xiv preprint ar Xiv:2312.11459, 2023. [62] Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neu S: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. Advances in Neural Information Processing Systems, 34:27171 27183, 2021. [63] Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, et al. Rodin: A generative model for sculpting 3d digital avatars using diffusion. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4563 4573, 2023. [64] Yiming Wang, Qin Han, Marc Habermann, Kostas Daniilidis, Christian Theobalt, and Lingjie Liu. Neus2: Fast learning of neural implicit surfaces for multi-view reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3295 3306, 2023. [65] Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems, 36, 2024. [66] Xin Wen, Junsheng Zhou, Yu-Shen Liu, Hua Su, Zhen Dong, and Zhizhong Han. 3D shape reconstruction from 2D images with disentangled attribute flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3803 3813, 2022. [67] Jane Wilhelms and Allen Van Gelder. Octrees for faster isosurface generation. ACM Transactions on Graphics (TOG), 11(3):201 227, 1992. [68] Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. ar Xiv preprint ar Xiv:2310.08528, 2023. [69] Jiale Xu, Xintao Wang, Weihao Cheng, Yan-Pei Cao, Ying Shan, Xiaohu Qie, and Shenghua Gao. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20908 20918, 2023. [70] Yinghao Xu, Zifan Shi, Wang Yifan, Hansheng Chen, Ceyuan Yang, Sida Peng, Yujun Shen, and Gordon Wetzstein. Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation. ar Xiv preprint ar Xiv:2403.14621, 2024. [71] Taoran Yi, Jiemin Fang, Guanjun Wu, Lingxi Xie, Xiaopeng Zhang, Wenyu Liu, Qi Tian, and Xinggang Wang. Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. ar Xiv preprint ar Xiv:2310.08529, 2023. [72] Bowen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, and Baining Guo. Gaussiancube: Structuring gaussian splatting using optimal transport for 3d generative modeling. ar Xiv preprint ar Xiv:2403.19655, 2024.

[73] Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. ar Xiv preprint ar Xiv:2404.19702, 2024. [74] Wenyuan Zhang, Yu-Shen Liu, and Zhizhong Han. Neural signed distance function inference through splatting 3d gaussians pulled on zero-level set. In Advances in Neural Information Processing Systems, 2024. [75] Wenyuan Zhang, Kanle Shi, Yu-Shen Liu, and Zhizhong Han. Learning unsigned distance functions from multi-view images with volume rendering priors. European Conference on Computer Vision, 2024. [76] Junsheng Zhou, Yu-Shen Liu, and Zhizhong Han. Zero-shot scene reconstruction from single images with deep prior assembly. In Advances in Neural Information Processing Systems (Neur IPS), 2024. [77] Junsheng Zhou, Baorui Ma, Shujuan Li, Yu-Shen Liu, Yi Fang, and Zhizhong Han. Cap-udf: Learning unsigned distance functions progressively from raw point clouds with consistencyaware field optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. [78] Junsheng Zhou, Baorui Ma, Shujuan Li, Yu-Shen Liu, and Zhizhong Han. Learning a more continuous zero level set in unsigned distance fields through level set projection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023. [79] Junsheng Zhou, Baorui Ma, and Yu-Shen Liu. Fast learning of signed distance functions from noisy point clouds via noise to noise mapping. IEEE transactions on pattern analysis and machine intelligence, 2024. [80] Junsheng Zhou, Baorui Ma, Liu Yu-Shen, Fang Yi, and Han Zhizhong. Learning consistencyaware unsigned distance functions progressively from raw point clouds. In Advances in Neural Information Processing Systems (Neur IPS), 2022. [81] Junsheng Zhou, Baorui Ma, Wenyuan Zhang, Yi Fang, Yu-Shen Liu, and Zhizhong Han. Differentiable registration of images and lidar point clouds with voxelpoint-to-pixel matching. In Advances in Neural Information Processing Systems (Neur IPS), 2023. [82] Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, and Xinlong Wang. Uni3D: Exploring Unified 3D Representation at Scale. In International Conference on Learning Representations (ICLR), 2024. [83] Junsheng Zhou, Xin Wen, Baorui Ma, Yu-Shen Liu, Yue Gao, Yi Fang, and Zhizhong Han. 3doae: Occlusion auto-encoders for self-supervised learning on point clouds. IEEE International Conference on Robotics and Automation (ICRA), 2024. [84] Junsheng Zhou, Weiqi Zhang, Baorui Ma, Kanle Shi, Yu-Shen Liu, and Zhizhong Han. Udiff: Generating conditional unsigned distance fields with optimal wavelet diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. [85] Heming Zhu, Yu Cao, Hang Jin, Weikai Chen, Dong Du, Zhangye Wang, Shuguang Cui, and Xiaoguang Han. Deep fashion3d: A dataset and benchmark for 3d garment reconstruction from single images. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part I 16, pages 512 530. Springer, 2020. [86] Zi-Xin Zou, Zhipeng Yu, Yuan-Chen Guo, Yangguang Li, Ding Liang, Yan-Pei Cao, and Song-Hai Zhang. Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. ar Xiv preprint ar Xiv:2312.09147, 2023.

A More Experimental Details

A.1 Gaussian Splatting Data Preparing

Diff GS takes the fitted 3DGS as input for learning generative modeling. To prepare the 3DGS dataset of Shape Net, we uniformly render 100 views from the ground truth meshes with blender to first obtain the dense multi-view images for each 3D shape in the chair and airplane classes of Shape Net dataset. After that, we leverage the vanilla 3D Gaussian Splatting [28] method for fitting the 3DGS for each shape with the rendered multi-view images.

For achieving more stable and regularized 3DGS data for better generative modeling, we design some strategies for better initialization and optimization of 3DGS. (1) Since we fit 3DGS from existing 3D datasets with known geometries, we can simply sample dense point clouds uniformly from the surfaces as the perfect initialization for 3DGS optimization, instead of initializing with COLMAP points. The sampled point number is set to 100K. (2) We observe that optimizing 3DGS freely may often lead to some extremly large Gaussians. This will lead to unstable training of the Gaussian VAE and latent diffusion models, further affecting the generative modeling results. Therefore, we clip the scales at a maximum size of 0.01 to avoid the abnormal Gaussians.

A.2 Gaussian Splatting Completion

We explore the task of Gaussian Splatting completion. Specifically, the Gaussian Splatting completion task is to recover the complete 3DGS from a partial 3DGS which contains large occlusions. In real-world applications, having only sparse views with limited viewpoint movement available for optimizing 3DGS often results in a partial 3DGS. Solving Gaussian Splatting completion task enables us to infer the complete and dense 3DGS from the partial ones for improving the rendering quality at invisible viewpoints.

Data preparation. We generate partial 3D Gaussian Splatting data from the complete datasets in a straightforward manner. First, we randomly divide each 3DGS into 8 chunks. Then, we occlude 7 chunks, leaving the remaining chunk as the partial 3DGS. This method allows us to prepare partial-complete 3DGS pairs for training and testing.

Implementation. To leverage Diff GS with partial 3DGS as the conditions for Gaussian Splatting completion, we leverage a modified Point Net [50] as the custom encoder γpartial for partial 3DGS, which projects the partial 3DGS with K channels into a global partial 3DGS embedding. The Diff GS for Gaussian Splatting completion is trained with the target of Eq.(8) by introducing partial 3DGS embeddings through the cross-attention module.

A.3 Point-to-Gaussian Generation

Data preparation. For the task of Point-to-Gaussian generation, we first prepare the training/testing data as point cloud-3DGS pairs obtained through the Gaussian Splatting Fitting process described in Sec. A.1. Specifically, the point clouds are generated by densely sampling 100K points on the ground truth meshes, while the paired 3DGS are obtained by optimizing with multi-view images rendered around the ground truth meshes. The data preparation for the Deep Fashion3D [85] dataset follows the same process as for the Shape Net [6] dataset.

Note that we train the Point-to-Gaussian Diff GS models for fitting the Deep Fashion3D dataset, which contains only 563 garment instances. In contrast, we split the airplane and chair classes of the Shape Net dataset into train/test sets to learn generalizable representations that enables Diff GS to predict novel appearances for unknown point cloud geometries. The results shown in Fig. 6 of the main paper include both the fitting results on the Deep Fashion3D dataset and the generation results from the point clouds in the test set of the airplane and chair classes in the Shape Net dataset.

Implementation. For implementing Point-to-Gaussian, we train the Gaussian VAE using threedimensional point clouds as inputs instead of 3DGS with attributes. Specifically, we replace the GS encoder in the Gaussian VAE with a Point Net-based network to learn representations from threedimensional point clouds and recover Gaussian attributes from them. All architectures, optimization

Triplane Gaussian Dream Gaussian LGM Ours

Splatter Image

Figure 7: Qualitative comparison of image-to-3D generation.

targets, and Gaussian extraction processes remain the same as in the Gaussian VAE, except for the encoder. Note that for the Point-to-Gaussian task, the Gaussian LDM is not trained, as we focus solely on decoding and extracting Gaussians from the point cloud inputs.

B More Comparisons and Ablations

We further conducted comprehensive experiments focusing on two critical tasks: text-to-3D generation and single-view 3D generation. These tasks are central to demonstrating the flexibility and robustness of our approach in different application scenarios.

B.1 Image-to-3D Generation

We compare Diff GS with various SOTA methods on implicit generation or Gaussian Splatting generation. The visual comparison of image-to-3D generation is presented in Fig. 7. These results illustrate the superior visual quality and fidelity achieved by our method compared to the SOTA baseline methods including Splatter Image [57], Triplane Gaussian [86], LGM [58] and Dream Gaussian [59]. Our approach consistently produced more detailed and accurate generations, effectively capturing intricate textures and geometries that are often challenging for the compared optimization-based and multi-view based methods.

B.2 Text-to-3D Generation

We conduct evaluations under the difficult task of text-to-3D generation. We compare Diff GS with the SOTA data-driven and optimization-based methods Shap E [27], LGM [58] and Dream Gaussian [59]. We present the visual comparisons in Fig. 8, where Diff GS achieves more visual-appealing results compared to the baselines.

We also follow the common setting to conduct quantitative comparisons using the CLIP score, a metric that measures the semantic alignment between the generated 3D models and the input conditions. The results are presented in Tab. 4. According to Tab. 4, our method achieved higher CLIP scores than both Dream Gaussian and LGM. This indicates that our approach more faithfully adheres to the condition inputs.

However, it is important to note that Diff GS is trained on the Shape Net dataset, while some methods (e.g., LGM, Triplane Gaussian) are trained on the Objaverse [13] dataset. As a result, the comparisons may not fully reflect the generation capabilities due to different data sources. The above comparisons are provided for reference, and we plan to conduct further training using the same dataset for a more accurate comparison.

A chair with a cloth seat and white legs

Shap E LGM Ours

Dream Gaussian

Wooden chair with gray seat, small armrests and 4 legs

Figure 8: Qualitative comparison of text-to-3D generation.

Table 4: Comparisons of text consistencies.

Dream Gaussian Shap E LGM Ours

CLIP Scores 29.08 32.76 32.52 33.42

B.3 Unconditional Generation

We compare Diff GS with Diff TF [4] on the airplane class of the Shape Net dataset. The visual comparison is shown in Fig. 9, where our proposed Diff GS achieves more visually appealing results than Diff TF. Diff TF often produces shape generations with blurry textures, resulting in poor rendering quality. In contrast, our method produces high-fidelity renderings with the generated high-quality 3DGS, accurately capturing both geometry and appearances.

We further make a comparison with the GAN-based 3D generative model EG3D [5] and the Ne RFbased SSD-Ne RF [8] on unconditional generation of car models under Shape Net dataset. The visual comparison is shown in Fig. 10, where Diff GS significantly outperforms EG3D and SSD-Ne RF on the geometry and appearance details.

Diff TF Ours

Figure 9: Visual comparisons with state-of-the-art method Diff TF [4] on unconditional generation of Shape Net airplanes.

EG3D SSD-Ne RF Ours

Figure 10: Visual comparisons with state-of-the-arts on unconditional generation of Shape Net cars.

B.4 Ablation Study on Octree Depth

The depth of the octree in our Gaussian extraction algorithm is an important hyper-parameter in the framework. To explore how the octree depth affects the rendering quality, we conduct ablations on the octree depth as shown in Tab. 5. The results demonstrate that a larger octree depth leads to better quality by capturing more geometry details.

The deeper octree depths may lead to increased complexity in Gaussian extraction. To address the trade-off between the efficiency and quality, we conducted an ablation study in Tab. 5 to identify the optimal balance between computational cost and reconstruction quality. The results indicate that a moderate increase in octree depths can significantly improve the Gaussian quality with very few additional cost.

Table 5: Ablations on the sample time and Gaussian quality with different octree depths.

Depth PSNR SSIM LPIPS Sample Time (s)

7 29.77 0.9772 0.0254 0.15 8 31.70 0.9824 0.0156 0.16 9 32.70 0.9842 0.0162 0.21 10 34.01 0.9879 0.0149 0.58

B.5 Ablation Study on Framework Designs

We also conduct ablations to demonstrate the effectiveness of introducing triplanes as the decoder implementation of our Gaussian VAE. We replace the Triplane decoder with simple Multi-Layer Perceptron (MLP) model and show the performance in Fig. 11. The results indicate that using MLP fails to capture the intricacies of Gaussian modeling adequately. The Triplane model demonstrates superior performance in terms of detail accuracy and geometric fidelity.

C Efficiency Analysis and Comparison

C.1 Parameters and Inference Time

We compare the model sizes and generation time between Diff GS and other SOTA generative models. The results are shown in Tab. 6, where we highlight that Diff GS demonstrates significant efficiency compared to the SOTA baselines Diff TF, Shap E, SSDNe RF and Dream Gaussian. Diff GS also offers competitive generation time with GET3D and LGM. Diff GS has a significantly smaller number of parameters compared to most baseline methods. The smaller parameter count of Diff GS translates

(a) Ground-truth (b) Gaussian Splatting (c) Triplane (d) MLP

PSNR: 34.47 33.71 21.50

PSNR: 37.01 36.62 20.04

Figure 11: Ablations on 3D representation implementations. (a) The input shapes from Shape Net. (b) Gaussians trained on the ground truth shape. (c) Reconstructed Gaussian primitives obtained with our Triplane-based Gaussian Variational Auto-encoder using proposed Gaussian Splatting functions. (d) Replace the Triplane architectures with simple MLPs.

Table 6: Comparison of model parameters and inference time. Inference time is measured on a single NVIDIA RTX 3090 GPU.

GET3D Diff TF SSDNerf Shap E Dream Gaussian LGM Ours

Parameters (M) 34.3 929.9 244.9 759.5 258.7 429.8 127.4 Inference Time (s) 5.0 99.7 27.4 27.1 197 10.5 9.5

into reduced memory usage and potentially faster inference times, which can be advantageous in environments where computational resources are limited.

C.2 Gaussian Numbers during Extraction

We conducted supplementary experiments to analyze the impact of the number of extracted Gaussian points on optimization time. The results of this ablation study are presented in Tab. 7. The results show that for generations with a smaller number of Gaussians, e.g., 50K, the optimization is extremely fast, taking only 0.64 seconds to converge. For high-quality Gaussian generations with 350K primitives, the optimization time increases to 2.5 seconds, which is still efficient. In our experiments, we selected a configuration of 350K Gaussian points. This choice balances quality and computational efficiency, providing a robust representation of the model without excessively increasing processing time.

Table 7: Ablations on Gaussian Number. Optimization time is measured on a single NVIDIA RTX 3090 GPU.

Num PSNR SSIM LPIPS Opt. Time (s)

50K 28.61 0.9787 0.0251 0.64 100K 30.35 0.9838 0.0151 1.26 350K 34.01 0.9879 0.0149 2.5

D Implement Details

We implement Diff GS with Pytorch Lightning. We leverage the Adam optimizer with a learning rate of 0.0001. We train Diff GS with eight 3090 GPUs and the convergence in each class of Shape Net dataset takes around 5 days. The Guassian encoder and Triplane decoder are implemented based on the SDF-VAE encoder and decoder of Diffusion SDF [12], where we modify the Point Net [50] in the SDF-VAE encoder to receive K dimension 3DGS as the inputs.

E Limitation

One limitation of our method is that it sometimes produces overly creative color schemes. We illustrate this issue with two examples in Fig. 12. As shown, the failure cases of Diff GS result in excessively colorful appearances. For instance, the chair shown exhibits a color transition from yellow to blue to green and finally to orange. These overly creative generations may not accurately reflect real-world shapes.

Figure 12: Failure cases of our method. Diff GS sometimes generates overly creative shapes with colorful appearances.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope? Answer: [Yes] Justification: Our main claims made in the abstract and introduction accurately reflect the paper s contributions and scope. Guidelines:

The answer NA means that the abstract and introduction do not include the claims made in the paper. The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper. 2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors? Answer: [Yes] Justification: We analysis the limitations of our method in Sec.F of the appendix. Guidelines:

The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. The authors are encouraged to create a separate "Limitations" section in their paper. The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren t acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations. 3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof? Answer: [NA]

Justification: The paper does not include theoretical results.

Guidelines:

The answer NA means that the paper does not include theoretical results. All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. All assumptions should be clearly stated or referenced in the statement of any theorems. The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. Theorems and Lemmas that the proof relies upon should be properly referenced.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: We provide the detailed information in reproducing our methods in Sec.3, Sec.4 of the main paper and the appendix. We also provide a demonstration code of our method in the supplementary materials.

Guidelines:

The answer NA means that the paper does not include experiments. If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. While Neur IPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm. (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully. (c) If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide our demonstration code as a part of our supplementary materials. We will release the source code, data and instructions upon acceptance. Guidelines:

The answer NA means that paper does not include experiments requiring code. Please see the Neur IPS code and data submission guidelines (https://nips.cc/ public/guides/Code Submission Policy) for more details. While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). The instructions should contain the exact command and environment needed to run to reproduce the results. See the Neur IPS code and data submission guidelines (https: //nips.cc/public/guides/Code Submission Policy) for more details. The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable). Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted. 6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results? Answer: [Yes] Justification: We provide the training and testing details in the experiment section (Sec.4) and the appendix. Guidelines:

The answer NA means that the paper does not include experiments. The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. The full details can be provided either with the code, in appendix, or as supplemental material. 7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments? Answer: [No] Justification: We report the average performance as the experimental results. Guidelines:

The answer NA means that the paper does not include experiments. The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper. The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).

The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) The assumptions made should be given (e.g., Normally distributed errors). It should be clear whether the error bar is the standard deviation or the standard error of the mean. It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: The computer resources needed to reproduce the experiments are provided in Sec.C of the appendix. Guidelines:

The answer NA means that the paper does not include experiments. The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn t make it into the paper). 9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines? Answer: [Yes] Justification: The research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics. Guidelines:

The answer NA means that the authors have not reviewed the Neur IPS Code of Ethics. If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction). 10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed? Answer: [Yes] Justification: We discuss the applications and potential impacts of our method in the introduction. Guidelines:

The answer NA means that there is no societal impact of the work performed. If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.

Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML). 11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)? Answer: [NA] Justification: The paper poses no such risks. Guidelines:

The answer NA means that the paper poses no such risks. Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort. 12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected? Answer: [Yes] Justification: We use the open-sourced datasets under their licenses. Guidelines:

The answer NA means that the paper does not use existing assets. The authors should cite the original paper that produced the code package or dataset. The authors should state which version of the asset is used and, if possible, include a URL. The name of the license (e.g., CC-BY 4.0) should be included for each asset. For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.

For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. If this information is not available online, the authors are encouraged to reach out to the asset s creators. 13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets? Answer: [NA] Justification: The paper does not release new assets. Guidelines:

The answer NA means that the paper does not release new assets. Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. The paper should discuss whether and how consent was obtained from people whose asset is used. At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file. 14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. According to the Neur IPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector. 15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:

The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the Neur IPS Code of Ethics and the guidelines for their institution. For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.