# probabilistic_implicit_scene_completion__4e09c00a.pdf

Published as a conference paper at ICLR 2022

PROBABILISTIC IMPLICIT SCENE COMPLETION

Dongsu Zhang, Changwoon Choi, Inbum Park & Young Min Kim Department of Electrical and Computer Engineering, Seoul National University 96lives@snu.ac.kr, changwoon.choi00@gmail.com, {inbum0215, youngmin.kim}@snu.ac.kr

We propose a probabilistic shape completion method extended to the continuous geometry of large-scale 3D scenes. Real-world scans of 3D scenes suffer from a considerable amount of missing data cluttered with unsegmented objects. The problem of shape completion is inherently ill-posed, and high-quality result requires scalable solutions that consider multiple possible outcomes. We employ the Generative Cellular Automata that learns the multi-modal distribution and transform the formulation to process large-scale continuous geometry. The local continuous shape is incrementally generated as a sparse voxel embedding, which contains the latent code for each occupied cell. We formally derive that our training objective for the sparse voxel embedding maximizes the variational lower bound of the complete shape distribution and therefore our progressive generation constitutes a valid generative model. Experiments show that our model successfully generates diverse plausible scenes faithful to the input, especially when the input suffers from a signiﬁcant amount of missing data. We also demonstrate that our approach outperforms deterministic models even in less ambiguous cases with a small amount of missing data, which infers that probabilistic formulation is crucial for high-quality geometry completion on input scans exhibiting any levels of completeness.

1 INTRODUCTION

High-quality 3D data can create realistic virtual 3D environments or provide crucial information to interact with the environment for robots or human users (Varley et al. (2017)). However, 3D data acquired from a real-world scan is often noisy and incomplete with irregular samples. The task of 3D shape completion aims to recover the complete surface geometry from the raw 3D scans. Shape completion is often formulated in a data-driven way using the prior distribution of 3D geometry, which often results in multiple plausible outcomes given incomplete and noisy observation. If one learns to regress a single shape out of multi-modal shape distribution, one is bound to lose ﬁne details of the geometry and produce blurry outputs as noticed with general generative models (Goodfellow (2017)). If we extend the range of completion to the scale of scenes with multiple objects, the task becomes even more challenging with the memory and computation requirements for representing large-scale high resolution 3D shapes.

In this work, we present continuous Generative Cellular Automata (c GCA), which generates multiple continuous surfaces for 3D reconstruction. Our work builds on Generative Cellular Automata (GCA) (Zhang et al. (2021)), which produces diverse shapes by progressively growing the object surface from the immediate neighbors of the input shape. c GCA inherits the multi-modal and scalable generation of GCA, but overcomes the limitation of discrete voxel resolution producing high-quality continuous surfaces. Speciﬁcally, our model learns to generate diverse sparse voxels associated with their local latent codes, namely sparse voxel embedding, where each latent code encodes the deep implicit ﬁelds of continuous geometry near each of the occupied voxels (Chabra et al. (2020); Jiang et al. (2020)). Our training objective maximizes the variational lower bound for the log-likelihood of the surface distribution represented with sparse voxel embedding. The stochastic formulation is modiﬁed from the original GCA, and theoretically justiﬁed as a sound generative model.

We demonstrate that c GCA can faithfully generate multiple plausible solutions of shape completion even for large-scale scenes with a signiﬁcant amount of missing data as shown in Figure 1. To the best of our knowledge, we are the ﬁrst to tackle the challenging task of probabilistic scene completion,

Published as a conference paper at ICLR 2022

Figure 1: Three examples of complete shapes using c GCA given noisy partial input observation. Even when the raw input is severely damaged (left), c GCA can generate plausible yet diverse complete continuous shapes.

which requires not only the model to generate multiple plausible outcomes but also be scalable enough to capture the wide-range context of multiple objects.

We summarize the key contributions as follows: (1) We are the ﬁrst to tackle the problem of probabilistic scene completion with partial scans, and provide a scalable model that can capture largescale context of scenes. (2) We present continuous Generative Cellular Automata, a generative model that produces diverse continuous surfaces from a partial observation. (3) We modify infusion training (Bordes et al. (2017)) and prove that the formulation indeed increases the variational lower bound of data distribution, which veriﬁes that the proposed progressive generation is a valid generative model.

2 PRELIMINARIES: GENERATIVE CELLULAR AUTOMATA

Continuous Generative Cellular Automata (c GCA) extends the idea of Generative Cellular Automata (GCA) by Zhang et al. (2021) but generates continuous surface with implicit representation instead of discrete voxel grid. For the completeness of discussion, we brieﬂy review the formulation of GCA.

Starting from an incomplete voxelized shape, GCA progressively updates the local neighborhood of current occupied voxels to eventually generate a complete shape. In GCA, a shape is represented as a state s = {(c, oc)|c Z3, oc {0, 1}}, a set of binary occupancy oc for every cell c Z3, where the occupancy grid stores only the sparse cells on the surface. Given a state of an incomplete shape s0, GCA evolves to the state of a complete shape s T by sampling s1:T from the Markov chain

st+1 pθ( | st), (1)

where T is a ﬁxed number of transitions and pθ is a homogeneous transition kernel parameterized by neural network parameters θ. The transition kernel pθ is implemented with sparse CNN (Graham et al. (2018); Choy et al. (2019)), which is a highly efﬁcient neural network architecture that computes the convolution operation only on the occupied voxels.

The progressive generation of GCA conﬁnes the search space of each transition kernel at the immediate neighborhood of the current state. The occupancy probability within the neighborhood is regressed following Bernoulli distribution, and then the subsequent state is independently sampled for individual cells. With the restricted domain for probability estimation, the model is scalable to high resolution 3D voxel space. GCA shows that the series of local growth near sparse occupied cells can eventually complete the shape as a uniﬁed structure since the shapes are connected. While GCA is a scalable solution for generating diverse shapes, the grid representation for the 3D geometry inherently limits the resolution of the ﬁnal shape.

3 CONTINUOUS GENERATIVE CELLULAR AUTOMATA In Sec. 3.1, we formally introduce an extension of sparse occupancy voxels to represent continuous geometry named sparse voxel embedding, where each occupied voxel contains latent code representing local implicit ﬁelds. We train an autoencoder that can compress the implicit ﬁelds into the embeddings and vice versa. Then we present the sampling procedure of c GCA that generates 3D shape in Sec. 3.2, which is the inference step for shape completion. Sec. 3.3 shows the training objective of c GCA, which approximately maximizes the variational lower bound for the distribution of the complete continuous geometry.

Published as a conference paper at ICLR 2022

mode seeking sampling

implicit function

sparse voxel embedding 𝑠

Encoding/Decoding Sparse Voxel Embedding Sampling from c GCA

decode 𝑓!(𝑠$%$!, 𝑞)

Figure 2: Overview of our method. The implicit function of continuous shape can be encoded as sparse voxel embedding s and decoded back (left). The colors in the sparse voxel embedding represent the clustered labels of latent code zc for each cell c. The sampling procedure of c GCA (right) involves T steps of sampling the stochastic transition kernel pθ, followed by T mode seeking steps which remove cells with low probability. From the ﬁnal sparse voxel embedding s T +T , the decoder can recover the implicit representation for the complete continuous shape.

3.1 SPARSE VOXEL EMBEDDING

In addition to the sparse occupied voxels of GCA, the state of c GCA, named sparse voxel embedding, contains the associated latent code, which can be decoded into continuous surface. Formally, the state s of c GCA is deﬁned as a set of pair of binary occupancy oc and the latent code zc, for the cells c in a three-dimensional grid Z3

s = {(c, oc, zc)|c Z3, oc {0, 1}, zc RK}. (2)

Similar to GCA, c GCA maintains the representation sparse by storing only the set of occupied voxels and their latent codes, and sets zc = 0 if oc = 0.

The sparse voxel embedding s can be converted to and from the implicit representation of local geometry with neural networks, inspired by the work of Peng et al. (2020), Chabra et al. (2020), and Chibane et al. (2020a). We utilize the (signed) distance to the surface as the implicit representation, and use autoencoder for the conversion. The encoder gφ produces the sparse voxel embedding s from the coordinate-distance pairs P = {(p, dp)|p R3, dp R}, where p is a 3D coordinate and dp is the distance to the surface, s = gφ(P). The decoder fω, on the other hand, regresses the local implicit ﬁeld value dq at the 3D position q R3 given the sparse voxel embedding s, ˆdq = fω(s, q) for given coordinate-distance pairs Q = {(q, dq)|q R3, dq R}. The detailed architecture of the autoencoder is described in Appendix A, where the decoder fω generates continuous geometry by interpolating hierarchical features extracted from sparse voxel embedding. An example of the conversion is presented on the left side of Fig. 2, where the color of the sparse voxel embedding represents clustered labels of the latent codes with k-means clustering (Hartigan & Wong (1979)). Note that the embedding of a similar local geometry, such as the seat of the chair, exhibits similar values of latent codes.

The parameters φ, ω of the autoencoder are jointly optimized by minimizing the following loss function:

L(φ, ω) = 1 |Q|

(q,dq) Q |fω(s, q) max min dq

ϵ , 1 , 1 | + β 1

c s zc , (3)

where s = gφ(P) and ϵ is the size of a single voxel. The ﬁrst term in Eq. (3) corresponds to minimizing the normalized distance and the second is the regularization term for the latent code weighted by hyperparameter β. Clamping the maximum distance makes the network focus on predicting accurate values at the vicinity of the surface (Park et al. (2019); Chibane et al. (2020b)).

3.2 SAMPLING FROM CONTINUOUS GENERATIVE CELLULAR AUTOMATA

The generation process of c GCA echos the formulation of GCA (Zhang et al. (2021)), and repeats T steps of sampling from the transition kernel to progressively grow the shape. Each transition kernel

Published as a conference paper at ICLR 2022

p(st+1|st) is factorized into cells within the local neighborhood of the occupied cells of the current state, N(st) = {c Z3 | d(c, c ) r, c st} 1 given a distance metric d and the radius r:

p(st+1|st) = Y

c N(st) pθ(oc, zc|st) = Y

c N(st) pθ(oc|st)pθ(zc|st, oc). (4)

Note that the distribution is further decomposed into the occupancy oc and the latent code zc, where we denote oc and zc as the random variable of occupancy and latent code for cell c in state st+1. Therefore the shape is generated by progressively sampling the occupancy and the latent codes for the occupied voxels which are decoded and fused into a continuous geometry. The binary occupancy is represented with the Bernoulli distribution

pθ(oc|st) = Ber(λθ,c), (5)

where λθ,c [0, 1] is the estimated occupancy probability at the corresponding cell c. With our sparse representation, the distribution of the latent codes is

pθ(zc|st, oc) = δ0 if oc = 0 N(µθ,c, σt I) if oc = 1. (6)

δ0 is a Dirac delta distribution at 0 indicating that zc = 0 when oc = 0. For the occupied voxels (oc = 1), zc follows the normal distribution with the estimated mean of the latent code µθ,c RK and the predeﬁned standard deviation σt I, where σt decreases with respect to t.

Initial State. Given an incomplete point cloud, we set the initial state s0 of the sampling chain by setting the occupancy oc to be 1 for the cells that contain point cloud and associating the occupied cells with a latent code sampled from the isotropic normal distribution. However, the input can better describe the provided partial geometry if we encode the latent code zc of the occupied cells with the encoder gφ. The ﬁnal completion is more precise when all the transitions pθ are conditioned with the initial state containing the encoded latent code. Further details are described in Appendix A.

Mode Seeking. While we effectively model the probabilistic distribution of multi-modal shapes, the ﬁnal reconstruction needs to converge to a single coherent shape. Naïve sampling of the stochastic transition kernel in Eq. (4) can include noisy voxels with low-occupancy probability. As a simple trick, we augment mode seeking steps that determine the most probable mode of the current result instead of probabilistic sampling. Speciﬁcally, we run additional T steps of the transition kernel but we select the cells with probability higher than 0.5 and set the latent code as the mean of the distribution µθ,c. The mode seeking steps ensure that the ﬁnal shape discovers the dominant mode that is closest to s T as depicted in Fig. 2, where it can be transformed into implicit function with the pretrained decoder fw.

3.3 TRAINING CONTINUOUS GENERATIVE CELLULAR AUTOMATA

We train a homogeneous transition kernel pθ(st+1|st), whose repetitive applications eventually yield the samples that follow the learned distribution. However, the data contains only the initial s0 and the ground truth state x, and we need to emulate the sequence for training. We adapt infusion training (Bordes et al. (2017)), which induces the intermediate transitions to converge to the desired complete state. To this end, we deﬁne a function Gx(s) that ﬁnds the valid cells that are closest to the complete shape x within the neighborhood of the current state N(s):

Gx(s) = {argminc N(s)d(c, c ) | c x}. (7)

Then, we deﬁne the infusion kernel qt factorized similarly as the sampling kernel in Eq. (4):

qt θ(st+1|st, x) = Y

c N(st) qt θ(oc, zc|st, x) = Y

c N(st) qt θ(oc|st, x)qt θ(zc|st, oc, x). (8)

The distributions for both oc and zc are gradually biased towards the ground truth ﬁnal shape x with the infusion rate αt, which increases linearly with respect to time step, i.e., αt = max(α1t + α0, 1), with α1 > 0: qt θ(oc|st, x) = Ber((1 αt)λθ,c + αt1[c Gx(st)]), (9)

1We use the notation c s if oc = 1 for c Z3 to denote occupied cells.

Published as a conference paper at ICLR 2022

qt θ(zc|st, oc, x) = δ0 if oc = 0 N((1 αt)µθ,c + αtzx c , σt I) if oc = 1. (10)

Here 1 is an indicator function, and we will denote ox c, zx c as the occupancy and latent code of the ground truth complete shape x at coordinate c.

c GCA aims to optimize the log-likelihood of the ground truth sparse voxel embedding log pθ(x). However, since the direct optimization of the exact log-likelihood is intractable, we modify the variational lower bound using the derivation of diffusion-based models (Sohl-Dickstein et al. (2015)):

log pθ(x) X

s0:T 1 qθ(s0:T 1|x) log pθ(s0:T 1, x)

qθ(s0:T 1|x) (11)

= log p(s0)

q(s0) | {z } Linit

0 t<T 1 DKL(qθ(st+1|st, x) pθ(st+1|st)) | {z } Lt

+ Eqθ[log pθ(x|s T 1)] | {z } Lfinal

where the full derivation is in Appendix B.1. We now analyze Linit, Lt, Lﬁnal separately. We ignore the term Linit during optimization since it contains no trainable parameters. Lt for 0 t < T 1 can be decomposed as the following :

c N(st) DKL(qθ(oc|st, x) pθ(oc|st)) | {z } Lo +qθ(oc = 1|st, x) DKL(qθ(zc|st, x, oc = 1) pθ(zc|st, oc = 1)) | {z } Lz

where the full derivation is in Appendix B.2. Since Lo and Lz are the KL divergence between Bernoulli and normal distributions, respectively, Lt can be written in a closed-form. In practice, the scale of Lz can be much larger than that of Lo. This results in local minima in the gradient-based optimization and reduces the occupancy probability qθ(oc = 1|st, x) for every cell. So we balance the two losses by multiplying a hyperparameter γ at Lz, which is ﬁxed as γ = 0.01 for all experiments.

The last term Lﬁnal can be written as following:

c N(s T 1) log{(1 λθ,c)1[ox c = 0]δ0(zx c ) + λθ,c1[ox c = 1]N(zx c ; µθ,c, σT 1I)}. (13)

A problem rises when computing Lﬁnal, since the usage of Dirac distribution makes Lﬁnal if ox c = 0, which does not produce a valid gradient for optimization. However, we can replace the loss Lﬁnal with a well-behaved loss Lt for t = T 1 , by using the following proposition: Proposition 1. By replacing δ0(zx c ) with the indicator function 1[zx c = 0] when computing Lﬁnal, LT 1 = Lﬁnal, for T 1

Proof. The proof is found in Appendix B.3.

The proposition above serves as a justiﬁcation for approximating Lﬁnal as LT 1, with the beneﬁts of having a simpler training procedure and easier implementation. Replacing δ0(z T c ) with 1[zx c = 0] can be regarded as a reweighting technique that naturally avoids divergence in the lower bound, since both functions output a non-zero value only at 0. Further discussions about the replacement are in Appendix B.3.

In conclusion, the training procedure is outlined as follows:

1. Sample s0:T by s0 q0, st+1 qt θ( |st, x). 2. For t < T, update θ with θ θ + η θLt, where η is the learning rate.

Note that the original infusion training (Bordes et al. (2017)) also attempts to minimize the variational lower bound, employing the Monte Carlo approximation with reparameterization trick (Kingma & Welling (2014)) to compute the gradients. However, our objective avoids the approximations and can compute the exact lower bound for a single training step. The proposed simpliﬁcation can be applied to infusion training with any data structure including images. We also summarize the difference of our formulation compared to GCA in Appendix C.

Published as a conference paper at ICLR 2022

Table 1: Quantitative comparison of probabilistic scene completion in Shape Net scene dataset with different levels of completeness. The best results are marked as bold. Both CD (quality, ) and TMD (diversity, ) in tables are multiplied by 104.

min. rate 0.2 min. rate 0.5 min. rate 0.8 Method min. CD avg. CD TMD min. CD avg. CD TMD min. CD avg. CD TMD

Conv Occ 3.60 - - 1.33 - - 0.74 - - IFNet 12.94 - - 8.55 - - 7.49 - - GCA 4.97 6.32 11.56 3.02 3.54 4.92 2.50 2.64 2.76

c GCA 2.80 3.88 10.07 1.16 1.49 3.91 0.69 0.87 3.05 c GCA (w/ cond.) 1.75 2.33 5.38 0.87 1.08 2.96 0.57 0.64 2.34

IFNet Conv Occ Input

min. rate 0.2

min. rate 0.5 min. rate 0.8

Figure 3: Qualitative comparison on Shape Net scene dataset. Best viewed on screen. Minimum rate indicates the guaranteed rate of surface points for each object in the scene. While deterministic methods (Conv Occ, IFNet) produce blurry surfaces since they cannot model multi-modal distribution, probabilistic methods (GCA, c GCA) generate multiple plausible scenes. c GCA is the only method that can generate multiple plausible scenes without losing the details for each object.

4 EXPERIMENTS

We test the probabilistic shape completion of c GCA for scenes (Section 4.1) and single object (Section 4.2). For all the experiments, we use latent code dimension K = 32, trained with regularization parameter β = 0.001. All the results reported are re-trained for each dataset. Further implementation details are in the Appendix A.2 and effects of mode seeking steps are discussed in Appendix E.

4.1 SCENE COMPLETION

In this section, we evaluate our method on two datasets: Shape Net scene (Peng et al. (2020)) and 3DFront (Fu et al. (2021)) dataset. The input point cloud are created by iteratively removing points within a ﬁxed distance from a random surface point. We control the minimum preserved ratio of the original complete surface points for each object in the scene to test varying levels of completeness. The levels of completeness tested are 0.2, 0.5, and 0.8 with each dataset trained/tested separately. We evaluate the quality and diversity of the completion results by measuring the Chamfer-L1 distance (CD), total mutual distance (TMD), respectively, as in the previous methods (Peng et al. (2020); Zhang et al. (2021)). For probabilistic methods, ﬁve completion results are generated and we report the minimum and average of Chamfer-L1 distance (min. CD, avg. CD). Note that if the input is severely incomplete, there exist various modes of completion that might be feasible but deviate from its ground truth geometry. Nonetheless, we still compare CD assuming that plausible reconstructions are likely to be similar to the ground truth.

Shape Net Scene. Shape Net scene contains synthetic rooms that contain multiple Shape Net (Chang et al. (2015)) objects, which have been randomly scaled with randomly scaled ﬂoor and random

Published as a conference paper at ICLR 2022

Table 2: Quantitative comparison of probabilistic scene completion in 3DFront. The best results are marked as bold. Note that CD (quality, ) and TMD (diversity, ) in tables are multiplied by 103.

min. rate 0.2 min. rate 0.5 min. rate 0.8 Method min. CD avg. CD TMD min. CD avg. CD TMD min. CD avg. CD TMD

GCA 4.89 6.59 16.90 3.11 3.53 8.95 2.53 2.82 7.98

c GCA 4.07 5.42 16.20 2.12 2.57 7.82 1.64 2.19 5.97 c GCA (w/ cond.) 3.53 4.26 9.23 2.19 2.54 6.89 1.47 1.69 5.35

Figure 4: Qualitative comparison on 3DFront dataset with 0.5 object minimum rate. Best viewed on screen. Since the raw inputs of furniture are highly incomplete, there exist multiple plausible reconstructions. Probabilistic approaches produce diverse yet detailed scene geometry. GCA suffers from artifacts due to the discrete voxel resolution.

walls. We compare the performance of c GCA with two deterministic scene completion models that utilize occupancy as the implicit representation: Conv Occ (Peng et al. (2020)) and IFNet (Chibane et al. (2020a)). We additionally test the quality of our completion against a probabilistic method of GCA (Zhang et al. (2021)). Both GCA and c GCA use 643 voxel resolution with T = 15 transitions. c GCA (w/ cond.) indicates a variant of our model, where each transition is conditioned on the initial state s0 obtained by the encoder gφ as discussed in Sec. 3.2.

Table 1 contains the quantitative comparison on Shape Net scene. Both versions of c GCA outperform all other methods on min. CD for all level of completeness. The performance gap is especially prominent for highly incomplete input, which can be visually veriﬁed from Fig. 3. The deterministic models generate blurry objects given high uncertainty, while our method consistently generates detailed reconstructions for inputs with different levels of completeness. Our result coincides with the well-known phenomena of generative models, where the deterministic models fail to generate crisp outputs of multi-modal distribution (Goodfellow (2017)). Considering practical scenarios with irregular real-world scans, our probabilistic formulation is crucial for accurate 3D scene completion. When conditioned with the initial state, the completion results of c GCA stay faithful to the input data, achieving lower CDs. Further analysis on varying completeness of input is discussed in Appendix F.

3DFront. 3DFront is a large-scale indoor synthetic scene dataset with professionally designed layouts and contains high-quality 3D objects, in contrast to random object placement for Shape Net scene. 3DFront dataset represents the realistic scenario where the objects are composed of multiple meshes without clear boundaries for inside or outside. Unless the input is carefully processed to be converted into a watertight mesh, the set-up excludes many of the common choices for implicit representation,

Published as a conference paper at ICLR 2022

Table 3: Quantitative comparison of single object probabilistic shape completion results on Shape Net. The best results trained in a single class are marked as bold. Note that MMD (quality, ), UHD (ﬁdelity, ) and TMD (diversity, ) in tables are multiplied by 103, 103, and 102 respectively.

MMD (quality) UHD (ﬁdelity) TMD (diversity) Method Sofa Chair Table Avg. Sofa Chair Table Avg. Sofa Chair Table Avg.

c GAN 5.70 6.53 6.10 6.11 11.40 12.10 11.20 11.57 7.44 8.21 6.88 7.51 GCA (643) 4.70 6.23 6.22 5.72 8.88 7.95 7.63 8.15 22.39 12.41 18.83 17.88

c GCA (323) 4.59 6.11 6.08 5.59 10.43 9.99 9.29 9.90 9.72 13.65 26.04 16.47 c GCA (643) 4.51 6.30 5.89 5.57 8.99 8.43 7.22 8.21 11.18 16.70 31.94 19.94 c GCA (643, w/ cond.) 4.64 6.15 5.90 5.56 8.00 6.75 6.45 7.07 14.01 11.49 31.01 18.84

c GAN GCA c GCA

Figure 5: Qualitative comparison on probabilistic shape completion of a single object. c GCA is the only method that can produce a continuous surface.

such as occupancy or signed distance ﬁelds. However, the formulation of c GCA can be easily adapted for different implicit representation, and we employ unsigned distance ﬁelds (Chibane et al. (2020b)) to create the sparse voxel embedding for 3DFront dataset. We compare the performance of c GCA against GCA, both with voxel resolution of 5cm and T = 20 transitions.

Table 2 shows that c GCA outperforms GCA by a large margin in CD, generating high-ﬁdelity completions with unsigned distance ﬁelds. While both GCA and c GCA are capable of generating multiple plausible results, GCA suffers from discretization artifacts due to voxelized representation, as shown in Fig. 4. c GCA not only overcomes the limitation of the resolution, but also is scalable to process the entire rooms at once during both training and test time. In contrast, previous methods for scene completion (Siddiqui et al.; Peng et al. (2020)) divide the scene into small sections and separately complete them. We analyze the scalability in terms of the network parameters and the GPU usage in Appendix D. In Appendix G, we also provide results on Scan Net (Dai et al. (2017a)) dataset, which is one of the widely used datasets for real-world indoor environments.

4.2 SINGLE OBJECT COMPLETION

We analyze various performance metrics of c GCA for a single object completion with chair/sofa/table classes of Shape Net (Chang et al. (2015)) dataset. Given densely sampled points of a normalized object in [ 1, 1]3, the incomplete observation is generated by selecting points within the sphere of radius 0.5 centered at one of the surface points. From the partial observation, we sample 1,024 points which serve as a sparse input, and sample 2,048 points from completion results for testing. Following the previous method (Wu et al. (2020)), we generate ten completions and compare MMD (quality), UHD (ﬁdelity), and TMD (diversity). Our method is compared against other probabilistic shape completion methods: c GAN (Wu et al. (2020)) which is based on point cloud, and GCA (Zhang et al. (2021)) which uses voxel representation. We use T = 30 transitions.

Quantitative and qualitative results are shown in Table 3 and Fig. 5. Our approach exceeds other baselines in all metrics, indicating that c GCA can generate high-quality completions (MMD) that are faithful to input (UHD) while being diverse (TMD). By using latent codes, the completed continuous surface of c GCA can capture geometry beyond its voxel resolution. The quality of completed shape in 323 voxel resolution therefore even outperforms in MMD for discrete GCA in higher 643 voxel

Published as a conference paper at ICLR 2022

resolution. Also, the UHD score of c GCA (w/ cond.) exceeds that of GCA and vanilla c GCA indicating that conditioning latent codes from the input indeed preserves the input partial geometry.

5 RELATED WORKS

3D Shape Completion. The data-driven completion of 3D shapes demands a large amount of memory and computation. The memory requirement for voxel-based methods increases cubically to the resolution (Dai et al. (2017b)) while the counterpart of point cloud based representations (Yuan et al. (2018)) roughly increases linearly with the number of points. Extensions of scene completion in voxel space utilize hierarchical representation (Dai et al. (2018)) or subdivided scenes (Dai et al. (2018; 2020)) with sparse voxel representations. Recently, deep implicit representations (Park et al. (2019); Chen & Zhang (2019); Mescheder et al. (2019)) suggest a way to overcome the limitation of resolution. Subsequent works (Chabra et al. (2020); Chibane et al. (2020a); Jiang et al. (2020); Peng et al. (2020)) demonstrate methods to extend the representation to large-scale scenes. However, most works are limited to regressing a single surface from a given observation. Only a few recent works (Wu et al. (2020); Zhang et al. (2021); Smith & Meger (2017)) generate multiple plausible outcomes by modeling the distribution of surface conditioned on the observation. c GCA suggests a scalable solution for multi-modal continuous shape completion by employing progressive generation with continuous shape representations.

Diffusion Probabilistic Models. One way to capture the complex data distribution by the generative model is to use a diffusion process inspired by nonequilibrium thermodynamics such as Sohl Dickstein et al. (2015); Ho et al. (2020); Luo & Hu (2021). The diffusion process incrementally destroys the data distribution by adding noise, whereas the transition kernel learns to revert the process that restores the data structure. The learned distribution is ﬂexible and easy to sample from, but it is designed to evolve from a random distribution. On the other hand, the infusion training by Bordes et al. (2017) applies a similar technique but creates a forward chain instead of reverting the diffusion process. Since the infusion training can start from a structured input distribution, it is more suitable to a shape completion that starts from a partial data input. However, the infusion training approximates the lower bound of variational distribution with Monte Carlo estimates using the reprameterization trick (Kingma & Welling (2014)). We modify the training objective and introduce a simple variant of infusion training that can maximize the variational lower bound of the log-likelihood of the data distribution without using Monte Carlo approximation.

6 CONCLUSION

We are the ﬁrst to tackle the challenging task of probabilistic scene completion, which requires not only the model to generate multiple plausible outcomes but also be scalable to capture the wide-range context of multiple objects. To this end, we propose continuous Generative Cellular Automata, a scalable generative model for completing multiple plausible continuous surfaces from an incomplete point cloud. c GCA compresses the implicit ﬁeld into sparse voxels associated with their latent code named sparse voxel embedding, and incrementally generates diverse implicit surfaces. The training objective is proven to maximize the variational lower bound of the likelihood of sparse voxel embeddings, indicating that c GCA is a theoretically valid generative model. Extensive experiments show that our model is able to faithfully generate multiple plausible surfaces from partial observation.

There are a few interesting future directions. Our results are trained with synthetic scene datasets where the ground truth data is available. It would be interesting to see how well the data performs in real data with self-supervised learning. For example, we can extend our method to real scenes such as Scan Net Dai et al. (2017a) or Matterport 3D (Chang et al. (2017)) by training the infusion chain with data altered to have different levels of completeness as suggested by Dai et al. (2020). Also, our work requires two-stage training, where the transition kernel is trained with the ground truth latent codes generated from the pre-trained autoencoder. It would be less cumbersome if the training could be done in an end-to-end fashion. Lastly, our work takes a longer inference time compared to previous methods (Peng et al. (2020)) since a single completion requires multiple transitions. Reducing the number of transitions by using a technique similar to Salimans & Ho (2022) can accelerate the runtime.

Published as a conference paper at ICLR 2022

7 ETHICS STATEMENT

The goal of our model is to generate diverse plausible scenes given an observation obtained by sensors. While recovering the detailed geometry of real scenes is crucial for many VR/AR and robotics applications, it might violate proprietary or individual privacy rights when abused. Generating the unseen part can also be regarded as creating fake information that can deceive people as real.

8 REPRODUCIBILITY

Code to run the experiments is available at https://github.com/96lives/gca. Appendix A contains the implementation details including the network architecture, hyperparameter settings, and dataset processing. Proofs and derivations are described in Appendix B.

9 ACKNOWLEDGEMENT

We thank Youngjae Lee for helpful discussion and advice on the derivations. This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2020R1C1C1008195) and the National Convergence Research of Scientiﬁc Challenges through the National Research Foundation of Korea (NRF) funded by Ministry of Science and ICT (NRF2020M3F7A1094300). Young Min Kim is the corresponding author.

Florian Bordes, Sina Honari, and Pascal Vincent. Learning to generate samples from noise through infusion training. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. Open Review.net, 2017. URL /https://openreview.net/forum?id=BJAFbaolg.

Rohan Chabra, Jan Eric Lenssen, Eddy Ilg, Tanner Schmidt, Julian Straub, Steven Lovegrove, and Richard A. Newcombe. Deep local shapes: Learning local SDF priors for detailed 3d reconstruction. Co RR, abs/2003.10983, 2020. URL /https://arxiv.org/abs/2003.10983.

Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. International Conference on 3D Vision (3DV), 2017.

Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. Shape Net: An Information-Rich 3D Model Repository. Technical Report ar Xiv:1512.03012 [cs.GR], Stanford University Princeton University Toyota Technological Institute at Chicago, 2015.

Zhiqin Chen and Hao Zhang. Learning implicit ﬁelds for generative shape modeling. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 5939 5948. Computer Vision Foundation / IEEE, 2019. doi: /10.1109/CVPR.2019.00609. URL /http://openaccess.thecvf.com/content_ CVPR_2019/html/Chen_Learning_Implicit_Fields_for_Generative_ Shape_Modeling_CVPR_2019_paper.html.

Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. Implicit functions in feature space for 3d shape reconstruction and completion. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 6968 6979. IEEE, 2020a. doi: /10.1109/CVPR42600.2020.00700. URL /https://doi.org/10.1109/ CVPR42600.2020.00700.

Julian Chibane, Aymen Mir, and Gerard Pons-Moll. Neural unsigned distance ﬁelds for implicit function learning. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12,

Published as a conference paper at ICLR 2022

2020, virtual, 2020b. URL /https://proceedings.neurips.cc/paper/2020/hash/ f69e505b08403ad2298b9f262659929a-Abstract.html.

Christopher B. Choy, Jun Young Gwak, and Silvio Savarese. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 3075 3084. Computer Vision Foundation / IEEE, 2019. doi: /10.1109/CVPR.2019.00319. URL /http://openaccess.thecvf.com/content_CVPR_2019/html/Choy_ 4D_Spatio-Temporal_Conv Nets_Minkowski_Convolutional_Neural_ Networks_CVPR_2019_paper.html.

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 2432 2443. IEEE Computer Society, 2017a. doi: /10.1109/CVPR.2017.261. URL /https://doi.org/10.1109/CVPR.2017.261.

Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape completion using 3d-encoderpredictor cnns and shape synthesis. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp. 6545 6554. IEEE Computer Society, 2017b. doi: /10.1109/CVPR.2017.693. URL /https://doi.org/10.1109/ CVPR.2017.693.

Angela Dai, Daniel Ritchie, Martin Bokeloh, Scott Reed, Jürgen Sturm, and Matthias Nießner. Scancomplete: Large-scale scene completion and semantic segmentation for 3d scans. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4578 4587. IEEE Computer Society, 2018. doi: /10.1109/ CVPR.2018.00481. URL /http://openaccess.thecvf.com/content_cvpr_2018/ html/Dai_Scan Complete_Large-Scale_Scene_CVPR_2018_paper.html.

Angela Dai, Christian Diller, and Matthias Nießner. SG-NN: sparse generative neural networks for self-supervised scene completion of RGB-D scans. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 846 855. IEEE, 2020. doi: /10.1109/CVPR42600.2020.00093. URL /https://doi.org/10.1109/ CVPR42600.2020.00093.

Huan Fu, Bowen Cai, Lin Gao, Ling-Xiao Zhang, Jiaming Wang, Cao Li, Qixun Zeng, Chengyue Sun, Rongfei Jia, Binqiang Zhao, and Hao Zhang. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10933 10942, October 2021.

Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks. Co RR, abs/1701.00160, 2017. URL /http://arxiv.org/abs/1701.00160.

Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3d semantic segmentation with submanifold sparse convolutional networks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 9224 9232. IEEE Computer Society, 2018. doi: /10.1109/ CVPR.2018.00961. URL /http://openaccess.thecvf.com/content_cvpr_2018/ html/Graham_3D_Semantic_Segmentation_CVPR_2018_paper.html.

J. A. Hartigan and M. A. Wong. A k-means clustering algorithm. JSTOR: Applied Statistics, 28(1): 100 108, 1979.

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, Neur IPS 2020, December 6-12, 2020, virtual, 2020. URL /https://proceedings.neurips.cc/paper/2020/hash/ 4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.

Published as a conference paper at ICLR 2022

Chiyu Max Jiang, Avneesh Sud, Ameesh Makadia, Jingwei Huang, Matthias Nießner, and Thomas A. Funkhouser. Local implicit grid representations for 3d scenes. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 6000 6009. IEEE, 2020. doi: /10.1109/CVPR42600.2020.00604. URL /https://doi.org/ 10.1109/CVPR42600.2020.00604.

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann Le Cun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL /http: //arxiv.org/abs/1412.6980.

Diederik P Kingma and Max Welling. Auto-encoding variational bayes, 2014.

William E. Lorensen and Harvey E. Cline. Marching cubes: A high resolution 3d surface construction algorithm. In Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH 87, pp. 163 169, New York, NY, USA, 1987. Association for Computing Machinery. ISBN 0897912276. doi: /10.1145/37401.37422. URL /https://doi.org/ 10.1145/37401.37422.

Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.

Lars M. Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 4460 4470. Computer Vision Foundation / IEEE, 2019. doi: /10.1109/CVPR.2019.00459. URL /http://openaccess.thecvf.com/content_CVPR_2019/html/Mescheder_ Occupancy_Networks_Learning_3D_Reconstruction_in_Function_Space_ CVPR_2019_paper.html.

Jeong Joon Park, Peter Florence, Julian Straub, Richard A. Newcombe, and Steven Lovegrove. Deepsdf: Learning continuous signed distance functions for shape representation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 165 174. Computer Vision Foundation / IEEE, 2019. doi: /10.1109/CVPR.2019.00025. URL /http://openaccess.thecvf.com/content_ CVPR_2019/html/Park_Deep SDF_Learning_Continuous_Signed_Distance_ Functions_for_Shape_Representation_CVPR_2019_paper.html.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems 32, pp. 8024 8035. Curran Associates, Inc., 2019. URL /http://papers.neurips.cc/paper/9015-pytorchan-imperative-style-high-performance-deep-learning-library.pdf.

Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In European Conference on Computer Vision (ECCV), 2020.

Charles Ruizhongtai Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. Co RR, abs/1612.00593, 2016. URL /http: //arxiv.org/abs/1612.00593.

O. Ronneberger, P.Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), volume 9351 of LNCS, pp. 234 241. Springer, 2015. URL /http://lmb.informatik.unifreiburg.de/Publications/2015/RFB15a. (available on ar Xiv:1505.04597 [cs.CV]).

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL /https://openreview.net/ forum?id=TId IXIpzho I.

Published as a conference paper at ICLR 2022

Yawar Siddiqui, Justus Thies, Fangchang Ma, Qi Shan, Matthias Nießner, and Angela Dai. Retrieval Fuse: Neural 3D Scene Reconstruction with a Database. ar Xiv e-prints, art. ar Xiv:2104.00024.

Vincent Sitzmann, Julien N.P. Martel, Alexander W. Bergman, David B. Lindell, and Gordon Wetzstein. Implicit neural representations with periodic activation functions. In Proc. Neur IPS, 2020.

Edward J. Smith and David Meger. Improved adversarial systems for 3d object generation and reconstruction. volume 78 of Proceedings of Machine Learning Research, pp. 87 96. PMLR, 13 15 Nov 2017. URL /http://proceedings.mlr.press/v78/smith17a.html.

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Francis R. Bach and David M. Blei (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 2256 2265. JMLR.org, 2015. URL /http://proceedings.mlr.press/v37/sohldickstein15.html.

Jacob Varley, Chad De Chant, Adam Richardson, Avinash Nair, Joaquín Ruales, and Peter Allen. Shape completion enabled robotic grasping. In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017.

Rundi Wu, Xuelin Chen, Yixin Zhuang, and Baoquan Chen. Multimodal shape completion via conditional generative adversarial networks. In The European Conference on Computer Vision (ECCV), 2020.

Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert. Pcn: Point completion network. In 3D Vision (3DV), 2018 International Conference on, 2018.

Dongsu Zhang, Changwoon Choi, Jeonghwan Kim, and Young Min Kim. Learning to generate 3d shapes with generative cellular automata. In International Conference on Learning Representations, 2021. URL /https://openreview.net/forum?id=r ABUm U3ul Qh.

Published as a conference paper at ICLR 2022

𝐹! 𝜓!(𝐹!, 𝑞)

query point 𝑞

sparse conv MLP

sparse conv

sparse conv

sparse conv

3x3x3, 32 (2)

sparse conv

sparse conv

3x3x3, 32 (2)

sparse conv

Figure 6: Neural network architecture for the decoder fω. The left side shows the overall architecture for the decoder fω and the right side shows the architecture for sparse convolution layers fω1. The parenthesis denotes the stride of the sparse convolution and every convolution except the feature extracting layer is followed by batch normalization and Re LU activation.

A IMPLEMENTATION DETAILS

We use Pytorch (Paszke et al. (2019)) and the sparse convolution library Minkowski Engine (Choy et al. (2019)) for all implementations.

A.1 AUTOENCODER

Network Architecture. The autoencoder is composed of two modules, encoder gφ and decoder fω to convert between the sparse voxel embedding and implicit ﬁelds, which is depicted in Fig. 2. The encoder gφ generates the sparse voxel embedding s from the coordinate-distance pairs P = {(p, dp)|p R3, dp R}, where p is a 3D coordinate and dp is the (signed) distance to the surface, s = gφ(P). The decoder fω regresses the implicit ﬁeld value dq at the 3D position q R3 given the sparse voxel embedding s, dq = fω(q, s).

The encoder gφ is a a local Point Net (Qi et al. (2016)) implemented with MLP of 4 blocks with hidden dimension 128, as in Conv Occ (Peng et al. (2020)). Given a set of coordinate-distance pair P, we ﬁnd the nearest cell cp for each coordinate-distance pair (p, dp). Then we make the representation sparse by removing the coordinate-distance pair if the nearest cell cp do not contain the surface. For the remaining pairs, we normalize the coordinate p to the local coordinate p [ 1, 1]3 centered at the nearest voxel. Then we use the local coordinate to build a vector in R4 for coordinate-distance pair (p , dp) which serves as the input to the Point Net for the encoder gφ. The local Point Net architecture employs average pooling to the point-coordinate pairs that belong to the same cell.

The architecture of decoder fω, inspired by IFNet (Chibane et al. (2020a)) is depicted in Fig. 6. As shown on the left, the decoder fω can be decomposed into the sparse convolution fω1 that extracts hierarchical feature, followed by trilinear interpolation of the extracted feature Ψ, and ﬁnal MLP layers fω2. The sparse convolution layer fω1 aggregates the information into multi-level grids in n different resolutions, i.e., fω1(s) = F1, F2, ..., Fn, Starting from the original resolution F1, Fk contains a grid that is downsampled k times. However, in contrast to the original IFNet, our grid features are sparse, since they only store the downsampled occupied cells of sparse voxel embedding s. The sparse representation Fk = {(c, oc, ec)|c Z3, oc {0, 1}, ec RL} is composed of occupancy oc and L-dimensional feature ec for each cell, like sparse voxel embedding. The grid points c that are not occupied are considered as having zero features. We use n = 3 levels, with feature dimension L = 32 for all experiments.

The multi-scale features are then aggregated to deﬁne the feature for a query point q: Ψ(fω1(s), q) = P

k ψk(Fk, q). Here ψk(Fk, q) RL is the trilinear interpolation of features in discrete grid Fk to deﬁne the feature at a particular position q, similar to IFNet. We apply trilinear interpolation at each resolution k then combine the features.

Lastly, the MLP layer maps the feature at the query point q to the implicit function value ˆdq, fω2 : RL R. The MLP is composed of 4 Res Net blocks with hidden dimension 128 with tanh as the ﬁnal activation function. The ﬁnal value ˆdq is the truncated signed distance within the range of [ 1, 1] except for 3DFront (Fu et al. (2021)) which does not have a watertight mesh. We use unsigned distance function for 3DFront as in NDF (Chibane et al. (2020b)) by changing the codomain of the

Published as a conference paper at ICLR 2022

Ground Truth Reconstruction Ground Truth Reconstruction

Figure 7: Visualizations of reconstructions by autoencoder. We visualize a reconstruction of a chair using signed distance ﬁelds (left) and a scene in 3DFront using unsigned distance ﬁelds (right). Both reconstructions show that sparse voxel embedding is able to reconstruct a continuous surface.

sparse conv 7 x 7 x 7, 32

sparse conv 5 x 5 x 5, 32 (2)

sparse conv block x 2, 32

sparse conv 5 x 5 x 5, 32 (2)

sparse conv block x 2, 64

sparse conv 3 x 3 x 3, 64 (2)

sparse conv block x 3, 128

sparse conv 3 x 3 x 3, 128 (2)

sparse conv block x 5, 256

sparse conv tr 3 x 3 x 3, 256 (2)

sparse conv block x 1, 256

sparse conv tr 3 x 3 x 3, 128 (2)

sparse conv block x 1, 128

sparse conv tr 3 x 3 x 3, 96 (2)

sparse conv down block, 96

sparse conv tr 5 x 5 x 5, 96 (2)

sparse conv down block, 96

sparse conv 3 x 3 x 3, 𝐾+ 1

sparse conv block x n, F

sparse conv 3 x 3 x 3, F

sparse conv 3 x 3 x 3, F

sparse conv down block, F

sparse conv 3 x 3 x 3, F

sparse conv 3 x 3 x 3, F

sparse conv 1 x 1 x 1, F

sparse conv down block, 64

sparse conv down block, 128

sparse conv down block, 256

sparse conv down block, 256

sparse conv down block, 128

sparse conv block x 1, 96

sparse conv block x 1, 96

Figure 8: Neural network architecture for the transition model pθ. We employ the architecture of U-Net. The paranthesis indicates the stride of convolutions and each convolution is followed by batch normalization.

decoder to [0, 1] with sigmoid as the ﬁnal activation function. Thus, the decoder can be summarized as fω(s, q) = fω2(Ψ(fω1(s), q)).

We would like to emphasize that the autoencoder does not use any dense 3D convolutions, and thus the implementation is efﬁcient, especially when reconstructing the sparse voxel embedding. Since the decoding architecture only consists of sparse convolutions and MLPs, the memory requirement increases linearly with respect to the number of occupied voxels. In contrast, the requirement for dense 3D convolution increases cubic with respect to the maximum size of the scene.

Reconstruction Results. Fig. 7 shows a reconstruction of a chair (left) using signed distance ﬁelds and a scene in 3DFront (right) using unsigned distance ﬁelds. The ﬁgure shows that our autoencoder can faithfully reconstruct a continuous surface using the sparse voxel embedding. However, the usage of unsigned distance ﬁelds (Chibane et al. (2020b)) representation tends to generate relatively thicker reconstructions compared to that of signed distance ﬁelds. Without a clear distinction between the interior/exterior, the unsigned representation makes it difﬁcult to scrutinize the zero-level isosurface and thus results in comparably thick reconstruction. Thus, the ﬂexibility of being able to represent any mesh with various topology comes at a cost of expressive power, in contrast to signed distance ﬁelds which can only represent closed surfaces. Our work can be further improved by utilizing more powerful implicit representation techniques, such as SIREN (Sitzmann et al. (2020)), but we leave it as future work.

Hyperparameters and Implementation Details. We train the autoencoder with Adam (Kingma & Ba (2015)) optimizer with learning rate 5e-4 and use batch size of 3, 1, and 4 for Shape Net scene (Peng et al. (2020)), 3DFront (Fu et al. (2021)), and Shape Net (Chang et al. (2015)) dataset, respectively. The autoencoder for scene completion is pretrained with all classes of Shape Net assuming that the local geometric details for scenes are represented with the patches from individual objects.

A.2 TRANSITION MODEL

Network Architecture. Following GCA (Zhang et al. (2021)), we implement the transition model by using a variant of U-Net (Ronneberger et al. (2015)). The input to the sparse convolution network is the sparse voxel embedding st, and the output feature size of the last convolution is K + 1, where

Published as a conference paper at ICLR 2022

we use generalized sparse convolution for the last layer to obtain the occupancy (with dimension 1) and the latent code (with dimension K) of the neighborhood. Unlike GCA, our implementation does not need an additional cell-wise aggregation step.

s0 Conditioned Model. The s0 conditioned model is a variant of our model, where we employ the use of encoder gφ to estimate the initial latent codes of initial state s0 and condition every transition kernel on s0. Given incomplete point cloud R = {c1, c2, ..., c M} with ci R3, the variant constructs the initial s0 by encoding a set of coordinated-distance pair PR = {(c, 0)|c R}, where the distance is set to 0 for every coordinate in R. Since the point cloud is sampled from the surface of the original geometry, this is a reasonable estimate of the latent code.

The transition kernel pθ is conditioned on s0 by taking a sparse voxel embedding ˆst of latent code dimension 2K that concatenates the current state st and s0 as the input. Speciﬁcally, we deﬁne ˆst

to be ˆst = {(c, oc, zc)|oc = ost c os0 c , zc = [zs0 c , zst c ] R2K, c Z3}, where the ost c , os0 c , zst c , zs0 c are the occupancies and latent codes of st and s0. The occupancy is set to one if either st or s0 is occupied and the latent code is set to be the concatenation of latent codes of s0 and st. Thus, the only modiﬁcation to the neural network pθ is the size of input feature dimension, which is increased from K to 2K.

Hyperparameters and Implementation Details We train our model with Adam (Kingma & Ba (2015)) optimizer with learning rate 5e-4. The transition kernel is trained using infusion schedule αt = 0.1 + 0.005t and standard deviation schedule σt = 10 1 0.01t. We use the accelerated training scheme introduced in GCA (Zhang et al. (2021)) by managing the time steps based on the current step. We use neighborhood size r = 2, with metric d induced by L1 norm, except for the Shape Net (Chang et al. (2015)) experiment on 643 voxel resolution which uses r = 3. For transition model training, we utilize maximum batch size of 32, 16 and 6, for Shape Net scene, 3DFront, Shape Net dataset, respectively. We use adaptive batch size for the transition model, which is determined by the total number of cells in the minibatch for efﬁcient GPU memory usage. Lastly, we use mode seeking steps T = 5 for all experiments.

Visualizations and Testing. All surface visualizations are made by upsampling the original voxel resolution 4 times, where we efﬁciently query from the upsampled grid only near the occupied cells of the ﬁnal state s T +T . Then, the mesh is extracted with marching cubes (Lorensen & Cline (1987)) with no other postprocessing applied. From the 4 upsampled high resolution grid, we extract cells with the distance value below 0.5 which serve as the point cloud to compute the metrics for testing.

A.3 BASELINES

We use the ofﬁcial implementation or the code provided by the authors by contact with default hyperparameters unless stated otherwise. All results are obtained by training on our dataset. The output point cloud used for testing, is obtained by sampling from mesh with the same resolution if possible, unless stated otherwise. For GCA (Zhang et al. (2021)) we use the same hyperparameters as ours for fair comparison. The mesh is created by using marching cubes (Lorensen & Cline (1987)) with the voxel resolution. We use the volume 643 encoder for Conv Occ (Peng et al. (2020)) which achieves the best results in the original Shape Net scene dataset (Peng et al. (2020)). For IFNet (Chibane et al. (2020a)), we use the Shape Net128Vox model as used in the work of Siddiqui et al. with occupancy implicit representation. c GAN (Wu et al. (2020)) is tested by 2,048 generated point cloud.

A.4 DATASETS

Shape Net Scene. Shape Net scene (Peng et al. (2020)) contains synthetic rooms that contain multiple Shape Net (Chang et al. (2015)) objects, which has randomly scaled ﬂoor and random walls, normalized so that the longest length of the bounding box has length of 1. The number of rooms for train/val/test split are 3750/250/995. We sample SDF points for training our model. The input is created by iteratively sampling points from the surface and removing points within distance 0.1 for each instance. We run 20 iterations and each removal is revoked if the aggregated removal exceeds that of the guaranteed rate. Since we are more interested in the reconstruction of objects, the walls and ﬂoors are guaranteed to have minimum rate of 0.8, regardless of the deﬁned minimum rate. Lastly,

Published as a conference paper at ICLR 2022

we sample 10,000 points and add normal noise with standard deviation 0.005. For computing the metrics at test time, we sample 100,000 points.

3DFront. 3DFront (Fu et al. (2021)) is a large-scale indoor synthetic scene dataset with professionally designed layouts and contains high quality 3D objects. We collect rooms that contain multiple objects, where 10% of the biggest rooms are removed from the data to discard rooms with immense size for stable training. Thus, 7044/904/887 rooms are collected as train/val/test rooms. The incomplete point cloud is obtained by randomly sampling with density 219 points/m2 and then removed and noised as in Shape Net scene, but with removal sphere distance proportional to the object size and noise of standard deviation of 1cm. During evaluation, we sample points proportionally to the area for each ground truth surface with density of 875 points/m2 to compute the metrics.

Shape Net. We use chair, sofa, table classes of Shape Net (Chang et al. (2015)), where the signed distance values of the query points are obtained by using the code of Deep SDF (Park et al. (2019)). The shapes are normalized to ﬁt in a bounding box of [ 1, 1]3. The partial observation is generated by extracting points within a sphere of radius 0.5, centered at one of the surface points. Then we sample 1,024 points to create a sparse input and compare the metrics by sampling 2,048 points for a fair comparison against c GAN (Wu et al. (2020)).

Published as a conference paper at ICLR 2022

B PROOFS AND DETAILED DERIVATIONS

B.1 VARIATIONAL BOUND DERIVATION

We derive the variational lower bound for c GCA (Eq. (11)) below. The lower bound is derived by the work of Sohl-Dickstein et al. (2015) and we include it for completeness.

log pθ(x) = log X

s0:T 1 pθ(s0:T 1, x)

s0:T 1 qθ(s0:T 1|x)pθ(s0:T 1, x)

qθ(s0:T 1|x)

s0:T 1 qθ(s0:T 1|x) log pθ(s0:T 1, x)

qθ(s0:T 1|x) ( Jensen s inequality)

s0:T 1 qθ(s0:T 1|x)(log p(s0)

0 t<T 1 log pθ(st+1|st)

qθ(st+1|st, x) + log p(x|s T 1))

= log p(s0)

s0:T 1 qθ(s0:T 1|x) X

0 t<T 1 log pθ(st+1|st)

qθ(st+1|st, x) + Eqθ[log pθ(x|s T 1)]

The second term of the right hand side can be converted into KL divergence as following:

s0:T 1 qθ(s0:T 1|x) X

0 t<T 1 log pθ(st+1|st)

qθ(st+1|st, x)

0 i<T 1 qθ(si+1|si, x) X

0 t<T 1 log pθ(st+1|st)

qθ(st+1|st, x)

0 i<T 1 i =t

qθ(si+1|si, x)qθ(st+1|st, s T ) log pθ(st+1|st)

qθ(st+1|st, x)

0 i<T 1 i =t

qθ(si+1|si, x))qθ(st+1|st, x) log pθ(st+1|st)

qθ(st+1|st, x)

(s (t+1) denotes variables s0:T 1 except st+1)

st+1 qθ(st+1|st, x) log pθ(st+1|st)

qθ(st+1|st, x)

st+1 qθ(st+1|st, x) log qθ(st+1|st, x)

pθ(st+1|st)

0 t<T 1 DKL(qθ(st+1|st, x) pθ(st+1|st))

log pθ(x) log p(s0)

0 t<T 1 DKL(qθ(st+1|st, x) pθ(st+1|st))) + Eqθ[log pθ(x|s T 1)]

is derived.

Published as a conference paper at ICLR 2022

B.2 KL DIVERGENCE DECOMPOSITION

We derive the decomposition of KL divergence (Eq. (12) ) below.

Lt = DKL(qθ(st+1|st, x) pθ(st+1|st))

c N(st) DKL(qθ(oc, zc|st, x) pθ(oc, zc|st)) ( Eq. (4), Eq.( 8))

c N(st) DKL(qθ(oc|st, x) pθ(oc|st))

oc {0,1} qθ(oc|st, x)DKL(qθ(zc|st, x, oc) pθ(zc|st, oc))

( Eq. (4), Eq. (8) and chain rule for conditional KL divergence)

c N(st) DKL(qθ(oc|st, x) pθ(oc|st))

+ qθ(oc = 1|st, x)DKL(qθ(zc|st, x, oc = 1) pθ(zc|st, oc = 1))

( pθ(zc|st, oc = 0) = qθ(zc|st, x, oc = 0) = δ0(zc))

Thus, Eq. (12) is derived.

B.3 APPROXIMATING Lﬁnal BY LT 1

In this section, we claim that Lﬁnal can be replaced by LT 1, which allows us to train using a simpliﬁed objective. First, we show that Lﬁnal = Eqθ[log pθ(x|s T 1)] = due to the usage of Dirac distribution and introduce a non-diverging likelihood by replacing the Dirac function δ0(zc) with indicator function 1[zc = 0]. Recall,

Lﬁnal = Eqθ[log pθ(x|s T 1)]

c N(s T 1) log{(1 λθ,c)1[ox c = 0]δ0(zx c ) + λθ,c1[ox c = 1]N(zx c ; µθ,c, σT 1I)}

Thus, if any cell c N(s T 1) is unoccupied, the likelihood diverges to since zx c = 0. This disables us to use negative log-likelihood as a loss function, since all the gradients will diverge.

This problem can be easily resolved by substituting δ0(zx c ) with indicator function 1[zx c = 0]. Both functions have non-zero value only at zc = 0, but the former has value of while the latter has value 1. While 1[zx c = 0] is not a probability measure, i.e. R

zx c 1[zx c = 0] = 1, replacing δ0(zx c ) will have the effect of reweighting the likelihood at zx c = 0 from to 1, with the likelihood at other values zx c = 0 unchanged. Thus, 1[zx c = 0] is a natural replacement of δ0(zx c ) for computing a valid likelihood.

Proposition 1. By replacing δ0(zx c ) with the indicator function 1[zx c = 0] when computing Lﬁnal, LT 1 = Lﬁnal, for T 1

Proof. For brevity of notations, we deﬁne

λq,c = (1 αt)λθ,c + αt1[c Gx(st)],

µq,c = (1 αt)µθ,c + αtzx c ,

where λq,c, µq,c is the occupancy probability, and mean of the infusion kernel q at cell c. With T 1, there exists T1 < T, such that αT1 = 1, since αt = max(α1t + α0, 1). Since d(c, c ) = 0 if and only if c = c , Gx(st) = {argminc N(st)d(c, c ) | c x} must converge to a set of occupied coordinates of x since the distance decreases for each cell c x at every step. This indicates that λq,c = 1[c Gx(s T 1)] = ox c. Also, µq,c = zx c holds for all t > T1.

Published as a conference paper at ICLR 2022

Then, LT 1 and Lﬁnal can be expressed as the following:

LT 1 = DKL(qθ(s T |s T 1, x) pθ(s T |s T 1)

c N(s T 1) DKL(qθ(o T c |s T 1, x) pθ(o T c |s T 1))

+ qθ(o T c = 1|s T 1, x)DKL(qθ(zc|s T 1, x, o T c = 1) pθ(zc|s T 1, o T c = 1)) ( Eq. (12))

c N(s T 1) λq,c log λq,c

λθ,c + (1 λq,c) log 1 λq,c

1 λθ,c + λq,c 1 2σ2 µθ,c µq,c 2

c N(s T 1) ox c log ox c λθ,c + (1 ox c) log 1 ox c 1 λθ,c + ox c 1 2σ2 zx c µq,c 2

( λθ,c = ox c, µq,c = zx c )

Lﬁnal = Eqθ[log pθ(x|s T 1)]

c N(s T 1) log{(1 λθ,c)1[ox c = 0]δ0(zx c ) + λθ,c1[ox c = 1]N(zx c ; µθ,c, σT 1I)}

c N(s T 1) log{(1 λθ,c)1[ox c = 0]1[zx c = 0]

+ λθ,c1[ox c = 1] exp( 1

2σ2 zx c µθ,c 2 2)(2πσ) K

We divide cases for ox c = 0 and ox c = 1. When ox c = 0,

LT 1 = Lﬁnal = X

c N(s T 1) log(1 λθ,c),

where we set y log y = 0 for y = 0, since limy 0 y log y = 0. Analogously, when ox c = 1,

c N(s T 1) log λθ,c 1 2σ2 zx c µθ,c 2

c N(s T 1) log λθ,c 1 2σ2 zx c µθ,c 2 K

2 log(2πσ).

Thus, if ox c = 1, LT 1 = Lﬁnal K

2 log(2πσ), where K

2 log(2πσ) is a constant. We can conclude that θLT 1 = θLﬁnal for all ox c.

Published as a conference paper at ICLR 2022

Table 4: Number of neural network parameters and GPU memory usage comparison for different grid size with 3DFront dataset. The grid size indicates the largest length of the scene with voxel resolution 5cm and the unit of GPU memory is GB.

grid size Method # of parameters (x107) 55 77 93 124 235

Conv Occ 0.42 1.32 1.81 2.38 4.13 21.68 c GCA 4.12 2.40 2.81 2.88 2.84 3.24

C DIFFERENCE OF FORMULATION COMPARED TO GCA

We elaborate the difference of formulations compared to GCA (Zhang et al. (2021)).

Usage of Latent Code. The biggest difference between GCA and our model is the usage of local latent codes (Jiang et al. (2020); Chabra et al. (2020)) to generate continuous surface. Although the formulation of GCA allows scalable sampling of high resolution voxel, it cannot produce continuous surface due to predeﬁned voxel resolution. However, we deﬁne a new state named sparse voxel embedding by appending local latent codes to the voxel occupancy. Sparse voxel embedding can be decoded to produce continuous geometry, outperforming GCA in accuracy in all the evaluated datasets in Sec. 4.

Training Disconnected Objects. During training, GCA assumes a property named partial connectivity between an intermediate state st and ground truth state x to guarantee the convergence of infusion sequences to ground truth x (Property 1 and Proposition 1 of Zhang et al. (2021)). The connectivity means that any cell in st can reach any cell in x following the path of cells in x with local transitions bounded by neighborhood size r. The assumption is required since the infusion kernel of single cell qt(oc|st) for GCA, deﬁned as

qt(oc|st) = Ber((1 αt)λθ,c + αt1[c x]), (14)

inserts the ground truth cell c x if the cell is within the neighborhood of current state N(st), i.e. c N(st) x. However, we eliminate the assumption by deﬁning an auxiliary function

Gx(s) = {arg min c N(s) d(c, c )|c x}, (15)

and modifying the formulation as

qt(oc|st) = Ber((1 αt)λθ,c + αt1[c Gx(st)]), (16)

by switching x in the infusion term to Gx(st) from the original infusion kernel. The usage of Gx(st) guarantees infusing a temporary ground truth cell c that assures reaching all cells in x regardless of the connectivity. The assumption is no longer required and therefore our training procedure is a general version of GCA.

Training Procedure for Maximizing Log-likelihood. Both GCA and our model utilizes the infusion training (Bordes et al. (2017)), by emulating the sequence of states that we aim to learn. The objective of GCA does not explicitly show the relationship between the training objective and maximizing the log-likelihood, which is what we aim to achieve. In Sec. 3.3, we give a detailed explanation of how our training objective approximates maximizing the lower bound of log-likelihood of data distribution. Thus, our training procedure completes the heuristics used in original work of GCA.

D ANALYSIS ON SCALABILITY

In this section, we investigate the scalability of our approach. Scene completion needs to process the 3D geometry of large-scale scenes with multiple plausible outputs, and sparse representation is crucial to ﬁnd the diverse yet detailed solutions.

As the scale of the 3D geometry increases, the approaches utilizing the dense convolutional network cannot observe the entire scene at the same time. As a simple remedy, previous methods employ sliding window techniques for 3D scene completion (Siddiqui et al.; Peng et al. (2020)). Basically they divide the input scene into small segments, complete each segment, and then fuse them. The

Published as a conference paper at ICLR 2022

formulation can only observe the information that is within the same segment for completion, and assumes that the divisions contain all the information necessary to complete the geometry. The size of the maximum 3D volume that recent works utilize amounts to a 3D box whose longest length is 3.2m for 643 resolution voxel, where individual cells are about 5 cm in each dimension. Note that the size is comparable to ordinary furniture, such as sofa or dining table, whose spatial context might not be contained within the same segment. The relative positions between relevant objects can be ignored, and the missing portion of the same object can accidentally be separated into different segments. With our sparse representation, the receptive ﬁeld is large enough to observe the entire room, and we can stably complete the challenging 3D scenes with multiple objects capturing the mutual context.

We further provide the numerical comparison of our sparse representation against the dense convolution of Conv Occ (Peng et al. (2020)). Table 4 shows the number of neural network parameters and the GPU usage for processing single rooms of different sizes in 3DFront (Fu et al. (2021)). We used 5 cm voxel resolution and the grid size indicates the number of cells required to cover the longest length of the bounding box for the room. The maximum GPU usage during 25 transitions of c GCA is provided. We did not include the memory used for the ﬁnal reconstruction which regresses the actual distance values at dense query points, since the memory usage differs depending on the number of query points.

Our neural network is powerful, employing about 10 times more parameters than Conv Occ. With the efﬁcient sparse representation, we can design the network to capture larger receptive ﬁelds with deeper network, enjoying more expressive power. This is reﬂected in the detailed reconstruction results of our approach. The memory usage is similar for smaller rooms for both approaches, but c GCA becomes much more efﬁcient for larger rooms. When the grid size for room reaches 235, our method consumes 7 times less GPU memory compared to Conv Occ. As widely known, the memory and computation for 3D convolution increases cubic to the grid resolution with dense representation, while those for sparse representations increase with respect to the number of occupied voxels, which is much more efﬁcient. Therefore our sparse representation is scalable and crucial for scene completion.

E EFFECTS OF MODE SEEKING STEPS

mode seeking sampling

𝑠!$ 𝑠!% 𝑠!#

mesh reconstruction

sparse voxel

Figure 9: Ablation study on the effects of mode seeking step T . The left shows an example of chair generation, where the bottom row are sparse voxel embeddings and the top row is the mesh reconstruction of each state. The right is the result on the effects of mode seeking step T tested with probabilistic shape completion on sofa dataset.

We perform an ablation study on the effects of mode seeking steps, which removes the noisy voxels with low-occupancy probability as discussed in Sec. 3.2. We test with Shape Net dataset (Chang et al. (2015)) with voxel size 643 and neighborhood radius r = 3, which tends to show the most intense effect of mode seeking phase due to large neighborhood size.

The right side of Fig. 9 shows the graph of metrics on quality (MMD), ﬁdelity (UHD), and diversity (TMD) with respect to mode seeking steps. There is a dramatic decrease in all metrics on the ﬁrst step of mode seeking step. The effect is visualized in the left side of Fig. 9, where the unlikely voxels are removed in a single mode seeking step. After the ﬁrst mode seeking step, UHD and TMD metrics increase slightly, but overall, the metrics remain stable. We choose T = 5 for all the conducted experiments to ensure that shape reaches a stable mode.

Published as a conference paper at ICLR 2022

Table 5: Quantitative comparison of probabilistic scene completion in Shape Net scene dataset with different levels of sparsity. The best results are marked as bold. Both CD (quality, ) and TMD (diversity, ) in tables are multiplied by 104.

500 points 1,000 points 5,000 points 10,000 points Method min. CD avg. CD TMD min. CD avg. CD TMD min. CD avg. CD TMD min. CD avg. CD TMD

Conv Occ 36.58 - - 5.92 - - 1.65 - - 1.33 - - IFNet 15.27 - - 12.33 - - 9.64 - - 8.55 - - GCA 6.41 9.77 31.83 4.08 5.90 16.58 3.08 3.68 5.64 3.02 3.54 4.92

c GCA 11.30 17.61 50.58 3.64 5.53 16.18 1.32 1.73 4.64 1.16 1.49 3.91 c GCA (w/ cond.) 4.62 5.93 10.71 2.11 2.76 5.75 0.97 1.27 3.41 0.87 1.08 2.96

F ANALYSIS ON VARYING COMPLETENESS OF INPUT

In this section, we further investigate the effects on the level of completeness of input. Sec. F.1 shows results of completion models on sparse inputs and Sec. F.2 provides how our model behaves when a non-ambiguous input is given.

F.1 RESULTS ON SPARSE INPUT

For the experiments of Shape Net scene dataset, the models are trained and evaluated on the input where the ambiguity lies on the missing parts, but the remaining parts of input pointcloud are relatively clear by sampling 10,000 points, following the experiment setting of Conv Occ (Peng et al. (2020)). In this section, we further provide ambiguity to the remaining parts by sampling less points. For the models trained on 10,000 point samples with 0.5 completeness, we evaluate the results on sparse input by sampling 500, 1,000, 5,000 and 10,000 points after applying the same iterative removal process as done in training. As in Sec. 4.1, we measure the accuracy (CD) and diversity (TMD) metrics compared with Conv Occ (Peng et al. (2020)), IFNet (Chibane et al. (2020a)), and GCA (Zhang et al. (2021)).

Table 5 contains the quantitative comparison results. For all the models, accuracy (CD) degrades as the input gets sparser. However, c GCA (w/ cond.) achieves best accuracy at all times, indicating that our model is the most robust to the sparsity of input. For probabilistic models, the diversity (TMD) increases as the density decreases. This implies that the diversity of the reconstructions is dependent to the ambiguity of the input. This is intuitively a natural response, since the multi-modality of plausible outcomes increases as the partial observations become less informative.

Fig. 10 visualizes the results for various sparsity. All of the models produce less plausible reconstructions with 500 points (leftmost column of Fig. 10) compared to that of higher density inputs. However, we observe a clear distinction between the completions of probabilistic models (GCA, c GCA) and deterministic models (Conv Occ, IFNet). While probabilistic models try to generate the learned shapes from the observations, such as a shade of the lamp, deterministic methods tend to ﬁll the gaps between points. As discussed in Sec. 4.1, we hypothesize that this consequence stems from the fact that deterministic methods produce blurry results since it is forced to generate a single result along out of multiple plausible shapes which are the modes of the multi-modal distribution (Goodfellow (2017)). Therefore, we claim that a generative model capable of modeling multi-modal distribution should be used for shape completion task.

F.2 RESULTS ON NON-AMBIGUOUS INPUT

We present experiments where c GCA is given a non-ambiguous input. For the models trained on various level of completeness on Shape Net scene dataset, we evaluate each model on an input presenting distinct geometry. The input is generated by sampling 100,000 points from the mesh without any procedure of removal.

Table 6 shows the quantitative results of the experiments. The ﬁrst row shows the accuracy (CD) and diversity (TMD) scores of the models operated on test set when the removal procedure is same as that of the training set. The second row shows the metrics when the input of the test set is sampled densely (100,000 points) without any removal. For all models trained with varying level of completeness, accuracy increases with decreasing diversity given a clear shape compared to an ambiguous shape. The result indicates that the diversity of the trained models is dependent on the ambiguity of the input.

Published as a conference paper at ICLR 2022

c GCA GCA IFNet Conv Occ Input c GCA (w/ cond.)

500 Points 1,000 Points 5,000 Points 10,000 Points

Figure 10: Qualitative results on Shape Net scenes with varying level of density. Note that all the models are trained on dataset containing 10,000 points (rightmost column), but only tested with different density. While the probabilistic methods (GCA, c GCA) tries to generate the learned shapes (e.g. shades of lamp) with only 500 points, deterministic methods (Conv Occ, IFNet) tend to ﬁll the gaps between the points of the partial observation.

Output 1 Output 2 Output 3 Output 4

Figure 11: Qualitative results of c GCA tested on Shape Net scenes where the input is non-ambiguous. The model is trained on completeness of 0.2.

Published as a conference paper at ICLR 2022

Table 6: Quantitative results on c GCA trained on our Shape Net dataset, but tested with nonambiguous input. The ﬁrst row (training) indicates the metrics evaluated on dataset created with same removal procedure as the corresponding training dataset. The second row (non-ambiguous) shows the metrics where the input from the test dataset is sampled very densely without any removal from the ground truth mesh with the same trained models as the ﬁrst row. Both CD (quality, ) and TMD (diversity, ) in tables are multiplied by 104.

min. rate 0.2 min. rate 0.5 min. rate 0.8 evaluation dataset min. CD avg. CD TMD min. CD avg. CD TMD min. CD avg. CD TMD

training 2.80 3.88 10.07 1.16 1.49 3.91 0.69 0.87 3.05 non-ambiguous 2.46 3.58 8.72 0.73 0.87 2.91 0.61 0.71 2.72

This coincides with the result of Sec. F.1 that the reconstructions of our model is dependent on the multi-modality of plausible outcomes of input.

We also observe that the diversity of reconstructions is associated with the training data. Comparing the quantitative results on non-ambiguous input, the accuracy is low while the diversity is high when the model is trained on relatively incomplete data. We hypothesize that models trained with higher level of incompleteness are more likely to generate shapes from the input. However, Fig. 11 shows that while the completions of c GCA trained on highly incomplete data (min. rate 0.2) are diverse, they are quite plausible.

G PROBABILISTIC SCENE COMPLETION ON REAL DATASET

We investigate how our model behaves on indoor real-world data. We test on the Scan Net indoor scene dataset (Dai et al. (2017a)), which is highly incomplete compared to other datasets, such as Matterport (Chang et al. (2017)). The input is sampled by collecting 500 points/m2 from each scene. Scan Net dataset does not have a complete ground truth mesh, so we test our model trained on 3DFront (Fu et al. (2021)) dataset with completness of 0.8. We align the walls of Scan Net data to xy-axis since the scenes of 3DFront are axis aligned. We compare our result with GCA (Zhang et al. (2021)).

The results are visualized in Fig. 12. Our model shows diverse results (e.g. chairs, closets) and generalizes well to the real data, which has signiﬁcantly different statistics compared to the training data. Especially, the conditioned variant of our model shows better results by generalizing well to new data (e.g. tree), not found in 3DFront training dataset. We emphasize that our model is trained on unsigned distance ﬁelds and does not require the sliding-window technique, unlike the previous methods (Peng et al. (2020); Chibane et al. (2020a)).

Published as a conference paper at ICLR 2022

c GCA (w/ cond.) c GCA GCA Input

Figure 12: Qualitative results on Scan Net dataset with models trained on 3DFront dataset. Best viewed on screen. Multiple plausible reconstructions are shown (pink and blue box). c GCA (w/ cond.) shows better results for reconstructing a tree (green box) compared to that of vanilla c GCA, where a tree is never found in the training dataset. This allows us to infer that the conditioned variant tends to help generalize to unseen data better compared to the vanilla c GCA.

Published as a conference paper at ICLR 2022

H ADDITIONAL RESULTS FOR SCENE COMPLETION

H.1 ADDITIONAL RESULTS FOR SHAPENET SCENE

min. rate 0.2 min. rate 0.5 min. rate 0.8

c GCA (w/ cond.) c GCA GCA IFNet Conv Occ Input

Figure 13: Additional qualitative results on Shape Net scene, with varying level of completeness. Best viewed on screen.

Published as a conference paper at ICLR 2022

min. rate 0.2 min. rate 0.5 min. rate 0.8

c GCA (w/ cond.) c GCA GCA IFNet Conv Occ Input

Figure 14: Additional qualitative results on Shape Net scene, with varying level of completeness. Best viewed on screen.

Published as a conference paper at ICLR 2022

H.2 ADDITIONAL RESULTS FOR 3DFRONT

min. rate 0.2 min. rate 0.5 min. rate 0.8

c GCA (w/ cond.) c GCA GCA Input

Figure 15: Additional qualitative results on 3DFront, with varying level of completeness. Best viewed on screen.

Published as a conference paper at ICLR 2022

c GCA (w/ cond.)

min. rate 0.2

min. rate 0.5 min. rate 0.8

Figure 16: Additional qualitative results on 3DFront, with varying level of completeness. Best viewed on screen.

Published as a conference paper at ICLR 2022

I ADDITIONAL RESULTS FOR SINGLE OBJECT COMPLETION

Figure 17: Additional qualitative results on Shape Net sofa.

Figure 18: Additional qualitative results on Shape Net chair.

Published as a conference paper at ICLR 2022

Figure 19: Additional qualitative results on Shape Net table.