# probabilistic_implicit_scene_completion__4e09c00a.pdf Published as a conference paper at ICLR 2022 PROBABILISTIC IMPLICIT SCENE COMPLETION Dongsu Zhang, Changwoon Choi, Inbum Park & Young Min Kim Department of Electrical and Computer Engineering, Seoul National University 96lives@snu.ac.kr, changwoon.choi00@gmail.com, {inbum0215, youngmin.kim}@snu.ac.kr We propose a probabilistic shape completion method extended to the continuous geometry of large-scale 3D scenes. Real-world scans of 3D scenes suffer from a considerable amount of missing data cluttered with unsegmented objects. The problem of shape completion is inherently ill-posed, and high-quality result requires scalable solutions that consider multiple possible outcomes. We employ the Generative Cellular Automata that learns the multi-modal distribution and transform the formulation to process large-scale continuous geometry. The local continuous shape is incrementally generated as a sparse voxel embedding, which contains the latent code for each occupied cell. We formally derive that our training objective for the sparse voxel embedding maximizes the variational lower bound of the complete shape distribution and therefore our progressive generation constitutes a valid generative model. Experiments show that our model successfully generates diverse plausible scenes faithful to the input, especially when the input suffers from a significant amount of missing data. We also demonstrate that our approach outperforms deterministic models even in less ambiguous cases with a small amount of missing data, which infers that probabilistic formulation is crucial for high-quality geometry completion on input scans exhibiting any levels of completeness. 1 INTRODUCTION High-quality 3D data can create realistic virtual 3D environments or provide crucial information to interact with the environment for robots or human users (Varley et al. (2017)). However, 3D data acquired from a real-world scan is often noisy and incomplete with irregular samples. The task of 3D shape completion aims to recover the complete surface geometry from the raw 3D scans. Shape completion is often formulated in a data-driven way using the prior distribution of 3D geometry, which often results in multiple plausible outcomes given incomplete and noisy observation. If one learns to regress a single shape out of multi-modal shape distribution, one is bound to lose fine details of the geometry and produce blurry outputs as noticed with general generative models (Goodfellow (2017)). If we extend the range of completion to the scale of scenes with multiple objects, the task becomes even more challenging with the memory and computation requirements for representing large-scale high resolution 3D shapes. In this work, we present continuous Generative Cellular Automata (c GCA), which generates multiple continuous surfaces for 3D reconstruction. Our work builds on Generative Cellular Automata (GCA) (Zhang et al. (2021)), which produces diverse shapes by progressively growing the object surface from the immediate neighbors of the input shape. c GCA inherits the multi-modal and scalable generation of GCA, but overcomes the limitation of discrete voxel resolution producing high-quality continuous surfaces. Specifically, our model learns to generate diverse sparse voxels associated with their local latent codes, namely sparse voxel embedding, where each latent code encodes the deep implicit fields of continuous geometry near each of the occupied voxels (Chabra et al. (2020); Jiang et al. (2020)). Our training objective maximizes the variational lower bound for the log-likelihood of the surface distribution represented with sparse voxel embedding. The stochastic formulation is modified from the original GCA, and theoretically justified as a sound generative model. We demonstrate that c GCA can faithfully generate multiple plausible solutions of shape completion even for large-scale scenes with a significant amount of missing data as shown in Figure 1. To the best of our knowledge, we are the first to tackle the challenging task of probabilistic scene completion, Published as a conference paper at ICLR 2022 Figure 1: Three examples of complete shapes using c GCA given noisy partial input observation. Even when the raw input is severely damaged (left), c GCA can generate plausible yet diverse complete continuous shapes. which requires not only the model to generate multiple plausible outcomes but also be scalable enough to capture the wide-range context of multiple objects. We summarize the key contributions as follows: (1) We are the first to tackle the problem of probabilistic scene completion with partial scans, and provide a scalable model that can capture largescale context of scenes. (2) We present continuous Generative Cellular Automata, a generative model that produces diverse continuous surfaces from a partial observation. (3) We modify infusion training (Bordes et al. (2017)) and prove that the formulation indeed increases the variational lower bound of data distribution, which verifies that the proposed progressive generation is a valid generative model. 2 PRELIMINARIES: GENERATIVE CELLULAR AUTOMATA Continuous Generative Cellular Automata (c GCA) extends the idea of Generative Cellular Automata (GCA) by Zhang et al. (2021) but generates continuous surface with implicit representation instead of discrete voxel grid. For the completeness of discussion, we briefly review the formulation of GCA. Starting from an incomplete voxelized shape, GCA progressively updates the local neighborhood of current occupied voxels to eventually generate a complete shape. In GCA, a shape is represented as a state s = {(c, oc)|c Z3, oc {0, 1}}, a set of binary occupancy oc for every cell c Z3, where the occupancy grid stores only the sparse cells on the surface. Given a state of an incomplete shape s0, GCA evolves to the state of a complete shape s T by sampling s1:T from the Markov chain st+1 pθ( | st), (1) where T is a fixed number of transitions and pθ is a homogeneous transition kernel parameterized by neural network parameters θ. The transition kernel pθ is implemented with sparse CNN (Graham et al. (2018); Choy et al. (2019)), which is a highly efficient neural network architecture that computes the convolution operation only on the occupied voxels. The progressive generation of GCA confines the search space of each transition kernel at the immediate neighborhood of the current state. The occupancy probability within the neighborhood is regressed following Bernoulli distribution, and then the subsequent state is independently sampled for individual cells. With the restricted domain for probability estimation, the model is scalable to high resolution 3D voxel space. GCA shows that the series of local growth near sparse occupied cells can eventually complete the shape as a unified structure since the shapes are connected. While GCA is a scalable solution for generating diverse shapes, the grid representation for the 3D geometry inherently limits the resolution of the final shape. 3 CONTINUOUS GENERATIVE CELLULAR AUTOMATA In Sec. 3.1, we formally introduce an extension of sparse occupancy voxels to represent continuous geometry named sparse voxel embedding, where each occupied voxel contains latent code representing local implicit fields. We train an autoencoder that can compress the implicit fields into the embeddings and vice versa. Then we present the sampling procedure of c GCA that generates 3D shape in Sec. 3.2, which is the inference step for shape completion. Sec. 3.3 shows the training objective of c GCA, which approximately maximizes the variational lower bound for the distribution of the complete continuous geometry. Published as a conference paper at ICLR 2022 mode seeking sampling implicit function sparse voxel embedding 𝑠 Encoding/Decoding Sparse Voxel Embedding Sampling from c GCA decode 𝑓!(𝑠$%$!, 𝑞) Figure 2: Overview of our method. The implicit function of continuous shape can be encoded as sparse voxel embedding s and decoded back (left). The colors in the sparse voxel embedding represent the clustered labels of latent code zc for each cell c. The sampling procedure of c GCA (right) involves T steps of sampling the stochastic transition kernel pθ, followed by T mode seeking steps which remove cells with low probability. From the final sparse voxel embedding s T +T , the decoder can recover the implicit representation for the complete continuous shape. 3.1 SPARSE VOXEL EMBEDDING In addition to the sparse occupied voxels of GCA, the state of c GCA, named sparse voxel embedding, contains the associated latent code, which can be decoded into continuous surface. Formally, the state s of c GCA is defined as a set of pair of binary occupancy oc and the latent code zc, for the cells c in a three-dimensional grid Z3 s = {(c, oc, zc)|c Z3, oc {0, 1}, zc RK}. (2) Similar to GCA, c GCA maintains the representation sparse by storing only the set of occupied voxels and their latent codes, and sets zc = 0 if oc = 0. The sparse voxel embedding s can be converted to and from the implicit representation of local geometry with neural networks, inspired by the work of Peng et al. (2020), Chabra et al. (2020), and Chibane et al. (2020a). We utilize the (signed) distance to the surface as the implicit representation, and use autoencoder for the conversion. The encoder gφ produces the sparse voxel embedding s from the coordinate-distance pairs P = {(p, dp)|p R3, dp R}, where p is a 3D coordinate and dp is the distance to the surface, s = gφ(P). The decoder fω, on the other hand, regresses the local implicit field value dq at the 3D position q R3 given the sparse voxel embedding s, ˆdq = fω(s, q) for given coordinate-distance pairs Q = {(q, dq)|q R3, dq R}. The detailed architecture of the autoencoder is described in Appendix A, where the decoder fω generates continuous geometry by interpolating hierarchical features extracted from sparse voxel embedding. An example of the conversion is presented on the left side of Fig. 2, where the color of the sparse voxel embedding represents clustered labels of the latent codes with k-means clustering (Hartigan & Wong (1979)). Note that the embedding of a similar local geometry, such as the seat of the chair, exhibits similar values of latent codes. The parameters φ, ω of the autoencoder are jointly optimized by minimizing the following loss function: L(φ, ω) = 1 |Q| (q,dq) Q |fω(s, q) max min dq ϵ , 1 , 1 | + β 1 c s zc , (3) where s = gφ(P) and ϵ is the size of a single voxel. The first term in Eq. (3) corresponds to minimizing the normalized distance and the second is the regularization term for the latent code weighted by hyperparameter β. Clamping the maximum distance makes the network focus on predicting accurate values at the vicinity of the surface (Park et al. (2019); Chibane et al. (2020b)). 3.2 SAMPLING FROM CONTINUOUS GENERATIVE CELLULAR AUTOMATA The generation process of c GCA echos the formulation of GCA (Zhang et al. (2021)), and repeats T steps of sampling from the transition kernel to progressively grow the shape. Each transition kernel Published as a conference paper at ICLR 2022 p(st+1|st) is factorized into cells within the local neighborhood of the occupied cells of the current state, N(st) = {c Z3 | d(c, c ) r, c st} 1 given a distance metric d and the radius r: p(st+1|st) = Y c N(st) pθ(oc, zc|st) = Y c N(st) pθ(oc|st)pθ(zc|st, oc). (4) Note that the distribution is further decomposed into the occupancy oc and the latent code zc, where we denote oc and zc as the random variable of occupancy and latent code for cell c in state st+1. Therefore the shape is generated by progressively sampling the occupancy and the latent codes for the occupied voxels which are decoded and fused into a continuous geometry. The binary occupancy is represented with the Bernoulli distribution pθ(oc|st) = Ber(λθ,c), (5) where λθ,c [0, 1] is the estimated occupancy probability at the corresponding cell c. With our sparse representation, the distribution of the latent codes is pθ(zc|st, oc) = δ0 if oc = 0 N(µθ,c, σt I) if oc = 1. (6) δ0 is a Dirac delta distribution at 0 indicating that zc = 0 when oc = 0. For the occupied voxels (oc = 1), zc follows the normal distribution with the estimated mean of the latent code µθ,c RK and the predefined standard deviation σt I, where σt decreases with respect to t. Initial State. Given an incomplete point cloud, we set the initial state s0 of the sampling chain by setting the occupancy oc to be 1 for the cells that contain point cloud and associating the occupied cells with a latent code sampled from the isotropic normal distribution. However, the input can better describe the provided partial geometry if we encode the latent code zc of the occupied cells with the encoder gφ. The final completion is more precise when all the transitions pθ are conditioned with the initial state containing the encoded latent code. Further details are described in Appendix A. Mode Seeking. While we effectively model the probabilistic distribution of multi-modal shapes, the final reconstruction needs to converge to a single coherent shape. Naïve sampling of the stochastic transition kernel in Eq. (4) can include noisy voxels with low-occupancy probability. As a simple trick, we augment mode seeking steps that determine the most probable mode of the current result instead of probabilistic sampling. Specifically, we run additional T steps of the transition kernel but we select the cells with probability higher than 0.5 and set the latent code as the mean of the distribution µθ,c. The mode seeking steps ensure that the final shape discovers the dominant mode that is closest to s T as depicted in Fig. 2, where it can be transformed into implicit function with the pretrained decoder fw. 3.3 TRAINING CONTINUOUS GENERATIVE CELLULAR AUTOMATA We train a homogeneous transition kernel pθ(st+1|st), whose repetitive applications eventually yield the samples that follow the learned distribution. However, the data contains only the initial s0 and the ground truth state x, and we need to emulate the sequence for training. We adapt infusion training (Bordes et al. (2017)), which induces the intermediate transitions to converge to the desired complete state. To this end, we define a function Gx(s) that finds the valid cells that are closest to the complete shape x within the neighborhood of the current state N(s): Gx(s) = {argminc N(s)d(c, c ) | c x}. (7) Then, we define the infusion kernel qt factorized similarly as the sampling kernel in Eq. (4): qt θ(st+1|st, x) = Y c N(st) qt θ(oc, zc|st, x) = Y c N(st) qt θ(oc|st, x)qt θ(zc|st, oc, x). (8) The distributions for both oc and zc are gradually biased towards the ground truth final shape x with the infusion rate αt, which increases linearly with respect to time step, i.e., αt = max(α1t + α0, 1), with α1 > 0: qt θ(oc|st, x) = Ber((1 αt)λθ,c + αt1[c Gx(st)]), (9) 1We use the notation c s if oc = 1 for c Z3 to denote occupied cells. Published as a conference paper at ICLR 2022 qt θ(zc|st, oc, x) = δ0 if oc = 0 N((1 αt)µθ,c + αtzx c , σt I) if oc = 1. (10) Here 1 is an indicator function, and we will denote ox c, zx c as the occupancy and latent code of the ground truth complete shape x at coordinate c. c GCA aims to optimize the log-likelihood of the ground truth sparse voxel embedding log pθ(x). However, since the direct optimization of the exact log-likelihood is intractable, we modify the variational lower bound using the derivation of diffusion-based models (Sohl-Dickstein et al. (2015)): log pθ(x) X s0:T 1 qθ(s0:T 1|x) log pθ(s0:T 1, x) qθ(s0:T 1|x) (11) = log p(s0) q(s0) | {z } Linit 0 t