# uclidnet_single_view_reconstruction_in_object_space__1df2b897.pdf

UCLID-Net: Single View Reconstruction in Object Space

Benoit Guillard Edoardo Remelli CVLab EPFL, Switzerland {firstname.lastname}@epfl.ch

Most state-of-the-art deep geometric learning single-view reconstruction approaches rely on encoder-decoder architectures that output either shape parametrizations [7, 8, 23] or implicit representations [14, 26, 4]. However, these representations rarely preserve the Euclidean structure of the 3D space objects exist in. In this paper, we show that building a geometry preserving 3-dimensional latent space helps the network concurrently learn global shape regularities and local reasoning in the object coordinate space and, as a result, boosts performance. We demonstrate both on Shape Net synthetic images, which are often used for benchmarking purposes, and on real-world images that our approach outperforms state-of-the-art ones. Furthermore, the single-view pipeline naturally extends to multi-view reconstruction, which we also show.

1 Introduction

Most state-of-the-art deep geometric learning Single-View Reconstruction approaches (SVR) rely on encoder-decoder architectures that output either explicit shape parametrizations [7, 8, 23] or implicit representations [14, 26, 4]. However, the representations they learn rarely preserve the Euclidean structure of the 3D space objects exist in, and rather rely on a global vector embedding of the input image at a semantic level. In this paper, we show that building a geometry preserving 3-dimensional representation helps the network concurrently learn global shape regularities and local reasoning in the object coordinate space and, as a result, boosts performance. This corroborates the observation that choosing the right coordinate frame for the output of a deep network matters a great deal [21].

In our work, we use camera projection matrices to explicitly link cameraand object-centric coordinate frames. This allows us to reason about geometry and learn object priors in a common 3D coordinate system. More speciﬁcally, we use regressed camera pose information to back-project 2D feature maps to 3D feature grids at several scales. This is achieved within our novel architecture that comprises a 2D image encoder and a 3D shape decoder. They feature symmetrical downsampling and upsampling parts and communicate through multi-scale skip connections, as in the U-Net architecture [16]. However, unlike in other approaches, the bottleneck is made of 3D feature grids and we use backprojection layers [12, 11, 17] to lift 2D feature maps to 3D grids. As a result, feature localization from the input view is preserved. In other words, our feature embedding has a Euclidean structure and is aligned with object coordinate frame. Fig. 1 depicts this process. In reference to its characteristics, we dub our architecture UCLID-Net.

Earlier attempts at passing 2D features to a shape decoder via local feature extraction [24, 26] enabled spatial information to ﬂow to the decoder in a non semantic manner, often with limited impact on the ﬁnal result. In these approaches, the same local feature is attributed to all points lying along a camera ray. By contrast, UCLID-Net uses 3D convolutions to volumetrically process local features before passing them to the local shape decoders. This allows them to make different contributions at

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

Concatenation Back-projection

Volumetric feature grid

3D CNN outputs

Voxelized depth map

Figure 1: UCLID-Net. Given input image I, a CNN encoder estimates 2D feature maps Fs for scales s from 1 to S while pre-trained CNNs regress a depth map D and a camera pose P. P is used to backproject the feature maps Fs to object aligned 3D feature grids GFs for 1 s S without using depth information. In parallel, S corresponding voxelized depth grids GD s are built from D and P without using feature information. A 3D CNN then aggregates feature and depth grids from the lowest to the highest resolution into outputs HS, . . . , H0 of increasing resolutions. From H0, fully connected layers regress a coarse voxel shape, which is then reﬁned into a point cloud using local patch foldings. Supervision comes in the form of binary cross-entropy on the coarse output and Chamfer distance on the ﬁnal 3D point cloud.

different places along camera rays. To further promote geometrical reasoning, it never computes a global vector encoding of the input image. Instead, it relies on localized feature grids, either 2D in the image plane or 3D in object space. Finally, the geometric nature of the 3D feature grids enables us to exploit estimated depth maps and further boost reconstruction performance.

We demonstrate both on Shape Net synthetic images, which are often used for benchmarking purposes, and on real-world images that our approach outperforms state-of-the-art ones. Our contribution is therefore a demonstration that creating a Euclidean preserving latent space provides a clear beneﬁt for single-image reconstruction and a practical approach to taking advantage of it. Finally, the single-view pipeline naturally extends to multi-view reconstruction, which we also provide an example for.

2 Related work

Most recent SVR methods rely on a 2D-CNN to create an image description that is then passed to a 3D shape decoder that generates a 3D output. What differentiates them is the nature of their output which is strongly related to the structure of their shape decoder, and their approach to local feature extraction. We brieﬂy describe these below.

Shape Decoders The ﬁrst successful deep SVR models relied on 3D convolutions to regress voxelized shapes [5]. This restricts them to coarse resolutions because of their cubic computational and memory cost. This drawback can be mitigated using local subdivision schemes [9, 20]. Marr Net [25] and Pix3D [19] regress voxelized shapes as well but also incorporate depth, normal, and silhouette predictions as intermediate representations. They help disentangle shape from appearance and are used to compute a re-projection consistency loss. Depth, normal and silhouette are however not exploited in a geometric manner at inference time because they are encoded as ﬂat vectors. PSGN [6] regresses sparse scalar values, directly interpreted as 3D coordinates of a point cloud with ﬁxed

size and mild continuity. Atlas Net [8] introduces a per-patch surface parametrization and samples a point cloud from a set of learned parametric surfaces. One limitation, however, is that the patches it produces sometimes overlap each other or collapse during training [1].

To combine the strengths of voxel and mesh representations, Mesh R-CNN [7] uses a hybrid shape decoder that ﬁrst regresses coarse voxels, which are then reﬁned into mesh vertices using graph convolutions. Our approach is in the same spirit with two key differences. First, our coarse occupancy grid is used to instantiate folding patches and to sample 3D surface points in the Atlas Net [8] manner. However, unlike in Atlas Net, the locations of the sampled 3D points and the folding creating them are tightly coupled. Second, we regress shapes in object space, thus leveraging stronger object priors.

A competing approach is to rely on implicit shape representations. For example, the network of [14] computes occupancy maps that represent smooth watertight shapes at arbitrary resolutions. DISN [26] uses instead a Signed Distance Field (SDF). Shapes are encoded as zero-crossing of the ﬁeld and explicit 3D meshes can be recovered using the Marching Cubes [13] algorithm.

Local Feature Extraction Most SVR methods discussed above rely on a vectorized embedding passing from image encoder to shape decoder. This embedding typically ignores image feature localization and produces a global image descriptor. As shown in [21], such approaches are therefore prone to behaving like classiﬁers that simply retrieve shapes from a learned catalog. Hence, no true geometric reasoning occurs and recognition occurs at the scale of whole objects while ignoring ﬁne details.

(a) (b) (c) (d) (e) (f) Figure 2: (a) Input photograph from Pix3D [19]. (b) Ground truth shape seen from a different viewpoint. (c,d) DISN [26] reconstruction seen from the viewpoints of (a) and (b), respectively. (e,f) Our reconstruction seen from the viewpoints of (a) and (b), respectively. For DISN, local feature extraction makes it easy to recover the silhouette in (c) but fails to deliver the required depth information. Our approach avoids this pitfall.

There have been several attempts at preserving feature localization from the input image by passing local vectors from 2D feature maps of the image encoder to the shape decoder. In [24, 7], features from the 2D plane are propagated to the mesh convolution network that operates in the camera space. In DISN [26], features from the 2D plane are extracted and serve as local inputs to a SDF regressor, directly in object space. Unfortunately, features extracted in this manner do not incorporate any notion of depth and local shape regressors get the same input all along a camera ray. As a result and as shown in Fig. 2, DISN can reconstruct shapes with the correct outline when projected in the original viewpoint but that are nevertheless incorrect. In practice, this occurs when the network relies on both global and local features, but not when it relies on global features only. In other words, it seems that local features allow the network to take an undesirable shortcut by making silhouette recovery excessively easy, especially when the background is uniform. The depth constraint is too weakly enforced by the latent space, and must be carried out by the fully connected network regressing signed distance value. By contrast, our approach does avoids this pitfall, as shown in Fig. 2(f). This is allowed by two key differences: (i) the shape decoder relies on 3D convolutions to handle global spatial arrangement before fully connected networks locally regress shape parts, and (ii) predicted depth maps are made available as inputs to the shape decoder.

At the heart of UCLID-Net is a representation that preserves the Euclidean structure of the 3D world in which the shape we want to reconstruct lives. To encode the input image into it and then decode it into a 3D shape, we use the architecture depicted by Fig. 1. A CNN image encoder computes feature maps at S different scales while auxiliary ones produce a depth map estimate D and a camera projection model P : R3 R2. P allows us to back-project image feature onto the 3D space along camera rays and D to localize the features at the probable location of the surface on each of these ray.

(a) (b) Figure 3: (a) Backprojecting 2D features maps to 3D grids. Rays are cast from camera P through 2D feature map F to ﬁll 3D grid GF . It is applied to 2D feature maps from the image encoder to provide object space aligned 3D feature grids as inputs to the shape decoder (b) Schematic view of a 1D to 2D backprojection: all grid cells along a ray are given the same corresponding feature value.

The 2D feature maps and depth maps are back-projected to 3D grids that serve as input to the shape decoder, as shown by Fig. 3. This yields a coarse voxelized shape that is then reﬁned into a point cloud. If estimates of either the pose P or the depth map D happen to be available a priori, we can use them instead of regressing them. We will show in the results section that this provides a small performance boost when they are accurate but not a very large one because our predictions tend to be good enough for our purposes, that is, lifting the features to the 3D grids.

The back-projection mechanism we use is depicted by Fig. 3. It is similar to the one of [12, 11, 17] and has a major weakness when used for single view reconstruction. All voxels along a camera ray receive the same feature information, which can result in failures such as the one depicted by Fig. 2 if passed as is to local shape decoders. To remedy this, we concatenate feature grids with voxelized depth maps. The result is then processed as a whole using 3D convolutions before being passed to local decoders. In the remainder of this section, we ﬁrst introduce the basic back-projection mechanism, and then describe how our shape decoder fuses feature grids with depth information using a 3D CNN before locally regressing shapes.

3.1 Back-Projecting Feature and Depth Maps

We align all objects in the dataset to be canonically oriented within each class, centered at the origin, and scaled to ﬁll bounding box [ 1, 1]3. Given such a 3D object, a CNN produces a 2D feature map F Rf H W for input image I. Using P, the camera projection used to render it into image I R3 H W , we back-project F into object space as follows.

As in [12, 11], we subdivide bounding box [ 1, 1]3 into GF Rf N N N, a regular 3D grid. Each voxel (x, y, z) contains the f-dimensional feature vector

GF xyz = F{P

where { } denotes bilinear interpolation on the 2D feature map. As illustrated by Fig. 3, backprojecting can be understood as illuminating a grid of voxels with light rays that are cast by the camera and pass through the 2D feature map. This preserves geometric structure of the surface and 2D features are positioned consistently in 3D space.

In practice, we back-project 2D feature maps (F1, . . . , FS) of decreasing spatial resolutions, which yield 3D feature grids (GF1, . . . , GFS) of decreasing sizes (N1, . . . , NS). We linearly scale the projected coordinates to account for decreasing resolution.

We process depth maps in a different manner to exploit the available depth value at each pixel. Given a 2D depth map D RH W + of an object seen from camera with projection matrix P, we ﬁrst back-project the depth map to the corresponding 3D point cloud in object space. This point cloud is used to populate binary occupancy grids such as the one depicted by Fig. 4(a). As for feature maps,

we use this mechanism to produce a set of binary depth grids (GD 1 , . . . , GD S ) of decreasing sizes (N1, . . . , NS).

(a) (b) Figure 4: (a) Back-projecting depth maps. Input depth map and back-projected depth grid seen from two different view points. (b) Outputs of the occ and fold MLPs introduced in Section 3.2. One is an occupancy grid and the other a cloud of 3D points generated by individual folding patches. The points are colored according to which patch generated them.

3.2 Hybrid Shape Decoder

The feature grids discussed above contain learned features but lack an explicit notion of depth. The values in its voxels are the same along a camera ray. By contrast, the depth grids structurally carry depth information in a binary occupancy grid but without any explicit feature information. One approach to merging these two kinds of information would be to clamp projected features using depth. However, this is not optimal for two reasons. First, the depth maps can be imprecise and the decoder should learn to correct for that. Second, it can be advantageous to push feature information not only to the visible part of the surfaces but also to their occluded ones. Instead, we devised a shape decoder that takes as input the pairs of feature and depth grids at different scales {(GF1, GD 1 ) ..., (GFS, GD S )} we introduced in Section. 3.1 and outputs a point cloud.

Our decoder uses residual layers that rely on regular 3D convolutions and transposed ones to aggregate the input pairs in a bottom-up manner. We denote by layers the layer at scale s, and concat concatenation along the feature dimension of same size 3D grids. layers takes as input a feature grid of size Ns and outputs a grid Hs 1 of size Ns 1. If Ns 1 > Ns, layers performs upsampling, otherwise if Ns 1 = Ns, the resolution remains unchanged. At the lowest scale, layer S constructs its output from feature grid GFS and depth grid GD S as

HS 1 = layer S(concat(GFS, GD S )) . (2) At subsequent scales 1 s < S, the output of the previous layer is also used and we write

Hs 1 = layers(concat(GFs, GD s , Hs)) . (3) The 3D convolutions ensure that voxels in the ﬁnal feature grid H0 can receive information emanating from different lines of sight and are therefore key to addressing the limitations of methods that only rely on local feature extraction [26]. H0 is passed to two downstream Multi Layer Perceptrons (MLPs), we will refer to as occ and fold. occ returns a coarse surface occupancy grid. Within each voxel predicted to be occupied, fold creates one local patch that reﬁnes the prediction of occ and recovers high-frequency details in the manner of Atlas Net [8]. Both MLPs process each voxel of H0 independently. Fig. 4(b) depicts their output in a speciﬁc case. We describe them in more detail in the supplementary material.

Let e O = occ(H0) be the occupancy grid generated by occ and

xyz e Oxyz>τ

+ fold(u, v|(H0)xyz) | (u, v) Λ

be the union of the point clouds generated by fold in each individual H0 voxel in which the occupancy is above a threshold τ. As in [8, 27], fold continuously maps a discrete set of 2D parameters Λ [0, 1]2 to 3D points in space, which makes it possible to sample it at any resolution. During the training, we minimize a weighted sum of the cross-entropy between e O and the groundtruth surface occupancy and of the Chamfer-L2 distance between e X and a point cloud sampling of the ground-truth 3D model.

3.3 Implementation Details

In practice, our UCLID-Net architecture has S = 4 scales with grid sizes N1 = N2 = 28, N3 = 14, N4 = 7. The image encoder is a Res Net18 [10], in which we replaced the batch normalization layers by instance normalization ones [22]. Feature map Fs is the output of the s-th residual layer. The shape decoder mirrors the encoder, but in the 3D domain. It uses residual blocks, with transposed convolutions to increase resolution when required. Last feature grid H0 of the decoder has spatial resolution N0 = 28, with 40 feature channels. The 8 ﬁrst features serve as input to occ, and the last 32 to fold. occ is made of a single fully connected layer while fold comprises 7 and performs two successive folds as in [28]. The network is implemented in Pytorch, and trained for 150 epochs using the Adam optimizer, with initial learning rate 10 3, decreased to 10 4 after 100 epochs.

We take the camera to be a simple pinhole one with ﬁxed intrinsic parameters and train a CNN to regress rotation and translation from RGB images. Its architecture and training are similar to what is described in [26] except we replaced its VGG-16 backbone by a Res Net18. To regress depth maps from images, we train another off-the-shelf CNN with a feature pyramid architecture [3]. These auxiliary networks are trained independently from the main UCLID-Net, but using the same training samples. Code is available here: https://github.com/cvlab-epfl/UCLID-Net.

4 Experiments

4.1 Experimental Setup Datasets. Given the difﬁculty of annotation, there are relatively few 3D datasets for geometric deep learning. We use the following two:

Shape Net Core [2] features 38000 shapes belonging 13 object categories. Within each category objects are aligned with each other and we rescale them to ﬁt into a [ 1, 1]3 bounding box. For training and validation purposes, we use the RGB renderings from 36 viewpoints provided in DISN [26] with more variation and higher resolution than those of [5]. We use the same testing and training splits but re-generated the depth maps because the provided ones are clipped along the z-axis.

PIX3D [19] is a collection of pairs of real images of furniture with ground truth 3D models and pose annotations. With 395 3D shapes and 10,069 images, it contains far less samples than Shape Net. We therefore use it for validation only, on approximately 2.5k images of chairs.

Baselines and Metrics. We test our UCLID-Net against several state-of-the-art approaches: Atlas Net [8] provides a set of 25 patches sampled as a point cloud, Pixel2Mesh [24] regresses a mesh with ﬁxed topology, Mesh R-CNN [7] a mesh with varying topological structure, and DISN [26] uses an implicit shape representation in the form of a signed distance function. For Pixel2Mesh, we use the improved reimplementation from [7] with a deeper backbone, which we refer to as Pixel2Mesh+. All methods are retrained on the dataset described above, each according to their original training procedures.

We report our results and those of the baselines in terms of ﬁve separate metrics, Chamfer L1 and L2 Distances (CD L1, CD L2), Earth Mover s Distance (EMD), shell-Io U (s Io U), and average F-Score for a distance threshold of 5% (F@5%), which we describe in more detail in the supplementary material.

4.2 Comparative Results Shape Net. In Fig. 5, we provide qualitative UCLID-Net reconstruction results. In Tab. 6(a), we compare it quantitatively against our baselines. UCLID-Net outperforms all other methods. We provide the results in aggregate and refer the interested reader to the supplementary material for per-category results. As in [26], all metrics are computed on shapes scaled to ﬁt a unit radius sphere, and CD L2 and EMD values are scaled by 103 and 102, respectively. Note that these results were obtained using the depth maps and camera poses regressed by our auxiliary regressors. In other words, the input was only the image. We will see in the ablation study below that they can be further improved by supplying the ground-truth depth maps, which points towards a potential for further performance gains by using a more sophisticated depth regressor than the one we currently use.

Pix3D. In Fig. 7, we provide qualitative UCLID-Net reconstruction results. In Tab. 6(b), we compare it quantitatively against our baselines. We conform to the evaluation protocol of [19] and

Figure 5: Shape Net objects reconstructed by UCLID-Net. Top row: Input view. Bottom row: Final point cloud. The points are colored according to the patch that generated them.

Method CD-L2 ( ) EMD ( ) s Io U ( ) F@5% ( )

Atlas Net 13.0 8.0 15 89.3 Pixel2Mesh+ 7.0 3.8 30 95.0 Mesh R-CNN 9.0 4.7 24 92.5 DISN 9.7 2.6 30 90.7 Ours 6.3 2.5 37 96.2

Method CD-L1 ( ) EMD ( )

Pix3D 11.9 11.8 Atlas Net 12.5 12.8 Pixel2Mesh+ 10.0 12.3 Mesh R-CNN 10.8 13.7 DISN 10.4 11.7 Ours 7.5 8.7

(Shape Net) (Pix3D)

Figure 6: Comparative results. For Shape Net, we re-train and re-evaluate all methods. For Pix3D, lines 1-2 are duplicated from [19], while lines 3-6 depict our own evaluation using the same protocol. The up and down arrows next to the metric indicate whether a higher or lower value is better.

report the Chamfer-L1 distance (CD L1) and EMD on point clouds of size 1024. The CD L1 and EMD values are scaled by 102. UCLID-Net again outperforms all other methods. The only difference with the Shape Net case is that both DISN and UCLID-Net used the available camera models whereas none of the other methods leverages camera information.

4.3 From Singleto Multi-View Reconstruction

A further strength of UCLID-Net is that its internal feature representations make it suitable for multiview reconstruction. Given depth and feature grids provided by the image encoder from multiple views of the same object, their simple point-wise addition at each scale enables us to combine them in a spatially relevant manner. For input views a and b, the encoder produces feature/depth grids collections {(GF1 a , GD 1,a) ..., (GFS a , GD S,a)} and {(GF1 b , GD 1,b) ..., (GFS b , GD S,b)}. In this setting, we feed {(GF1 a + GF1 b , GD 1,a + GD 1,b) ..., (GFS a + GFS b , GD S,a + GD S,b)} to the shape decoder and let it merge details from both views. For best results, the decoder is ﬁne-tuned to account for the change in magnitude of its inputs. As can be seen in Fig. 8, this delivers better reconstructions than those obtained from each view independently.

Figure 7: Reconstructions on Pix3D photographs: from left to right, twice: input, DISN, ours.

(e) Figure 8: Two-views reconstruction. (a,b) Two input images of the same chair from Shape Net. (c) Reconstruction using only the ﬁrst one. (d) Reconstruction using only the second one. (e) Improved reconstruction using both images.

Method: depth camera CD-L2( ) EMD( )

CAR inf. inf. 4.08 2.23 NOD - GT 3.97 2.20 CAM inf. GT 3.83 2.16 CAD GT GT 3.80 2.14 ALL inf. inf. 4.03 2.23

(a) (b) Figure 9: (a) Ablation study: comparative results on a single object category (cars) with inferred (inf.), ground truth (GT) or removed (-) auxiliary information. (b) Failure mode. From left-to-right: input view, reconstruction seen from the back-right, seen from the back-left. The visible armrest is correctly carved. The other one (occluded in the input) is mistakenly reconstructed as solid.

4.4 Ablation Study

To quantify the impact of regressing camera poses and depth maps, we conducted an ablation study on the Shape Net car category. In Fig. 9(a), we report CD-L2 and EMD for different network conﬁgurations. Here, CAR is trained and evaluated on the cars subset, with inferred depth maps and camera poses. NOD is trained and evaluated with ground truth camera poses, but without depth information (not ground truth nor regressed, we simply remove the depth branch). CAM is trained and evaluated with inferred depth maps, but ground truth camera poses. CAD is trained and evaluated with ground truth camera poses and depth maps. Finally, ALL is trained on 13 object categories with inferred depth maps and cameras as it was in all the experiments above, but evaluated on cars only.

Using ground truth data annotation for depth and pose improves reconstruction quality. The margin is not signiﬁcant, which indicates that the regressed poses and depth maps are mostly good enough. Nevertheless, our pipeline is versatile enough to take advantage of additional information, such as depth map from a laser scanner or an accurate camera model obtained using classic photogrammetry techniques, when it is available. Fully removing the depth branch degrades accuracy. Note also that ALL marginally gets better performance than CAR. Training the network on multiple classes does not degrade performance when evaluated on a single class. In fact, having other categories in the training set increases the overall data volume, which seems to be beneﬁcial.

In Fig. 9(b), we present an interesting failure case. The visible armrest is correctly carved out while the occluded one is reconstructed as being solid. While incorrect, this result indicates that UCLID-Net has the ability to reason locally and does not simply retrieve a shape from the training database, as described in [21].

5 Conclusion We have shown that building intermediate representations that preserve the Euclidean structure of the 3D objects we try to model is beneﬁcial. It enables us to outperform state-of-the-art approaches to single view reconstruction. We have also investigated the use of multiple-views for which our representations are also well suited. In future work, we will extend our approach to handle video sequences for which camera poses can be regressed using either SLAM-type methods or learningbased ones. We expect that the beneﬁts we have observed in the single-view case will carry over and allow full scene reconstruction.

Broader impact

Our work is relevant to a variety of applications. In robotics, autonomous camera-equipped agents, for which a volumetric estimate of the environment can be useful, would beneﬁt from this. In medical applications, it would allow aggregating 2D scans to form 3D models of organs. It could also prove useful in industrial applications, such as in creating 3D designs from 2D sketches. More generally, constructing an easily handled differentiable representations of surfaces such as the ones we propose opens the way to assisted design and shape optimization.

As for any method enabling information extraction from images in an automated manner, malicious use is possible, especially raising privacy concerns. Accidents or malevolent use of autonomous agents is also a risk. To reduce accident threats we encourage the research community to propose explainable models, that perform more reconstruction than recognition - the latter regime arguably being more prone to adversarial attacks.

Acknowledgment

This project was supported in part by the Swiss National Science Foundation.

[1] J. Bednarík, S. Parashar, E. Gundogdu, M. Salzmann, and P. Fua. Shape reconstruction by learning differentiable surface representations. ar Xiv Preprint, abs/1911.11227, 2019.

[2] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. ar Xiv Preprint, 2015.

[3] H. Chen. Single image depth estimation with feature pyramid network. https://github. com/haofengac/Mono Depth-FPN-Py Torch, 2018.

[4] Z. Chen and H. Zhang. Learning implicit ﬁelds for generative shape modeling. In Conference on Computer Vision and Pattern Recognition, pages 5939 5948, 2019.

[5] C. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese. 3D-R2N2: A Uniﬁed Approach for Single and Multi-View 3D Object Reconstruction. In ECCV, pages 628 644, 2016.

[6] H. Fan, H. Su, and L. Guibas. A Point Set Generation Network for 3D Object Reconstruction from a Single Image. In Conference on Computer Vision and Pattern Recognition, 2017.

[7] G. Gkioxari, J. Malik, and J. Johnson. Mesh r-cnn. In International Conference on Computer Vision, pages 9785 9795, 2019.

[8] T. Groueix, M. Fisher, V. Kim, B. Russell, and M. Aubry. Atlasnet: A Papier-Mâché Approach to Learning 3D Surface Generation. In Conference on Computer Vision and Pattern Recognition, 2018.

[9] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In International Conference on 3D Vision, pages 412 420. IEEE, 2017.

[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition, pages 770 778, 2016.

[11] K. Iskakov, E. Burkov, V. Lempitsky, and Y. Malkov. Learnable triangulation of human pose. In International Conference on Computer Vision, pages 7718 7727, 2019.

[12] M. Ji, J. Gall, H. Zheng, Y. Liu, and L. Fang. Surfacenet: An end-to-end 3d neural network for multiview stereopsis. In International Conference on Computer Vision, pages 2307 2315, 2017.

[13] W. Lorensen and H. Cline. Marching Cubes: A High Resolution 3D Surface Construction Algorithm. In ACM SIGGRAPH, 1987.

[14] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Conference on Computer Vision and Pattern Recognition, pages 4460 4470, 2019.

[15] C. R. Qi. Autoencoder for point clouds. https://github.com/charlesq34/ pointnet-autoencoder, 2018.

[16] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Conference on Medical Image Computing and Computer Assisted Intervention, pages 234 241, 2015. [17] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhofer. Deepvoxels: Learning persistent 3d feature embeddings. In CVPR, pages 2437 2446, 2019. [18] X. Sun. Pix3d: Dataset and methods for single-image 3d shape modeling. https://github. com/xingyuansun/pix3d, 2018. [19] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In Conference on Computer Vision and Pattern Recognition, pages 2974 2983, 2018. [20] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree Generating Networks: Efﬁcient Convolutional Architectures for High-Resolution 3D Outputs. In International Conference on Computer Vision, 2017. [21] M. Tatarchenko, S. Richter, R. Ranftl, Z. Li, V. Koltun, and T. Brox. What do Single-view 3D Reconstruction Networks Learn? In Conference on Computer Vision and Pattern Recognition, pages 3405 3414, 2019. [22] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance Normalization: The Missing Ingredient for Fast Stylization. In ar Xiv Preprint, 2016. [23] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang. Pixel2mesh: Generating 3D Mesh Models from Single RGB Images. In European Conference on Computer Vision, 2018. [24] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In European Conference on Computer Vision. [25] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in Neural Information Processing Systems, pages 540 550, 2017. [26] Q. Xu, W. Wang, D. Ceylan, R. Mech, and U. Neumann. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In Advances in Neural Information Processing Systems, pages 490 500, 2019. [27] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Conference on Computer Vision and Pattern Recognition, pages 206 215, 2018. [28] Y. Yang, C. Feng, Y. Shen, and D. Tian. Foldingnet: Point Cloud Auto-Encoder via Deep Grid Deformation. In Conference on Computer Vision and Pattern Recognition, June 2018.