# neural_feature_matching_in_implicit_3d_representations__6107920c.pdf

Neural Feature Matching in Implicit 3D Representations

Yunlu Chen 1 Basura Fernando 2 Hakan Bilen 3 Thomas Mensink 4 1 Efstratios Gavves 1

Recently, neural implicit functions have achieved impressive results for encoding 3D shapes. Conditioning on low-dimensional latent codes generalises a single implicit function to learn shared representation space for a variety of shapes, with the advantage of smooth interpolation. While the beneﬁts from the global latent space do not correspond to explicit points at local level, we propose to track the continuous point trajectory by matching implicit features with the latent code interpolating between shapes, from which we corroborate the hierarchical functionality of the deep implicit functions, where early layers map the latent code to ﬁtting the coarse shape structure, and deeper layers further reﬁne the shape details. Furthermore, the structured representation space of implicit functions enables to apply feature matching for shape deformation, with the beneﬁts to handle topology and semantics inconsistency, such as from an armchair to a chair with no arms, without explicit ﬂow functions or manual annotations.

1. Introduction

In recent years neural implicit functions for 3D representations (Park et al., 2019; Mescheder et al., 2019; Chen & Zhang, 2019; Michalkiewicz et al., 2019) have gained popularity with beneﬁts including continuous resolutionfree representation and handling complicated topologies. Moreover, a single implicit function can be generalised to encode a variety of shapes, by representing each shape with a low-dimensional latent code that conditions the function. This is advantageous in terms of the better reconstruction performance as well as better smoothness and reasonable interpolations between shapes (Park et al., 2019; Chen & Zhang, 2019; Chen et al., 2019b) compared to explicit rep-

1Informatics Institute, University of Amsterdam, the Netherlands 2AI3, IHPC, A*STAR, Singapore 3School of Informatics, University of Edinburgh, Scotland 4Google Research, Amsterdam, the Netherlands. Correspondence to: Yunlu Chen <y.chen3@uva.nl>.

Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s).

feature matching

Figure 1. Feature Matching applied to mesh deformation. Appearance-ﬁtting methods overﬁt to the raw geometry of the target shape and fail with inconsistent semantics or topologies easily. Showing Mesh ODE (Huang et al., 2020) as an example when deforming an armchair to a no-arm chair. Instead, feature matching helps to resolve the semantic inconsistency issue and generate meaningful shapes with no extra annotations.

resentations (Yang et al., 2018; Fan et al., 2017; Groueix et al., 2018b; Chen et al., 2020).

Intuitively, implicit functions enjoy smooth interpolations because they rely on continuous coordinates as query point inputs. However, due to the nature of the encoding, the learned representations are global and do not correspond to any explicit local points. Thus, the advantage of smooth interpolations cannot be applied to shape manipulation of explicit 3D representations like meshes or CAD models.

We propose to extract point-level transformations from the deep features and the gradients with regard to the coordinate input in implicit functions. We track the continuous point trajectory that matches with the minimum change in pointwise features from implicit function with interpolated latent code. These can be considered as generalised correspondences, in the sense that the point does not necessarily lie on the interpolated shape surface but can be at any spatial location for the implicit function input.

The resulting transformed shapes will reﬂect the characteristics of the layer where we collect features. This provides insights of what each layer learns, which was not possible before. By analyzing the hierarchy in standard deep implicit functions, we ﬁnd that early layers gradually map the latent code to coarse shapes, while deeper layers reﬁne ﬁner details. Mid-layer features are semantically distinctive that

Neural Feature Matching in Implicit 3D Representations

encodes high-level information. We postulate that the hierarchical nature of implicit functions with latent codes is what facilitates generalisation over various styles of geometry.

The inherent structure in the implicit functions allows us to apply our method on mesh deformation, which requires to ﬁt a source to a target shape while preserving the local structure (the edge connectivities) from the source shape. Existing learning-based deformation methods (Jiang et al., 2020a; Huang et al., 2020; Wang et al., 2019) follow the appearance-ﬁtting paradigm, where the deformed source shape is enforced to ﬁt the target shape as the training objective. Instead, we rather rely on a pre-trained shape autoencoding implicit function, where it learns to ﬁt single shapes with a standard implicit function. The point transformation is later extracted via feature matching at inference time. As illustrated in Figure 1, feature matching can be favourable even though appearance ﬁtting looks a more natural choice for such a task. The reason is that by having to optimally ﬁt to different target shapes featuring their own ﬁne details, appearance ﬁtting can be harmful and lead to inconsistent semantics or topologies. Differently, feature matching does not enforce ﬁtting too strictly to the target geometry. We rather match the implicit features with high-level information, helping to resolve the semantic inconsistency issue and generate meaningful shapes without any external part segmentation annotations.

In this work we make the following contributions.

We propose a way to extend latent-coded implicit functions, so that they can be used for matching boundary points between a pair of examples with minimum feature difference at different scales. We ﬁnd out that features at different scales capture hierarchically different characteristics, with earlier layers capturing the coarser shape outlines and later layers encoding ﬁner shape details. We propose a novel shape deformation method that matches point features. The proposed method handles the challenging inconsistencies in topology and semantics, as our approach beneﬁts from the structured feature space from implicit functions.

2.1. Preliminaries on implicit functions

A neural implicit representation for a single 3D shape is a function fθ : R3 R which takes as input a 3D coordinate of any query point from the Euclidean space x R3 and predicts a scalar value indicating if x is inside or outside the shape. A latent-coded implicit function Fθ : R3 Rk R further generalises the function to representing a variety of shapes by conditioning the network on a kdimensional latent code z Rk as shape identity, which is

Figure 2. Illustration of the method. left: for source shape surface MA (blue) and target shape surface MB (purple), we sample point xt=0 MA and solve the trajectory with equation (3) and (4). Note that xt=1 does not necessarily lie on MB. right: Sampling dense points from MA for feature matching returns the transformed shape T (MA) (blue dashed curve).

either regressed by an encoder Eψ or jointly optimised in the auto-decoder framework (Park et al., 2019).

For a shape M with the latent code z given, Fθ( , z) is a scalar ﬁeld being either a signed distance ﬁeld (SDF) (Park et al., 2019) or a [0, 1] occupancy probability ﬁeld (Mescheder et al., 2019; Chen & Zhang, 2019) that represents the shape. The explicit shape surface M := {x R3|Fθ(x; z) = τ} is then extracted by marchingcubes (Lorensen & Cline, 1987), with τ the decision boundary whether the query point is inside or outside the shape. Typically τ = .5 for occupancy ﬁelds and τ = 0 for SDFs.

Architecture. In a simple form, Fθ is an L-layer multilayer perceptron (MLP) network

Fθ(x; z) = WL σ σ W1 (x z), (1)

where σ is a piecewise linear activation function (Arora et al., 2018) (e.g. Re LU), denotes concatenation, and for l = 1, . . . , L, Wl : Rwl 1 Rwl is the afﬁne mapping corresponding to the l-th layer weights, and wl is the width of the l-th layer. Note that w0 = k + 3 for the input layer with the concatenation of x and z, and w L = 1 for the output layer. Moreover, for l = 1, . . . , l 1, we denote Φ(l) θ : R3 Rk Rwl to be the ﬁrst l layers of Fθ

Φ(l)(x; z) := σ Wl σ σ W1 (x z). (2)

We refer to the output of Φ(l) θ (x; z) (or simply Φθ(x; z)) as the implicit features at x with latent code z.

Training. Given a set of training examples {M}. The network parameter is usually optimised with a supervised regression loss, or a classiﬁcation loss in (Mescheder et al., 2019).. With an encoder-decoder design, it is L(θ, ψ) = P

M,x |Fθ(x; Eψ(M)) s M,x|, with z = Eψ(M) and s M,x is the ground-truth signed distance or occupancy probability value. We omit the term θ in the following of the section for simplicity as θ is ﬁxed after training.

Neural Feature Matching in Implicit 3D Representations

2.2. Implicit Feature Matching

The literature on unsupervised correspondence on the image domain (Aberman et al., 2018; Choy et al., 2016), consider as corresponding points those pixels with similar deep features. As inspired, we track the continuous point trajectory that minimises the change in the pointwise implicit features with the latent code interpolation, yielding matching features iteratively with small steps of interpolated latent code.

Given a source shape A, a target shape B and their latent codes z A and z B in association with a trained implicit function F, MA = {x R3|F(x; z A) = τ} is the source shape surface represented by F. Our objective is to ﬁnd a shape transformation T which can be decomposed into point transformations TX , such that the collection of local transformations T (MA) = {TX (x)|x MA} yields a reasonable shape in accordance with shape B.

Inspired by the smooth interpolated shapes with implicit functions, we linearly interpolate the latent code as inspired by the smooth shape interpolations from implicit functions. zt := (1 t) z A + t z B yields the interpolation path of z, with the interpolation rate t [0, 1]. Note that z0 = z A and z1 = z B. At t = 0, the initial point coordinate is sampled from the source shape surface x0 MA. For a continuous point trajectory initiated at x0, for any inﬁnitesimal time step between t and t + dt, we deﬁne the displacement δt = xt+dt xt required to achieve minimum feature difference:

δt = argmin δ t <σ Φ(xt + δ t, zt+dt) Φ(xt, zt) . (3)

σ is a small positive value deﬁning a ball search region xt+dt around xt. Our assumption is that the point trajectory is smooth and continuous, because F or Φ, modelled as piecewise linear functions, are continuous with the inputs x and z (Arora et al., 2018; Atzmon et al., 2019). While an analytical solution to the path xt is intractable, we resort to numerical integration

xt = x0 + Z t =t

t =0 δt . (4)

over small displacements δt iteratively achieved from Equation (3) within a small time step dt , until reaching t = 1 for the desired point transformation to be TX (x0) = x0+ R 1 0 δt.

We illustrate the main idea of our method in Figure 2.

Gauss-Newton solution. The numerical integration in equation (4) rests upon an efﬁcient and robust optimization of equation (3). Viewing equation (3) as an over-constrained (wl 3) nonlinear equation system Φ(xt + δ t, zt+dt) Φ(xt, zt) = 0, we resort to Gauss-Newton algorithm for a least square solution that takes into account all ﬁrst-order partial derivatives when computing local updates and effectively performs an approximation of the second-order

no regularisation w/ regularisation

Figure 3. Regularisation. left: Nearby points on ﬂat surface have similar features which could cause mismatch. right: Adding small penalty on displacement resolves the issue.

derivatives. The Newton s optimisation has quadratic convergence and empirically requires less hyperparameter tuning compared to gradient descent.

The numerical solution of equation (3) is obtained by performing N Gauss-Newton iterations. We denote the displacement at the n-th iteration as [δt]n initialised at [δt]0 = [0, 0, 0]T ,

[δt]n = [δt]n 1 (JT J) 1JT [dΦ]n, (5)

where J = δtΦ(xt + δt, zt+dt) xΦ(xt; zt) Rwl 3

is the partial derivative of Φ with respect to the input coordinates and [dΦ]n 1 is the feature difference with an inﬁnitesimal change in the input at current estimation of [δt]n 1,

[dΦ]n 1 = Φ(xt + [δt]n 1, zt+dt) Φ(xt, zt). (6)

Optimization is run for N iterations such that δt [δt]N. We emphasize again that all above optimizations in equations (3), (4) and (5) are per point coordinate x.

Regularisation on displacement. Due to that the implicit feature ﬁeld Φ( ; z) is highly nonconvex, especially in deeper layers, the gradient ﬁelds xΦ( ; z) show erratic changes under small disturbances in input x. This happens more often on ﬂat surfaces where adjacent points share similar features, and therefore more sensitive to noise and easily drift aside during feature matching, as illustrated in Figure 3.

To address this, we propose to add a regularisation penalty on the norm of the displacement δ. In our iterative optimizer, this boils down to adding the regularisation that minimises [δt]n , given [δt]0 = 0. Note the difference from Levenberg Marquardt algorithm, which minimises [δt]n [δt]n 1 .

In the iterative scheme, equation (5) is in the form of the generalised Newton s iteration (Ben-Israel, 1966), where we understand ([δt]n [δt]n 1) as the actual target variable, with equation (5) reformed as [δt]n [δt]n 1 = (JT J) 1JT [dΦ]n 1. Therefore, minimising is equivalent to optimising for ([δt]n [δt]n 1) ( [δt]n 1). Then we solve it together within the same iteration by modifying equation (5) with: J J λ diag(1, 1, 1) , and (7)

[dΦ]n 1 [dΦ]n 1 λ ( [δt]n 1), (8)

Neural Feature Matching in Implicit 3D Representations

where λ is the weighting factor for the regularisation and diag(1, 1, 1) is indeed the counterpart of the Jacobian.

2.3. Application to mesh deformation

Implicit feature matching can be applied to deforming one source mesh to a target shape. With the source mesh as a set (V, E, F) of vertices V and the edges E or faces F, we transform vertices V with feature matching but not the edges or faces, yeilding the deformed mesh as (T (V), E, F).

Freedom of self-intersections. Notably, an important advantage for application to mesh deformation is that our method naturally prevents self-intersections, which implies the existence of at least two point trajectories ut and vt intersects at a certain intermediate time t .At this moment, ut = vt are at the same spatial location. Then for any t > t , ut = vt holds. So in the worst case there can be some mesh vertices that collapse to be at the same position, however, self-intersections are naturally avoided.

We clarify that may still exist due to discrete processing. However, when the set of hyperparameters are properly chosen in our experiments, self-intersection hardly happens with discrete tracking even in the presence of noise,, probably due to the implicit regularisation (e.g. spectral bias (Rahaman et al., 2019)) of the Re LU-MLP network with coordinate inputs.

3. Related Work

Neural implicit 3D representations. Deep implicit functions for 3D shapes have been shown to be highly effective for reconstruction (Park et al., 2019; Mescheder et al., 2019; Chen & Zhang, 2019; Michalkiewicz et al., 2019). Compared to other methods working on explicit representations (Wu et al., 2016; Fan et al., 2017; Groueix et al., 2018b; Yang et al., 2019; 2018), they incorporate 3D coordinates as input to the network which enables resolution-free representation for all topologies.The applications include reconstructing shapes from single images (Saito et al., 2019; Xu et al., 2019; 2020), raw point clouds (Atzmon et al., 2019; Atzmon & Lipman, 2020a;b), 4D reconstruction (Niemeyer et al., 2019), and view synthesis (Sitzmann et al., 2019; Mildenhall et al., 2020). Some recent advances enable to include structural and hierarchical designs (Genova et al., 2019; Jiang et al., 2020b; Chibane et al., 2020; Peng et al., 2020) with implicit function models in order to be aware of information from the local neighbourhood (Chen et al., 2019a) of the query point. We show the inherent hierarchy emerges in a simple latent-coded implicit function.

Learning-based shape deformation. Deformation between shapes with varying topology is a challenging problem. Recent learning-based solutions include learning the offset of each mesh vertices (Wang et al., 2019), free-form

deformation grids (Kurenkov et al., 2018) and deformation cages (Yifan et al., 2020). Current state-of-the-art methods use invertible ﬂows (Rezende & Mohamed, 2015; Kingma & Dhariwal, 2018; Dinh et al., 2014; 2016; Chen et al., 2018) to model the shape deformation ﬁeld (Huang et al., 2020; Jiang et al., 2020a), which enables bijective transformation and is free of intersection by nature.

All above methods follow the appearance-ﬁtting paradigm which could be harmful with inconsistent topology or semantics, unless the segmentation ground truth is available, as in Huang et al. 2020; Gao et al. 2019. By contrast, we rely on the generalisable latent space and implicit features to resolve the problem with no extra annotations.

4. Experiments and Evaluations

Implementation details. For implicit function architecture, we follow IM-Net (Chen & Zhang, 2019) and use a 7-layer MLP with Leaky-Re LU activations, taking as input a 3D coordinate and a 256-dim latent code, and outputs an occupancy probability. The latent code is obtained from a voxel encoder. we train one network per shape category with the same architecture, following the coarse-to-ﬁne progressive training scheme from Chen & Zhang 2019. 1

We use objects from the Shape Net dataset (Chang et al., 2015). The optimisation uses the following settings: we use t = 0.02 for a total of 50 intermediate steps with latent code interpolation. For the number of Newton s iterations at each time step we use N = 3. The regularisation factor λ is set as 0.01. See supplementary material for more details.

4.1. Analysis and ablation

We ﬁrst study the effect of matching features from different layers. This can help us obtain insights and better understanding in per layer features in the implicit function.

We densely sample points from the reconstructed shape surface represented by implicit functions and solve the point trajectory of the set of surface points. Then we observe what the set of transformed points forms.

Matching features in different layers We test the matching performance of implicit features for each of the six intermediate layers in the network, as the example in Figure 4. For deeper layers, the transformation gets closer to the target. While for early until mid-layers, the ﬁne geometrical details of the source shape are preserved. In contrast, when deeper layer features are used, the model ﬁts the target shape more in detail. Here some points fail to reach the target surface, because of the largely varied geometry such

1We use the improved implementation from the authors at https://github.com/czq142857/IM-NET-pytorch, which has some subtle differences from the original paper.

Neural Feature Matching in Implicit 3D Representations

Figure 4. Implicit feature matching applied to different layers. From the source shape transformed to match the target shape in two viewpoints, one per row. This is not interpolation over time. We observe that matching with implicit features in earlier layers focuses on ﬁtting the outline, while matching with implicit layers in later layers focuses on ﬁne-grained details. See for example how from layer 1 to layer 4 the model deforms to the coarse geometry of the target in a rigid way, while in layer 6 the model matches almost the exact surface of the target. Empirically matching features from early and mid-level layers leads to no breaking the topology or local details so to be applicable for mesh deformation.

Figure 5. t-SNE visualisation of corresponding point features at different levels, from level 1 (left) to level 6 (right). Colours represent different groups of corresponding points across 100 chairs. We observe, from early to mid-layers, implicit features become more distinctive, however, the last layers, as closed to the ﬁnal output that maps all surface points to the same value (τ), hence surface point features become more uniform distributed. We conclude that the mid-level layers encode more high-level semantics, while the last layers, focusing on ﬁtting local shape details, lose global information.

interpolation rate (t)

layer 6 layer 5 layer 4 layer 3 layer 2 layer 1

(a) Distance to source

interpolation rate (t) 0.000 0.000

layer 6 layer 5 layer 4 layer 3 layer 2 layer 1

(b) Distance to target

Figure 6. Chamfer distance to source and target shapes when transforming shapes using different layers for matching features. Averaged on transformation between 100 random pairs of chairs. We observe that the feature matching in the earlier layers preserves to not deviate from the source while in deeper layers it matches better the target. This is consistent with the observations from Figure 4.

that the model perceives that the corresponding point does not exist on the target at detail-level.

We further corroborate the results quantitatively using a large number of pairs. Figure 6 shows the Chamfer distance from the transformed point set to the source and the target shapes. The transformed shapes from deeper layers

0.0 0.5 1.0 interpolation rate (t)

preserved edge rate (%)

layer 6 layer 5 layer 4 layer 3 layer 2 layer 1

Figure 7. Edge preserveness. Higher value means less edges are broken from the source mesh, which implies that the ﬁner detail is more preserved. Averaged on transformation between 100 random pairs of chairs. Earlier layers (1-3) preserve the detail of the source shape with all 100% edges preserved, and the details are much changed in deeper layers. This is consistent with the observations from Figure 4.

are closer to the target and far away from the source. In Figure 7, we evaluate how much the ﬁner detail is preserved by measuring the edge preserveness using mesh data with known edges ϵ E as the source shape (See 2.3). TE(ϵ) is the transformed edge of source edge ϵ. Edge preserveness is deﬁned as the percentage (%) of ϵ E subject to 1 5 |ϵ| |TE(ϵ)| 5 |ϵ|, i.e., when the length of the edge

Neural Feature Matching in Implicit 3D Representations

Direct matching With interpolation

layer 3 layer 6 layer 6 layer 3

Figure 8. Necessity of matching with interpolation. Direct matching from source and target implicit ﬁelds with no intermediate steps from latent code interpolation fails.

no regularisation with regularisation

layer 3 layer 6 layer 6 layer 3

Figure 9. Effect of regularisation. Without regularisation, some points (especially those on the planar regions) do not move to the correct position. Since on the planar region the adjacent points are expected to have similar features, δt can easily drift away until the points reach the non-ﬂat part of the shape, where the gradients of the point features has stronger values, and therefore less inﬂuenced from the noise. Layer 6 features suffer more from the absence of regularisation because the gradient ﬁeld is more complex and non-smooth.

does not change more than ﬁve times longer or shorter . From Figure 7 we conclude that the transformed shape from early and mid layers preserves well the ﬁner details with regard to the source shape, while in deeper layers the details of local connectivity details are changed signiﬁcantly. These conclusions are all consistent with the observations from Figure 4.

Hierarchy in implicit function We interpret Figure 4 as reﬂecting the hierarchy of the layers in the network. The early layers learn gradually the outline of the rough shape geometry, while the ﬁnal layers learn the ﬁner details. The observation is in the context of shape transformations, but we expect it also holds for reconstruction of single shapes.

We claim that this observation is signiﬁcant with MLP architecture used for implicit functions. compared with the literature that shows emerging hierarchies in Conv Nets for image domain (Zeiler & Fergus, 2014; Bau et al., 2017). Conv Nets are inherently more structured with local convolution operations, while similar structures are less expected in MLPs as more general universal function approximators.

Mid-level features are the most distinctive. To better understand the difference of the features from different layers in the network, we check whether the network can distinguish the features from a set of different corresponding points obtained by feature matching. We take eight points

(c) uni-layer

(d) multi-layer

Figure 10. Effectiveness of transformation with features from combined layers. Deformations shown from matching features from layer 3 (c) and from layer 3 and layer 6 (d). Using feature from multi-levels reﬁnes the local geometry to the target (e.g. the legs become thinner) without breaking the topology of the source shape.

from one chair with farthest point sampling (fps). These points are matched by our method to 100 randomly selected chairs and taking the closest point on the target surface. and we mark those from each of the eight source points as a group of correspondents. We visually evaluate if clusters are formed using t-SNE (Maaten & Hinton, 2008), showing the points from each group of correspondents with different colours in Figure 5. We observe that the ﬁrst layer feature as a linear projection from the input can hardly learn the discrimination, while the mid-level features show good clustering. In the last hidden layer the corresponding points mix, which implies that the features from the layer cannot differentiate between the groups. The reason is that the last layer features are to be mapped to the output implicit ﬁeld, which is the same value τ to all surface points. Thus, the ﬁnal layers focus on the local detail and lose global information of the shape.

Our observations give an explanation to the fact that some implicit methods require a shallow network design with restricted expressivity in order to aim for either partlevel (Chen et al., 2019b) or point-level (Liu & Liu, 2020) correspondence among the shapes with varying topology, while their achievements cannot generalise to more standard deep architectures. With a shallow architecture that limits the expressive power, the output layer feature is less likely to lose too much of the global information.

Direct matching vs. matching with interpolation Alternative to feature matching iteratively with interpolated latent, one can also attempt direct matching between source and target latent without iterations. As shown in Figure 8, points transformed by direct matching fail to form a reasonable shape. Using layer 3, all points move to shape edges, which are expected to have more signiﬁcant feature gradients. Using layer 6, the points become random, which is consistent with the above analysis that deep layer features lose global information. we hypothesize that the reason for this poor behaviour could be the existence of a smooth non-Euclidean data manifold in the shared latent-interpolated feature space encoded by implicit features. Direct matching would fail

Neural Feature Matching in Implicit 3D Representations

feature matching

Ours Shape Flow Mesh ODE Neural Cage

appearance-fitting

appearance-fitting

appearance-fitting

fitting paradigm

rigidity constraint

Figure 11. Deforming from arm (source, blue) to no-arm (target, green) chairs. Appearance-ﬁtting methods fail with overﬁtting to the target shapes with no chairs. Adding rigidity constraints eases the problem, but not a fundamental solution. By contrast, our method resolves the problem without enforcing rigidity, beneﬁted from implicit features.

in case of such a non-Euclidean manifold, since the feature difference is measured using distance in Euclidean space. However, more investigation would be needed for this claim.

In Figure 9 we show qualitatively the ablation experiment from transformation via feature matching without such a regularisation. We show layer 3 and layer 6 as examples with the same source and target shapes as in Figure 4 of the main paper. We see that without regularisation, some points are not at the ideal position to compose the shape, especially on planar regions such as seat and back. Layer 6 features suffer more from no regularisation because it is more complex and non-smooth. Those points more easily goes to the direction of the edges of the shape where the feature gradients are expected to be more signiﬁcant. This is probably due to that the change of latent code is not ideally inﬁnitesimal, so the resulting δt is not accurate enough. Since on the planar region the adjacent points are expected to have similar features, δt can easily drift aside.

4.2. Mesh Deformation

Based on the above observations and analysis, matching mid-level features are able to ﬁt the target shape outline without breaking the local detail of the source. In addition, mid-level layers encode the most high-level global information. We apply feature matching for mesh deformation, which requires to preserve the local edge connectivity.

Choice of layers to match. We mostly rely on layer 3 features for matching each vertex from the source mesh. However, we are not restricted to using only single-level features. Empirically we ﬁnd that using mid-layer features jointly with ﬁner level features downweighted by a small factor 0 < η < 1 helps to ﬁt the local geometry of the target shape better, as illustrated in Figure 10. We take a combination of layer 3 and layer 6 features where the layer 6 feature is weighted by η = 0.1.

Qualitative results and analysis. We compare the quality of mesh deformation to Shape Flow (Jiang et al., 2020a) Mesh ODE (Huang et al., 2020) and Neural Cage (Yifan et al., 2020). We focus on an armchair to no-arm chair transition as a challenging senario, as in Figure 11. We conclude that our method produces much more plausible results, while other methods, all following the appearanceﬁtting paradigm, suffer from some degree of overﬁtting to the target shape.

Though not a fundamental solution, appearance-ﬁtting methods mitigates the problem of overﬁtting to the target by imposing explicit constraints on rigidity (Sorkine & Alexa), measuring how much the source edge is preserved by the deformed mesh. Both Shape Flow and Mesh ODE learn ﬂexible deformable ﬂow ﬁelds, yet introduce unnatural distortions. The difference is that Mesh ODE constrains deformation with a rigidity loss such that the local connectivity is preserved better than Shape Flow, and the distortion is less

Neural Feature Matching in Implicit 3D Representations

Table 1. Evaluation between deformed shape and the target. Numbers are CD( 0.001) / EMD( 0.01) in each cell. Note that the whole shape measurements is biased towards overﬁtting to the target, e.g. distorting the chair arms in Figure 11 is considered as better performance, hence tailored for appearance-ﬁtting methods, and not always a good measure. Our method is consistently better at part-level metrics, indicating better handling the semantics inconsistency.

Shape category chair airplane table

Part-level evaluation

Shape Flow (Jiang et al., 2020a) 1.365 / 6.750 4.285 / 5.794 0.378 / 5.194 5.551 / 5.229 - / - - / - Mesh ODE (Huang et al., 2020) 1.187 / 7.281 4.148 / 5.315 - / - - / - 2.564 / 8.298 14.859 / 7.578 Neural Cage (Yifan et al., 2020) 4.372 / 8.563 6.477 / 6.319 - / - - / - 11.367 / 11.116 21.676 / 9.378 This paper 1.744 / 7.143 3.772 / 3.256 0.935 / 5.601 5.458 / 4.193 4.998 / 8.387 14.748 / 4.174

Figure 12. Examples of transforming chairs in a variety of styles from source (blue) to target (green). Our method is able to handle all variations meaningfully.

heavy. In Neural Cage the rigidity constraint is even stronger by design,where the cage structure is used to bound an area of vertices, and deformation is solved per cage rather than per vertex by morphing the vertices accordingly. However, Neural Cage still suffers occasionally. As seen in the ﬁgure, the method bends down the arms and even the seats as a whole, which indicates that the rigidity constraints do not sufﬁce. By contrast, our method proposes a more fundamental solution, which relies on the hierarchy and the high-level information encoded by mid-level implicit features. We resolve the overﬁtting to such semantics or topology inconsistencies without explicit rigidity constraints.

More results are available in Figure 12, where we see that our method handles different styles of shapes. In Figure 13, a few examples on three other categories of airplanes, ta-

Figure 13. Deformations of car, table and airplane. From left to right: source shape, target to source, source to target, and target shape. Our method generalise to shapes from different categories.

Figure 14. Failure cases. From source (blue) to target (green). Our method is able to handle all variations meaningfully. Left: two back cylinder legs are deformed to ﬂakes to match the thin-plate table stand on the target; Right: the seating part is not well-recognized and stays at arms height on the target.

bles and cars. In Figure 14, we show some examples of failure cases. We also include some results on the continuous interpolation of the deformation process in supplement material.

Quantitative evaluations. We evaluate the bidirectional Chamfer distance (CD) and Earth Mover s distance (EMD) between the transformed shape T (A) and the target B to quantify the matching quality. Both metrics measuring the matching quality of the shapes as a whole are biased towards the aforementioned overﬁtting issue, e.g. the unnatural distortions in Figure 11 are regarded as better performance. See supplementary material for more discussion on the limitation of the global metrics. For this reason, we also evaluate the part-averaged distances which better reﬂect the ability to handle semantic inconsistency. Formally

Neural Feature Matching in Implicit 3D Representations

CDpart(T (A), B) = 1 Nc

PNc c=1 CD(T (Ac), Bc). Ac A

is the part segment that belongs to the c-th of all Nc part category for shape A , and so is Bc. EMDpart is deﬁned similarly. We evaluate on three representative shape categories, chair, table and airplane. For each category, we randomly select 500 pairs of source and target shapes from the test split of Shape Net Part (Yi et al., 2016) with the data preprocessed by Chen et al. 2019b. The results are in Table 1.

Our method is consistently better at handling semantic part consistency. Shape Flow and Mesh ODE are better at matching the global shapes, although as motivated above they often overﬁt. Neural Cage is not competitive in either of the metrics, due to trading shape ﬂexibility with higher rigidity by over-constraining.

4.3. Evaluation of point correspondence

We further evaluate the correspondences obtained by feature matching. This can be achieved by matching a source point to the target, and ﬁnd the nearest point on target shape surface. The goal is not to achieve state-of-the-art performance, since correspondences is not our main focus. Rather, we want to show that implicit features inherently encode the correspondences across shapes to some extent, without explicit training or specialized network designs.

Setup and baseline method. We compare with Occupancy Flow (Occ Flow) (Niemeyer et al., 2019). Occ Flow reconstructs 4D human motion from D-FAUST dataset (Bogo et al., 2017) with 3D human motions such as punching and jumping jacks, preprocessed into sequences of 17 (3D) frames. Occ Flow consists of two networks, an implicit 3D function occupancy network (Occ Net) (Mescheder et al., 2019) for encoding the shape at the initial frame, and a velocity network to predict the ﬂow or the correspondences over time, similar to that in Shape Flow or Mesh ODE. By contrast, we extract correspondences only from the 3D implicit function of Occ Net without a speciﬁc ﬂow network, see Figure 15.

We use the ofﬁcial release of the code from Occ Flow (Niemeyer et al., 2019) for the evaluation of the ℓ2 loss of the correspondence as well as the implementation of Occ Net. We do not use correspondence ground truth, nor do we need the velocity network with our method, which means we only need half of the components compared with Occ Flow. Following Niemeyer et al. 2019, dense point cloud is used as input of the source shape (the ﬁrst frame) and the target shape (the last frame) to discover correspondences between them. We match the implicit feature from Occ Net, and ﬁnd the nearest point on the shape for correspondence. See supplement material for more implementation details.

Results. We show the evaluation results in Table 2. When no supervision is available, the proposed method performs

Table 2. Correspondences. The proposed method recovers inherent correspondence from the implicit features without explicit ﬂow or correspondence functions or supervisions.

Supervised Cor. ℓ2

Nearest Neighbour 0.374 Ours 0.169

Occ Flow (Niemeyer et al., 2019) 0.167 3D-Coded (Groueix et al., 2018a) 0.096

Figure 15. Joint reconstruction and correspondence of human motion sequence. Correspondence (in colour) is usually considered as unavailable with a single implicit network Occ Net for 3D shape, as noted by Niemeyer et al. 2019. Our method extracts correspondence from matching features.

favourably. we are competitive to Occ Flow with supervision, although we do not really focusing on correspondence. Our method and Occ Flow cannot catch the performance of 3DCoded (Groueix et al., 2018a) as a state-of-the-art method speciﬁcally for the task, trained with large amount of data and heavy augmentation. We conclude that feature in a standard implicit function encodes correspondence and can be extracted with feature matching, even without adopting a speciﬁc architecture design.

5. Conclusion

In this work, we propose to extend deep implicit functions, which normally give global representations, so that they are amenable to local feature matching. To do so, we start from a self-ﬁtting learning paradigm for learning good shared representation space, upon which we can condition implicit functions. Then, to achieve local feature matching, we propose generalized correspondences by casting them as the trajectories from one shape to another where we have minimum change in the feature with interpolated latent code. By introducing locality to implicit functions, we can analyze what each layer in the implicit function learns, with earlier layers encoding coarse shapes and higher layers encoding ﬁner details. What is more, locality enables shape deformations, where the resulting shapes can handle complex topologies and semantics inconsistencies.

Acknowledgement This research was supported in part by the SAVI/Medi For project and the EPSRC programme grant Visual AI EP/T028572/1. We thank the anonymous reviewers for helpful comments and suggestions.

Neural Feature Matching in Implicit 3D Representations

Aberman, K., Liao, J., Shi, M., Lischinski, D., Chen, B., and Cohen-Or, D. Neural best-buddies: Sparse cross-domain correspondence. ACM Transactions on Graphics (TOG), 37(4):1 14, 2018.

Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectiﬁed linear units. In International Conference on Learning Representations, 2018.

Atzmon, M. and Lipman, Y. Sal: Sign agnostic learning of shapes from raw data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2565 2574, 2020a.

Atzmon, M. and Lipman, Y. Sal++: Sign agnostic learning with derivatives. ar Xiv preprint ar Xiv:2006.05400, 2020b.

Atzmon, M., Haim, N., Yariv, L., Israelov, O., Maron, H., and Lipman, Y. Controlling neural level sets. ar Xiv preprint ar Xiv:1905.11911, 2019.

Bau, D., Zhou, B., Khosla, A., Oliva, A., and Torralba, A. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6541 6549, 2017.

Ben-Israel, A. A newton-raphson method for the solution of systems of equations. Journal of Mathematical analysis and applications, 15(2):243 252, 1966.

Bogo, F., Romero, J., Pons-Moll, G., and Black, M. J. Dynamic faust: Registering human bodies in motion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6233 6242, 2017.

Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P., Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su, H., et al. Shapenet: An information-rich 3d model repository. ar Xiv preprint ar Xiv:1512.03012, 2015.

Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571 6583, 2018.

Chen, Y., Mensink, T. E. J., and Gavves, E. 3d neighborhood convolution: Learning depth-aware features for rgb-d and rgb semantic segmentation. In International Conference on 3D Vision, 2019a.

Chen, Y., Hu, V. T., Gavves, E., Mensink, T., Mettes, P., Yang, P., and Snoek, C. G. Pointmixup: Augmentation for point clouds. Proceedings of the European Conference on Computer Vision (ECCV), 2020.

Chen, Z. and Zhang, H. Learning implicit ﬁelds for generative shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5939 5948, 2019.

Chen, Z., Yin, K., Fisher, M., Chaudhuri, S., and Zhang, H. Bae-net: Branched autoencoder for shape cosegmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8490 8499, 2019b.

Chibane, J., Alldieck, T., and Pons-Moll, G. Implicit functions in feature space for 3d shape reconstruction and completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6970 6981, 2020.

Choy, C. B., Gwak, J., Savarese, S., and Chandraker, M. Universal correspondence network. ar Xiv preprint ar Xiv:1606.03558, 2016.

Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linear independent components estimation. ar Xiv preprint ar Xiv:1410.8516, 2014.

Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estimation using real nvp. ar Xiv preprint ar Xiv:1605.08803, 2016.

Fan, H., Su, H., and Guibas, L. J. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 605 613, 2017.

Gao, L., Yang, J., Wu, T., Yuan, Y.-J., Fu, H., Lai, Y.- K., and Zhang, H. Sdm-net: Deep generative network for structured deformable mesh. ACM Transactions on Graphics (TOG), 38(6):1 15, 2019.

Genova, K., Cole, F., Sud, A., Sarna, A., and Funkhouser, T. Deep structured implicit functions. ar Xiv preprint ar Xiv:1912.06126, 2019.

Groueix, T., Fisher, M., Kim, V. G., Russell, B. C., and Aubry, M. 3d-coded: 3d correspondences by deep deformation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 230 246, 2018a.

Groueix, T., Fisher, M., Kim, V. G., Russell, B. C., and Aubry, M. A papier-mˆach e approach to learning 3d surface generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 216 224, 2018b.

Huang, J., Jiang, C. M., Leng, B., Wang, B., and Guibas, L. Meshode: A robust and scalable framework for mesh deformation. ar Xiv preprint ar Xiv:2005.11617, 2020.

Neural Feature Matching in Implicit 3D Representations

Jiang, C., Huang, J., Tagliasacchi, A., Guibas, L., et al. Shapeﬂow: Learnable deformations among 3d shapes. In Advances in Neural Information Processing Systems, 2020a.

Jiang, C., Sud, A., Makadia, A., Huang, J., Nießner, M., and Funkhouser, T. Local implicit grid representations for 3d scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020b.

Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂow with invertible 1x1 convolutions. In Advances in neural information processing systems, pp. 10215 10224, 2018.

Kurenkov, A., Ji, J., Garg, A., Mehta, V., Gwak, J., Choy, C., and Savarese, S. Deformnet: Free-form deformation network for 3d shape reconstruction from a single image. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 858 866. IEEE, 2018.

Liu, F. and Liu, X. Learning implicit functions for topologyvarying dense 3d shape correspondence. ar Xiv preprint ar Xiv:2010.12320, 2020.

Lorensen, W. E. and Cline, H. E. Marching cubes: A high resolution 3d surface construction algorithm. ACM siggraph computer graphics, 21(4):163 169, 1987.

Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9(Nov): 2579 2605, 2008.

Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., and Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460 4470, 2019.

Michalkiewicz, M., Pontes, J. K., Jack, D., Baktashmotlagh, M., and Eriksson, A. Deep level sets: Implicit surface representations for 3d shape inference. ar Xiv preprint ar Xiv:1901.06802, 2019.

Mildenhall, B., Srinivasan, P. P., Tancik, M., Barron, J. T., Ramamoorthi, R., and Ng, R. Nerf: Representing scenes as neural radiance ﬁelds for view synthesis. ar Xiv preprint ar Xiv:2003.08934, 2020.

Niemeyer, M., Mescheder, L., Oechsle, M., and Geiger, A. Occupancy ﬂow: 4d reconstruction by learning particle dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5379 5389, 2019.

Park, J. J., Florence, P., Straub, J., Newcombe, R., and Lovegrove, S. Deepsdf: Learning continuous signed distance functions for shape representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 165 174, 2019.

Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., and Geiger, A. Convolutional occupancy networks. ar Xiv preprint ar Xiv:2003.04618, 2020.

Rahaman, N., Baratin, A., Arpit, D., Draxler, F., Lin, M., Hamprecht, F., Bengio, Y., and Courville, A. On the spectral bias of neural networks. In International Conference on Machine Learning, pp. 5301 5310. PMLR, 2019.

Rezende, D. J. and Mohamed, S. Variational inference with normalizing ﬂows. ar Xiv preprint ar Xiv:1505.05770, 2015.

Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., and Li, H. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2304 2314, 2019.

Sitzmann, V., Zollh ofer, M., and Wetzstein, G. Scene representation networks: Continuous 3d-structure-aware neural scene representations. In Advances in Neural Information Processing Systems, pp. 1121 1132, 2019.

Sorkine, O. and Alexa, M. As-rigid-as-possible surface modeling.

Wang, W., Ceylan, D., Mech, R., and Neumann, U. 3dn: 3d deformation network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1038 1046, 2019.

Wu, J., Zhang, C., Xue, T., Freeman, B., and Tenenbaum, J. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in neural information processing systems, pp. 82 90, 2016.

Xu, Q., Wang, W., Ceylan, D., Mech, R., and Neumann, U. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. In Advances in Neural Information Processing Systems, pp. 492 502, 2019.

Xu, Y., Fan, T., Yuan, Y., and Singh, G. Ladybird: Quasimonte carlo sampling for deep implicit ﬁeld based 3d reconstruction with symmetry. In European Conference on Computer Vision, pp. 248 263. Springer, 2020.

Yang, G., Huang, X., Hao, Z., Liu, M.-Y., Belongie, S., and Hariharan, B. Pointﬂow: 3d point cloud generation with continuous normalizing ﬂows. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4541 4550, 2019.

Yang, Y., Feng, C., Shen, Y., and Tian, D. Foldingnet: Point cloud auto-encoder via deep grid deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 206 215, 2018.

Neural Feature Matching in Implicit 3D Representations

Yi, L., Kim, V. G., Ceylan, D., Shen, I.-C., Yan, M., Su, H., Lu, C., Huang, Q., Sheffer, A., and Guibas, L. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (To G), 35(6):1 12, 2016.

Yifan, W., Aigerman, N., Kim, V. G., Chaudhuri, S., and Sorkine-Hornung, O. Neural cages for detail-preserving 3d deformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 75 83, 2020.

Zeiler, M. D. and Fergus, R. Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818 833. Springer, 2014.