# continuous_surface_embeddings__f80625a0.pdf

Continuous Surface Embeddings

Natalia Neverova David Novotny Vasil Khalidov Marc Szafraniec Patrick Labatut Andrea Vedaldi {nneverova, dnovotny, vkhalidov, mszafraniec, plabatut, vedaldi} fb.com

Facebook AI Research

In this work, we focus on the task of learning and representing dense correspondences in deformable object categories. While this problem has been considered before, solutions so far have been rather ad-hoc for speciﬁc object types (i.e., humans), often with signiﬁcant manual work involved. However, scaling the geometry understanding to all objects in nature requires more automated approaches that can also express correspondences between related, but geometrically different objects. To this end, we propose a new, learnable image-based representation of dense correspondences. Our model predicts, for each pixel in a 2D image, an embedding vector of the corresponding vertex in the object mesh, therefore establishing dense correspondences between image pixels and 3D object geometry. We demonstrate that the proposed approach performs on par or better than the state-ofthe-art methods for dense pose estimation for humans, while being conceptually simpler. We also collect a new in-the-wild dataset of dense correspondences for animal classes and demonstrate that our framework scales naturally to the new deformable object categories.

1 Introduction

Understanding the geometry of natural objects, such as humans and other animals, must start from the notion of correspondence. Correspondences tell us which parts of different objects are geometrically equivalent, and thus form the basis on which an understanding of geometry can be developed. In this paper, we are interested in particular in learning and computing correspondence starting from 2D images of the objects, a preliminary step for 3D reconstruction and other applications.

While the correspondence problem has been considered many times before, most solutions still involve a signiﬁcant amount of manual work. Consider for example a state-of-the-art method such as Dense Pose [17]. Given a new object category to model with Dense Pose, one must start by deﬁning a canonical shape S, a sort of average 3D shape used as a reference to express correspondences. Then, a dataset of images of the object must be collected and annotated with millions of manual point correspondences between the images and the canonical 3D model. Finally, the model must be manually partitioned into a number of parts, or charts, and a deep neural network must be trained to segment the image and regress the uv coordinates for each chart, guided by the manual annotations, yielding a Dense Pose predictor. Given a new category, this process must be repeated from scratch.

There are some obvious scalability issues with this approach. The most signiﬁcant one is that the entire process must be repeated for each new object category one wishes to model. This includes the laborious step of collecting annotations for the new class. However, categories such as animals share signiﬁcant similarities between them; for instance, recently [44] has shown that Dense Pose trained on humans transfers well on chimpanzees. Thus, a much better scalable solution can be obtained by sharing training data and models between classes. This brings us to the second shortcoming of Dense Pose: the nature of the model makes it difﬁcult to realize this information sharing. In par-

34th Conference on Neural Information Processing Systems (Neur IPS 2020), Vancouver, Canada.

continuous surface mappings

class-based matching

predictor, Φ

predictor, Φ

surface S00

LBO basis, US 2 RK M

LBO basis, US0 2 RK0 M

LBO basis, US00 2 RK00 M

embedding {e S}

embedding {e S0}

embedding {e S00}

texture maps

universal positional embedding

E 2 RH W D (D = 16)

segmentation and UV maps (D = 75)

(a) Dense Pose (IUV)

(b) This work (CSE)

Figure 1: Overview. Compared to Dense Pose [17], our CSE framework is conceptually simpler and is directly extendable to multi-class problems through learning a universal dense pose network. (Color coding for each class mesh in the ﬁgure is chosen arbitrarily and independently from others.)

ticular, the need for breaking the canonical 3D models into different charts makes relating different models cumbersome, particularly in a learning setup.

One important contribution of this paper is to introduce a better and more ﬂexible representation of correspondences that can be used as a drop-in replacement in architectures such as Dense Pose. The idea is to introduce a learnable positional embedding. Namely, we associate each point X of the canonical model S to a compact embedding vector e X, which provides a deformation-invariant representation of the point identity. We also note that the embedding can be interpreted as a smoothlyvarying function deﬁned over the 3D model S, interpreted as a manifold. As such, this allows us to use the machinery of functional maps [39] to work with the embeddings, with two important advantages: (1) being able to signiﬁcantly reduce the dimensionality of the representation and (2) being able to efﬁciently relate representations between models of different object categories.

Empirically, we show that we can learn a deep neural network that predicts, for each pixel in a 2D image, the embedding vector of the corresponding object point, therefore establishing dense correspondences between the image pixels and the object geometry. For humans, we show that the resulting correspondences are as or more accurate than the reference state-of-the-art Dense Pose implementation, while achieving a signiﬁcant simpliﬁcation of the Dense Pose framework by removing the need of charting the model. As an additional bonus, this removes the seams between the parts that affect Dense Pose. Then, we use the ability of the functional maps to relate different 3D shapes to help transferring information between different object categories. With this, and a very small amount of manual training data, we demonstrate for the ﬁrst time that a single (universal) Dense Pose network can be extended to capturing multiple animal classes with a high degree of sharing in compute and statistics. The overview of our method with learning continuous surface embeddings (CSE) is shown in Fig. 1b (for comparison, the Dense Pose [17] setup (IUV) is shown in Fig. 1a).

2 Related work

Human pose recognition. With deep learning, image-based human pose estimation has made substantial progress [52, 37, 12], also due to the availability of large datasets such as COCO [31],

MPII [3], Leeds Sports Pose Dataset (LSP) [23, 24], Penn Action [58], or Pose Track [2]. Our work is most related with Dense Pose [17], which introduced a method to establish dense correspondences between image pixels and points on the surface of the average SMPL human mesh model [32].

Unsupervised pose recognition. Most pose estimators [5, 49, 7, 50, 43, 48, 33, 59, 22] require full supervision, which is expensive to collect, especially for a model such as Dense Pose. A handful of works have tackled this issue by seeking unsupervised and weakly-supervised objectives, using cues such as equivariance to synthetic image transformations. The most relevant to us is Slim Dense Pose [36], which showed that Dense Pose annotations can be signiﬁcantly reduced without incurring a large performance penalty, but did not address the issue of scaling to multiple classes.

Animal pose recognition. Compared to humans, animal pose estimation is signiﬁcantly less explored. Some works specialise on certain animals (tigers [30], cheetahs [35] or drosophila melanogaster ﬂies [19]). Tulsiani et al. [51] transfer pose between annotated animals and unannotated ones that are visually similar. Several works have focused on animal landmark detection. Rashid et al. [41] and Yang et al. [55] studied animal facial keypoints. A well explored class are birds due to the CUB dataset [53]. Some works [57, 45] proposed various types of detectors for birds, while others explored reconstructing sparse [38] and dense [26, 25, 13] 3D shapes. Beyond birds, Zufﬁet al. [61, 62, 60] have explored systematically the problem of reconstructing 3D deformable animal models from image data. They utilise the SMAL animal shape model, which is an analogue for animals of the more popular human SMPL model for humans [32]. While [61, 62] are based on ﬁtting 2D keypoints at test time, Biggs et al. [6] directly regresses the 3D shape parameters instead. Recently, Kulkarni et al. [28, 29] leveraged canonical maps to perform the 3D shape ﬁtting as well as establishing of dense correspondences across different instances of an animal species.

3D shape analysis. Our work is also related to the literature that studies the intrinsic geometry of 3D shapes. Early approaches [15, 9] analysed shapes by performing multi-dimensional scaling of the geodesic distances over the shape surface, as these are invariant to isometric deformations. Later, Coifman and Lafon [14] popularized the diffusion geometry due to its increased robustness to small perturbations of the shape topology. The seminal work of Rustamov [42] proposed to use the eigenfunctions of the Laplace-Beltrami operator (LBO) on a mesh to deﬁne a basis of functions that smoothly vary along the mesh surface. The LBO basis was later leveraged in other diffusion descriptors such as the heat kernel signature (HKS) [47], wave kernel signature (WKS) [4] or Gromov-Hausdorff descriptors [10]. A scale-invariant version of HKS was introduced in [11], while [8] proposed an HKS-based equivalent of the image Bo W descriptor [46]. While HKS/WKS establish hard correspondences between individual points on shapes, Ovsjanikov et al. [39] introduced the functional maps (FM) that align shapes in a soft manner by ﬁnding a linear map between spaces of functions on meshes. Interestingly, [39] has revealed an intriguing connection between FMs and their efﬁcient representation using the LBO basis. The FM framework became popular and was later extended in [40, 27, 1]. Relevantly to us, [34] proposed Zoom Out, a method that estimates FM in a multi-scale fashion, which we improve for our species-to-species mesh correspondences.

We propose a new approach for representing continuous correspondences (or surface embeddings, CSE) between an image and points in a 3D object. To this end, let S R3 be a canonical surface. Each point X 2 S should be thought of as a landmark , i.e. an object point that can be identiﬁed consistently despite changes in viewpoint, object deformations (e.g. a human moving), or even swapping an instance of the object with another (e.g. two different humans).

In order to express these correspondences, we consider an embedding function e : S ! RD associating each 3D point X 2 S to the corresponding D-dimensional vector e X. Then, for each pixel x 2 of image I, we task a deep network Φ with computing the corresponding embedding vector Φx(I) 2 RD. From this, we recover the corresponding canonical 3D point X 2 S probabilistically, via a softmax-like function:

p(X|x, I, e, Φ) = exp ( he X, Φx(I)i) R

S exp ( he X, Φx(I)i) d X . (1)

In this formulation, the embedding function e X is learnable just like the network Φ. The simplest way of implementing this idea is to approximate the surface S with a mesh with vertices

discontinuities

between segments

c spatially

smooth mapping

optimal solution

M = 32 M = 256 M = 1024

CSE training for diﬀerent sizes of LBO bases IUV training (Dense Pose)

Figure 2: Comparison of IUV (Dense Pose [17]), and CSE-learned mappings (vertices predicted in the image are shown in color). CSE training produces smooth seamless predictions with no shrinking effect (the remaining discontinuities are due to biases in annotations, see the suppl. mat. for details).

X1, . . . , XK 2 S, obtaining a discrete variant of this model:

p(k|x, I, E, Φ) = exp ( hek, Φx(I)i) PK

k=1 exp ( hek, Φx(I)i)

where ek = e Xk is a shorthand notation of the embedding vector associated to the k-th vertex of the mesh and E is the K D matrix with all the embedding vectors {ek}K

k=1 as rows.

Given a training set of triplets (I, x, k), where I is an image, x a pixel, and k the index of the corresponding mesh vertex, we can learn this model by minimizing the cross-entropy loss:

L(E, Φ) = avg (I,x,k)2T

log p(k|x, I, E, Φ). (3)

We found it beneﬁcial to modify this loss to account for the geometry of the problem, minimizing the cross entropy between a Gaussian-like distribution centered on the ground-truth point k and the predicted posterior:

g S(q; k) log p(q|x, I, E, Φ), g S(q; k) / exp

2σ2 d S(Xq, Xk)

where d S : S S ! R+ is the geodesic distance between points on the surface.

Comparison with Dense Pose. Dense Pose [17], which is the current state-of-the-art approach for recovering dense correspondences, decomposes the canonical model S = [J

i=1 Si into J non-overlapping patches each parametrized via a chart fi : Si ! [0, 1]2 mapping the patch Si to the square [0, 1]2 of local uv coordinates. Dense Pose then learns a network that maps each pixel x in an image I of the object to the corresponding chart index and uv coordinate as Φx(I) = (i, u, v) 2 {1, . . . , M} [0, 1]2.

An issue with this formulation is that the charting ( Si, fi) of the canonical model S is arbitrary and needs to be deﬁned manually. The Dense Pose network Φ then needs to output, for each pixel u, a probability value that the pixel belongs to one of the J charts, plus possible chart coordinates (u, v) 2 [0, 1]2 for each patch. This requires at minimum J + 2 values for each pixel, but in practice is implemented as a set of 3J channels as different patches use different coordinate predictors. Our model eq. (2) is a substantial simpliﬁcation because it only requires the initial mesh, whereas the embedding matrix E is learned automatically. This reduces the amount of manual work required to instantiate Dense Pose, removes the seams between the different charts Si, that are arbitrary, and, perhaps most importantly, allows to share a single embedding space, and thus network, between several classes, as discussed below.

3.1 Injecting geometric knowledge via spectral analysis

There are three issues with the formulation we have given so far. First, the embedding matrix E is large, as it contains D parameters for each of the K mesh vertices. Second, the representation

SMPL!giraffe

Figure 3: Initializing animal predictor from a human-trained network and mesh mappings by setting ˆE0 = C ˆE for each animal class (see 3.3 for details). For each image-mesh pair, the color coding corresponds to the new target animal category, as displayed on the corresponding 3D model.

depends on the discretization of the mesh, so for example it is not clear what to do if we resample the mesh to increase its resolution. Third, it is not obvious, given two surfaces S and S0 for two different object categories (e.g. humans and chimps), how their embeddings e and e0 can be related.

We can elegantly solve these issues by interpreting e X as a smooth function S ! RD deﬁned on a manifold S, of which the mesh is a discretization. Then, we can use the machinery of functional maps [39] to manipulate the encodings, with several advantages.

For this, we begin by discretizing the representation; namely, given a real function r : S ! R deﬁned on the surface S, we represent it as a K-dimensional vector r 2 RK of samples rk = r(Xk) taken at the vertices. Then, we deﬁne the discrete Laplace-Beltrami operator (LBO) L = A 1W where A 2 RK K

+ is diagonal and W 2 RK K positive semi-deﬁnite (see below for details). The interest in the LBO is that it provides a basis for analyzing functions deﬁned on the surface. Just like the Fourier basis in Rn is given by the eigenfunctions of the Laplacian, we can deﬁne a Fourier basis on the surface S as the matrix U 2 RK K of generalized eigenvectors WU = AU , where is a diagonal matrix of eigenvalues. Due to A, the matrix U is orthonormal w.r.t. the metric A, in the sense that U >AU = I. Each column uk of the matrix U is a K-dimensional vector deﬁning an eigenfunction on the surface S, corresponding to eigenvalue λk 0. The Fourier transform of function r is then the vector of coefﬁcients ˆr such that r = Uˆr. Furthermore, if we retain in U only the columns corresponding to the M smallest eigenvalues, so that U 2 RK M, setting r = Uˆr deﬁnes a smooth (low-pass) function. Appendix A.1 explains how we construct W and A in detail.

This machinery is of interest to us because we can regard the i-th component (ek)i of the embedding vectors as a scalar function deﬁned on the mesh. Furthermore, we expect this function to vary smoothly, meaning that we can express the overall code matrix as E = U ˆE, where ˆE 2 RM D is a matrix of coefﬁcients much smaller than E 2 RK D. In other words, we have used the geometry of the mesh to dramatically reduce the number of parameters to learn. LBO bases can also be used to relate different meshes, as explained next.

3.2 Relating different categories

Let S and S0 be the canonical shapes of two objects, e.g. human and chimp. The shapes are analogous, but also sufﬁciently different to warrant modelling them with related but distinct canonical shapes. In the discrete setting, a functional map (FM) [39] is just a linear map T 2 RK0 K sending functions r deﬁned on S to functions r0 = Tr deﬁned on S0. The space of such maps is very large, but we are interested particularly in two kinds. The ﬁrst are point-to-point maps, which are analogous to permutation matrices , and thus express correspondences between the two shapes. The second kind are maps restricted to low-pass functions on the two surfaces. Assuming that functions can be written as r = Uˆr and r0 = U 0ˆr0 for LBO bases (U, A) and (U 0, A0), the functional map r0 = Tr can be written in an equivalent manner as ˆr0 = Cˆr where C = (U 0)>A0TU acts on the spectra of the functions. The advantage of the spectral representation is that the M 0 M matrix C is generally much smaller than the K0 K matrix T. Then, we can relate positional embeddings E and E0 for the two shapes, which are smooth, as ˆE0 C ˆE.

When shapes S and S0 are approximately isometric, we can resort to automatic methods to establish correspondences (or C) between them. When S and S0 are not, this is much harder. Instead, we

category LVIS training set test set

# inst. # inst. # corresp. coverage # inst. # corresp. coverage

dog 2317 483 1424 21.5% 200 596 10.1% cat 2294 586 1720 23.9% 200 591 10.0% bear 776 98 289 4.8% 200 589 9.0% sheep 1142 257 765 13.3% 200 511 10.2% cow 1686 426 1267 19.7% 200 593 9.9% horse 2299 605 1783 25.8% 200 587 10.0% zebra 1999 665 1968 28.8% 200 592 10.6% giraffe 2235 651 1936 27.4% 200 594 10.2% elephant 2242 670 1992 28.1% 200 592 10.0%

Table 1: Annotation statistics of the Dense Pose-LVIS dataset. Coverage is expressed as the number of vertices in a given class mesh with at least one corresponding ground truth annotation. The corresponding animal meshes are shown on the right (source: hum3d.com).

start from a small number of manual correspondences (kj, k0

j), j = 1, . . . , Q between surfaces and interpolate those using functional maps. To do this, we use a variant of the Zoom Out method [34] due to its simplicity. This amounts to alternating two steps: given a matrix C of order M1 M1, we decode this as a set of point-to-point correspondences (kj, k0

j); then, given (kj, k0

j), we estimate a matrix C of order M2 > M1, increasing the resolution of the match. This is done for a sequence Mt = 12, 16, . . . , 256 until the desired resolution is achieved (see Appendix A.2 for details).

3.3 Cross-species Dense Pose with functional maps

We are now ready to describe how functional maps can be used to facilitate transferring and sharing a single Dense Pose predictor Φ between categories with different canonical shapes S, S0 and corresponding per-vertex embeddings E, E0.

To this end, assume that we have learned the pose regressor Φ on a source category (S, E) (e.g. humans). We are free to apply Φ to an image I0 of a different category (S0, E0) (e.g. chimps), but, while we might know S0, we do not know E0. However, if we assume that the regressor Φ can be shared among categories, then it is natural to also share their positional embeddings as well. Namely, we assume that E0 is approximately the same as E up to a remapping of the embedding vectors from shape S to shape S0. Based on the last section, we can thus write ˆE0 = C ˆE, or, equivalently, E0 = TE where T = U 0CU >A. With this, we can simply replace E with E0 in eq. (2) to now regress the pose of the new category using the same regressor network Φ. For training, we optimise the same cross entropy loss L(E, ) in eq. (3), just combining images and annotations from the two object categories and swapping E and E0 depending on the class of the input image.

The procedure above can be easily generalised to any number of categories S1, . . . , SK with reference to the same source category S and functional maps C1, . . . , CK. In our case, we select humans as source category as they contain the largest number of annotations.

We rely on the Dense Pose-COCO dataset [17] for evaluation of the proposed method on the human category and comparison with the Dense Pose (IUV) training. For the multi-class setting, we make use of a recent Dense Pose-Chimps [44] test benchmark containing a small number of annotated correspondences for chimpanzees. We split the set of annotated instances of [44] into 500 training and 430 test samples containing 1354 and 1151 annotated correspondences respectively.

Additionally, we collect correspondence annotations on a set of 9 animal categories of the LVIS dataset [21]. Based on images from the COCO dataset [31], LVIS features signiﬁcantly more accurate object masks. We refer to this data as Dense Pose-LVIS. The annotation statistics for the collected animal correspondences are given in Table 1. Note that compared to the original Dense Pose COCO labelling effort that produced 5 million annotated points for the human category (96% coverage of the SMPL mesh), our annotations are three orders of magnitude smaller and only 18% of vertices of animal meshes, on average, have at least one ground truth annotation.

architecture AP AP50 AP75 APM APL AR AR50 AR75 ARM ARL

IUV (baselines)

DP-RCNN (R50) [17] 54.9 89.8 62.2 47.8 56.3 61.9 93.9 70.8 49.1 62.8 DP-RCNN (R101) [17] 56.1 90.4 64.4 49.2 57.4 62.8 93.7 72.4 50.1 63.6 Parsing-RCNN [56] 65 93 78 56 67 AMA-net [20] 64.1 91.4 72.9 59.3 65.3 71.6 94.7 79.8 61.3 72.3

DP-RCNN* (R50) 65.3 92.5 77.1 58.6 66.6 71.1 95.3 82.0 60.1 71.9 DP-RCNN* (R101) 66.4 92.9 77.9 60.6 67.5 71.9 95.5 82.6 62.1 72.6 DP-RCNN-Deep Lab* (R50) 66.8 92.8 79.7 60.7 68.0 72.1 95.8 82.9 62.2 72.4 DP-RCNN-Deep Lab* (R101) 67.7 93.5 79.7 62.6 69.1 73.6 96.5 84.7 64.2 74.2

DP-RCNN* (R50) 66.1 92.5 78.2 58.7 67.4 71.7 95.5 82.4 60.3 72.5 DP-RCNN* (R101) 67.0 93.8 78.6 60.1 68.3 72.8 96.4 83.7 61.5 73.6 DP-RCNN-Deep Lab* (R50) 66.6 93.8 77.6 60.8 67.7 72.8 96.5 83.1 62.1 73.5 DP-RCNN-Deep Lab* (R101) 68.0 94.1 80.0 61.9 69.4 74.3 97.1 85.5 63.8 75.0

Table 2: Performance on Dense Pose-COCO, with IUV (top) and CSE (bottom) training (GPSm scores, minival). First block: published SOTA Dense Pose methods, second block: our optimized architectures + IUV training, third block: our optimized architectures + CSE training. All CSE models are trained with loss Lσ (eq. 4), LBO size M = 256, embedding size D = 16.

# LBO basis, M loss L loss Lσ 32 62.9 63.2 64 64.1 63.9 128 65.2 65.4 256 65.6 66.1 512 65.6 65.9 1024 65.7 65.9

# embedding, D loss L loss Lσ 2 38.3 46.4 4 60.0 64.7 8 60.2 65.6 16 65.4 66.1 32 65.6 66.0 64 65.1 66.1

training training mode data, % IUV loss L loss Lσ 1 18.9 7.2 17.8 5 36.6 26.5 31.4 10 42.2 36.3 39.3 50 58.0 56.3 58.5 100 65.3 65.4 66.1

Table 3: Hyperparameter search and performance in low data regimes (AP, Dense Pose COCO, minival): (left) LBO basis size, M (D = 16), (center) embedding size, D (M = 256), (right) comparison of IUV and CSE training in small data regimes. DP-RCNN* (R50) predictor.

5 Experiments

Architectures. Our networks are implemented in Py Torch within the Detectron2 [54] framework. The training is performed on 8 GPUs for 130k iterations on Dense Pose-COCO (standard s1x schedule [54]) and 5k iterations on Dense Pose-Chimps and Dense Pose-LVIS. The code, trained models and the dataset will be made publicly available to ensure reproducibility.

Prior to benchmarking the CSE setup, we carefully optimized all core architectures for dense pose estimation and introduced the following changes (similarly to [56, 20]): (1) single channel instance mask prediction as a replacement for the coarse segmentation of [17]; (2) optimized weights for (i, u, v) components; (3) Deep Lab head and Panoptic FPN (similarly to [43]). The experimental results are reported following the updated protocol based on GPSm scores [54]. More details on the network architectures and training hyperparameters are given in the supplementary material.

Comparison of CSE vs IUV training. The comparison of the state-of-the-art methods for dense pose estimation and our optimized architectures for both IUV and CSE training is provided in Table 2. The CSE-trained models perform better or on par with their IUV-trained counterparts, while producing a more compact representation (D = 16 vs D = 75) and requiring only simpliﬁed supervision (single vertex indices vs (i, u, v) annotations).

Inﬂuence of hyperparameters. In Table 3 we investigate the CSE network sensitivity to the size of the LBO basis, M (left), and the output embedding, D (right), given training losses L or Lσ. The value M = 256 represents the tradeoff between mapping s smoothness and its ﬁdelity. It does not seem beneﬁcial to increase the embedding size beyond D = 16, so we adapt this value for the rest of the experiments. The smoothed loss Lσ yields better performance in a low dimensional setting.

Low data regime. Prior to proceeding with the multi-class experiments on animal classes, we investigate changes in models performance as a function of the amount of ground truth annotations by training on subsets of the Dense Pose-COCO dataset. As shown in Table 3 on the right, Lσ-based training scales down more gracefully and is signiﬁcantly more robust than L.

model Schimp Ssmpl human model 2.0 loss L 8.5 3.3 loss Lσ 21.1 3.2 + pretrain Φ 36.7 34.5 + align E 37.2 35.7

Table 4: Performance on the Dense Pose-Chimps dataset with CSE training (AP, GPSm scores, measured on both chimp and SMPL meshes wrt the GT mapping Schimp ! Ssmpl from [43]).

model cat dog bear sheep cow horse zebra giraffe elephant mean

single class + L 5.5 4.7 1.8 0.9 2.8 4.5 12.5 11.4 19.4 7.05 single class + Lσ 20.2 16.4 10.8 14.2 22.5 24.3 23.9 27.1 26.4 20.6

joint training

multiclass + Lσ 20.2 18.3 19.3 25.4 22.4 26.3 33.2 30.9 29.9 25.0

class agnostic + L 8.7 6.6 4.1 8.2 6.4 7.3 19.2 15.5 9.2 9.5 class agnostic + Lσ 20.5 18.3 20.1 25.9 24.5 25.7 34.5 30.5 27.1 25.2 + pretrain Φ 28.0 28.3 22.0 31.8 36.5 32.7 43.1 41.2 34.9 33.1 + align E 30.9 29.4 25.1 35.3 36.5 34.3 46.0 38.3 39.6 35.0

Table 5: Performance on the Dense Pose-LVIS dataset with CSE training (AP, GPSm scores).

Figure 4: Qualitative results on the Dense Pose-LVIS dataset (single predictor for all classes).

Multi-surface training. The results on Dense Pose-Chimps and Dense Pose-LVIS datasets are reported in Tables 4 and 5 (DP-RCNN* (R50), M = 256, D = 16). In both cases, training from scratch results in poor performance, especially in a single class setting. Initializing the predictor Φ with the human class trained weights together with the alignment of mesh-speciﬁc vertex embeddings (as described in 3.3) gives a signiﬁcant boost. Interestingly, class agnostic training by mapping all class embeddings to the shared space turns out to be more effective than having a separate set of output planes for each category (the latter is denoted as multiclass in Table 5). Quantitative results produced by the best predictor are shown in Figure 4.

Conclusion. In this work, we have made an important step towards designing universal networks for learning dense correspondences within and across different object categories (animals). We have demonstrated that training joint predictors in the image space with simultaneous alignment of canonical surfaces in 3D results in an efﬁcient transfer of knowledge between different classes even when the amount of ground truth annotations is severely limited.

Broader impact

In our paper, we help improve the ability of machines to understand the pose of articulated objects such as humans in images. In particular, we make the process of learning new object categories much more efﬁcient.

An application of our method is the observation of the human body. This may come with some concerns on possible negative uses of the technology. However, we should note that our approach cannot be considered biometrics, because from pose alone, even if dense, it is not possible to ascertain the identity of an individual (in particular, we do not perform 3D reconstruction, nor we reconstruct facial features). This mitigates the potential risk when our method is applied to humans.

We believe that our work has signiﬁcant opportunities for a positive impact by opening up the possibility that machines could ultimately understand the pose of thousands of animal classes. In addition to numerous applications in VR, AR, marketing and the like, such a technology can beneﬁt animalhuman-machine interaction (e.g. in aid of the visually impaired), can be used to better safeguard animals on the Internet (e.g. by detecting animal abuse), and, perhaps most importantly, can allow conservationists and other researchers to observe animals in the wild at an unprecedented scale, automatically analysing their motion and activities, and thus collecting information on their number, state of health, and other statistics. Thus, while we acknowledge that this technology may ﬁnd negative uses (as almost any technology does), we believe that the positives far outweigh them.

[1] Yonathan Aﬂalo, Anastasia Dubrovina, and Ron Kimmel. Spectral Generalized Multidimensional Scaling. International Journal of Computer Vision, 118(3):380 392, 2016.

[2] Mykhaylo Andriluka, Umar Iqbal, Anton Milan, Eldar Insafutdinov, Leonid Pishchulin, Juer-

gen Gall, and Bernt Schiele. Pose Track: A Benchmark for Human Pose Estimation and Tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5167 5176, 2018.

[3] Mykhaylo Andriluka, Leonid Pishchulin, Peter V. Gehler, and Bernt Schiele. 2D Human Pose

Estimation: New Benchmark and State of the Art Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3686 3693, 2014.

[4] Mathieu Aubry, Ulrich Schlickewei, and Daniel Cremers. The wave kernel signature: A quan-

tum mechanical approach to shape analysis. In IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 1626 1633, 2011.

[5] Miguel Ángel Bautista, Artsiom Sanakoyeu, Ekaterina Tikhoncheva, and Björn Ommer.

Clique CNN: Deep Unsupervised Examplar Learning. In Advances in Neural Information Processing Systems (NIPS), pages 3846 3854, 2016.

[6] Benjamin Biggs, Thomas Roddick, Andrew Fitzgibbon, and Roberto Cipolla. Creatures Great

and SMAL: Recovering the Shape and Motion of Animals from Video. In Asian Conference on Computer Vision (ACCV), pages 3 19, 2018.

[7] Biagio Brattoli, Uta Büchler, Anna-Sophia Wahl, Martin E. Schwab, and Björn Ommer. LSTM

Self-Supervision for Detailed Behavior Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3747 3756, 2017.

[8] Alexander M. Bronstein, Michael M. Bronstein, Leonidas J. Guibas, and Maks Ovsjanikov.

Shape google: Geometric words and expressions for invariant shape retrieval. ACM Transactions on Graphics (TOG), 30(1):1 20, 2011.

[9] Alexander M. Bronstein, Michael M. Bronstein, and Ron Kimmel. Generalized multidimen-

sional scaling: a framework for isometry-invariant partial surface matching. Proceedings of the National Academy of Sciences (PNAS), 103(5):1168 1172, 2006.

[10] Alexander M. Bronstein, Michael M. Bronstein, Ron Kimmel, Mona Mahmoudi, and

Guillermo Sapiro. A Gromov-Hausdorff framework with Diffusion Geometry for

Topologically-Robust Non-rigid Shape Matching. International Journal of Computer Vision, 89(2 3):266 286, 2010.

[11] Michael M. Bronstein and Iasonas Kokkinos. Scale-invariant heat kernel signatures for non-

rigid shape recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1704 1711, 2010.

[12] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime Multi-person 2D Pose

Estimation Using Part Afﬁnity Fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1302 1310, 2017.

[13] Wenzheng Chen, Huan Ling, Jun Gao, Edward J. Smith, Jaakko Lehtinen, Alec Jacobson,

and Sanja Fidler. Learning to Predict 3D Objects with an Interpolation-based Differentiable Renderer. In Advances in Neural Information Processing Systems (Neur IPS), pages 9605 9616, 2019.

[14] Ronald R. Coifman and Stéphane Lafon. Diffusion maps. Applied and Computational Har-

monic Analysis, 21(1):5 30, 2006.

[15] Asi Elad (Elbaz) and Ron Kimmel. On Bending Invariant Signatures for Surfaces. IEEE

Transactions on Pattern Analysis and Machine Intelligence, 25(10):1285 1295, 2003.

[16] Danielle Ezuz and Mirela Ben-Chen. Deblurring and Denoising of Maps between Shapes.

Computer Graphics Forum, 36(5):165 174, 2017.

[17] Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Dense Pose: Dense Human Pose

Estimation in the Wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7297 7306, 2018.

[18] Riza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Dense Pose: Dense human pose

estimation in the wild. In Proc. CVPR, 2018.

[19] Semih Günel, Helge Rhodin, Daniel Morales, João Campagnolo, Pavan Ramdya, and Pascal

Fua. Deep Fly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila. e Life, 2019.

[20] Yuyu Guo, Lianli Gao, Jingkuan Song, Peng Wang, Wuyuan Xie, and Heng Tao Shen. Adap-

tive Multi-Path Aggregation for Human Dense Pose Estimation in the Wild. In ACM International Conference on Multimedia, pages 356 364, 2019.

[21] Agrim Gupta, Piotr Dollár, and Ross Girshick. LVIS: A Dataset for Large Vocabulary Instance

Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5356 5364, 2019.

[22] Tomas Jakab, Ankush Gupta, Hakan Bilen, and Andrea Vedaldi. Unsupervised Learning of

Object Landmarks through Conditional Image Generation. In Advances in Neural Information Processing Systems (Neur IPS), pages 4020 4031, 2018.

[23] Sam Johnson and Mark Everingham. Clustered Pose and Nonlinear Appearance Models for

Human Pose Estimation. In British Machine Vision Conference (BMVC), pages 1 11, 2010.

[24] Sam Johnson and Mark Everingham. Learning effective human pose estimation from inaccu-

rate annotation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1465 1472, 2011.

[25] Angjoo Kanazawa, David W. Jacobs, and Manmohan Chandraker. Warp Net: Weakly Super-

vised Matching for Single-View Reconstruction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3253 3261, 2016.

[26] Angjoo Kanazawa, Shubham Tulsiani, Alexei A. Efros, and Jitendra Malik. Learning category-

speciﬁc mesh reconstruction from image collections. In European Conference on Computer Vision (ECCV), pages 386 402, 2018.

[27] Artiom Kovnatsky, Michael M. Bronstein, Xavier Bresson, and Pierre Vandergheynst. Func-

tional correspondence by matrix completion. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 905 914, 2015.

[28] Nilesh Kulkarni, Abhinav Gupta, David Fouhey, and Shubham Tulsiani. Articulation-aware

Canonical Surface Mapping. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[29] Nilesh Kulkarni, Shubham Tulsiani, and Abhinav Gupta. Canonical Surface Mapping via Geometric Cycle Consistency. In International Conference on Computer Vision (ICCV), pages 2202 2211, 2019.

[30] Shuyuan Li, Jianguo Li, Weiyao Lin, and Hanlin Tang. Amur tiger re-identiﬁcation in the wild.

ar Xiv e-prints ar Xiv:1906.05586, 2019.

[31] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan,

Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV), pages 740 755, 2014.

[32] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J.

Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.

[33] Dominik Lorenz, Leonard Bereska, Timo Milbich, and Björn Ommer. Unsupervised part-

based disentangling of object shape and appearance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10955 10964, 2019.

[34] Simone Melzi, Jing Ren, Emanuele Rodolà, Abhishek Sharma, Peter Wonka, and Maks Ovs-

janikov. Zoom Out: spectral upsampling for efﬁcient shape correspondence. ACM Transaction on Graphics, 38(6):155:1 155:14, 2019.

[35] Tanmay Nath, Alexander Mathis, An Chi Chen, Amir Patel, Matthias Bethge, and Macken-

zie Weygandt Mathis. Using Deep Lab Cut for 3D markerless pose estimation across species and behaviors. Nature Protocols, 2019.

[36] Natalia Neverova, James Thewlis, Rıza Alp Güler, Iasonas Kokkinos, and Andrea Vedaldi.

Slim Dense Pose: Thrifty Learning from Sparse Annotations and Motion Cues. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10915 10923, 2019.

[37] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked Hourglass Networks for Human Pose

Estimation. In European Conference on Computer Vision (ECCV), pages 483 499, 2016.

[38] David Novotny, Nikhila Ravi, Benjamin Graham, Natalia Neverova, and Andrea Vedaldi.

C3DPO: Canonical 3D Pose Networks for Non-Rigid Structure From Motion. In International Conference on Computer Vision (ICCV), pages 7687 7696, 2019.

[39] Maks Ovsjanikov, Mirela Ben-Chen, Justin Solomon, Adrian Butscher, and Leonidas J.

Guibas. Functional maps: a ﬂexible representation of maps between shapes. ACM Transactions on Graphics (TOG), 31(4):1 11, 2012.

[40] Jonathan Pokrass, Alexander M. Bronstein, Michael M. Bronstein, Pablo Sprechmann, and

Guillermo Sapiro. Sparse modeling of intrinsic correspondences. Computer Graphics Forum, 32(2):459 468, 2013.

[41] Maheen Rashid, Xiuye Gu, and Yong Jae Lee. Interspecies knowledge transfer for facial keypoint detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6894 6903, 2017.

[42] Raif M. Rustamov. Laplace-Beltrami eigenfunctions for deformation invariant shape represen-

tation. In Symposium on Geometry Processing, pages 225 233, 2007.

[43] Artsiom Sanakoyeu, Miguel Ángel Bautista, and Björn Ommer. Deep unsupervised learning

of visual similarities. Pattern Recognition, 78:331 343, 2018.

[44] Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. Mc Carthy, Andrea Vedaldi, and Natalia

Neverova. Transferring Dense Pose to Proximal Animal Classes. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[45] Saurabh Singh, Derek Hoiem, and David A. Forsyth. Learning to Localize Little Landmarks.

In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 260 269, 2016.

[46] Josef Sivic and Andrew Zisserman. Video Google: A Text Retrieval Approach to Object

Matching in Videos. In International Conference on Computer Vision (ICCV), pages 1470 1477, 2003.

[47] Jian Sun, Maks Ovsjanikov, and Leonidas J. Guibas. A Concise and Provably Informative

Multi-Scale Signature Based on Heat Diffusion. Computer Graphics Forum, 28(5):1383 1392, 2009.

[48] James Thewlis, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of

landmarks by descriptor vector exchange. ICCV, 2019.

[49] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised Learning of Object Land-

marks by Factorized Spatial Embeddings. In International Conference on Computer Vision (ICCV), pages 3229 3238, 2017.

[50] James Thewlis, Hakan Bilen, and Andrea Vedaldi. Unsupervised object learning from dense

invariant image labelling. In Advances in Neural Information Processing Systems (NIPS), pages 844 855, 2017.

[51] Shubham Tulsiani, João Carreira, and Jitendra Malik. Pose Induction for Novel Object Cate-

gories. In IEEE International Conference on Computer Vision (ICCV), pages 64 72, 2015.

[52] Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. Convolutional Pose Ma-

chines. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724 4732, 2016.

[53] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie,

and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.

[54] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2.

https://github.com/facebookresearch/detectron2, 2019.

[55] Heng Yang, Renqiao Zhang, and Peter Robinson. Human and sheep facial landmarks localisa-

tion by triplet interpolated features. In IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1 8, 2015.

[56] Lu Yang, Qing Song, Zhihui Wang, and Ming Jiang. Parsing R-CNN for Instance-Level Hu-

man Analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 364 373, 2019.

[57] Ning Zhang, Jeff Donahue, Ross B. Girshick, and Trevor Darrell. Part-Based R-CNNs for Fine-

Grained Category Detection. In European Conference on Computer Vision (ECCV), pages 834 849, 2014.

[58] Weiyu Zhang, Menglong Zhu, and Konstantinos G. Derpanis. From Actemes to Action: A

Strongly-Supervised Representation for Detailed Action Understanding. International Conference on Computer Vision (ICCV), pages 2248 2255, 2013.

[59] Yuting Zhang, Yijie Guo, Yixin Jin, Yijun Luo, Zhiyuan He, and Honglak Lee. Unsuper-

vised Discovery of Object Landmarks as Structural Representations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2694 2703, 2018.

[60] Silvia Zufﬁ, Angjoo Kanazawa, Tanya Y. Berger-Wolf, and Michael J. Black. Three-D Sa-

fari: Learning to Estimate Zebra Pose, Shape, and Texture from Images "In the Wild". In International Conference on Computer Vision (ICCV), pages 5358 5367, 2019.

[61] Silvia Zufﬁ, Angjoo Kanazawa, David W. Jacobs, and Michael J. Black. 3D Menagerie: Mod-

eling the 3D Shape and Pose of Animals. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5524 5532, 2017.

[62] Silvia Zufﬁ, Angjoo Kanazawa, David W. Jacobs, and Michael J. Black:. Lions and Tigers

and Bears: Capturing Non-Rigid, 3D, Articulated Shape from Images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3955 3963, 2018.