# manipose_manifoldconstrained_multihypothesis_3d_human_pose_estimation__dadd5b6f.pdf

Mani Pose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation

Cédric Rommel1 Victor Letzelter1,3 Nermin Samet1 Renaud Marlet1,5

Matthieu Cord1,2 Patrick Pérez1 Eduardo Valle1,4

1Valeo.ai, Paris, France 2Sorbonne Université, Paris, France 3LTCI, Télécom Paris, Institut Polytechnique de Paris, France 4Recod.ai Lab, School of Electrical and Computing Engineering, University of Campinas, Brazil 5LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallee, France

We propose Mani Pose, a manifold-constrained multi-hypothesis model for humanpose 2D-to-3D lifting. We provide theoretical and empirical evidence that, due to the depth ambiguity inherent to monocular 3D human pose estimation, traditional regression models suffer from pose-topology consistency issues, which standard evaluation metrics (MPJPE, P-MPJPE and PCK) fail to assess. Mani Pose addresses depth ambiguity by proposing multiple candidate 3D poses for each 2D input, each with its estimated plausibility. Unlike previous multi-hypothesis approaches, Mani Pose forgoes generative models, greatly facilitating its training and usage. By constraining the outputs to lie on the human pose manifold, Mani Pose guarantees the consistency of all hypothetical poses, in contrast to previous works. We showcase the performance of Mani Pose on real-world datasets, where it outperforms state-of-the-art models in pose consistency by a large margin while being very competitive on the MPJPE metric.

1 Introduction

We propose Mani Pose, a novel approach for human-pose 2D-to-3D lifting. Mani Pose directly addresses the depth ambiguity inherent to monocular 3D human pose estimation by being both multihypothesis and manifold-constrained, thus avoiding pose consistency issues, which plague traditional regression-based methods. Unlike previous multi-hypothesis approaches, Mani Pose forgoes the use of costly generative models, while still estimating the plausibility of each hypothesis.

Monocular 3D human pose estimation (HPE) is a challenging learning problem that aims to predict 3D human poses given an image or a video from a single camera. Often, the problem is split into two successive steps: first 2D human pose estimation, then 2D-to-3D lifting. Such separation is favorable because 2D-HPE is much more mature, leading to better overall results. Due to depth ambiguity and occlusions, 2D-to-3D lifting is intrinsically ill-posed: multiple 3D poses correspond to the same projection observed in 2D. Despite that, the field has experienced fast developments, with substantial improvements in terms of mean-per-joint-prediction error (MPJPE) and derived metrics (e.g., P-MPJPE, PCK) [52, 53, 42, 47].

However, recent studies [49, 12, 40] noted that poses predicted by state-of-the-art models fail to respect basic invariances of human morphology, such as bilateral sagittal symmetry, or constant length across time of rigid body segments connecting the joints. Not only do we address those concerns with Mani Pose (see Fig. 1), but we also provide theoretical elements clarifying the cause of those issues. We show in particular that pose consistency and traditional performance metrics (such as MPJPE)

38th Conference on Neural Information Processing Systems (Neur IPS 2024).

+ manifold constraints + multi-hypothesis

Figure 1: Optimizing both 3D position and pose consistency requires combining constraints and multiple hypotheses. Results from Tables 2 and 4. Previous unconstrained methods provide inconsistent poses (top). Regularization (MR) and disentanglement constraints improve consistency, but degrade joint position error (bottom-right). Ours is the only method that achieves both good joint error and consistency, thanks to a combination of disentanglement and a few hypotheses (see circles sizes).

cannot be optimized simultaneously by a standard regression model, because MPJPE ignores the topology of the space of human poses, and traditional regression models imply unimodality, thus overlooking the inherently ambiguous nature of 3D-HPE.

Our contributions include:

Mani Pose, a novel, multi-hypothesis, manifold-constrained model for human-pose 2D-to3D lifting, which is able to estimate the plausibility of each hypothesis without resorting to costly generative models.

Theoretical insights that elucidate why traditional regression models associated with standard metrics such as MPJPE fail to enforce pose consistency.

Extensive empirical results, including comparison to strong baselines, evaluation on two challenging datasets (Human 3.6M and MPI-INF-3DHP), and ablations. Mani Pose outperforms state-of-the-art methods by a substantial margin in terms of pose consistency, while still beating them in the MPJPE metric. The ablations confirm the importance of both multiple hypotheses and of constraining the poses to their manifold.

The Py Torch [37] implementation of Mani Pose and code used for all our experiments can be found at https://github.com/cedricrommel/manipose.

2 Related work

Regression-based 2D-to-3D pose lifting. While 2D-to-3D human pose lifting was initially restricted to static frames [31, 3], the field embraced recurrent [13], convolutional [38] and graph neural networks [2, 55, 14, 51] to handle motion. Spatial-temporal transformers appear more recently [42, 53], including Mix STE [52], arguably becoming the state of the art. We adopt them in our work. A few previous works constrain predicted poses to respect human symmetries [50, 4], an idea we advance with a novel constraint implementation, in a multi-hypothesis setting.

SMPL-based methods. While 3D human pose lifting s objective is to predict 3D joint positions based on 2D keypoints, the neighboring field of human pose and shape reconstruction (HPSR) aims at estimating whole 3D body meshes from images. HPSR is hence more challenging than 3D-HPE, which explains why models are often larger, frame-based and more reliant on optimization-based post-processing [16, 39, 46, 9]. Nonetheless, our work shares some ideas from this field. Indeed, modern HPSR methods often predict joint angles (and body shape parameters), which are fed to the pre-trained parametric model SMPL [29] to produce human body meshes, thus ensuring that limbs sizes remain constant along a movement. Note, however, that these are also single-hypothesis regression methods and hence share the same caveats as most 3D-HPE approaches.

Multi-hypothesis 3D-HPE. The intrinsic depth-ambiguity of 3D-HPE led the community to investigate multi-hypothesis approaches, including Mixture Density Networks [25, 36, 1], variational autoencoders [44], normalizing flows [18, 49] and diffusion models [12, 6, 10]. Contrary to ours, those methods rely on a generative model to sample 3D pose hypotheses conditioned on the 2D input. A notable exception is MHFormer [27], which, like Mani Pose, is deterministic, but treats the hypotheses as intermediate representations to be aggregated at the final network layers, thus concluding with a one-to-one 2D-to-3D mapping. We strive to avoid such injectivity and to preserve the multiple hypotheses, for reasons we will justify both empirically and theoretically in the next sessions. Moreover, none of the previous multi-hypothesis approaches constrain hypotheses to lie on the human pose manifold, thus failing to guarantee good pose consistency.

Multiple choice learning (MCL) [11] is a simple approach for estimating multimodal distributions, suited for ambiguous tasks, using the winner-takes-all loss. Adapted for deep learning by Lee et al. [20, 21], it produces diverse predictors, each specialized in a particular subset of the data distribution. MCL has proved its effectiveness in several computer vision tasks [41, 19, 33, 8, 30, 45], and was first applied to 2D-HPE in [41]. Our work is the first to employ MCL for the 3D-HPE task, by leveraging recent innovations of Letzelter et al. [22].

3 Mani Pose

Pose Decoder

Not learned

2D keypoints

Figure 2: Overview of Mani Pose. The rotations module predicts K possible sequences of segment rotations with their corresponding likelihoods (scores), while the segments module estimates the shared segment lengths. Hence, predicted poses are constrained to a manifold defined by the estimated lengths, guaranteeing their consistency.

Following the previous state of the art, we split 3D-HPE into two steps, first estimating J human 2D keypoints in the pixel space from a sequence of T video frames [x1, . . . , x T ] R2 J T , and then lifting them to 3D joint positions [ˆp1, . . . , ˆp T ] R3 J T . We focus on the second step (i.e., lifting) in the rest of the paper, assuming the availability of 2D keypoints xi. Our method aims to both ensure pose consistency and resolve depth ambiguity, as we will discuss in the next section.

3.1 Constraining predictions to the pose manifold

Rationale. Human morphology prevents the joints from arbitrarily occupying the whole space. Instead, the poses within a movement are restricted to a manifold, reflecting the human skeleton s rigidity. If we knew the length of each segment connecting pairs of joints for a given subject, we could guarantee that the predicted poses lie on the correct pose manifold by only predicting the body part s rotations with respect to a reference skeleton. Since we do not have access to ground-truth segment lengths in real use cases, we propose to predict them, thus disentangling the estimation of the reference lengths (fixed across time) from the estimation of the joint rotations (variable across time).

Disentangled representations. We constrain model predictions to lie on an estimated manifold by predicting parametrized disentangled transformations of a reference pose u (R3)J, for which all segments have unit length. Namely, we propose to split the network into two parts (cf. Fig. 2):

1. Segments module, which predicts segment lengths s RJ 1, shared by the T frames (time steps) of the input sequence;

2. Rotations module, which predicts the rotation r = [r1,0, . . . , r T,J 1] (Rd)J T of each joint relative to their parent joint at each time step.

Rotations representation. We represent rotations using 6D continuous embeddings (i.e., d = 6). Compared to quaternions or axis-angles, those representations are continuous and, hence, better learned by neural networks, as demonstrated by their proposers [54].

Pose decoding. To deliver pose predictions in (R3)J T , the intermediate representations (s, r) must be decoded. We achieve that in three steps (cf. Fig. 3):

1. We scale the unit segments of the reference pose u (R3)J using s, forming a scaled reference pose u : u j = u τ(j) + sj(uj uτ(j)) for 0 < j J 1, where τ maps the index of a joint to its parent s, if any.

2. For each time step 1 t T and joint 0 j < J, we convert the predicted rotation representations rt,j into rotation matrices Rt,j SO(3) (Algorithm 1).

3. We apply those rotation matrices Rt,j at each time step t to the scaled reference pose u using forward kinematics (Algorithm 2).

3.2 Multiple choice learning

Mani Pose architecture. As explained in the introduction, the inherent depth ambiguity of pose lifting requires multiple hypotheses to conciliate pose consistency and MPJPE performance. To address this, we adopt the multiple choice learning (MCL) [21] framework, more precisely leveraging the resilient MCL approach as proposed by Letzelter et al. [22]. This methodology allows the estimation of conditional distributions for regression tasks, enabling our model to predict multiple plausible 3D poses for each 2D input. Specifically, instead of a single rotation rt (Rd)J per time step, Mani Pose s rotations module predicts an intermediate representation et (Rd )J that feeds K linear heads (with weights W k r and W k γ ), each predicting its own rotation hypothesis rk t (Rd)J

with a corresponding likelihood γk t [0, 1]. That is, for all 1 t T, rk t = W k r et and γk t = σ[ γt]k, where the softmax function σ is applied to the vector γt = [ γ1 t , . . . , γK t ] RK of intermediate values γk t = W k γ et.

All rotation hypotheses are decoded together with the shared segment-length predictions s, resulting in K hypothetical pose sequences ˆpk = (ˆpk t )T t=1, with corresponding likelihood sequences γk = (γk t )T t=1, called scores hereafter (Fig. 2).

Loss function. As in [22], Mani Pose is trained with a composite loss

L = Lwta + βLscore . (1)

The first term, Lwta, is the winner-takes-all loss [21]

Lwta(ˆp(x), p) = 1

t=1 min k J1,KK ℓ(ˆpk t (x), pt) , (2)

where ℓ(ˆpk t (x), pt) 1

J PJ 1 j=0 pt,j ˆpk t,j(x) 2, and ˆpk t (x) denotes the pose prediction at time t using the kth head. The second term, Lscore, is the scoring loss

Lscore(ˆp(x), γ(x), p) = 1

t=1 H δ(ˆpt, pt), γt(x) , (3)

where H( , ) is the cross-entropy, ˆpt = (ˆpk t )K k=1, and

[δ(ˆpt, pt)]k 1 h k arg min k J1,KK ℓ ˆpk t , pt i (4)

is the indicator function of the winner pose hypothesis, which is the closest to the ground truth. Eq. (3) is the average cross-entropy between target and predicted scores γt(x) [0, 1]K at each time t.

Those losses are complementary. The winner-takes-all loss updates only the best predicted hypothesis, specializing each head on part of the data distribution [21]. The scoring loss allows the model to learn how likely each head is to winning, thus avoiding overconfidence of non-winner heads (cf. [19, 45]).

Conditional distribution estimation. As detailed in [22], the model may be interpreted probabilistically as a multimodal conditional density estimator. More precisely, it models the distribution P(p|x) of 3D poses conditioned on 2D poses as a mixture of Dirac distributions:

k=1 γk(x)δˆpk(x)(p) . (5)

Hence, the predicted conditional distribution has, at each predicted hypothesis ˆpk, a peak whose likelihood is given by the predicted score γk. As described in Section 4, interpreting hypotheses and scores probabilistically enables us to handle depth ambiguity.

4 Formal analysis

Rotations Conversion

Forward Kinematics

Predicted movement

Scaled ref. pose

Unit ref. pose

Figure 3: Pose decoder overview.

Mani Pose, as outlined in Section 3, is crafted to address the flaws inherent in unconstrained, single-hypothesis lifting-based 3D-HPE methods (see Fig. 1). This section illustrates that without Mani Pose s critical components (multiple hypotheses and manifold constraint), it is impossible to simultaneously minimize joint error and ensure pose consistency (Section 4.1). To illustrate this, a toy example within a simplified 1D-to-2D framework is provided in Section 4.2.

4.1 Single-hypothesis position-error minimization leads to inconsistent skeleton lengths

We formally highlight the limitations of unconstrained single-hypothesis 3D-HPE, justifying our approach, which combines consistency constraints and multiple hypotheses to resolve depth ambiguity.

Let p = [p1, . . . , p J] R3 J be a human pose, defined by the Cartesian 3D coordinates of each of the J joints of a predefined skeleton. Then, a sequence of T poses of the same subject at increasing time steps t1 . . . t T R forms a movement m = [p0, . . . , p T ] R3 J T . Assuming bone length is fixed during a movement (which is empirically verifiable in human pose datasets), then the poses pt of m must all lie on the same smooth manifold. Proposition 4.1 (Human pose manifold). Assuming a rigid skeleton, all poses of a movement m = [pt]T t=1 lie on a manifold M of dimension 2(J 1):

t {1, . . . , T}, pt M . (6)

Proof sketch. (Detailed in Appendix B). Skeleton rigidity implies that, if i is a joint connected to the root, then it lies on a 2D sphere S2 (0, si,0) centered at the origin with fixed radius si,0. Another joint j linked to i has a position expressible by its spherical coordinates relative to i with fixed radius sj,i. That implies an homeomorphism between the position pt,j of joint j and the direct product of spheres centered at the origin S2 (0, si,0) S2 (0, sj,i). By induction, one can show that pt lies on a subspace of (R3)J, which is homeomorphic to a product of spheres centered at the origin.

Proposition 4.1 implies that all poses predicted for a video sequence should ideally lie on the same manifold M as the ground-truth data, which is homeomorphic to the direct product of 2D unit spheres (S2)J 1 (cf. Appendix B). Crucially, we can further show that minimizing joint position error using a single-hypothesis model necessarily leads to predicted poses lying outside the true manifold: Proposition 4.2 (Inconsistency of MSE minimizer). With a rigid skeleton and mild assumptions on the training distribution, predicted 3D poses minimizing the traditional mean squared error (MSE) loss lie outside the pose manifold M.

Proof sketch. (See Appendix B). Consider a skeleton with J joints, with (x, p), as pairs of corresponding 2D inputs and 3D poses. Let the function ℓ= (ℓj)J 1 j=1 compute the lengths of the segments in a pose, which shall remain constant. On a dataset {(xi, pi)}N i=1 drawn from the joint distribution of 2D and 3D poses, let the expected MSE of a traditional predictive model f be Ex,p p f(x) 2 2 . Let the ideal model f be the one minimizing that expected MSE, which is the conditional expectation f (x) = E[p | x]. Jensen inequality and the rigidity assumption imply that, for any joint j, ℓ2 j (f (x)) < s2 j where sj is the true length of the segment associated with joint j. This shows that the poses predicted by f violate the original segment length constraints, and thus, the original rigidity assumption.

Proposition 4.2 has the following implications:

1. Traditional unconstrained single-hypothesis approaches are bound to predict inconsistent movements, where bone lengths may vary. 2. With a single hypothesis, models constrained to the manifold will always lose to unconstrained models in terms of MPJPE performance (formalized in Corollary B.1). 3. The only way of reaching both optimal MPJPE and consistency is through multiple hypotheses (formalized in Corollary B.3).

Therefore, the MPJPE metric (and its traditional extensions) is insufficient to assess 3D-HPE, as it completely ignores pose consistency. Furthermore, we are able to prove in Appendix B.2 that multiple hypotheses (constrained or not) can always reach better joint position errors than single-hypothesis models.

4.2 Insights to the formal argument on a simplified setting

Table 1: 1D-to-2D performance. Fig. 4-D setting, results averaged over five random seeds.

MPJPE Distance to circle

Unconst. MLP 0.753 0.008 0.42 0.01 Constrained MLP 0.777 0.027 0.00 0.00 Mani Pose 0.752 0.012 0.00 0.00

We illustrate the argument of Section 4.1 with a simplified 1D-to-2D setup. We further generalize this intuitive illustration to the 2D-to-3D setting in Appendix C of the supplementary.

As in human pose lifting, we take a root joint J0 as reference, fixed at (0, 0). For a joint J1, the problem amounts to predicting the 2D position (x, y), given its 1D projection u = x, assuming a constant distance s = 1 between them. This simplification ignores the camera perspective and considers the joints to be connected by a rigid segment as in the case of human poses.

We train three different models with comparable architectures on two datasets {(xi, (xi, yi))}N i=1 sampled from the angular distributions represented in blue on Fig. 4. The models correspond to:

1. A 2-layer MLP ( ) trained to minimize the mean squared error between true (x, y) and predicted joint positions (ˆx, ˆy);

2. A constrained MLP of the same size ( ), predicting the angle ˆθ instead of the joint position; 3. Mani Pose: our constrained multi-hypothesis model capable of predicting K = 2 possible angles (ˆθk)K k=1 with their corresponding likelihoods.

Fig. 4 shows that the traditional unconstrained single-hypothesis approach ( ) leads to good results in an easy unimodal scenario (C), but fails when facing a more challenging bimodal distribution (D), leading to predictions outside the circle manifold, as depth ambiguity makes the lifting problem ill-posed. The single-hypothesis constrained model ( ) delivers predictions on the circle, at the cost of worse MPJPE performance than the unconstrained MLP. Such performance decrease is due to the Euclidean topology of the MPJPE metric having its minimum ( ) outside the manifold (Fig. 4-B).

Crucially, this implies that the unconstrained single-hypothesis models are bound to make inconsistent predictions, with varying bone lengths (the circle radius). It also shows that models constrained to the manifold (circle) will always be outcompeted by unconstrained models on MPJPE performance.

Predicting multiple hypotheses constrained to the circle, with their respective likelihoods ( in Fig. 4-B) allows escaping this dilemma, which is exactly what Mani Pose does ( in Fig. 4-D). The

p(y|x) = 0.67

p(y|x) = 0.33

Inputs Outputs GT probability

Constr. MLP Unconstr. MLP MSE minimizer

Constr. MH min. Mani Pose - hyps. Mani Pose - scores

Figure 4: (A) 1D-to-2D articulated pose lifting problem. (B) True MSE minimizers under a multimodal distribution. One-to-one mappings cannot both reach optimal performance and stay on the pose manifold (dashed circle). (C) Without depth ambiguity, unconstrained models are effective. (D) Ambiguity from multimodal distributions challenges both constrained and unconstrained models. Multi-hypothesis approaches can deliver an acceptable solution to the problem.

predicted hypotheses are all on the circle, contrary to the unconstrained MLP, and spread between the two distribution modes, unlike the constrained single-hypothesis method.

Moreover, the predicted scores (length of green lines) match the 2

3 ground-truth likelihoods of the two modes. Those advantages translate into perfect pose consistency and into comparable MPJPE performance with respect to the unconstrained MLP (Table 1).

5 Experiments

5.1 Experimental setup

Datasets. We evaluate our model on two 3D-HPE datasets. Human 3.6M [15] contains 3.6 million images of 7 actors performing 15 different indoor actions. It is the most widely used dataset for 3D-HPE. Following previous works [52, 27, 53, 38], we train on subjects S1, S5, S6, S7, S8, and test on subjects S9 and S11, adopting a 17-joint skeleton (cf. Fig. 5). We employ a pre-trained CPN [5] to compute the input 2D keypoints, as in [38, 52]. MPI-INF-3DHP [32] also adopts a 17-joint skeleton, but, with fewer samples and containing both indoor and outdoor scenes, it is more challenging than Human 3.6M. We used ground-truth 2D keypoints for this dataset, as usually done [53, 4, 52].

Traditional evaluation metrics. The mean per-joint position error (MPJPE) is the usual performance metric for the datasets above, under different protocols, both reported in mm. In protocol #1, the root joint position is set as a reference, and the predicted root position is translated to 0. In protocol #2 (P-MPJPE), predictions are additionally Procrustes-corrected. For MPI-INF-3DHP, additional thresholded metrics derived from MPJPE are often reported, such as AUC (Area Under Curve) and PCK (Percentage of Correct Keypoints) with a threshold at 150 mm, as explained in [32].

Pose consistency metrics. MPJPE being insufficient to assess pose consistency (Section 4), we further assess to which extent predicted skeletons are rigid by measuring the average standard deviations of segment lengths across time in predicted action sequences:

MPSCE 1 J 1

t=1 (st,j,τ(j) sj,τ(j))2 , (7)

with st,j,i = ˆpt,j ˆpt,i 2 and sj,i = 1

T PT t=1 st,j,i, where τ was defined in Section 3.1. We call this metric, reported in mm, the Mean Per Segment Consistency Error (MPSCE).

Following [12, 40], we also assess the bilateral symmetry of predicted skeletons through the Mean Per Segment Symmetry Error (MPSSE), in mm:

MPSSE 1 T |Jleft|

j Jleft |st,j,τ(j) st,j ,τ(j )| , with j = ζ(j) , (8)

where Jleft denotes the set of indices of left-side joints and ζ maps left-side joint indices to their right-side counterparts.

Multi-hypothesis evaluation protocol. One must decide how to use multiple hypotheses to compute the metrics. The dominant approach [24, 25, 36, 44, 49, 12] is the oracle evaluation, i.e., using the predicted hypothesis closer to the ground truth (i.e., Eq. (2) for MPJPE). That makes sense for multi-hypothesis methods, as the oracle metric measures the distance between the target and the discrete set of predicted hypotheses. It aligns with the idea of many possible outputs for a given input.

Hypotheses can also be aggregated into a final pose, e.g., through unweighted or weighted averaging (using predicted scores). The latter has the disadvantage of falling back to a one-to-one mapping scheme, which is precisely what we want to avoid in a multi-hypothesis setting.

We report both oracle and aggregated metrics in our experiments, favoring oracle results.

Implementation details. Mani Pose, as presented in Section 3, is compatible with any backbone. Here, we chose to build on the Mix STE [52] network for both the rotations and the segment modules (the latter in a reduced scale). Details about our architecture and training appear in Appendix D.

5.2 Comparison with the state of the art

Table 2: Pose consistency evaluation of state-of-the-art methods on Human3.6M. MPJPE performance and pose consistency are not correlated; only Mani Pose excels in both. T: sequence length. K: number of hypotheses. Orac.: Metric computed using oracle hypothesis. Grey lines: Methods where the Oracle MPJPE is computed with non-comparable number of hypotheses with respect to the other baselines. Bold: best; Underlined: second best. *: Method with unavailable code ; MPSSE values reported in [12]. : Results with comparable number of hypotheses. : Results computed with official checkpoint and code.

T K Orac. MPJPE MPSSE MPSCE

Single-hypothesis methods: ST-GCN [2] 7 1 48.8 8.9 10.8 Video Pose3D [38] 243 1 46.8 6.5 7.8 Pose Former [53] 81 1 44.3 4.3 7.2 Anatomy3D [4] 243 1 44.1 1.4 2.0 Mix STE [52] 243 1 40.9 8.8 9.9

Multi-hypothesis methods: Wehrbein et al. [49] 1 200 44.3 12.2 14.8 Diff Pose (Holmquist et al.) [12]* 1 200 43.3 14.9 - GFPose [6] 1 200 35.6 13.1 16.5 D3DP (P-Best) [43] 243 20 39.5 6.9 9.0 GFPose [6] 1 10 45.1 13.1 16.5 Sharma et al. [44] 1 10 46.8 13.0 9.9 Diff Pose (Gong et al.) [10] 243 5 39.3 5.2 6.1 MHFormer [27] 351 3 43.0 5.7 8.0

Mani Pose (Ours) 243 5 42.1 0.4 0.8 Mani Pose (Ours) 243 5 39.1 0.3 0.5

Human 3.6M. Comparisons with state-of-the-art singleand multi-hypothesis methods are presented in Table 2 and illustrated in Fig. 1. Mani Pose outperforms previous methods in terms of Oracle MPJPE in comparable scenarios, while reaching nearly perfect consistency. Moreover, note that MPJPE and consistency metrics are not positively correlated for single-hypothesis methods. As predicted in Section 4.1, our empirical results show that MPJPE improvements achieved by Mix STE come at the cost of poorer consistency compared to previous models. In contrast, the only singlehypothesis constrained model, Anatomy3D [4], achieves good consistency at the expense of inferior MPJPE. Those results empirically validate the theoretical predictions of Sections 4.1 and B, further

confirming what we have shown, intuitively, in the simplified 1D-to-2D setting (Section 4.2). Note that while Mani Pose is deterministic, previous multi-hypothesis methods are generative, except for MHFormer. Table 2 shows that they require up to two orders of magnitude more hypotheses than Mani Pose to reach competitive performance (see, e.g., the performance of GFPose). This property is expected. Indeed, optimization based on Winner-Takes-All theoretically leads to an optimal coverage of the modes of the conditional distribution with a fixed number of samples [23], in contrast to generative-based approaches. This is reflected in the oracle metric, which approximates the so-called quantization (or Distortion) error, as defined in (27), when the number of data points is large. More detailed MPJPE results per action appear in Tables 8 and 9 in the supplemental. We also complement our analysis on the diversity of Mani Pose in Fig. 11 of the appendix.

Fig. 6 showcases qualitative results, where multiple hypotheses help in depth-ambiguous situations.

Table 3: Comparison with the state-of-the-art on MPI-INF-3DHP using ground-truth 2D poses. T: sequence length.

T PCK AUC MPJPE MPSSE MPSCE

Video Pose3D [38] 81 85.5 51.5 84.8 10.4 27.5 Pose Former [53] 9 86.6 56.4 77.1 10.8 14.2 Mix STE [52] 27 94.4 66.5 54.9 17.3 21.6 P-STMO [42] 81 97.9 75.8 32.2 8.5 11.3

Mani Pose (Ours) Aggr. 27 98.0 75.3 37.7 0.6 1.3 Mani Pose (Ours) Orac. 27 98.4 77.0 34.6 0.6 1.3

MPI-INF-3DHP. Similar results were obtained for this dataset (cf. Table 3). Not only does Mani Pose reach consistency errors close to 0, but also best PCK and AUC performance. As for MPJPE, only [42] achieves slightly better performance, at the cost of large pose consistency errors.

5.3 Ablation study

Table 4: Ablation study: Single hypothesis cannot optimize both MPJPE and consistency. Mani Pose uses the same backbone as Mix STE. MR: with manifold regularization. MC: manifoldconstrained. Bold: best. Underlined: second best.

MR MC K # Params. MPJPE MPSSE MPSCE

Mani Pose (Ours) 5 34.44 M 39.1 0.3 0.5 w/o MH 1 34.42 M 44.6 0.3 0.5 w/o MC, w/ MR 1 33.78 M 42.3 5.7 7.3 w/o MR (Mix STE) 1 33.78 M 40.9 8.8 9.9

Figure 5: MPSCE, MPSSE and MPJPE per segment/coordinate (lower is better). Mani Pose mostly helps to deal with the depth ambiguity (z coordinate). Ground-truth poses are represented but not visible because they have perfect consistency.

Impact of components. We evaluate the impact of removing each component of Mani Pose on the Human 3.6M performance (Table 4). The components tested are the multiple hypotheses (MH) and the manifold constraint (MC). We also compare MC to a more standard manifold regularization (MR), i.e., adding Eq. (7) to the loss. Note that without all these components, we fall back to Mix STE [52], and that the performances reported in Table 4 also appear in Fig. 1.

We see that MR helps to improve pose consistency, but not as much as MC. However, without multiple hypotheses, MC consistency improvements come at the cost of degraded MPJPE performance, as

foreseen by our formal analysis (Section 4). Only the combination of both MC and MH allows us to optimize both consistency and MPJPE.

Fine error analysis. We can see in Fig. 5 that, compared to Mix STE, Mani Pose reaches substantially superior MPSSE and MPSCE, consistency across all skeleton segments. Furthermore, note that larger Mix STE errors occur for segments KNEE-FOOT and ELBOW-WRIST, which are the most prone to depth ambiguity. That agrees with coordinate-wise errors depicted in Fig. 5, showing that Mani Pose improvements mostly translate into a reduction of Mix STE depth errors, which are twice as large as for other coordinates. Further ablations, including the effect of the number of hypotheses K, the score loss weight β and the rotations representation choice appear in the supplemental.

Ground-truth

Hypothesis 1: p1(x)

Single hypothesis

Hypothesis 2: p2(x)

Score : γk(x)

Hypothesis 3: p3(x)

Figure 6: Qualitative comparison between Mani Pose and state-of-the-art regression method, Mix STE. Two pairs of predicted hypotheses by Mani Pose are illustrated in green-pink (left) and green-purple (right), where opacity is used to represent the predicted scores. Multiple hypotheses and constraints help to deal with depth ambiguities and avoids predicting shorter limbs (red circles).

6 Conclusion

We presented a new manifold-constrained multi-hypothesis human pose lifting method (Mani Pose) and demonstrated its empirical superiority to the existing state-of-the-art on two challenging datasets. Further, we provided theoretical evidence supporting the tenets of our method, by showing the inherent limitation of unconstrained single-hypothesis approaches to 3D-HPE. We established that unconstrained single-hypothesis methods cannot deliver consistent poses and that constraining or regularizing single-hypothesis models leads to worse position errors. We also showed that traditional MPJPE-like metrics are insufficient to assess consistency.

Limitations. To guarantee its consistency, Mani Pose relies on the forward kinematics algorithm, which is inherently sequential across joints. Removing that dependence is an interesting avenue for accelerating the method. On another note, while Mani Pose ensures the rigidity of the predicted poses, imposing constraints within human body articulation limits presents another area for enhancement.

Acknowledgments and Disclosure of Funding

This work was granted access to the HPC resources of IDRIS under the allocation 2023-AD011014073 made by GENCI. It was also partly funded by the French Association for Technological Research (ANRT CIFRE contract 2022-1854). We are grateful to the reviewers for their insightful comments.

[1] Bishop, C.M.: Mixture density networks. Working paper, Aston University (1994) 3

[2] Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploiting spatialtemporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2272 2281 (2019) 2, 8, 23, 24

[3] Chen, C.H., Ramanan, D.: 3d human pose estimation= 2d pose estimation+ matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7035 7043 (2017) 2

[4] Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology 32(1), 198 209 (2021) 2, 7, 8, 23, 24

[5] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7103 7112 (2018) 7

[6] Ci, H., Wu, M., Zhu, W., Ma, X., Dong, H., Zhong, F., Wang, Y.: Gfpose: Learning 3d human pose prior with gradient fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4800 4810 (2023) 3, 8

[7] Du, Q., Faber, V., Gunzburger, M.: Centroidal voronoi tessellations: Applications and algorithms. SIAM review 41(4), 637 676 (1999) 19

[8] Firman, M., Campbell, N.D., Agapito, L., Brostow, G.J.: Diversenet: When one right answer is not enough. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5598 5607 (2018) 3

[9] Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4d: Reconstructing and tracking humans with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14783 14794 (2023) 2

[10] Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: Diffpose: Toward more reliable 3d pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13041 13051 (2023) 3, 8, 24, 26

[11] Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: Learning to produce multiple structured outputs. Advances in neural information processing systems 25 (2012) 3

[12] Holmquist, K., Wandt, B.: Diffpose: Multi-hypothesis human pose estimation using diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15977 15987 (2023) 1, 3, 8, 23

[13] Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). pp. 68 84 (2018) 2, 22

[14] Hu, W., Zhang, C., Zhan, F., Zhang, L., Wong, T.T.: Conditional directed graph convolution for 3d human pose estimation. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 602 611 (2021) 2

[15] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(7), 1325 1339 (Jul 2014) 7, 15

[16] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7122 7131 (2018) 2

[17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 (2014) 20, 22

[18] Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K.: Probabilistic modeling for human mesh recovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11605 11614 (2021) 3

[19] Lee, K., Hwang, C., Park, K., Shin, J.: Confident multiple choice learning. In: International Conference on Machine Learning. pp. 2014 2023. PMLR (2017) 3, 5

[20] Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D.: Why m heads are better than one: Training a diverse ensemble of deep networks. ar Xiv preprint ar Xiv:1511.06314 (2015) 3

[21] Lee, S., Purushwalkam Shiva Prakash, S., Cogswell, M., Ranjan, V., Crandall, D., Batra, D.: Stochastic multiple choice learning for training diverse deep ensembles. Advances in Neural Information Processing Systems 29 (2016) 3, 4, 5

[22] Letzelter, V., Fontaine, M., Chen, M., Pérez, P., Essid, S., Richard, G.: Resilient multiple choice learning: A learned scoring scheme with application to audio scene analysis. Advances in neural information processing systems 36 (2024) 3, 4, 5

[23] Letzelter, V., Perera, D., Rommel, C., Fontaine, M., Essid, S., Richard, G., Pérez, P.: Winnertakes-all learners are geometry-aware conditional density estimators. In: Proceedings of the 41 st International Conference on Machine Learning (2024) 9, 19

[24] Li, C., Lee, G.H.: Generating multiple hypotheses for 3d human pose estimation with mixture density network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9887 9895 (2019) 8, 23

[25] Li, C., Lee, G.H.: Weakly supervised generative network for multiple 3d human pose hypotheses. In: British Machine Vision Conference (BMVC) (2020) 3, 8, 23

[26] Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3383 3393 (2021) 22

[27] Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147 13156 (2022) 3, 7, 8, 23

[28] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.c., Asari, V.: Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5064 5073 (2020) 23, 24

[29] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multiperson linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1 248:16 (Oct 2015) 2

[30] Makansi, O., Ilg, E., Cicek, O., Brox, T.: Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7144 7153 (2019) 3

[31] Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 2640 2649 (2017) 2

[32] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 2017 international conference on 3D vision (3DV). pp. 506 516. IEEE (2017) 7

[33] Mun, J., Lee, K., Shin, J., Han, B.: Learning to specialize with knowledge distillation for visual question answering. Advances in neural information processing systems 31 (2018) 3

[34] Murray, R.M., Li, Z., Sastry, S.S.: A mathematical introduction to robotic manipulation. CRC press (2017) 17, 22

[35] Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: International Conference on Machine Learning. pp. 7176 7185. PMLR (2020) 24, 26

[36] Oikarinen, T., Hannah, D., Kazerounian, S.: Graphmdn: Leveraging graph structure and deep learning to solve inverse problems. In: 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1 9. IEEE (2021) 3, 8, 23

[37] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019) 2

[38] Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7753 7762 (2019) 2, 7, 8, 9, 22, 23, 24

[39] Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11488 11499 (2021) 2

[40] Rommel, C., Valle, E., Chen, M., Khalfaoui, S., Marlet, R., Cord, M., Pérez, P.: Diff HPE: Robust, Coherent 3D Human Pose Lifting with Diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3220 3229 (2023) 1, 8

[41] Rupprecht, C., Laina, I., Di Pietro, R., Baust, M., Tombari, F., Navab, N., Hager, G.D.: Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In: Proceedings of the IEEE international conference on computer vision. pp. 3591 3600 (2017) 3, 19

[42] Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., Gao, W.: P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation (Jul 2022) 1, 2, 9

[43] Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W.: Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14761 14771 (2023) 8, 23

[44] Sharma, S., Varigonda, P.T., Bindal, P., Sharma, A., Jain, A.: Monocular 3d human pose estimation by generation and ordinal ranking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2325 2334 (2019) 3, 8, 23

[45] Tian, K., Xu, Y., Zhou, S., Guan, J.: Versatile multiple choice learning and its application to vision computing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6349 6357 (2019) 3, 5

[46] Tiwari, G., Anti c, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: European Conference on Computer Vision. pp. 572 589. Springer (2022) 2

[47] Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3d pose estimation from videos. In: European Conference on Computer Vision. pp. 764 780. Springer (2020) 1, 23, 24

[48] Waskom, M.L.: seaborn: statistical data visualization. Journal of Open Source Software 6(60), 3021 (2021). https://doi.org/10.21105/joss.03021, https://doi.org/10.21105/ joss.03021 16

[49] Wehrbein, T., Rudolph, M., Rosenhahn, B., Wandt, B.: Probabilistic monocular 3d human pose estimation with normalizing flows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11199 11208 (2021) 1, 3, 8, 23

[50] Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., Zhang, W.: Deep kinematics analysis for monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition. pp. 899 908 (2020) 2

[51] Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16105 16114 (2021) 2, 23

[52] Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13232 13242 (2022) 1, 2, 7, 8, 9, 21, 22, 23, 24

[53] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D Human Pose Estimation with Spatial and Temporal Transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11636 11645. IEEE, Montreal, QC, Canada (Oct 2021). https://doi.org/10.1109/ICCV48922.2021.01145 1, 2, 7, 8, 9, 23, 24

[54] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5745 5753 (2019) 4, 22, 24

[55] Zou, Z., Tang, W.: Modulated graph convolutional network for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11477 11487 (2021) 2, 23, 24

Appendix / supplemental material

This supplemental material is organized as follows:

Appendix A contains empirical verification of our assumptions, Appendix B presents the proofs of our theoretical results, together with a few corollaries, Appendix C provides further implementation details concerning the 1D-to-2D experiment, as well as an extension to the 2D-to-3D setting, Appendix D contains implementation and training details concerning Mani Pose, as well as compared baselines, Appendix E presents further results of the Human 3.6M experiment, and finally, Appendix F explains the provided experiment code.

A Assumption verifications

Let us first define a few elements that we will need needed for our derivations. Definition A.1 (Human skeleton). We define a human skeleton as an undirected connected graph G = (V, E) with J = |V | nodes, called joints, associated with different human body articulation points. We assume a predefined order of joints and denote A = [Aij]0 i,j<J {0, 1}J J the adjacency matrix of G, defining joints connections. Definition A.2 (Human pose and movement). Let G be a skeleton of J joints. We attach to each joint i a position p G i in R3 and call the vector p G = [p G 0 , . . . , p G J 1] (R3)J a human pose. Furthermore, given a series of increasing time steps t1 < t2 < < t T R, we define a human movement m as a sequence of poses of the same subject at those instants m = [p G t1, . . . , p G t T ] (R3)J T .

We base the theoretical results of Section 4.1 on the following assumptions. The first states the reference frame traditionally used for assessing 3D-HPE models: Assumption A.3 (Reference root joint). For any skeleton G and movement m of length T, the joint of index 0, called the root joint, is at the origin p G t,0 = [0, 0, 0] at all times t1 t t T . That is equivalent to measuring positions p G t in a reference frame attached to the root joint.

The second assumption concerns the rigidity of human body parts: Assumption A.4 (Rigid segments). We assume that the Euclidean distance between adjacent joints is constant within a movement m: for any pair of instants t and t and for any joints i, j such that Aij = 1, we assume that st,i,j = st ,i,j = si,j , (9)

where st,i,j = p G t,i p G t,j 2 > 0.

Finally, we assume that the conditional distribution of poses does not collapse to a single point, i.e., that we have a one-to-many problem: Assumption A.5 (Non-degenerate conditional distribution). Given a joint distribution P(x G, p G) of 3D poses p G (R3)J and corresponding 2D inputs x G (R2)J, we assume that the conditional distribution P(p G|x G) is non-degenerate, i.e., it is not a single Dirac distribution.

Note that can be true even when P(x G, p G) is unimodal (e.g., Fig. 4).

We verified on Human 3.6M [15] ground-truth data that assumptions A.4 and A.5 hold for actual poses in both training and test splits.

Segments rigidity. As shown on Figs. 5 and 9, ground-truth 3D poses have perfect MPSSE (8) and MPSCE (7) metrics, meaning that ground-truth skeletons are perfectly symmetric, with rigid segments. Assumption A.4 is thus verified in actual training and test data.

Non-degenerate distributions. As shown on Fig. 7, the conditional distribution of ground-truth 3D poses given 2D keypoints position is clearly multimodal, and, thus, non-degenerate (not reduced to a single Dirac distribution). That validates assumption A.5 and explains why multi-hypothesis techniques are necessary.

0.5 0.0 0.5 1.0

0.5 0.0 0.5 1.0

v (a) S9, Walking

0.5 0.0 0.5

0.5 0.0 0.5

v (b) S1, Greeting

0.5 0.0 0.5

0.5 0.0 0.5

v (c) S11, Directions

v (d) S1, Sitting Down

Figure 7: Estimated joint distributions of ground-truth 2D inputs (u, v pixel coordinates) together with 3D z-coordinates (depth) for different subjects and actions. The depth density conditional on inputs is clearly multimodal. Vertical red lines are examples of depth-ambiguous inputs. Distributions are estimated with a kernel density estimator from the Seaborn plotting library [48].

B Proofs and additional corollaries

B.1 Properties of manifold constraint and multi-hypotheses models

This section contains the proofs of the theoretical results presented in Section 4.1, together with a few corollaries.

PROOF. [Proposition 4.1] Let i be a joint connected to the root p0 (i.e., Ai0 = 1). From assumptions A.3 and A.4, we know that at any instant t, p G t,i lies on the sphere S2(0, si,0) centered at 0 with radius si,0 independent of time. Therefore, its position can be fully parameterized in spherical coordinates by two angles (θt,i, ϕt,i). Let j be a joint connected to i. Like before, assumption A.4 implies that at any instant t, p G t,j lies on the moving sphere S2(p G t,i, sj,i) centered at p G t,i with radius sj,i independent of time. Thus, we can fully describe p G t,j with the position of its center, p G t,i and the spherical coordinates (θt,j, ϕt,j) of joint j relative to the center of the sphere, i.e., joint i. That means that there is a bijection between the possible positions attainable by p G t,j at any instant and the direct product of spheres S2(0, si,0) S2(0, sj,i).1 That bijection is an homeomorphism since it is a composition of homeomorphisms: we can compute p G t,j from (θt,i, ϕt,i, θt,j, ϕt,j) following the forward kinematics algorithm [34] (cf. algo. 2), i.e., using a composition of rotations and translations.

Now let us assume for some arbitrary joint k that p G t,k lies at all times on a space M2d homeomorphic to a product of spheres of dimension 2d. That means that p G t,k can be fully parametrized using 2d spherical angles (θ1, ϕ1, . . . , θd, ϕd). Let l be a joint connected to k (typically one further step away from the root joint p0 and not already represented in M2d). As before, at any instant t, p G t,l needs to lie on the sphere centered on p G t,k of constant radius sk,l. Thus, we can fully describe p G t,l using the 2(d + 1)-tuple of angles obtained by concatenating its spherical coordinates relative to joint k, together with the 2d-tuple describing p G t,k, i.e.the center of the sphere. So p G t,l lies on a space M2(d+1) homeomorphic to a product of spheres of dimension 2(d + 1).

We can conclude by induction that at any instant t, pt = [p G t,1, . . . , p G t,J] lies on the same subspace of (R3)J, which is homeomorphic to a product of spheres centered at the origin: O

i<j/Aij=1 S2(0, si,j) . (10)

Finally, the previous space is trivially homeomorphic to (S2)J 1 through the scaling (1/si,j)i<j/Aij=1. (S2)J 1 is a manifold of dimension 2(J 1) as the direct product of J 1 manifolds of dimension 2.

PROOF. [Proposition 4.2] Let G be a skeleton with J joints, x (R2)J a 2D pose, p (R3)J its corresponding 3D pose, and P(x, p) a joint distribution of poses in 2D and 3D. We define ℓ= (ℓj)J 1 j=1 as the function allowing us to compute the length of the segments of a pose p:

ℓj : p 7 pj pτ(j) 2 , 0 < j J 1 , (11)

where τ : {1, . . . , J 1} {0, . . . , J 1} maps joint indices to the index of their parent joint:

τ(i) = j < i, s.t. Aij = 1 . (12)

From assumption A.4, we know that for any pose p from the training distribution,

j , ℓj(p) = sj,τ(j) . (13)

Given D = {(xi, pi)}N i=1 P(x, p), some i.i.d. evaluation data, the MSE of a model f is defined as:

MSE(f; N) = 1

i=1 pi f(xi) 2 2 , (14)

and converges to MSE (f) = Ex,p p f(x) 2 2 (15)

1S2(0, sj,i) is homeomorphic to S2(p G t,j; sj,i).

as the dataset size N goes to infinity. We then define the oracle MSE minimizer as f = arg min f MSE (f) . (16)

The quantity in (15) is known in statistics as the expected L2-risk and it is a well-known fact that its minimizer is the conditional expectation: f (x) = E[p|x = x] . (17) Thus, since ℓ2 j are strictly convex and P(p|x) is non-degenerate according to assumption A.5, we can conclude from Jensen s strict inequality that for all j, ℓ2 j(f (x)) = ℓ2 j(E[p|x = x]) < E[ℓ2 j(p)|x = x] = s2 jτ(j) , (18)

where the last equality arises from the fact that ℓ2 j(p) is not random according to (13). Thus, given that ℓj > 0 and sj,τ(j) > 0, we can say that ℓj(f (x)) < sj,τ(j) for all joints j. We conclude that the model f minimizing MSE predicts poses that violate assumption A.4 and are inconsistent.

As an immediate corollary of proposition 4.2, we may state the following result, which was empirically illustrated in many parts of our paper: Corollary B.1. Given a fixed training distribution P(x, p) respecting assumptions A.3-A.5, for all 3D-HPE model f predicting consistent poses, i.e., that respect assumption A.4, there is an inconsistent model f with lower mean-squared error.

PROOF. Let f arg min f MSE ( f). According to proposition 4.2, f is inconsistent. Suppose that the consistent model f is such that MSE (f) MSE (f ) . (19) Since MSE reaches its minimum at f , we have MSE (f) = MSE (f ). Thus, f arg min f MSE ( f), which means that f is also inconsistent according to proposition 4.2. That is impossible given that we assumed f to be consistent. We conclude that Eq. (19) is wrong and that MSE (f) > MSE (f ) . (20)

Note that propositions 4.2 and B.1 assume the use of the MSE loss, which is the most widely used loss in 3D-HPE. We can however extend them to the case where MPJPE serves as optimization criteria under an additional technical assumption: Corollary B.2. The predicted poses minimizing the mean-per-joint-position-error loss are inconsistent if the training poses distribution P(x, p) verifies Asm. A.3-A.5 and if the joint-wise residuals norm standard deviation is small compared to the joint-wise loss:

Vx,p pj fj(x) 2

Ex,p pj fj(x) 2 0 . (21)

PROOF. From proposition 4.2 we know that the poses predicted by the minimizer f of MSE (f) = Ex,p p f(x) 2 2 (22)

are inconsistent. Let fj be the component of f corresponding to the jth joint. We define the jth mean-per-joint-position-error component as:

MPJPE j(f) Ex,p pj fj(x) 2 . (23) Under the small variance assumption, we have: Vx,p pj fj(x) 2

Ex,p pj fj(x) 2 2 (24)

= Ex,p p f(x) 2 2 Ex,p pj fj(x) 2 2

Ex,p pj fj(x) 2 2 (25)

= MSE j(f) MPJPE j(f)2

MPJPE j(f)2 0 , (26)

so both criteria, MSE and MPJPE, are asymptotically equivalent and have the same minimizer f , which is inconsistent according to proposition 4.2.

Corollary B.3. Under Asm. A.4-A.5 and under (21), the only way to get both optimal MPJPE and consistency is to use multiple hypotheses.

PROOF. Corollary B.1 and Proposition 4.2 imply that single-hypothesis models (constrained or not) deliver either suboptimal MPJPE or inconsistent pose predictions. Hence, by negation, we get our result.

In the next section, we further show that multi-hypotheses models, constrained or not, can theoretically show a better L2-risk (or quantization) performance compared with single-hypotheses models.

B.2 Multiple hypotheses (constrained or not) can improve L2-risk over single-hypothesis models

Let X = R2 J denote the space of input 2D poses and P = R3 J the space of 3D poses. Also, let R(f) = Ex,p[ p f(x) 2 2] be the L2-risk of some pose estimator f under some underlying continuous joint distribution of 2D-3D pose pairs P(x, p), with density ρ (when it exists).

Before stating the proposition, we need to define an adapted notion of risk for multi-hypothesis models under the oracle aggregation scheme: Definition B.4 (Winner-takes-all risk, [41]). As in [41] (section 3.2) and in [23] (section 2.2), we define the L2-risk for K-head models f WTA = (f 1 WTA, . . . , f K WTA) as:

RK WTA(f WTA) Z

Vk(f WTA(x)) f k WTA(x) p 2 2ρ(x, p) dp dx , (27)

where Vk(g) s denotes the kth cell of the Voronoi tesselation of the output space P defined by generators g = (g1, . . . , g K) PK:

Vk(g) p P | gk p 2 2 < gr, p 2 2, r = k . (28)

The risk above translates the notion of oracle pose, since it partitions the space of ground-truth poses P into regions where some hypothesis is the closest, and uses only that hypothesis to compute the risk in that region. Note that R1 WTA(f) = R(f) for any function f, since a single-cell tessellation of P is P itself.

In the following, we assume that f is expressive enough, so that, minimizing the risk (27) comes down to minimizing K X

Vk(f WTA(x)) f k WTA(x) p 2 2ρ(x, p) dp ,

for each x X. Proposition B.5 (Optimality of manifold constrained multi-hypothesis models). A K-hypotheses model f WTA = (f 1, WTA, . . . , f K, WTA) minimizing (27) has always a risk lower or equal to a singlehypothesis model f MSE minimizing R:

RK WTA(f WTA) R1 WTA(f MSE) = R(f MSE) . (29)

PROOF. Following [23] (Section 2.2), we decouple the cell generators from the risk arguments in (27):

Vk(g) zk p 2 2ρ(p|x) dp , (30)

for any generators g = (g1, . . . , g K) PK and arguments z = (z1, . . . , z K) PK. Note that RK WTA(f) = R

X K(f(x), f(x))ρ(x) dx.

According to Proposition 3.1 of [7] (or Proposition 2.1 in [23]), if f WTA minimizes RK WTA, then (f WTA(x), f WTA(x)) has to minimize K for all x X:

K(f WTA(x), f WTA(x)) K(g, z), g, z PK PK. (31)

Let s choose g such that gk = f k, WTA(x) and z such that zk = f MSE(x) for all 1 k K. Then

RK WTA(f 1 , . . . , f K ) Z

Vk(f WTA(x)) f MSE(x) p 2 2ρ(p|x)ρ(x) dp dx = R(f MSE) , (32)

where the last equality comes from the fact that Vk(f WTA(x)) defines a partition of P.

C Further details of 1D-to-2D case study

C.1 Implementation details

Datasets. We created a dataset of input-output pairs {(xi, (xi, yi))}N i=1, divided into 1 000 training examples, 1 000 validation examples and 1 000 test examples. Since the 2D position of J1 is fully determined by the angle θ between the segment (J0, J1) and the x-axis, the dataset is generated by first sampling θ from a von Mises mixture distribution, then converting it into Cartesian coordinates (xi, yi) to form the outputs, and finally projecting them into the x-axis to obtain the inputs.

Distribution scenarios. We considered three different distribution scenarios with different levels of difficulty:

1. Easy scenario: a unimodal distribution centered at θ = 2π

5 , where the axis of maximum 2D variance is approximately parallel to the x-axis (Fig. 4-A).

2. Difficult unimodal scenario: a unimodal distribution centered at θ = 0, where the axis of maximum 2D variance is perpendicular to the x-axis (Fig. 4-B).

3. Difficult multimodal scenario: a bimodal distribution, with modes at θ1 = π

3 and θ2 = π

3 and mixture weights w1 = 2

3 and w2 = 1

3, i.e., where the projection of modes onto the x-axis are close to each other (Fig. 4-C).

All von Mises components in all scenarios had concentrations equal to 20.

Architectures and training. All three models were based on a multi-layer perceptron (MLP) with 2 hidden layers of 32 neurons each, using tanh activation.

The constrained and unconstrained MLPs were trained using the mean-squared loss 1

N PN i=1((ˆxi xi)2 + (ˆyi yi)2). Mani Pose was trained with the loss in Eq. (1), and had K = 2 heads. We trained all models with batches of 100 examples for a maximum of 50 epochs. We used the Adam optimizer [17], with default hyperparameters and no weight decay. Learning rates were searched for each model and distribution independently over a small grid: [10 5, 10 4, 10 3, 10 2] (cf. selected values in Table 5). They were scheduled during training using a plateau strategy of factor 0.5, patience of 10 epochs and threshold of 10 4.

Table 5: Selected learning rates for 1D-to-2D synthetic experiment.

Distribution A B C

Unconstr. MLP 10 3 10 3 10 2

Constrained MLP 10 2 10 4 10 2

Mani Pose 10 2 10 3 10 2

C.2 Extension to 2D-to-3D setup with more joints

We further extend the two-joint 1D-to-2D lifting experiment of Section 4.2 to 2D-to-3D with three joints, aiming at providing a scenario that is closer to real-world 3D-HPE, but that can still be fully dissected and visualized.

As in Section 4.2, we suppose that joint J0 is at the origin at all times, that J1 is connected to J0 through a rigid segment of length s1 and that J2 is connected to J1 through a second rigid segment of length s1 < s0. We further assume that both J1 and J2 are allowed to rotate around two axes

orthogonal to each other. Thus, J1 is constrained to lie on a circle S1(0, s0), while J2 lies on a torus T homeomorphic to S1(0, s0) S1(0, s1). Without loss of generality, we set the radii s0 = 2 and s1 = 1 and assume them to be known.

Given that setup, we are interested in learning to predict the 3D pose (J1, J2) = (x1, y1, z1, x2, y2, z2) R6, given its 2D projection (K1, K2) = (x1, z1, x2, z2) R4. We create a dataset comprising 20000 training, 2000 validation, and 2000 test examples, sampled using an arbitrary von Mises mixture of poloidal and toroidal angles (θ, ϕ) in T . We set the modes of such a mixture at [( π, 0), (0, π/4), ( 1

2, π/4), (2π/3, π/2)], with concentrations of [2, 4, 3, 10] and weights [0.3, 0.4, 0.2, 0.1]. Similarly to Fig. 4-C, that creates a difficult multimodal distribution, depicted in Fig. 8.

3 2 1 0 1 2 3

Figure 8: Visualisation of the von Mises mixture distribution on the torus T. The different colors (blue, green, red, purple) represent the modes of the sampled points. We are only representing joint J2 here for clarity.

We train and evaluate the same baselines as in Section 4.2 in that new scenario, using a similar setup (cf. Appendix C.1, Architectures and training). Note that for those experiments, we used an initial learning rate of 10 3 for each baseline, and a batch size of 1000 examples. The corresponding Mean Per Segment Consistency Error (MPSCE) and Mean Per Joint Position Error (MPJPE) results are reported in Table 6.

Table 6: Mean per joint prediction error (MPJPE) and mean per segment consistency error (MPSCE) in a 2D-to-3D scenario. Results are averaged over five random seeds. Mani Pose reaches perfect MPSCE consistency without degrading MPJPE performance.

MPJPE MPSCE

Unconst. MLP 1.152 0.021 0.269 0.018 Constrained MLP 1.166 0.028 0.000 0.000 Mani Pose 1.149 0.036 0.000 0.000

We see that the same observations as in Section 4.2 also apply here: although the unconstrained MLP yields competitive MPJPE results, its predictions are not consistently aligned with the manifold, as indicated by its poor MPSCE performance. Again, we show here that Mani Pose offers an effective balance between maintaining manifold consistency and achieving high joint-position-error performance.

D Further Mani Pose implementation details

D.1 Architectural details

Our architecture is backbone-agnostic, as shown on Fig. 2. Thus, in order to have a fair comparison, we decided to implement it using the most powerful architecture available, i.e., Mix STE [52].

In practice, the rotations module follows the Mix STE architecture with dl = 8 spatio-temporal transformer blocks of dimension dm = 512 and time receptive field of T = 243 frames for Human 3.6M experiments and T = 43 frames for MPI-INF-3DHP experiments. Contrary to Mix STE, that network

outputs rotation embeddings of dimension 6 for each joint and frame, instead of Cartesian coordinates of dimension 3.

Concerning the segment module, it was also implemented with a smaller Mix STE backbone of depth dl = 2 and dimension dm = 128.

The ablation study presented in Table 4 shows that the increase in the number of parameters between Mix STE and Mani Pose is negligible.

D.2 Pose decoding details

The pose decoding block from Fig. 2 is described in Section 3.1 and is based on Algorithms 1 and 2. The whole procedure is illustrated on Fig. 3.

Table 7: Joint-wise weights used in the Winner-takes-all loss Eq. (2) (as in [52]).

Joint 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Weight 1 1 2.5 2.5 1 2.5 2.5 1 1 1 1.5 1.5 4 4 1.5 4 4

Algorithm 1 6D rotation representation conversion [54]

Require: Predicted 6D rotation representation r R6.

1: x [r0, r1, r2] , 2: y [r3, r4, r5] , 3: x x / x 2 , 4: z x y , 5: z z / z 2 , 6: y z x , 7: return R = [x|y|z] R3 3 .

Algorithm 2 Forward Kinematics [34, 26]

Require: Scaled reference pose u (R3)J, predicted rotation matrices Rt,j, 0 j < J.

1: R t,0 Rt,0 , 2: pt,0 u 0 , 3: for j = 1, . . . , J 1 do 4: R t,j Rt,j R t,τ(j) , Compose relative rotations 5: pt,j R t,j(u j u τ(j)) + pt,τ(j) , 6: end for 7: return pt = [pt,j]0 j<J

D.3 Training details

Training tactics. In order to have a fair comparison with Mix STE [52], we trained Mani Pose using the same training tactics, such as pose flip augmentation both at training and test time. Moreover, the training loss (1) was complemented with two additional terms described in [52]:

1. a TCloss term, initially introduced in [13]; 2. a velocity loss term, introduced in [38].

We also weighted the Winner-takes-all MPJPE loss (2) as in [52] (cf. weights in Table 7). The score loss weight, β, was set to 0.1 according to our hyperparameter study (Appendix E), while TCloss and velocity loss terms had respective weights of 0.5 and 2 (values from [52]).

Training settings. We trained our model for a maximum of 200 epochs with the Adam optimizer [17], using default hyperparameters, a weight decay of 10 6 and an initial learning rate of 4 10 5. The latter was reduced with a plateau scheduler of factor 0.5, patience of 11 epochs and threshold of 0.1 mm. Batches contained 3 sequences of T = 243 frames each for the Human 3.6M training, and 30 sequences of T = 43 frames for MPI-INF-3DHP.

Compute resources. Trainings were carried out on a single NVIDIA RTX 2000 GPU with around 11GB of memory. The training of the large model with 243 frames on Human 3.6M dataset took around 26 hours.

Dataset licences. Human 3.6M is a dataset released under a research-only custom license, and is available upon request at this URL: http://vision.imar.ro/human3.6m/description.php. MPI-INF-3DHP is released under non-commercial custom license and can be found at: https: //vcai.mpi-inf.mpg.de/3dhp-dataset/.

D.4 Baselines evaluation.

All Human 3.6M evaluations of MPSSE and MPSCE listed in Tables 2 and 4 were performed using the official checkpoints of these methods and their corresponding official evaluation scripts. Concerning MPI-INF-3DHP evaluations from Table 3, checkpoints were not available (except for P-STMO). Thus, baseline models were retrained from scratch using the official MPI-INF-3DHP training scripts provided by the authors of each work, using hyperparameters reported in their corresponding papers. We checked that we were able to reproduce the reported MPJPE results.

E Further results on the Human 3.6M dataset

Table 8: Quantitative comparison with the state-of-the-art methods on Human3.6M under Protocol #1 (MPJPE in mm), using detected 2D poses. T: sequence length. K: number of hypotheses. Orac.: Metric computed using oracle hypothesis. Bold: best; Underlined: second best.

T K Orac. Dir. Disc Eat Greet Phone Photo Pose Purch. Sit Sit D. Smoke Wait Walk D. Walk Walk T. Avg.

Single-hypothesis methods: Graph SH [51] 1 1 45.2 49.9 47.5 50.9 54.9 66.1 48.5 46.3 59.7 71.5 51.4 48.6 53.9 39.9 44.1 51.9 MGCN [55] 1 1 45.4 49.2 45.7 49.4 50.4 58.2 47.9 46.0 57.5 63.0 49.7 46.6 52.2 38.9 40.8 49.4 ST-GCN [2] 7 1 44.6 47.4 45.6 48.8 50.8 59.0 47.2 43.9 57.9 61.9 49.7 46.6 51.3 37.1 39.4 48.8 Video Pose3D [38] 243 1 45.2 46.7 43.3 45.6 48.1 55.1 44.6 44.3 57.3 65.8 47.1 44.0 49.0 32.8 33.9 46.8 UGCN [47] 96 1 41.3 43.9 44.0 42.2 48.0 57.1 42.2 43.2 57.3 61.3 47.0 43.5 47.0 32.6 31.8 45.6 Liu et al. [28] 243 1 41.8 44.8 41.1 44.9 47.4 54.1 43.4 42.2 56.2 63.6 45.3 43.5 45.3 31.3 32.2 45.1 Pose Former [53] 81 1 41.5 44.8 39.8 42.5 46.5 51.6 42.1 42.0 53.3 60.7 45.5 43.3 46.1 31.8 32.2 44.3 Anatomy3D [4] 243 1 41.4 43.2 40.1 42.9 46.6 51.9 41.7 42.3 53.9 60.2 45.4 41.7 46.0 31.5 32.7 44.1 Mix STE [52] 243 1 37.6 40.9 37.3 39.7 42.3 49.9 40.1 39.8 51.7 55.0 42.1 39.8 41.0 27.9 27.9 40.9

Multi-hypothesis methods: Li et al. [25] 1 10 62.0 69.7 64.3 73.6 75.1 84.8 68.7 75.0 81.2 104.3 70.2 72.0 75.0 67.0 69.0 73.9 Li et al. [24] 1 5 43.8 48.6 49.1 49.8 57.6 61.5 45.9 48.3 62.0 73.4 54.8 50.6 56.0 43.4 45.5 52.7 Oikarinen et al. [36] 1 200 40.0 43.2 41.0 43.4 50.0 53.6 40.1 41.4 52.6 67.3 48.1 44.2 44.9 39.5 40.2 46.2 Sharma et al. [44] 1 10 37.8 43.2 43.0 44.3 51.1 57.0 39.7 43.0 56.3 64.0 48.1 45.4 50.4 37.9 39.9 46.8 Wehrbein et al. [49] 1 200 38.5 42.5 39.9 41.7 46.5 51.6 39.9 40.8 49.5 56.8 45.3 46.4 46.8 37.8 40.4 44.3 Diff Pose [12] 1 200 38.1 43.1 35.3 43.1 46.6 48.2 39.0 37.6 51.9 59.3 41.7 47.6 45.4 37.4 36.0 43.3 MHFormer [27] 351 3 39.2 43.1 40.1 40.9 44.9 51.2 40.6 41.3 53.5 60.3 43.7 41.1 43.8 29.8 30.6 43.0 D3DP [43] 243 20 37.3 39.4 35.4 37.8 41.3 48.1 39.0 37.9 49.8 52.8 41.1 39.0 39.4 27.3 27.2 39.5

Mani Pose (Ours) 243 5 39.6 45.8 41.9 37.1 42.7 52.3 47.7 39.5 42.7 53.3 42.6 40.9 48.2 27.0 30.0 42.1 Mani Pose (Ours) 243 5 36.0 41.5 38.9 34.5 39.6 48.5 42.7 37.4 39.8 50.0 40.2 37.7 45.3 25.9 28.6 39.1

Protocol #1 and #2 detailed results. A detailed quantitative comparison in terms of MPJPE per action on Human3.6M dataset between Mani Pose and state-of-the-art methods is shown in Table 8. We see that Mani Pose reaches the best MPJPE performance on average and on most actions. Table 9 contains a similar analysis in terms of P-MPJPE (i.e., MPJPE with procrust-aligned poses). We observe the same patterns as in Table 8, namely that Mani Pose reaches the second-best P-MPJPE performance on average and for most actions. We confirm here again that the substantial improvements in pose consistency brought by Mani Pose are not obtained at the expense of traditional metrics derived from MPJPE.

Errors per joint. On the top of Fig. 9 we see that most of Mix STE errors come from feet, elbows and wrist joints, which are most prone to depth ambiguity. Mani Pose helps to reduce the position errors for most of those ambiguous joints, probably as a byproduct of its major consistency improvements shown in Fig. 5.

Impact of hyperparameters. Mani Pose introduces two additional hyperparameters when compared to Mix STE: the number K of hypotheses and the score loss weight β (cf. Eq. (1)). We further assess the impact of their respective values on MPJPE. For computational cost reasons, we used a smaller version of our model for this study, with transformer blocks of dimension dm = 64 and time receptive field of T = 27 frames. Fig. 10 (left) shows that more hypotheses help, but that the performance

Table 9: Quantitative comparison with the state-of-the-art methods on Human3.6M under Protocol #2 (P-MPJPE in mm), using detected 2D poses. Bold: best; Underlined: second best. Mani Pose results using the oracle evaluation. Actions: Directions, Discussion, Eating, Greeting, Talking on the Phone, Taking photo, Posing, Makes purchases, Sitting on chair, Activities while seated, Smoking, Waiting, Walking dog, Walking, Walking together.

T K Dir. Disc Eat Greet Phone Photo Pose Purch. Sit Sit D. Smoke Wait Walk D. Walk Walk T. Avg.

MGCN [55] 1 1 35.7 38.6 36.3 40.5 39.2 44.5 37.0 35.4 46.4 51.2 40.5 35.6 41.7 30.7 33.9 39.1 ST-GCN [2] 1 1 35.7 37.8 36.9 40.7 39.6 45.2 37.4 34.5 46.9 50.1 40.5 36.1 41.0 29.6 33.2 39.0 Pavllo et al. [38] 243 1 34.2 36.8 33.9 37.5 37.1 43.2 34.4 33.5 45.3 52.7 37.7 34.1 38.0 25.8 27.7 36.8 Zheng et al. [53] 81 1 34.1 36.1 34.4 37.2 36.4 42.2 34.4 33.6 45.0 52.5 37.4 33.8 37.8 25.6 27.3 36.5 Liu et al. [28] 243 1 32.3 35.2 33.3 35.8 35.9 41.5 33.2 32.7 44.6 50.9 37.0 32.4 37.0 25.2 27.2 35.6 Anatomy3D [4] 243 1 32.6 35.1 32.8 35.4 36.3 40.4 32.4 32.3 42.7 49.0 36.8 32.4 36.0 24.9 26.5 35.0 UGCN [47] 96 1 31.8 34.3 35.4 33.5 35.4 41.7 31.1 31.6 44.4 49.0 36.4 32.2 35.0 24.9 23.0 34.5 Mix STE [52] 243 1 30.8 33.1 30.3 31.8 33.1 39.1 31.1 30.5 42.5 44.5 34.0 30.8 32.7 22.1 22.9 32.6

Mani Pose (Ours) 243 5 31.9 35.7 30.8 33.5 34.0 39.8 33.0 31.4 41.1 45.9 36.0 32.3 35.4 24.7 25.8 34.1

improvements saturate around 5 hypotheses. Concerning β, Fig. 10 (right) shows that lower values help to improve the MPJPE performance.

Impact of the rotations representations used.

The disentanglement between segments length and orientation is not novel, and was proposed in previous works restricted to the single-hypothesis case, such as Anatomy3D [4]. While Mani Pose represents segments orientations as full 3D rotations relatively to parent segments in the kinematics tree, Anatomy3D simply predicts segments absolute directions, i.e., normalized vectors in the 3D space. This solution has the advantage of not over-parametrizing the segments orientations (which are invariant to rotations around the segment axis) and being lower dimensional (3 vs 6). One might hence wonder whether Anatomy3D s parametrization is not preferable. As shown in Table 10, Anatomy3D s implementation led to poorer results when compared to our rotations parametrization in a multi-hypothesis setting. This motivated us to use full 3D rotations representations proposed in [54] in our experiments, despite their caveats. Note that [54] also shows good empirical results in the related problem of inverse kinematics of human 3D poses.

Table 10: Rotations representation ablation: learning 3D directions instead of full rotations yields poorer results. Dim.: Dimension of rotations or directions representations. K: Number of hypotheses. β Scores regularization. Bold: best. Underlined: second best.

Learn Dim. K β MPJPE MPSSE MPSCE

Mani Pose (Ours) Rotations 6 5 0.1 39.1 0.3 0.5 Anatomy3D-like [4] Directions 3 5 0.1 39.6 3.2 5.9 Directions 3 5 0.5 41.8 3.9 6.9 Directions 3 3 0.5 43.2 4.4 7.5

Diversity of predicted poses. As explained in Section 5.2, Mani Pose s state-of-the-art oracle MPJPE results show that it excels in terms of diversity when the latter is assessed using the quantization error. There are many other ways of assessing distribution diversity. In an attempt to quantify the diversity of pose distributions learned by Mani Pose by other means, we have computed the coverage (as defined in [35]) of generated poses relatively to the ground-truth test set of Human 3.6M. For computation cost reasons (it grows quadratically with sample size), we limited our analysis to 5 actions from subject S11. We compare Mani Pose to Diff Pose [10], using 5 hypotheses for both, and observe similar diversity on average (cf. Fig. 11).

We provide the code to reproduce all our experiments under https://github.com/ cedricrommel/manipose.

Figure 9: Detailed results on H3.6M. Top: Mean position errors per joint. Bottom: Human 3.6M skeleton.

# Hypotheses K

0.1 0.5 1.0

Score loss weight

Aggregated MPJPE Oracle MPJPE

Figure 10: Impact of the number K of hypotheses (left) and score loss weight β (right) on Mani Pose aggregated and oracle performance. Results are obtained on H3.6M with a smaller network (dm = 64) and a shorter sequence (T = 27). Left plot obtained with β = 0.1 and right plot with K = 5.

0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 Coverage

Sitting Down

Diff Pose Mani Pose

Figure 11: Mani Pose achieves similar diversity to Diff Pose [10]. Diversity is assessed through the coverage [35] over test data from subject 11 from Human 3.6M. 5 hypotheses were predicted/sampled for each frame by both models.

Neur IPS Paper Checklist

Question: Do the main claims made in the abstract and introduction accurately reflect the paper s contributions and scope?

Answer: [Yes]

Justification: Our claims are enumerated at the bottom of the introduction and backed by our experiments from Section 5 and our theoretical results from Section 4.

2. Limitations

Question: Does the paper discuss the limitations of the work performed by the authors?

Answer: [Yes]

Justification: Limitations are discussed in a separate section in the last page of the manuscript.

3. Theory Assumptions and Proofs

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

Answer: [Yes]

Justification: All propositions start by stating their assumptions (some might be detailed in the appendix, such as for Proposition 4.2). All results are proved formally in the supplementary material, with a proof sketch provided within the main article.

4. Experimental Result Reproducibility

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

Answer: [Yes]

Justification: The experimental setting of all experiment is described in details, either in the main article (Sections 4.2 and 5) or in the supplementary material (Appendices C.1 and D).

5. Open access to data and code

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

Answer: [Yes]

Justification: The code for the toy experiments was made completely available to reviewers in the supplementary material, together with instructions to reproduce the paper figures and tables, as well as environment configurations. Concerning Human3.6M and MPI-INF-3DHP experiments, their code has been open-sourced to the community under publication (cf. Appendix F).

6. Experimental Setting/Details

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

Answer: [Yes]

Justification: The experimental setting of all experiment is described in details, either in the main article (Sections 4.2 and 5) or in the supplementary material (Appendices C.1 and D).

7. Experiment Statistical Significance

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

Answer: [Yes]

Justification: In toy experiments, we do report error bars corresponding to the standarddeviation across several 5 runs (cf. Tables 1 and 6). For real-world datasets, however, error bars are not reported because it would be too computationally expensive. This is customary in computer vision and practiced by all competing baselines.

8. Experiments Compute Resources

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

Answer: [Yes]

Justification: Compute resources are described in Appendix D.3.

9. Code Of Ethics

Question: Does the research conducted in the paper conform, in every respect, with the Neur IPS Code of Ethics https://neurips.cc/public/Ethics Guidelines?

Answer: [Yes]

Justification: Our experiments comply with Neur IPS Code of Ethics. For instance, we only work with well-known publicly available datasets.

10. Broader Impacts

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

Answer: [NA]

Justification: This is a foundational work on 3D human pose lifting, and is hence not tied to any particular application. We do not believe that the enhancements it proposes could be used to make existing 3D human pose technology more dangerous in any way.

11. Safeguards

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Answer: [NA]

Justification: The paper poses no such risks.

12. Licenses for existing assets

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

Answer: [Yes]

Justification: The main assets used in this work are the human pose datasets Human 3.6M and MPI-INF-3DHP, whose papers are duely cited. We also provide their licenses, availability conditions and download URL in the supplementary material (Appendix D.3).

13. New Assets

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

Answer: [Yes]

Justification: The only asset introduced in this paper is the experimental code, which is well documented in the its own README.md file a provided alongside its (open) license.

14. Crowdsourcing and Research with Human Subjects

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

Answer: [NA]

Justification: The paper does not involve crowdsourcing nor research with human subjects.

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained? Answer: [NA] Justification: The paper does not involve crowdsourcing nor research with human subjects.