# manipose_manifoldconstrained_multihypothesis_3d_human_pose_estimation__dadd5b6f.pdf Mani Pose: Manifold-Constrained Multi-Hypothesis 3D Human Pose Estimation Cédric Rommel1 Victor Letzelter1,3 Nermin Samet1 Renaud Marlet1,5 Matthieu Cord1,2 Patrick Pérez1 Eduardo Valle1,4 1Valeo.ai, Paris, France 2Sorbonne Université, Paris, France 3LTCI, Télécom Paris, Institut Polytechnique de Paris, France 4Recod.ai Lab, School of Electrical and Computing Engineering, University of Campinas, Brazil 5LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, Marne-la-Vallee, France We propose Mani Pose, a manifold-constrained multi-hypothesis model for humanpose 2D-to-3D lifting. We provide theoretical and empirical evidence that, due to the depth ambiguity inherent to monocular 3D human pose estimation, traditional regression models suffer from pose-topology consistency issues, which standard evaluation metrics (MPJPE, P-MPJPE and PCK) fail to assess. Mani Pose addresses depth ambiguity by proposing multiple candidate 3D poses for each 2D input, each with its estimated plausibility. Unlike previous multi-hypothesis approaches, Mani Pose forgoes generative models, greatly facilitating its training and usage. By constraining the outputs to lie on the human pose manifold, Mani Pose guarantees the consistency of all hypothetical poses, in contrast to previous works. We showcase the performance of Mani Pose on real-world datasets, where it outperforms state-of-the-art models in pose consistency by a large margin while being very competitive on the MPJPE metric. 1 Introduction We propose Mani Pose, a novel approach for human-pose 2D-to-3D lifting. Mani Pose directly addresses the depth ambiguity inherent to monocular 3D human pose estimation by being both multihypothesis and manifold-constrained, thus avoiding pose consistency issues, which plague traditional regression-based methods. Unlike previous multi-hypothesis approaches, Mani Pose forgoes the use of costly generative models, while still estimating the plausibility of each hypothesis. Monocular 3D human pose estimation (HPE) is a challenging learning problem that aims to predict 3D human poses given an image or a video from a single camera. Often, the problem is split into two successive steps: first 2D human pose estimation, then 2D-to-3D lifting. Such separation is favorable because 2D-HPE is much more mature, leading to better overall results. Due to depth ambiguity and occlusions, 2D-to-3D lifting is intrinsically ill-posed: multiple 3D poses correspond to the same projection observed in 2D. Despite that, the field has experienced fast developments, with substantial improvements in terms of mean-per-joint-prediction error (MPJPE) and derived metrics (e.g., P-MPJPE, PCK) [52, 53, 42, 47]. However, recent studies [49, 12, 40] noted that poses predicted by state-of-the-art models fail to respect basic invariances of human morphology, such as bilateral sagittal symmetry, or constant length across time of rigid body segments connecting the joints. Not only do we address those concerns with Mani Pose (see Fig. 1), but we also provide theoretical elements clarifying the cause of those issues. We show in particular that pose consistency and traditional performance metrics (such as MPJPE) 38th Conference on Neural Information Processing Systems (Neur IPS 2024). + manifold constraints + multi-hypothesis Figure 1: Optimizing both 3D position and pose consistency requires combining constraints and multiple hypotheses. Results from Tables 2 and 4. Previous unconstrained methods provide inconsistent poses (top). Regularization (MR) and disentanglement constraints improve consistency, but degrade joint position error (bottom-right). Ours is the only method that achieves both good joint error and consistency, thanks to a combination of disentanglement and a few hypotheses (see circles sizes). cannot be optimized simultaneously by a standard regression model, because MPJPE ignores the topology of the space of human poses, and traditional regression models imply unimodality, thus overlooking the inherently ambiguous nature of 3D-HPE. Our contributions include: Mani Pose, a novel, multi-hypothesis, manifold-constrained model for human-pose 2D-to3D lifting, which is able to estimate the plausibility of each hypothesis without resorting to costly generative models. Theoretical insights that elucidate why traditional regression models associated with standard metrics such as MPJPE fail to enforce pose consistency. Extensive empirical results, including comparison to strong baselines, evaluation on two challenging datasets (Human 3.6M and MPI-INF-3DHP), and ablations. Mani Pose outperforms state-of-the-art methods by a substantial margin in terms of pose consistency, while still beating them in the MPJPE metric. The ablations confirm the importance of both multiple hypotheses and of constraining the poses to their manifold. The Py Torch [37] implementation of Mani Pose and code used for all our experiments can be found at https://github.com/cedricrommel/manipose. 2 Related work Regression-based 2D-to-3D pose lifting. While 2D-to-3D human pose lifting was initially restricted to static frames [31, 3], the field embraced recurrent [13], convolutional [38] and graph neural networks [2, 55, 14, 51] to handle motion. Spatial-temporal transformers appear more recently [42, 53], including Mix STE [52], arguably becoming the state of the art. We adopt them in our work. A few previous works constrain predicted poses to respect human symmetries [50, 4], an idea we advance with a novel constraint implementation, in a multi-hypothesis setting. SMPL-based methods. While 3D human pose lifting s objective is to predict 3D joint positions based on 2D keypoints, the neighboring field of human pose and shape reconstruction (HPSR) aims at estimating whole 3D body meshes from images. HPSR is hence more challenging than 3D-HPE, which explains why models are often larger, frame-based and more reliant on optimization-based post-processing [16, 39, 46, 9]. Nonetheless, our work shares some ideas from this field. Indeed, modern HPSR methods often predict joint angles (and body shape parameters), which are fed to the pre-trained parametric model SMPL [29] to produce human body meshes, thus ensuring that limbs sizes remain constant along a movement. Note, however, that these are also single-hypothesis regression methods and hence share the same caveats as most 3D-HPE approaches. Multi-hypothesis 3D-HPE. The intrinsic depth-ambiguity of 3D-HPE led the community to investigate multi-hypothesis approaches, including Mixture Density Networks [25, 36, 1], variational autoencoders [44], normalizing flows [18, 49] and diffusion models [12, 6, 10]. Contrary to ours, those methods rely on a generative model to sample 3D pose hypotheses conditioned on the 2D input. A notable exception is MHFormer [27], which, like Mani Pose, is deterministic, but treats the hypotheses as intermediate representations to be aggregated at the final network layers, thus concluding with a one-to-one 2D-to-3D mapping. We strive to avoid such injectivity and to preserve the multiple hypotheses, for reasons we will justify both empirically and theoretically in the next sessions. Moreover, none of the previous multi-hypothesis approaches constrain hypotheses to lie on the human pose manifold, thus failing to guarantee good pose consistency. Multiple choice learning (MCL) [11] is a simple approach for estimating multimodal distributions, suited for ambiguous tasks, using the winner-takes-all loss. Adapted for deep learning by Lee et al. [20, 21], it produces diverse predictors, each specialized in a particular subset of the data distribution. MCL has proved its effectiveness in several computer vision tasks [41, 19, 33, 8, 30, 45], and was first applied to 2D-HPE in [41]. Our work is the first to employ MCL for the 3D-HPE task, by leveraging recent innovations of Letzelter et al. [22]. 3 Mani Pose Pose Decoder Not learned 2D keypoints Figure 2: Overview of Mani Pose. The rotations module predicts K possible sequences of segment rotations with their corresponding likelihoods (scores), while the segments module estimates the shared segment lengths. Hence, predicted poses are constrained to a manifold defined by the estimated lengths, guaranteeing their consistency. Following the previous state of the art, we split 3D-HPE into two steps, first estimating J human 2D keypoints in the pixel space from a sequence of T video frames [x1, . . . , x T ] R2 J T , and then lifting them to 3D joint positions [ˆp1, . . . , ˆp T ] R3 J T . We focus on the second step (i.e., lifting) in the rest of the paper, assuming the availability of 2D keypoints xi. Our method aims to both ensure pose consistency and resolve depth ambiguity, as we will discuss in the next section. 3.1 Constraining predictions to the pose manifold Rationale. Human morphology prevents the joints from arbitrarily occupying the whole space. Instead, the poses within a movement are restricted to a manifold, reflecting the human skeleton s rigidity. If we knew the length of each segment connecting pairs of joints for a given subject, we could guarantee that the predicted poses lie on the correct pose manifold by only predicting the body part s rotations with respect to a reference skeleton. Since we do not have access to ground-truth segment lengths in real use cases, we propose to predict them, thus disentangling the estimation of the reference lengths (fixed across time) from the estimation of the joint rotations (variable across time). Disentangled representations. We constrain model predictions to lie on an estimated manifold by predicting parametrized disentangled transformations of a reference pose u (R3)J, for which all segments have unit length. Namely, we propose to split the network into two parts (cf. Fig. 2): 1. Segments module, which predicts segment lengths s RJ 1, shared by the T frames (time steps) of the input sequence; 2. Rotations module, which predicts the rotation r = [r1,0, . . . , r T,J 1] (Rd)J T of each joint relative to their parent joint at each time step. Rotations representation. We represent rotations using 6D continuous embeddings (i.e., d = 6). Compared to quaternions or axis-angles, those representations are continuous and, hence, better learned by neural networks, as demonstrated by their proposers [54]. Pose decoding. To deliver pose predictions in (R3)J T , the intermediate representations (s, r) must be decoded. We achieve that in three steps (cf. Fig. 3): 1. We scale the unit segments of the reference pose u (R3)J using s, forming a scaled reference pose u : u j = u τ(j) + sj(uj uτ(j)) for 0 < j J 1, where τ maps the index of a joint to its parent s, if any. 2. For each time step 1 t T and joint 0 j < J, we convert the predicted rotation representations rt,j into rotation matrices Rt,j SO(3) (Algorithm 1). 3. We apply those rotation matrices Rt,j at each time step t to the scaled reference pose u using forward kinematics (Algorithm 2). 3.2 Multiple choice learning Mani Pose architecture. As explained in the introduction, the inherent depth ambiguity of pose lifting requires multiple hypotheses to conciliate pose consistency and MPJPE performance. To address this, we adopt the multiple choice learning (MCL) [21] framework, more precisely leveraging the resilient MCL approach as proposed by Letzelter et al. [22]. This methodology allows the estimation of conditional distributions for regression tasks, enabling our model to predict multiple plausible 3D poses for each 2D input. Specifically, instead of a single rotation rt (Rd)J per time step, Mani Pose s rotations module predicts an intermediate representation et (Rd )J that feeds K linear heads (with weights W k r and W k γ ), each predicting its own rotation hypothesis rk t (Rd)J with a corresponding likelihood γk t [0, 1]. That is, for all 1 t T, rk t = W k r et and γk t = σ[ γt]k, where the softmax function σ is applied to the vector γt = [ γ1 t , . . . , γK t ] RK of intermediate values γk t = W k γ et. All rotation hypotheses are decoded together with the shared segment-length predictions s, resulting in K hypothetical pose sequences ˆpk = (ˆpk t )T t=1, with corresponding likelihood sequences γk = (γk t )T t=1, called scores hereafter (Fig. 2). Loss function. As in [22], Mani Pose is trained with a composite loss L = Lwta + βLscore . (1) The first term, Lwta, is the winner-takes-all loss [21] Lwta(ˆp(x), p) = 1 t=1 min k J1,KK ℓ(ˆpk t (x), pt) , (2) where ℓ(ˆpk t (x), pt) 1 J PJ 1 j=0 pt,j ˆpk t,j(x) 2, and ˆpk t (x) denotes the pose prediction at time t using the kth head. The second term, Lscore, is the scoring loss Lscore(ˆp(x), γ(x), p) = 1 t=1 H δ(ˆpt, pt), γt(x) , (3) where H( , ) is the cross-entropy, ˆpt = (ˆpk t )K k=1, and [δ(ˆpt, pt)]k 1 h k arg min k J1,KK ℓ ˆpk t , pt i (4) is the indicator function of the winner pose hypothesis, which is the closest to the ground truth. Eq. (3) is the average cross-entropy between target and predicted scores γt(x) [0, 1]K at each time t. Those losses are complementary. The winner-takes-all loss updates only the best predicted hypothesis, specializing each head on part of the data distribution [21]. The scoring loss allows the model to learn how likely each head is to winning, thus avoiding overconfidence of non-winner heads (cf. [19, 45]). Conditional distribution estimation. As detailed in [22], the model may be interpreted probabilistically as a multimodal conditional density estimator. More precisely, it models the distribution P(p|x) of 3D poses conditioned on 2D poses as a mixture of Dirac distributions: k=1 γk(x)δˆpk(x)(p) . (5) Hence, the predicted conditional distribution has, at each predicted hypothesis ˆpk, a peak whose likelihood is given by the predicted score γk. As described in Section 4, interpreting hypotheses and scores probabilistically enables us to handle depth ambiguity. 4 Formal analysis Rotations Conversion Forward Kinematics Predicted movement Scaled ref. pose Unit ref. pose Figure 3: Pose decoder overview. Mani Pose, as outlined in Section 3, is crafted to address the flaws inherent in unconstrained, single-hypothesis lifting-based 3D-HPE methods (see Fig. 1). This section illustrates that without Mani Pose s critical components (multiple hypotheses and manifold constraint), it is impossible to simultaneously minimize joint error and ensure pose consistency (Section 4.1). To illustrate this, a toy example within a simplified 1D-to-2D framework is provided in Section 4.2. 4.1 Single-hypothesis position-error minimization leads to inconsistent skeleton lengths We formally highlight the limitations of unconstrained single-hypothesis 3D-HPE, justifying our approach, which combines consistency constraints and multiple hypotheses to resolve depth ambiguity. Let p = [p1, . . . , p J] R3 J be a human pose, defined by the Cartesian 3D coordinates of each of the J joints of a predefined skeleton. Then, a sequence of T poses of the same subject at increasing time steps t1 . . . t T R forms a movement m = [p0, . . . , p T ] R3 J T . Assuming bone length is fixed during a movement (which is empirically verifiable in human pose datasets), then the poses pt of m must all lie on the same smooth manifold. Proposition 4.1 (Human pose manifold). Assuming a rigid skeleton, all poses of a movement m = [pt]T t=1 lie on a manifold M of dimension 2(J 1): t {1, . . . , T}, pt M . (6) Proof sketch. (Detailed in Appendix B). Skeleton rigidity implies that, if i is a joint connected to the root, then it lies on a 2D sphere S2 (0, si,0) centered at the origin with fixed radius si,0. Another joint j linked to i has a position expressible by its spherical coordinates relative to i with fixed radius sj,i. That implies an homeomorphism between the position pt,j of joint j and the direct product of spheres centered at the origin S2 (0, si,0) S2 (0, sj,i). By induction, one can show that pt lies on a subspace of (R3)J, which is homeomorphic to a product of spheres centered at the origin. Proposition 4.1 implies that all poses predicted for a video sequence should ideally lie on the same manifold M as the ground-truth data, which is homeomorphic to the direct product of 2D unit spheres (S2)J 1 (cf. Appendix B). Crucially, we can further show that minimizing joint position error using a single-hypothesis model necessarily leads to predicted poses lying outside the true manifold: Proposition 4.2 (Inconsistency of MSE minimizer). With a rigid skeleton and mild assumptions on the training distribution, predicted 3D poses minimizing the traditional mean squared error (MSE) loss lie outside the pose manifold M. Proof sketch. (See Appendix B). Consider a skeleton with J joints, with (x, p), as pairs of corresponding 2D inputs and 3D poses. Let the function ℓ= (ℓj)J 1 j=1 compute the lengths of the segments in a pose, which shall remain constant. On a dataset {(xi, pi)}N i=1 drawn from the joint distribution of 2D and 3D poses, let the expected MSE of a traditional predictive model f be Ex,p p f(x) 2 2 . Let the ideal model f be the one minimizing that expected MSE, which is the conditional expectation f (x) = E[p | x]. Jensen inequality and the rigidity assumption imply that, for any joint j, ℓ2 j (f (x)) < s2 j where sj is the true length of the segment associated with joint j. This shows that the poses predicted by f violate the original segment length constraints, and thus, the original rigidity assumption. Proposition 4.2 has the following implications: 1. Traditional unconstrained single-hypothesis approaches are bound to predict inconsistent movements, where bone lengths may vary. 2. With a single hypothesis, models constrained to the manifold will always lose to unconstrained models in terms of MPJPE performance (formalized in Corollary B.1). 3. The only way of reaching both optimal MPJPE and consistency is through multiple hypotheses (formalized in Corollary B.3). Therefore, the MPJPE metric (and its traditional extensions) is insufficient to assess 3D-HPE, as it completely ignores pose consistency. Furthermore, we are able to prove in Appendix B.2 that multiple hypotheses (constrained or not) can always reach better joint position errors than single-hypothesis models. 4.2 Insights to the formal argument on a simplified setting Table 1: 1D-to-2D performance. Fig. 4-D setting, results averaged over five random seeds. MPJPE Distance to circle Unconst. MLP 0.753 0.008 0.42 0.01 Constrained MLP 0.777 0.027 0.00 0.00 Mani Pose 0.752 0.012 0.00 0.00 We illustrate the argument of Section 4.1 with a simplified 1D-to-2D setup. We further generalize this intuitive illustration to the 2D-to-3D setting in Appendix C of the supplementary. As in human pose lifting, we take a root joint J0 as reference, fixed at (0, 0). For a joint J1, the problem amounts to predicting the 2D position (x, y), given its 1D projection u = x, assuming a constant distance s = 1 between them. This simplification ignores the camera perspective and considers the joints to be connected by a rigid segment as in the case of human poses. We train three different models with comparable architectures on two datasets {(xi, (xi, yi))}N i=1 sampled from the angular distributions represented in blue on Fig. 4. The models correspond to: 1. A 2-layer MLP ( ) trained to minimize the mean squared error between true (x, y) and predicted joint positions (ˆx, ˆy); 2. A constrained MLP of the same size ( ), predicting the angle ˆθ instead of the joint position; 3. Mani Pose: our constrained multi-hypothesis model capable of predicting K = 2 possible angles (ˆθk)K k=1 with their corresponding likelihoods. Fig. 4 shows that the traditional unconstrained single-hypothesis approach ( ) leads to good results in an easy unimodal scenario (C), but fails when facing a more challenging bimodal distribution (D), leading to predictions outside the circle manifold, as depth ambiguity makes the lifting problem ill-posed. The single-hypothesis constrained model ( ) delivers predictions on the circle, at the cost of worse MPJPE performance than the unconstrained MLP. Such performance decrease is due to the Euclidean topology of the MPJPE metric having its minimum ( ) outside the manifold (Fig. 4-B). Crucially, this implies that the unconstrained single-hypothesis models are bound to make inconsistent predictions, with varying bone lengths (the circle radius). It also shows that models constrained to the manifold (circle) will always be outcompeted by unconstrained models on MPJPE performance. Predicting multiple hypotheses constrained to the circle, with their respective likelihoods ( in Fig. 4-B) allows escaping this dilemma, which is exactly what Mani Pose does ( in Fig. 4-D). The p(y|x) = 0.67 p(y|x) = 0.33 Inputs Outputs GT probability Constr. MLP Unconstr. MLP MSE minimizer Constr. MH min. Mani Pose - hyps. Mani Pose - scores Figure 4: (A) 1D-to-2D articulated pose lifting problem. (B) True MSE minimizers under a multimodal distribution. One-to-one mappings cannot both reach optimal performance and stay on the pose manifold (dashed circle). (C) Without depth ambiguity, unconstrained models are effective. (D) Ambiguity from multimodal distributions challenges both constrained and unconstrained models. Multi-hypothesis approaches can deliver an acceptable solution to the problem. predicted hypotheses are all on the circle, contrary to the unconstrained MLP, and spread between the two distribution modes, unlike the constrained single-hypothesis method. Moreover, the predicted scores (length of green lines) match the 2 3 ground-truth likelihoods of the two modes. Those advantages translate into perfect pose consistency and into comparable MPJPE performance with respect to the unconstrained MLP (Table 1). 5 Experiments 5.1 Experimental setup Datasets. We evaluate our model on two 3D-HPE datasets. Human 3.6M [15] contains 3.6 million images of 7 actors performing 15 different indoor actions. It is the most widely used dataset for 3D-HPE. Following previous works [52, 27, 53, 38], we train on subjects S1, S5, S6, S7, S8, and test on subjects S9 and S11, adopting a 17-joint skeleton (cf. Fig. 5). We employ a pre-trained CPN [5] to compute the input 2D keypoints, as in [38, 52]. MPI-INF-3DHP [32] also adopts a 17-joint skeleton, but, with fewer samples and containing both indoor and outdoor scenes, it is more challenging than Human 3.6M. We used ground-truth 2D keypoints for this dataset, as usually done [53, 4, 52]. Traditional evaluation metrics. The mean per-joint position error (MPJPE) is the usual performance metric for the datasets above, under different protocols, both reported in mm. In protocol #1, the root joint position is set as a reference, and the predicted root position is translated to 0. In protocol #2 (P-MPJPE), predictions are additionally Procrustes-corrected. For MPI-INF-3DHP, additional thresholded metrics derived from MPJPE are often reported, such as AUC (Area Under Curve) and PCK (Percentage of Correct Keypoints) with a threshold at 150 mm, as explained in [32]. Pose consistency metrics. MPJPE being insufficient to assess pose consistency (Section 4), we further assess to which extent predicted skeletons are rigid by measuring the average standard deviations of segment lengths across time in predicted action sequences: MPSCE 1 J 1 t=1 (st,j,τ(j) sj,τ(j))2 , (7) with st,j,i = ˆpt,j ˆpt,i 2 and sj,i = 1 T PT t=1 st,j,i, where τ was defined in Section 3.1. We call this metric, reported in mm, the Mean Per Segment Consistency Error (MPSCE). Following [12, 40], we also assess the bilateral symmetry of predicted skeletons through the Mean Per Segment Symmetry Error (MPSSE), in mm: MPSSE 1 T |Jleft| j Jleft |st,j,τ(j) st,j ,τ(j )| , with j = ζ(j) , (8) where Jleft denotes the set of indices of left-side joints and ζ maps left-side joint indices to their right-side counterparts. Multi-hypothesis evaluation protocol. One must decide how to use multiple hypotheses to compute the metrics. The dominant approach [24, 25, 36, 44, 49, 12] is the oracle evaluation, i.e., using the predicted hypothesis closer to the ground truth (i.e., Eq. (2) for MPJPE). That makes sense for multi-hypothesis methods, as the oracle metric measures the distance between the target and the discrete set of predicted hypotheses. It aligns with the idea of many possible outputs for a given input. Hypotheses can also be aggregated into a final pose, e.g., through unweighted or weighted averaging (using predicted scores). The latter has the disadvantage of falling back to a one-to-one mapping scheme, which is precisely what we want to avoid in a multi-hypothesis setting. We report both oracle and aggregated metrics in our experiments, favoring oracle results. Implementation details. Mani Pose, as presented in Section 3, is compatible with any backbone. Here, we chose to build on the Mix STE [52] network for both the rotations and the segment modules (the latter in a reduced scale). Details about our architecture and training appear in Appendix D. 5.2 Comparison with the state of the art Table 2: Pose consistency evaluation of state-of-the-art methods on Human3.6M. MPJPE performance and pose consistency are not correlated; only Mani Pose excels in both. T: sequence length. K: number of hypotheses. Orac.: Metric computed using oracle hypothesis. Grey lines: Methods where the Oracle MPJPE is computed with non-comparable number of hypotheses with respect to the other baselines. Bold: best; Underlined: second best. *: Method with unavailable code ; MPSSE values reported in [12]. : Results with comparable number of hypotheses. : Results computed with official checkpoint and code. T K Orac. MPJPE MPSSE MPSCE Single-hypothesis methods: ST-GCN [2] 7 1 48.8 8.9 10.8 Video Pose3D [38] 243 1 46.8 6.5 7.8 Pose Former [53] 81 1 44.3 4.3 7.2 Anatomy3D [4] 243 1 44.1 1.4 2.0 Mix STE [52] 243 1 40.9 8.8 9.9 Multi-hypothesis methods: Wehrbein et al. [49] 1 200 44.3 12.2 14.8 Diff Pose (Holmquist et al.) [12]* 1 200 43.3 14.9 - GFPose [6] 1 200 35.6 13.1 16.5 D3DP (P-Best) [43] 243 20 39.5 6.9 9.0 GFPose [6] 1 10 45.1 13.1 16.5 Sharma et al. [44] 1 10 46.8 13.0 9.9 Diff Pose (Gong et al.) [10] 243 5 39.3 5.2 6.1 MHFormer [27] 351 3 43.0 5.7 8.0 Mani Pose (Ours) 243 5 42.1 0.4 0.8 Mani Pose (Ours) 243 5 39.1 0.3 0.5 Human 3.6M. Comparisons with state-of-the-art singleand multi-hypothesis methods are presented in Table 2 and illustrated in Fig. 1. Mani Pose outperforms previous methods in terms of Oracle MPJPE in comparable scenarios, while reaching nearly perfect consistency. Moreover, note that MPJPE and consistency metrics are not positively correlated for single-hypothesis methods. As predicted in Section 4.1, our empirical results show that MPJPE improvements achieved by Mix STE come at the cost of poorer consistency compared to previous models. In contrast, the only singlehypothesis constrained model, Anatomy3D [4], achieves good consistency at the expense of inferior MPJPE. Those results empirically validate the theoretical predictions of Sections 4.1 and B, further confirming what we have shown, intuitively, in the simplified 1D-to-2D setting (Section 4.2). Note that while Mani Pose is deterministic, previous multi-hypothesis methods are generative, except for MHFormer. Table 2 shows that they require up to two orders of magnitude more hypotheses than Mani Pose to reach competitive performance (see, e.g., the performance of GFPose). This property is expected. Indeed, optimization based on Winner-Takes-All theoretically leads to an optimal coverage of the modes of the conditional distribution with a fixed number of samples [23], in contrast to generative-based approaches. This is reflected in the oracle metric, which approximates the so-called quantization (or Distortion) error, as defined in (27), when the number of data points is large. More detailed MPJPE results per action appear in Tables 8 and 9 in the supplemental. We also complement our analysis on the diversity of Mani Pose in Fig. 11 of the appendix. Fig. 6 showcases qualitative results, where multiple hypotheses help in depth-ambiguous situations. Table 3: Comparison with the state-of-the-art on MPI-INF-3DHP using ground-truth 2D poses. T: sequence length. T PCK AUC MPJPE MPSSE MPSCE Video Pose3D [38] 81 85.5 51.5 84.8 10.4 27.5 Pose Former [53] 9 86.6 56.4 77.1 10.8 14.2 Mix STE [52] 27 94.4 66.5 54.9 17.3 21.6 P-STMO [42] 81 97.9 75.8 32.2 8.5 11.3 Mani Pose (Ours) Aggr. 27 98.0 75.3 37.7 0.6 1.3 Mani Pose (Ours) Orac. 27 98.4 77.0 34.6 0.6 1.3 MPI-INF-3DHP. Similar results were obtained for this dataset (cf. Table 3). Not only does Mani Pose reach consistency errors close to 0, but also best PCK and AUC performance. As for MPJPE, only [42] achieves slightly better performance, at the cost of large pose consistency errors. 5.3 Ablation study Table 4: Ablation study: Single hypothesis cannot optimize both MPJPE and consistency. Mani Pose uses the same backbone as Mix STE. MR: with manifold regularization. MC: manifoldconstrained. Bold: best. Underlined: second best. MR MC K # Params. MPJPE MPSSE MPSCE Mani Pose (Ours) 5 34.44 M 39.1 0.3 0.5 w/o MH 1 34.42 M 44.6 0.3 0.5 w/o MC, w/ MR 1 33.78 M 42.3 5.7 7.3 w/o MR (Mix STE) 1 33.78 M 40.9 8.8 9.9 Figure 5: MPSCE, MPSSE and MPJPE per segment/coordinate (lower is better). Mani Pose mostly helps to deal with the depth ambiguity (z coordinate). Ground-truth poses are represented but not visible because they have perfect consistency. Impact of components. We evaluate the impact of removing each component of Mani Pose on the Human 3.6M performance (Table 4). The components tested are the multiple hypotheses (MH) and the manifold constraint (MC). We also compare MC to a more standard manifold regularization (MR), i.e., adding Eq. (7) to the loss. Note that without all these components, we fall back to Mix STE [52], and that the performances reported in Table 4 also appear in Fig. 1. We see that MR helps to improve pose consistency, but not as much as MC. However, without multiple hypotheses, MC consistency improvements come at the cost of degraded MPJPE performance, as foreseen by our formal analysis (Section 4). Only the combination of both MC and MH allows us to optimize both consistency and MPJPE. Fine error analysis. We can see in Fig. 5 that, compared to Mix STE, Mani Pose reaches substantially superior MPSSE and MPSCE, consistency across all skeleton segments. Furthermore, note that larger Mix STE errors occur for segments KNEE-FOOT and ELBOW-WRIST, which are the most prone to depth ambiguity. That agrees with coordinate-wise errors depicted in Fig. 5, showing that Mani Pose improvements mostly translate into a reduction of Mix STE depth errors, which are twice as large as for other coordinates. Further ablations, including the effect of the number of hypotheses K, the score loss weight β and the rotations representation choice appear in the supplemental. Ground-truth Hypothesis 1: p1(x) Single hypothesis Hypothesis 2: p2(x) Score : γk(x) Hypothesis 3: p3(x) Figure 6: Qualitative comparison between Mani Pose and state-of-the-art regression method, Mix STE. Two pairs of predicted hypotheses by Mani Pose are illustrated in green-pink (left) and green-purple (right), where opacity is used to represent the predicted scores. Multiple hypotheses and constraints help to deal with depth ambiguities and avoids predicting shorter limbs (red circles). 6 Conclusion We presented a new manifold-constrained multi-hypothesis human pose lifting method (Mani Pose) and demonstrated its empirical superiority to the existing state-of-the-art on two challenging datasets. Further, we provided theoretical evidence supporting the tenets of our method, by showing the inherent limitation of unconstrained single-hypothesis approaches to 3D-HPE. We established that unconstrained single-hypothesis methods cannot deliver consistent poses and that constraining or regularizing single-hypothesis models leads to worse position errors. We also showed that traditional MPJPE-like metrics are insufficient to assess consistency. Limitations. To guarantee its consistency, Mani Pose relies on the forward kinematics algorithm, which is inherently sequential across joints. Removing that dependence is an interesting avenue for accelerating the method. On another note, while Mani Pose ensures the rigidity of the predicted poses, imposing constraints within human body articulation limits presents another area for enhancement. Acknowledgments and Disclosure of Funding This work was granted access to the HPC resources of IDRIS under the allocation 2023-AD011014073 made by GENCI. It was also partly funded by the French Association for Technological Research (ANRT CIFRE contract 2022-1854). We are grateful to the reviewers for their insightful comments. [1] Bishop, C.M.: Mixture density networks. Working paper, Aston University (1994) 3 [2] Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.J., Yuan, J., Thalmann, N.M.: Exploiting spatialtemporal relationships for 3d pose estimation via graph convolutional networks. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2272 2281 (2019) 2, 8, 23, 24 [3] Chen, C.H., Ramanan, D.: 3d human pose estimation= 2d pose estimation+ matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7035 7043 (2017) 2 [4] Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Transactions on Circuits and Systems for Video Technology 32(1), 198 209 (2021) 2, 7, 8, 23, 24 [5] Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7103 7112 (2018) 7 [6] Ci, H., Wu, M., Zhu, W., Ma, X., Dong, H., Zhong, F., Wang, Y.: Gfpose: Learning 3d human pose prior with gradient fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4800 4810 (2023) 3, 8 [7] Du, Q., Faber, V., Gunzburger, M.: Centroidal voronoi tessellations: Applications and algorithms. SIAM review 41(4), 637 676 (1999) 19 [8] Firman, M., Campbell, N.D., Agapito, L., Brostow, G.J.: Diversenet: When one right answer is not enough. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5598 5607 (2018) 3 [9] Goel, S., Pavlakos, G., Rajasegaran, J., Kanazawa, A., Malik, J.: Humans in 4d: Reconstructing and tracking humans with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14783 14794 (2023) 2 [10] Gong, J., Foo, L.G., Fan, Z., Ke, Q., Rahmani, H., Liu, J.: Diffpose: Toward more reliable 3d pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13041 13051 (2023) 3, 8, 24, 26 [11] Guzman-Rivera, A., Batra, D., Kohli, P.: Multiple choice learning: Learning to produce multiple structured outputs. Advances in neural information processing systems 25 (2012) 3 [12] Holmquist, K., Wandt, B.: Diffpose: Multi-hypothesis human pose estimation using diffusion models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15977 15987 (2023) 1, 3, 8, 23 [13] Hossain, M.R.I., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European conference on computer vision (ECCV). pp. 68 84 (2018) 2, 22 [14] Hu, W., Zhang, C., Zhan, F., Zhang, L., Wong, T.T.: Conditional directed graph convolution for 3d human pose estimation. In: Proceedings of the 29th ACM International Conference on Multimedia. pp. 602 611 (2021) 2 [15] Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(7), 1325 1339 (Jul 2014) 7, 15 [16] Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7122 7131 (2018) 2 [17] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ar Xiv preprint ar Xiv:1412.6980 (2014) 20, 22 [18] Kolotouros, N., Pavlakos, G., Jayaraman, D., Daniilidis, K.: Probabilistic modeling for human mesh recovery. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11605 11614 (2021) 3 [19] Lee, K., Hwang, C., Park, K., Shin, J.: Confident multiple choice learning. In: International Conference on Machine Learning. pp. 2014 2023. PMLR (2017) 3, 5 [20] Lee, S., Purushwalkam, S., Cogswell, M., Crandall, D., Batra, D.: Why m heads are better than one: Training a diverse ensemble of deep networks. ar Xiv preprint ar Xiv:1511.06314 (2015) 3 [21] Lee, S., Purushwalkam Shiva Prakash, S., Cogswell, M., Ranjan, V., Crandall, D., Batra, D.: Stochastic multiple choice learning for training diverse deep ensembles. Advances in Neural Information Processing Systems 29 (2016) 3, 4, 5 [22] Letzelter, V., Fontaine, M., Chen, M., Pérez, P., Essid, S., Richard, G.: Resilient multiple choice learning: A learned scoring scheme with application to audio scene analysis. Advances in neural information processing systems 36 (2024) 3, 4, 5 [23] Letzelter, V., Perera, D., Rommel, C., Fontaine, M., Essid, S., Richard, G., Pérez, P.: Winnertakes-all learners are geometry-aware conditional density estimators. In: Proceedings of the 41 st International Conference on Machine Learning (2024) 9, 19 [24] Li, C., Lee, G.H.: Generating multiple hypotheses for 3d human pose estimation with mixture density network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9887 9895 (2019) 8, 23 [25] Li, C., Lee, G.H.: Weakly supervised generative network for multiple 3d human pose hypotheses. In: British Machine Vision Conference (BMVC) (2020) 3, 8, 23 [26] Li, J., Xu, C., Chen, Z., Bian, S., Yang, L., Lu, C.: Hybrik: A hybrid analytical-neural inverse kinematics solution for 3d human pose and shape estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3383 3393 (2021) 22 [27] Li, W., Liu, H., Tang, H., Wang, P., Van Gool, L.: Mhformer: Multi-hypothesis transformer for 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13147 13156 (2022) 3, 7, 8, 23 [28] Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.c., Asari, V.: Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5064 5073 (2020) 23, 24 [29] Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: A skinned multiperson linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34(6), 248:1 248:16 (Oct 2015) 2 [30] Makansi, O., Ilg, E., Cicek, O., Brox, T.: Overcoming limitations of mixture density networks: A sampling and fitting framework for multimodal future prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7144 7153 (2019) 3 [31] Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE international conference on computer vision. pp. 2640 2649 (2017) 2 [32] Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3d human pose estimation in the wild using improved cnn supervision. In: 2017 international conference on 3D vision (3DV). pp. 506 516. IEEE (2017) 7 [33] Mun, J., Lee, K., Shin, J., Han, B.: Learning to specialize with knowledge distillation for visual question answering. Advances in neural information processing systems 31 (2018) 3 [34] Murray, R.M., Li, Z., Sastry, S.S.: A mathematical introduction to robotic manipulation. CRC press (2017) 17, 22 [35] Naeem, M.F., Oh, S.J., Uh, Y., Choi, Y., Yoo, J.: Reliable fidelity and diversity metrics for generative models. In: International Conference on Machine Learning. pp. 7176 7185. PMLR (2020) 24, 26 [36] Oikarinen, T., Hannah, D., Kazerounian, S.: Graphmdn: Leveraging graph structure and deep learning to solve inverse problems. In: 2021 International Joint Conference on Neural Networks (IJCNN). pp. 1 9. IEEE (2021) 3, 8, 23 [37] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019) 2 [38] Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7753 7762 (2019) 2, 7, 8, 9, 22, 23, 24 [39] Rempe, D., Birdal, T., Hertzmann, A., Yang, J., Sridhar, S., Guibas, L.J.: Humor: 3d human motion model for robust pose estimation. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11488 11499 (2021) 2 [40] Rommel, C., Valle, E., Chen, M., Khalfaoui, S., Marlet, R., Cord, M., Pérez, P.: Diff HPE: Robust, Coherent 3D Human Pose Lifting with Diffusion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 3220 3229 (2023) 1, 8 [41] Rupprecht, C., Laina, I., Di Pietro, R., Baust, M., Tombari, F., Navab, N., Hager, G.D.: Learning in an uncertain world: Representing ambiguity through multiple hypotheses. In: Proceedings of the IEEE international conference on computer vision. pp. 3591 3600 (2017) 3, 19 [42] Shan, W., Liu, Z., Zhang, X., Wang, S., Ma, S., Gao, W.: P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation (Jul 2022) 1, 2, 9 [43] Shan, W., Liu, Z., Zhang, X., Wang, Z., Han, K., Wang, S., Ma, S., Gao, W.: Diffusion-based 3d human pose estimation with multi-hypothesis aggregation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14761 14771 (2023) 8, 23 [44] Sharma, S., Varigonda, P.T., Bindal, P., Sharma, A., Jain, A.: Monocular 3d human pose estimation by generation and ordinal ranking. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 2325 2334 (2019) 3, 8, 23 [45] Tian, K., Xu, Y., Zhou, S., Guan, J.: Versatile multiple choice learning and its application to vision computing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6349 6357 (2019) 3, 5 [46] Tiwari, G., Anti c, D., Lenssen, J.E., Sarafianos, N., Tung, T., Pons-Moll, G.: Pose-ndf: Modeling human pose manifolds with neural distance fields. In: European Conference on Computer Vision. pp. 572 589. Springer (2022) 2 [47] Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3d pose estimation from videos. In: European Conference on Computer Vision. pp. 764 780. Springer (2020) 1, 23, 24 [48] Waskom, M.L.: seaborn: statistical data visualization. Journal of Open Source Software 6(60), 3021 (2021). https://doi.org/10.21105/joss.03021, https://doi.org/10.21105/ joss.03021 16 [49] Wehrbein, T., Rudolph, M., Rosenhahn, B., Wandt, B.: Probabilistic monocular 3d human pose estimation with normalizing flows. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11199 11208 (2021) 1, 3, 8, 23 [50] Xu, J., Yu, Z., Ni, B., Yang, J., Yang, X., Zhang, W.: Deep kinematics analysis for monocular 3d human pose estimation. In: Proceedings of the IEEE/CVF Conference on computer vision and Pattern recognition. pp. 899 908 (2020) 2 [51] Xu, T., Takano, W.: Graph stacked hourglass networks for 3d human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16105 16114 (2021) 2, 23 [52] Zhang, J., Tu, Z., Yang, J., Chen, Y., Yuan, J.: Mixste: Seq2seq mixed spatio-temporal encoder for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13232 13242 (2022) 1, 2, 7, 8, 9, 21, 22, 23, 24 [53] Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z.: 3D Human Pose Estimation with Spatial and Temporal Transformers. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11636 11645. IEEE, Montreal, QC, Canada (Oct 2021). https://doi.org/10.1109/ICCV48922.2021.01145 1, 2, 7, 8, 9, 23, 24 [54] Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5745 5753 (2019) 4, 22, 24 [55] Zou, Z., Tang, W.: Modulated graph convolutional network for 3d human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11477 11487 (2021) 2, 23, 24 Appendix / supplemental material This supplemental material is organized as follows: Appendix A contains empirical verification of our assumptions, Appendix B presents the proofs of our theoretical results, together with a few corollaries, Appendix C provides further implementation details concerning the 1D-to-2D experiment, as well as an extension to the 2D-to-3D setting, Appendix D contains implementation and training details concerning Mani Pose, as well as compared baselines, Appendix E presents further results of the Human 3.6M experiment, and finally, Appendix F explains the provided experiment code. A Assumption verifications Let us first define a few elements that we will need needed for our derivations. Definition A.1 (Human skeleton). We define a human skeleton as an undirected connected graph G = (V, E) with J = |V | nodes, called joints, associated with different human body articulation points. We assume a predefined order of joints and denote A = [Aij]0 i,j 0. Finally, we assume that the conditional distribution of poses does not collapse to a single point, i.e., that we have a one-to-many problem: Assumption A.5 (Non-degenerate conditional distribution). Given a joint distribution P(x G, p G) of 3D poses p G (R3)J and corresponding 2D inputs x G (R2)J, we assume that the conditional distribution P(p G|x G) is non-degenerate, i.e., it is not a single Dirac distribution. Note that can be true even when P(x G, p G) is unimodal (e.g., Fig. 4). We verified on Human 3.6M [15] ground-truth data that assumptions A.4 and A.5 hold for actual poses in both training and test splits. Segments rigidity. As shown on Figs. 5 and 9, ground-truth 3D poses have perfect MPSSE (8) and MPSCE (7) metrics, meaning that ground-truth skeletons are perfectly symmetric, with rigid segments. Assumption A.4 is thus verified in actual training and test data. Non-degenerate distributions. As shown on Fig. 7, the conditional distribution of ground-truth 3D poses given 2D keypoints position is clearly multimodal, and, thus, non-degenerate (not reduced to a single Dirac distribution). That validates assumption A.5 and explains why multi-hypothesis techniques are necessary. 0.5 0.0 0.5 1.0 0.5 0.0 0.5 1.0 v (a) S9, Walking 0.5 0.0 0.5 0.5 0.0 0.5 v (b) S1, Greeting 0.5 0.0 0.5 0.5 0.0 0.5 v (c) S11, Directions v (d) S1, Sitting Down Figure 7: Estimated joint distributions of ground-truth 2D inputs (u, v pixel coordinates) together with 3D z-coordinates (depth) for different subjects and actions. The depth density conditional on inputs is clearly multimodal. Vertical red lines are examples of depth-ambiguous inputs. Distributions are estimated with a kernel density estimator from the Seaborn plotting library [48]. B Proofs and additional corollaries B.1 Properties of manifold constraint and multi-hypotheses models This section contains the proofs of the theoretical results presented in Section 4.1, together with a few corollaries. PROOF. [Proposition 4.1] Let i be a joint connected to the root p0 (i.e., Ai0 = 1). From assumptions A.3 and A.4, we know that at any instant t, p G t,i lies on the sphere S2(0, si,0) centered at 0 with radius si,0 independent of time. Therefore, its position can be fully parameterized in spherical coordinates by two angles (θt,i, ϕt,i). Let j be a joint connected to i. Like before, assumption A.4 implies that at any instant t, p G t,j lies on the moving sphere S2(p G t,i, sj,i) centered at p G t,i with radius sj,i independent of time. Thus, we can fully describe p G t,j with the position of its center, p G t,i and the spherical coordinates (θt,j, ϕt,j) of joint j relative to the center of the sphere, i.e., joint i. That means that there is a bijection between the possible positions attainable by p G t,j at any instant and the direct product of spheres S2(0, si,0) S2(0, sj,i).1 That bijection is an homeomorphism since it is a composition of homeomorphisms: we can compute p G t,j from (θt,i, ϕt,i, θt,j, ϕt,j) following the forward kinematics algorithm [34] (cf. algo. 2), i.e., using a composition of rotations and translations. Now let us assume for some arbitrary joint k that p G t,k lies at all times on a space M2d homeomorphic to a product of spheres of dimension 2d. That means that p G t,k can be fully parametrized using 2d spherical angles (θ1, ϕ1, . . . , θd, ϕd). Let l be a joint connected to k (typically one further step away from the root joint p0 and not already represented in M2d). As before, at any instant t, p G t,l needs to lie on the sphere centered on p G t,k of constant radius sk,l. Thus, we can fully describe p G t,l using the 2(d + 1)-tuple of angles obtained by concatenating its spherical coordinates relative to joint k, together with the 2d-tuple describing p G t,k, i.e.the center of the sphere. So p G t,l lies on a space M2(d+1) homeomorphic to a product of spheres of dimension 2(d + 1). We can conclude by induction that at any instant t, pt = [p G t,1, . . . , p G t,J] lies on the same subspace of (R3)J, which is homeomorphic to a product of spheres centered at the origin: O i 0 and sj,τ(j) > 0, we can say that ℓj(f (x)) < sj,τ(j) for all joints j. We conclude that the model f minimizing MSE predicts poses that violate assumption A.4 and are inconsistent. As an immediate corollary of proposition 4.2, we may state the following result, which was empirically illustrated in many parts of our paper: Corollary B.1. Given a fixed training distribution P(x, p) respecting assumptions A.3-A.5, for all 3D-HPE model f predicting consistent poses, i.e., that respect assumption A.4, there is an inconsistent model f with lower mean-squared error. PROOF. Let f arg min f MSE ( f). According to proposition 4.2, f is inconsistent. Suppose that the consistent model f is such that MSE (f) MSE (f ) . (19) Since MSE reaches its minimum at f , we have MSE (f) = MSE (f ). Thus, f arg min f MSE ( f), which means that f is also inconsistent according to proposition 4.2. That is impossible given that we assumed f to be consistent. We conclude that Eq. (19) is wrong and that MSE (f) > MSE (f ) . (20) Note that propositions 4.2 and B.1 assume the use of the MSE loss, which is the most widely used loss in 3D-HPE. We can however extend them to the case where MPJPE serves as optimization criteria under an additional technical assumption: Corollary B.2. The predicted poses minimizing the mean-per-joint-position-error loss are inconsistent if the training poses distribution P(x, p) verifies Asm. A.3-A.5 and if the joint-wise residuals norm standard deviation is small compared to the joint-wise loss: Vx,p pj fj(x) 2 Ex,p pj fj(x) 2 0 . (21) PROOF. From proposition 4.2 we know that the poses predicted by the minimizer f of MSE (f) = Ex,p p f(x) 2 2 (22) are inconsistent. Let fj be the component of f corresponding to the jth joint. We define the jth mean-per-joint-position-error component as: MPJPE j(f) Ex,p pj fj(x) 2 . (23) Under the small variance assumption, we have: Vx,p pj fj(x) 2 Ex,p pj fj(x) 2 2 (24) = Ex,p p f(x) 2 2 Ex,p pj fj(x) 2 2 Ex,p pj fj(x) 2 2 (25) = MSE j(f) MPJPE j(f)2 MPJPE j(f)2 0 , (26) so both criteria, MSE and MPJPE, are asymptotically equivalent and have the same minimizer f , which is inconsistent according to proposition 4.2. Corollary B.3. Under Asm. A.4-A.5 and under (21), the only way to get both optimal MPJPE and consistency is to use multiple hypotheses. PROOF. Corollary B.1 and Proposition 4.2 imply that single-hypothesis models (constrained or not) deliver either suboptimal MPJPE or inconsistent pose predictions. Hence, by negation, we get our result. In the next section, we further show that multi-hypotheses models, constrained or not, can theoretically show a better L2-risk (or quantization) performance compared with single-hypotheses models. B.2 Multiple hypotheses (constrained or not) can improve L2-risk over single-hypothesis models Let X = R2 J denote the space of input 2D poses and P = R3 J the space of 3D poses. Also, let R(f) = Ex,p[ p f(x) 2 2] be the L2-risk of some pose estimator f under some underlying continuous joint distribution of 2D-3D pose pairs P(x, p), with density ρ (when it exists). Before stating the proposition, we need to define an adapted notion of risk for multi-hypothesis models under the oracle aggregation scheme: Definition B.4 (Winner-takes-all risk, [41]). As in [41] (section 3.2) and in [23] (section 2.2), we define the L2-risk for K-head models f WTA = (f 1 WTA, . . . , f K WTA) as: RK WTA(f WTA) Z Vk(f WTA(x)) f k WTA(x) p 2 2ρ(x, p) dp dx , (27) where Vk(g) s denotes the kth cell of the Voronoi tesselation of the output space P defined by generators g = (g1, . . . , g K) PK: Vk(g) p P | gk p 2 2 < gr, p 2 2, r = k . (28) The risk above translates the notion of oracle pose, since it partitions the space of ground-truth poses P into regions where some hypothesis is the closest, and uses only that hypothesis to compute the risk in that region. Note that R1 WTA(f) = R(f) for any function f, since a single-cell tessellation of P is P itself. In the following, we assume that f is expressive enough, so that, minimizing the risk (27) comes down to minimizing K X Vk(f WTA(x)) f k WTA(x) p 2 2ρ(x, p) dp , for each x X. Proposition B.5 (Optimality of manifold constrained multi-hypothesis models). A K-hypotheses model f WTA = (f 1, WTA, . . . , f K, WTA) minimizing (27) has always a risk lower or equal to a singlehypothesis model f MSE minimizing R: RK WTA(f WTA) R1 WTA(f MSE) = R(f MSE) . (29) PROOF. Following [23] (Section 2.2), we decouple the cell generators from the risk arguments in (27): Vk(g) zk p 2 2ρ(p|x) dp , (30) for any generators g = (g1, . . . , g K) PK and arguments z = (z1, . . . , z K) PK. Note that RK WTA(f) = R X K(f(x), f(x))ρ(x) dx. According to Proposition 3.1 of [7] (or Proposition 2.1 in [23]), if f WTA minimizes RK WTA, then (f WTA(x), f WTA(x)) has to minimize K for all x X: K(f WTA(x), f WTA(x)) K(g, z), g, z PK PK. (31) Let s choose g such that gk = f k, WTA(x) and z such that zk = f MSE(x) for all 1 k K. Then RK WTA(f 1 , . . . , f K ) Z Vk(f WTA(x)) f MSE(x) p 2 2ρ(p|x)ρ(x) dp dx = R(f MSE) , (32) where the last equality comes from the fact that Vk(f WTA(x)) defines a partition of P. C Further details of 1D-to-2D case study C.1 Implementation details Datasets. We created a dataset of input-output pairs {(xi, (xi, yi))}N i=1, divided into 1 000 training examples, 1 000 validation examples and 1 000 test examples. Since the 2D position of J1 is fully determined by the angle θ between the segment (J0, J1) and the x-axis, the dataset is generated by first sampling θ from a von Mises mixture distribution, then converting it into Cartesian coordinates (xi, yi) to form the outputs, and finally projecting them into the x-axis to obtain the inputs. Distribution scenarios. We considered three different distribution scenarios with different levels of difficulty: 1. Easy scenario: a unimodal distribution centered at θ = 2π 5 , where the axis of maximum 2D variance is approximately parallel to the x-axis (Fig. 4-A). 2. Difficult unimodal scenario: a unimodal distribution centered at θ = 0, where the axis of maximum 2D variance is perpendicular to the x-axis (Fig. 4-B). 3. Difficult multimodal scenario: a bimodal distribution, with modes at θ1 = π 3 and θ2 = π 3 and mixture weights w1 = 2 3 and w2 = 1 3, i.e., where the projection of modes onto the x-axis are close to each other (Fig. 4-C). All von Mises components in all scenarios had concentrations equal to 20. Architectures and training. All three models were based on a multi-layer perceptron (MLP) with 2 hidden layers of 32 neurons each, using tanh activation. The constrained and unconstrained MLPs were trained using the mean-squared loss 1 N PN i=1((ˆxi xi)2 + (ˆyi yi)2). Mani Pose was trained with the loss in Eq. (1), and had K = 2 heads. We trained all models with batches of 100 examples for a maximum of 50 epochs. We used the Adam optimizer [17], with default hyperparameters and no weight decay. Learning rates were searched for each model and distribution independently over a small grid: [10 5, 10 4, 10 3, 10 2] (cf. selected values in Table 5). They were scheduled during training using a plateau strategy of factor 0.5, patience of 10 epochs and threshold of 10 4. Table 5: Selected learning rates for 1D-to-2D synthetic experiment. Distribution A B C Unconstr. MLP 10 3 10 3 10 2 Constrained MLP 10 2 10 4 10 2 Mani Pose 10 2 10 3 10 2 C.2 Extension to 2D-to-3D setup with more joints We further extend the two-joint 1D-to-2D lifting experiment of Section 4.2 to 2D-to-3D with three joints, aiming at providing a scenario that is closer to real-world 3D-HPE, but that can still be fully dissected and visualized. As in Section 4.2, we suppose that joint J0 is at the origin at all times, that J1 is connected to J0 through a rigid segment of length s1 and that J2 is connected to J1 through a second rigid segment of length s1 < s0. We further assume that both J1 and J2 are allowed to rotate around two axes orthogonal to each other. Thus, J1 is constrained to lie on a circle S1(0, s0), while J2 lies on a torus T homeomorphic to S1(0, s0) S1(0, s1). Without loss of generality, we set the radii s0 = 2 and s1 = 1 and assume them to be known. Given that setup, we are interested in learning to predict the 3D pose (J1, J2) = (x1, y1, z1, x2, y2, z2) R6, given its 2D projection (K1, K2) = (x1, z1, x2, z2) R4. We create a dataset comprising 20000 training, 2000 validation, and 2000 test examples, sampled using an arbitrary von Mises mixture of poloidal and toroidal angles (θ, ϕ) in T . We set the modes of such a mixture at [( π, 0), (0, π/4), ( 1 2, π/4), (2π/3, π/2)], with concentrations of [2, 4, 3, 10] and weights [0.3, 0.4, 0.2, 0.1]. Similarly to Fig. 4-C, that creates a difficult multimodal distribution, depicted in Fig. 8. 3 2 1 0 1 2 3 Figure 8: Visualisation of the von Mises mixture distribution on the torus T. The different colors (blue, green, red, purple) represent the modes of the sampled points. We are only representing joint J2 here for clarity. We train and evaluate the same baselines as in Section 4.2 in that new scenario, using a similar setup (cf. Appendix C.1, Architectures and training). Note that for those experiments, we used an initial learning rate of 10 3 for each baseline, and a batch size of 1000 examples. The corresponding Mean Per Segment Consistency Error (MPSCE) and Mean Per Joint Position Error (MPJPE) results are reported in Table 6. Table 6: Mean per joint prediction error (MPJPE) and mean per segment consistency error (MPSCE) in a 2D-to-3D scenario. Results are averaged over five random seeds. Mani Pose reaches perfect MPSCE consistency without degrading MPJPE performance. MPJPE MPSCE Unconst. MLP 1.152 0.021 0.269 0.018 Constrained MLP 1.166 0.028 0.000 0.000 Mani Pose 1.149 0.036 0.000 0.000 We see that the same observations as in Section 4.2 also apply here: although the unconstrained MLP yields competitive MPJPE results, its predictions are not consistently aligned with the manifold, as indicated by its poor MPSCE performance. Again, we show here that Mani Pose offers an effective balance between maintaining manifold consistency and achieving high joint-position-error performance. D Further Mani Pose implementation details D.1 Architectural details Our architecture is backbone-agnostic, as shown on Fig. 2. Thus, in order to have a fair comparison, we decided to implement it using the most powerful architecture available, i.e., Mix STE [52]. In practice, the rotations module follows the Mix STE architecture with dl = 8 spatio-temporal transformer blocks of dimension dm = 512 and time receptive field of T = 243 frames for Human 3.6M experiments and T = 43 frames for MPI-INF-3DHP experiments. Contrary to Mix STE, that network outputs rotation embeddings of dimension 6 for each joint and frame, instead of Cartesian coordinates of dimension 3. Concerning the segment module, it was also implemented with a smaller Mix STE backbone of depth dl = 2 and dimension dm = 128. The ablation study presented in Table 4 shows that the increase in the number of parameters between Mix STE and Mani Pose is negligible. D.2 Pose decoding details The pose decoding block from Fig. 2 is described in Section 3.1 and is based on Algorithms 1 and 2. The whole procedure is illustrated on Fig. 3. Table 7: Joint-wise weights used in the Winner-takes-all loss Eq. (2) (as in [52]). Joint 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Weight 1 1 2.5 2.5 1 2.5 2.5 1 1 1 1.5 1.5 4 4 1.5 4 4 Algorithm 1 6D rotation representation conversion [54] Require: Predicted 6D rotation representation r R6. 1: x [r0, r1, r2] , 2: y [r3, r4, r5] , 3: x x / x 2 , 4: z x y , 5: z z / z 2 , 6: y z x , 7: return R = [x|y|z] R3 3 . Algorithm 2 Forward Kinematics [34, 26] Require: Scaled reference pose u (R3)J, predicted rotation matrices Rt,j, 0 j < J. 1: R t,0 Rt,0 , 2: pt,0 u 0 , 3: for j = 1, . . . , J 1 do 4: R t,j Rt,j R t,τ(j) , Compose relative rotations 5: pt,j R t,j(u j u τ(j)) + pt,τ(j) , 6: end for 7: return pt = [pt,j]0 j