# on_the_calibration_of_human_pose_estimation__0570b480.pdf

On the Calibration of Human Pose Estimation

Kerui Gu * 1 Rongyu Chen * 1 Xuanlong Yu 2 Angela Yao 1

2D human pose estimation predicts keypoint locations and the corresponding confidence. Calibration-wise, the confidence should be aligned with the pose accuracy. Yet existing pose estimation methods tend to estimate confidence with heuristics such as the maximum value of heatmaps. This work shows, through theoretical analysis and empirical verification, a calibration gap in current pose estimation frameworks. Our derivations directly lead to closed-form adjustments in the confidence based on additionally inferred instance size and visibility. Given the black-box nature of deep neural networks, however, it is not possible to close the gap with only closed-form adjustments. We go one step further and propose a Calibrated Confidence Net (CCNet) to explicitly learn network-specific adjustments with a confidence prediction branch. The proposed CCNet, as a lightweight post-hoc addition, improves the calibration of standard off-the-shelf pose estimation frameworks. The project page is at https://comp.nus.edu.sg/ keruigu/calibrate pose/ project.html.

1. Introduction

2D human pose estimation (HPE) methods typically predict keypoint locations and corresponding confidences. The progress in developing such methods is primarily centred on improving keypoint location accuracy (Xu et al., 2022; Mao et al., 2022). The confidence, on the other hand, is estimated in an ad-hoc manner and based on heuristics such as taking the maximum value of the keypoint heatmap (Xiao et al., 2018; Sun et al., 2019) or variance of predictive distribution (Li et al., 2021a).

*Equal contribution 1School of Computing, National University of Singapore 2U2IS, ENSTA Paris, IP Paris. Correspondence to: Kerui Gu <keruigu@comp.nus.edu.sg>.

Proceedings of the 41 st International Conference on Machine Learning, Vienna, Austria. PMLR 235, 2024. Copyright 2024 by the author(s).

Figure 1. Adding our CCNet to detection- (Det) and regression- (Reg) based pose estimation improves confidence estimation. The area under the Precision-Recall curve measures the quality of the confidence estimate. The orange striped and blue shaded areas denote the improvement. Random assignment of confidence (Rand Conf) has a terrible calibration while m AR serve as the upper bound confidence estimation.

Having well-calibrated confidences that are aligned with the pose accuracy is important for applications that require pose estimation. In these applications, the pose can be used either as an input to a downstream task, such as 3D mesh recovery (Kolotouros et al., 2019; Li et al., 2022) or directly for reasoning and decision-making, such as robotics or autonomous driving (Abdar et al., 2021). Being able to rely on the confidence is not only useful, e.g. for discarding low-confidence outputs, but also safety-critical for humanmachine interactions.

How well-calibrated are current pose estimation systems, and do their confidences align with the actual pose accuracy? One line of work rooted in uncertainty (Bramlage et al., 2023; Pierzchlewicz et al., 2022) introduces distribution modelling and retrains the pose-estimation network. The resulting network has more reliable confidence, but it comes at the expense of pose accuracy. Furthermore, confidence is evaluated from the perspective of distribution calibration. Such an approach ignores the alignment of confidence to the accuracy and thus may not serve as helpful indicators (Kuleshov & Deshpande, 2022).

On the Calibration of Human Pose Estimation

To answer the above question, we first analyze the expectation of the predicted confidence versus the ideal confidence based on the pose accuracy metric. For example, in the popular benchmark MSCOCO (Lin et al., 2014), the accuracy is measured by object keypoint similarity (OKS). Yet the confidence, as heuristics, is wrongly formulated and is therefore systematically miscalibrated. Our analysis bridges some of the calibration gaps simply by changing the closedform expression for confidence, e.g. by accounting for the instance s scale and keypoint visibilities.

However, only correcting the formulation of the confidence term is insufficient. In practice, network predictions vary depending on different backbones and datasets. In this case, we can make further network-specific adjustments to better calibrate the confidence. To that end, we propose a simple yet effective Calibrated Confidence Net (CCNet) to complement pose-estimation frameworks. CCNet, as an ad-hoc add-on, is framework agnostic and applicable to any existing pose estimation methods. Using the penultimate features of the original pose models, it explicitly estimates a score and visibility measure. The outputs are supervised with ground truth visibility and its OKS to directly link the predicted confidence and address the miscalibration of that network. With only a few epochs for training and minimal additional parameters, the pose estimation framework improves in calibration and m AP.

Summarizing our contributions,

We are the first to provide a principled understanding of the calibration of 2D pose estimation. Pose calibration has been overlooked in the literature but has importance for downstream applications and safety-critical decision making.

We mathematically formulate the ideal form of pose confidence and reveal its mismatch to the practical confidence form of current pose estimation methods. A simple solution is provided to verify and correct the misalignment.

We propose a simple but effective method to explicitly model the calibration with minimal addition of parameters and training time. Experiments show that adding the calibration branch gives a significant improvement on the primary metric m AP and also benefits the downstream tasks.

2. Related Work

Pose Estimation. Past works in 2D top-down-based pose estimation mainly focused on improving accuracy. Few

works give some heuristic or empirical understanding on the pose confidence. (Papandreou et al., 2017) proposed an effective re-scoring strategy based on the detected bounding box , which is applied in several top-down-based methods (Xiao et al., 2018; Li et al., 2021a). PETR (Shi et al., 2022) empirically found that changing the matching objective to be OKS-based improves the average precision under the same average recall, indicating a better ranking over the samples. Poseur (Mao et al., 2022), which follows a regression paradigm, noted that previous regression scoring is heuristic. They rescore the confidence into a likelihood based on the detection scores. Although there exist several works that try to change the form of confidence, they remain heuristic and purely empirical. Our paper gives a theoretical understanding of the confidence for both heatmapand regression-based methods; it also analyzes and corrects the confidence to be better calibrated with the pose accuracy.

Confidence Estimation. Confidence estimates are essential in real-world applications (Kendall & Gal, 2017; Lakshminarayanan et al., 2017; Amini et al., 2020). Guo et al. (2017) reveal that the softmax output of modern neural networks, which typically is interpreted as a categorical distribution in classification, is poorly calibrated. The outputs do not faithfully reflect the actual accuracy and tend to be overconfident. Guo et al. (2017) study post-hoc confidence calibration, which can be plugged into any trained model. Similarly, recent work (Pathiraja et al., 2023) introduces train-time calibration to object detection.

For regression tasks, there are no agreed-upon conventions. The quantile-based definition is common (Song et al., 2019), but the evaluation is nontrivial in high dimensions. Other methods directly improve and evaluate probability distribution models (Kendall & Gal, 2017; Amini et al., 2020). Finally, (Xiao et al., 2018; Yu et al., 2021; Mukhoti et al., 2023) estimate prediction errors or related metrics instead of probability values and thus adopt ranking-based evaluation such as Area Under Curves (Ilg et al., 2018; Franchi et al., 2022). However, there are few works studying calibration in pose estimation (Bramlage et al., 2023; Pierzchlewicz et al., 2022). We argue that pose confidence is useful and informative only when it aligns well with actual accuracy; otherwise, it is not beneficial (Kuleshov & Deshpande, 2022). To this end, our work mainly studies more efficient Auxillary Confidence Regression (Corbi ere et al., 2019; Yu et al., 2021; Shen et al., 2023) and evaluates calibration with comprehensive metrics including AP (AUPRC).

3. Preliminaries

3.1. Human Pose Estimation

We consider top-down 2D human pose estimation, where people are already localized and cropped from the scene.

On the Calibration of Human Pose Estimation

Heatmap Detection

Calibrated KP Conf

Calibration Loss (Eq. 19)

Calibrated Vis

Vis Loss (Eq. 20)

Calibrated Instance Conf (Eq. 6)

Figure 2. We introduce CCNet, a lightweight post-hoc addition to off-the-shelf pose estimation methods. CCNet directly estimates better-calibrated confidences from latent pose representations without modifying backbone parameters. The green arrows depict the accuracy (e.g., OKS and PCK) calculation flow used during training. Using COCO OKS as an example, CCNet s predicted keypoint confidence and visibility are supervised by the confidence calibration loss and visibility loss (indicated by red arrows), respectively. The final instance confidence is obtained through a weighted aggregation.

Given a single-person image x, pose estimation methods estimate K keypoint coordinates ˆp RK 2 and confidence score ˆs [0, 1]K. The keypoint scores are aggregated into a personor instance-wise confidence ˆc [0, 1], where higher values indicate higher confidence.

Heatmap methods (Xiao et al., 2018; Sun et al., 2019; Xu et al., 2022) estimate K heatmaps ˆ H RH W to represent pseudo-likelihoods, i.e. unnormalized probabilities of each pixel being the k-th keypoint (see example in Fig. 2). The heatmap ˆ Hk can be decoded into the joint coordinate ˆpk and joint confidence ˆsk with a simple arg max:

ˆpk = argmax( ˆ Hk), ˆsk = max( ˆ Hk), (1)

although more complex forms of decoding have been proposed (Zhang et al., 2020; Gu et al., 2021b) in place of the arg max.

Methods which estimate heatmaps are learned with an MSE loss with respect to a ground truth heatmap Hk

k=1 MSE( ˆ Hk, Hk). (2)

Typically, the ground truth heatmap Hk is constructed as 2D Gaussian, with the mean at the ground truth keypoint location and a fixed standard deviation l.

Regression methods directly regress either deterministic coordinates of the keypoints or likelihood distributions of the coordinates. We focus on the state-of-the-art RLE regression (Li et al., 2021a; Mao et al., 2022), which models the likelihood as a distribution parameterized by mean and standard deviation parameters ˆµ and ˆσ. The keypoint prediction

and its confidence are given as

ˆp = ˆµ, ˆsk = 1 ˆσ. (3)

The loss is formulated as a negative log-likelihood:

k=1 log ˆp(pk|x; ˆpk, ˆσk), (4)

which can be further expanded as an adaptive weighted loss between ˆpk and pk, along with some regularization term such as log ˆσ2 k.

Other regression-based methods use heatmap maximum (Wei et al., 2020) or keypoint classification confidence from another head (Li et al., 2021b) or simply fill the confidence as 1 (Sun et al., 2018).

Instance-wise confidence scores are derived by aggregating the keypoint confidences with a weighted summation:

ˆc = agg(ˆs) =

k=1 ˆwkˆsk, where ˆwk = I(ˆsk > τˆs) PK k=1 I(ˆsk > τˆs) ,

(5) where I is an indicator function and τˆs is a manually defined threshold. The keypoint-to-instance aggregation agg( ) is an averaging function that selects only keypoints above the threshold τˆs, with ˆsk > τˆs.

3.2. Evaluating Pose Models

Several metrics are used for evaluating keypoint accuracy. One example is the End-Point Error (EPE), defined as the

On the Calibration of Human Pose Estimation

mean Euclidean distance between the estimated and ground truth keypoint. EPE, measured in pixels, cannot account for a person s scale. Another metric is the Percentage of Correct Keypoint (PCK), which tallies the fraction of keypoints within varying thresholds. PCK is normalized with respect to head size and factors in scale, but does not distinguish between different types of keypoints.

A more sophisticated evaluation measure for keypoint accuracy is Object Keypoint Similarity (OKS) (Lin et al., 2014). OKS factors in both instance size and keypoint variation as an instance measure. It is defined as a weighted sum of the exponential envelope of a scaled end-point error:

k=1 wk exp ˆpk pk 2

where wk = vk PK k=1 vk , and l2 k = varka. (7)

Above, a is the body area, vark is a per-keypoint annotation falloff constant, and vk is a visibility indicator equal to 1 only if keypoint k is present in the scene 1. The scaling lk in the exponential envelope accounts for differences in scale across the different body joints and overall pose area. A person instance estimate is regarded as correct (positive) if its OKS exceeds some threshold.

3.3. m AP & m AR

Based on OKS, a ranking-independent metric mean Average Recall (m AR) and a ranking-dependent metric mean Average Precision (m AP) can be established to evaluate a given pose model. The m AR purely evaluates the pose accuracy of the model while the m AP considers confidence as well. With the same pose accuracy, a higher similarity between the rankings of the confidence and OKS brings higher m AP. Note that the formulations of m AP and m AR are the same as the Area Under (maximum) Precision-Recall Curve (AUPRC) and the Area Under the Receiver Operating Characteristic (AUROC) used in conventional classification (Qi et al., 2021). We give mathematical formulations of m AR and m AP as follows.

Over a dataset with N samples, we can tabulate the mean Average Recall (m AR) and mean Average Precision (m AP) over T thresholds {τt} as

This equation clearly states that m AR is rankingindependent to the predicted confidence and purely evaluates the accuracy of poses. However, the primary metric

1Accounts for occluded and unoccluded keypoints.

used for evaluating 2D pose estimation is mean Average Precision (m AP). The m AP is defined as

j=1 I(cj > τt)

where i denotes an index based on the instances sorted according to their estimated confidences, i.e. ˆc1 . . . ˆci ˆc N. The m AP therefore relies on the estimated confidences ˆc to be consistent with the OKS in relative ordering and is dependent on the ranking of the predicted confidence.

4. An Analysis on Pose Calibration

4.1. Problem Formulation & Assumptions

For a well-calibrated pose model, the predicted pose confidence should follow the same ranking as the accuracy. While there are several accuracy measures for the pose, as outlined in Sec. 3.2, we center our analysis on OKS, as it is the most comprehensive, and its corresponding m AP metric.

For the analysis, we formulate the expected OKS and predicted confidence of both heatmapand RLE-based methods from a statistical perspective, following two standard assumptions (Xiao et al., 2018; Li et al., 2021a). First, the K keypoints of a person are conditionally independent given the image. For clarity, we drop the k subscript in this section. Secondly, we assume that the ground truth location of each keypoint in an image follows a Gaussian distribution p N(µ, σ2I) (Li et al., 2021a; Chen et al., 2023), where µ specifies the true underlying location. For simplicity, we consider a 2D isotropic Gaussian in our exposition and develop our analysis only in terms of variance σ2, although the analysis can easily be extended for non-isotropic cases.

4.2. Expected OKS

Assuming a Gaussian distribution parameterized by (µ, σ2) for the ground truth pose p, the expected value of the OKS distribution for an estimated pose ˆp is given by

Ep[OKS] = Ep

σ2 + l2 exp ˆp µ 2

Note the above equation is a function of {µ, σ, l, ˆp}, depending on the ground truth Gaussian and l, the exponential envelope fall-off rate given in Eq. 7. When a network is perfectly trained, ˆp will approach µ and the exponential term simplifies to 1, which results in the following confidence

σ2 + l2 = 1 σ2

σ2 + l2 . (12)

On the Calibration of Human Pose Estimation

4.3. Ad-Hoc Confidence

Heatmap methods synthesize a ground truth heatmap H by constructing an isotropic Gaussian centered at the ground truth p and a standard deviation of l set heuristically, e.g., l = 2. Given our previous assumption on the distribution of p, the effective ground truth can be expressed as p N(µ, σ2 + l2), or in heatmap form as

h p = 2π l2p( p|p) = exp p p 2

If we consider the predicted heatmap ˆh which minimizes the MSE loss in Eq. 2, we arrive at the following:

ˆh = arg min ˆh Ep[(ˆh h)2] = Ep[h] (14)

p p(p|x) 2π2p( p|p)dp = 2π2p( p|x). (15)

The resulting optimal spatial heatmap is ˆ H = {ˆh} c N(µ, ˆσ2I), which approximates the synthesized ground truth heatmap, with ˆσ2 = σ2 + l2 (see Appendix for a similar derivation for the case when p is not centered at µ). This derivation highlights that predicted heatmaps learned with a pixel-wise MSE loss exhibit a standard deviation slightly larger than l = 2 even if the coordinate prediction is accurate (Gu et al., 2021a). See Appendix Sec. A for proof.

It follows Eq. 15 that the predicted confidence, defined as the max from Eq. 1, and located at ˆp µ, is given by

ˆsdet = ˆhµ = 2π l2p( p = µ|x) (16)

2π(σ2 + l2) exp µ µ 2

σ2 + l2 = l2

The two expected values from Eq. 12 and Eq. 17 are different for the same input, i.e. a same location µ. This difference arises because l is constant, while l changes depending on the (person) instance size and keypoint. For example, a larger person leads to an underestimation of the OKS.

RLE-based regression methods are learned by minimizing a negative log-likelihood over the predicted distribution as shown in Eq. 4. For simplicity, we consider a normal distribution p N(ˆp, ˆσ2I), though alternative distributions such as a Laplace or Normalizing Flow lead to the same conclusions. After training, we show that the predicted distribution approximates the optimal, ˆp µ, ˆσ σ, p p. Detailed derivations are given in Appendix Sec. A.

Substituting the ˆσ from above into the heuristic score for

Table 1. m AP (first four columns) and m AR (last column). Orig means applying their original ways of estimating confidence. Mean , Pred , and GT correspond to adjusting the confidence prediction using mean, predicted, and ground truth area in Eq. 12 respectively. Results show that our closed-form adjustment improves the calibration and thus increases the m AP.

Method Type Orig Variables in Eq. 12 m AR Mean Pred GT

SBL heatmap 72.4 72.2 73.0 73.6 75.6 RLE regression 72.2 71.8 73.2 73.3 75.4

RLE given in Eq. 3, we arrive at

ˆsreg = 1 ˆσ = E 1 rπ

Comparing Eq. 12 with Eq. 18, the predicted confidence of RLE-based methods are linear in ˆσ and only models the annotation variation but ignores the instance size. It also averages across all the keypoints, without excluding the occluded keypoints, leading to more inconsistencies with the expected value of OKS (Eq. 7).

4.4. Confidence Correction

Although the confidence values in these three forms ( Eqs. 12, 17, and 18) all decrease as σ increases, the rankings of estimated confidence still differ from that of actual OKS accuracy. One explanation is that when it comes to a specific sample, the actual OKS will vary, but the predicted confidences of both heatmapand RLE-based methods remain unchanged since they don t consider the instance size and keypoint falloff constants. For two similar σ s, the OKS will likely have different rankings depending on l, which becomes inconsistent with the ranking of confidences.

Motivated by the above analysis, we provide a simple confidence correction to make the pose network better calibrated to the OKS. This can serve as the empirical verification of our theoretical derivations. From Eq. 12, we can see that to match the format of expected OKS, we need the knowledge of σ and l. For σ, we obtain from the heatmaps by Pearson s chi-squared test for heatmap-based methods and directly from the predicted ˆσ for RLE-based methods. For l, we estimate it and use the ground truth. Table 1 demonstrates that this adjustment from the theoretical analysis of the expected OKS improves m AP. However, this rescoring is based on the dismantling of metrics under ideal assumptions. We further address non-idealities in the next section, using the proposed Confidence Net.

On the Calibration of Human Pose Estimation

Table 2. Comparisons with state-of-the-art methods on the COCO validation set. The blue color depicts improved value after applying the proposed CCNet. Hm , Reg , and const. represent confidence functions originating from heatmap maximum, direct regression, and constant value, respectively. This table demonstrates that our CCNet considerably improves the AP of all methods.

Method Confidence Backbone Input Size #Params (M) #GFLOPs m AP AP.5 AP.75 AP (M) AP (L) m AR

Detection SBL Hm Res Net-50 256 192 34.00 5.46 72.4 91.5 80.4 69.8 76.6 75.6 +CCNet Res Net-50 256 192 34.08 5.52 73.3 (+0.9) 92.6 80.9 70.4 77.5 75.6 SBL Hm Res Net-152 384 288 68.64 12.77 76.5 92.5 83.6 73.6 81.2 79.3 +CCNet Res Net-152 384 288 68.71 12.83 77.3 (+0.8) 93.5 84.1 74.0 81.6 79.3 HRNet Hm HRNet-W32 256 192 28.54 7.70 76.0 93.5 83.4 73.7 80.0 79.3 +CCNet HRNet-W32 256 192 28.62 7.76 77.0 (+1.0) 93.7 84.0 74.0 81.0 79.3 HRNet Hm HRNet-W48 384 288 63.62 15.31 77.4 93.4 84.4 74.8 82.1 80.9 +CCNet HRNet-W48 384 288 73.69 15.36 78.3 (+0.9) 93.6 85.1 75.5 83.4 80.9 Vi TPose Hm Vi T-Base 256 192 89.99 17.85 77.3 93.5 84.5 75.0 81.6 80.4 +CCNet Vi T-Base 256 192 90.07 17.91 78.1 (+0.8) 93.7 85.0 75.4 83.3 80.4 Regression RLE Reg Res Net-50 256 192 23.6 4.0 72.2 90.5 79.2 71.8 75.3 75.4 +CCNet Res Net-50 256 192 23.6 4.0 73.6 (+1.4) 91.6 80.2 72.0 77.6 75.4 RLE Reg Res Net-152 384 288 58.3 11.3 76.3 92.4 82.6 75.6 79.7 79.2 +CCNet Res Net-152 384 288 58.3 11.3 77.1 (+0.8) 92.6 83.2 75.6 81.3 79.2 RLE Reg HRNet-W32 256 192 39.3 7.1 76.7 92.4 83.5 76.0 79.3 79.4 +CCNet HRNet-W32 256 192 39.3 7.1 77.5 (+0.8) 92.6 84.2 75.9 81.3 79.4 RLE Reg HRNet-W48 384 288 75.6 33.3 77.9 92.4 84.5 77.1 81.4 80.6 +CCNet HRNet-W48 384 288 75.6 33.3 78.8 (+0.9) 92.6 85.1 77.0 82.9 80.6 Poseur Reg Res Net-50 256 192 33.1 4.6 76.8 92.6 83.7 74.2 81.4 79.7 +CCNet Res Net-50 256 192 33.1 4.6 77.7 (+0.9) 92.7 84.2 74.9 82.3 79.7 IPR const. Res Net-50 256 192 34.0 5.5 65.6 88.1 71.8 61.3 70.2 74.9 IPR Hm Res Net-50 256 192 34.0 5.5 69.5 88.9 74.6 67.2 74.7 74.9 +CCNet Res Net-50 256 192 34.1 5.5 70.8 (+1.3) 90.5 78.1 68.1 75.8 74.9

5. Calibrated Confidence Net (CCNet)

The correction in Sec. 4.4 is insufficient to fully close the calibration gap because it assumes that the network predicts the keypoint location perfectly. In practice, different models have different correlations between the prediction and ˆσ. As such, we propose Calibrated Confidence Net (CCNet) (see Fig. 2) as an efficient and effective calibration add-on to existing pose estimation methods. Denoting the previous pose network as Pred Net, which estimates keypoint locations, we add the lightweight CCNet to predict confidence based on the features of Pred Net. For instance, for heatmap-based methods, we detach and utilize the penultimate features after the deconvolution layers. For RLE-based methods, we similarly use the features after the Global Average Pooling layer. In this way, it does not require re-training and allows CCNet to access Pred Net s rich features. Furthermore, as the Pred Net is fixed, m AR remains unaffected.

Formally, CCNet outputs a calibrated confidence ˆsk [0, 1] for each keypoint given the input x. It additionally predicts a visibility ˆvk [0, 1] to correct the bias caused by the thresholding operation in existing practice (Eq. 5). Accuracy may not be well aligned with visibility (Sec. 6.4). For confidence, a simple yet effective MSE loss is applied to

calibrate predictions with ground truth keypoints as

k=1 (ˆsk sk)2, (19)

where sk is the OKS for this keypoint. For visibility, we commonly treat it as a binary classification and use a Binary Cross-Entropy loss

k=1 (vk log ˆvk + (1 vk) log(1 ˆvk)). (20)

The total loss, which updates only CCNet, is the following weighted sum

L = Lconf + λLvis, (21)

where λ serves as a weighting hyperparameter. Following the OKS form (Eq. 7), we similarly obtain the instancelevel confidence by aggregating the predicted visibility and confidence.

6. Experiments

6.1. Datasets

Datasets & Evaluation Metrics. We evaluate pose estimation tasks on three benchmarks: MSCOCO (Lin et al.,

On the Calibration of Human Pose Estimation

Table 3. m AP evaluation on the COCO-Whole Body validation set based on Poseur (Mao et al., 2022). We base our CCNet on the part confidence and improve the whole body AP by 2.3.

Body Foot Face Hand Whole

m AP Whole 67.2 63.6 84.6 58.3 61.0 Part 68.5 68.9 85.9 62.5 61.0 +CCNet 69.9 69.2 86.4 62.7 63.3 (+2.3) m AR 72.3 72.9 88.1 65.4 67.2

2014), MPII (Andriluka et al., 2014), and MSCOCOWhole Body (Jin et al., 2020). For the downstream tasks, we evaluate the 3D fitting task on 3DPW (Von Marcard et al., 2018).

MSCOCO consists of 250k person instances annotated with 17 keypoints. We evaluate the model with m AP over the standard 10 OKS thresholds. We also evaluate on MPII with the Percentage of Correct Keypoints (PCK) and on MSCOCO-Whole Body, which includes face and hand keypoints. We test our method with the common metric m AP to show its capability on face and hand keypoint detection apart from the body.

For the downstream task, 3DPW is a more challenging outdoor benchmark with around 3k SMPL annotations for testing. We follow the convention (Kolotouros et al., 2019) and use MPJPE, PA-MPJPE, and MVE as the evaluation metric. Additional implementation details and pseudo-code are provided in Appendix Sec. B.

6.2. Comparisons with SOTA

MSCOCO is the most challenging dataset to evaluate pose models on. Since our method is a plug-and-play module after the training of pose models, we evaluate our method on several baselines, including SBL (Xiao et al., 2018), HRNet (Sun et al., 2019), Vi TPose (Xu et al., 2022) for heatmap-based pipelines, RLE (Li et al., 2021a), IPR (Sun et al., 2018), Poseur (Mao et al., 2022) for regression-based pipelines, using their officially released checkpoints. Results in Tab. 2 show that our simple yet effective method gives improvements across varying backbones, learning pipelines, and scoring functions. It is model-agnostic and is applicable even when the uncertainty estimation capabilities of different networks vary.

We further posit that pose estimation methods should be aware of confidence estimation and report improved m AP and corresponding m AR even though many methods do not compare their m ARs. The gap between the two reflects how well- (or rather, poorly-) calibrated a pose model is. Qualitative visualizations of the calibrated confidence are provided in Appendix Sec. C.

Table 4. The proposed CCNet improves all m AP and AUSE-PCK evaluations on the MPII validation set.

PCK.5 PCK.1 m AP m AR AUSE PCK.5 PCK.1

RLE 86.2 32.9 75.4 78.8 3.35 1.76 +CCNet 76.6 (+1.2) 2.98 1.49 SBL 88.5 33.9 77.3 80.5 3.90 2.36 +CCNet 77.7 (+0.4) 3.52 1.95

Table 5. Other confidence quantification evaluations except for m AP on the COCO validation set, where Ins and KP are abbreviations of instance and keypoint, respectively.

m AP m AR Pearson Corr AUSE KP Ins KP Ins KP

RLE 77.9 82.7 0.700 0.637 2.72 5.03 +CCNet 78.7 (+0.8) 0.782 0.636 1.72 4.22 SBL 76.7 82.8 0.643 0.543 2.77 6.47 +CCNet 78.9 (+2.2) 0.718 0.628 2.13 4.11

COCO-Whole Body dataset evaluates the task of wholebody pose estimation, which includes body, face and hand keypoints. The convention is to assign the whole instance confidence to each part, which is unreasonable for evaluating the AP of the corresponding part. By simply changing the confidence of each part to the aggregation of the predicted part (Tab. 3 third row) instead of all the keypoints (Tab. 3 second row), the AP is significantly improved. After applying the proposed CCNet, we further improve the AP on every part and the whole body.

MPII is a single-person dataset and another commonly used benchmark. We use both OKS and PCK as the accuracy metric, which reflects on the m AP and AUSE (Ilg et al., 2018), respectively, as the final evaluation that considers both accuracy and calibration. Table 4 demonstrates the proposed CCNet is better on all metrics and therefore is metric and benchmark agnostic.

6.3. Confidence Evaluation

We are among the first to systematically explore confidence estimation for human pose estimation. The additional studies verify how CCNet will benefit confidence estimation beyond the AP measure.

Pearson Correlation between instance/keypoint accuracy and its confidence estimate is another measure of the quality of the confidence forecaster (Li et al., 2021a; Gu et al., 2021a; Bramlage et al., 2023). A well-estimated confidence estimate is proportional to the expected accuracy given the input condition. Our model gives stronger correlations between confidence and accuracy (4-5th col in Tab. 5).

On the Calibration of Human Pose Estimation

Figure 3. The (average) OKS and the decrease in estimated confidence correlate well with the circular occluder size (proportion to the input size). The right panel illustrates a Gaussian blur placed on the right wrist.

Table 6. 3D errors on 3DPW test set. Results show that better calibrated 2D pose net further improves the 3D results.

Method PA-MPJPE MPJPE MVE

SPIN 60.2 102.1 130.6 +SBL 58.8 100.5 128.7 +CCNet 57.8 99.7 127.5

Area Under Sparsification Error (AUSE) (Ilg et al., 2018; Franchi et al., 2022) is plotted by gradually removing the most uncertain samples and computing the remaining error. Such a metric reveals how closely the estimated confidence matches the factual accuracy. The best confidence ranking is based on the actual coincidence between the prediction and ground truth. The results in the last two columns of Tab. 5 show that our method is qualified to pick out more accurately predicted poses and filter out predictions with larger errors, which is helpful for real-world deployments.

Occlusion Robustness (Bramlage et al., 2023) tests the confidence estimate based on simulating object occlusions with synthesis patches added. As the size of the added occlusion patch increases (such as blur at the wrist), the annotation ambiguity caused by blur occlusions also increases. Predicted coordinates at multiple positions behind the occluder are considered feasible and possible (Chen et al., 2023), so the distance (error) between the (mean) prediction and a single annotation coordinate increases as the size of the occluder increases. Figure 3 shows that confidence shrinks along with accuracy (the green curve) as the occlusion patch expands, but it better matches accuracy after calibration. Once occlusion exceeds a certain level, humans can no longer estimate the keypoint and simply label it as invisible.

3D Mesh Recovery is a challenging task, especially on the in-the-wild data, such as 3DPW test set (Von Marcard et al., 2018). A common way to improve the 3D predictions is to align the projected 3D poses with the predicted 2D poses from off-the-shelf 2D pose estimators. Confidence, therefore, is a critical indicator of whether the predicted

Figure 4. Visibility aggregation ablation. (a) Confidence distribution of visible and invisible keypoints. (b) The effectiveness of additional visibility prediction in keypoint-to-instance confidence aggregation.

2D poses are trustworthy. Mathematically, it can be treated as the weight of the distance between the projected 2D location and its predicted 2D location. In this way, A bettercalibrated model will better distinguish the quality of the predicted 2D keypoints and help reduce the downstream error for mesh recovery. Empirically, the 2D detection results are given by the off-the-shelf pose network (Xiao et al., 2018); we update the 3D mesh with a 2D reprojection loss. The initial 3D predictions are from SPIN (Kolotouros et al., 2019). Table 6 shows that the calibrated 2D pose network better refines the 3D predictions.

6.4. Design Choices & Discussions

Surrogate Losses (Kendall & Gal, 2017; Lakshminarayanan et al., 2017; Corbi ere et al., 2019; Amini et al., 2020; Qi et al., 2021; Yu et al., 2021) propose various methods to estimate confidence. In our empirical explorations, we surprisingly find these sophisticated methods capture uncertainty no better than MSE (Appendix Tab. C). This might be bound by post-hoc confidence estimation of a frozen Pred Net and estimatability (Yu et al., 2024). Note that training-time adaptation of confidence (Bramlage et al., 2023; Pathiraja et al., 2023) which require adjusting the model architecture and training from the beginning with the proposed losses, though well-calibrated, generally hurts prediction accuracy (Oh & Shin, 2022) and leads to a less satisfactory m AP. How to better hybridize the advantages of these two approaches is a promising direction for future work.

Input Features are the basis of our confidence estimation and we treat them as frozen penultimate features to preserve the lightweight nature of CCNet. To verify that they contain sufficiently rich information, we compared them with input features from shallower layers, prediction, and original keypoint confidence estimates which roughly indicate the ground-truth range. A strategy of copying the backbone and fine-tuning similar to Corbi ere et al. (2019); Yu et al. (2021); Zhang et al. (2023) is also considered. As results in Appendix Tab. g, we find that the penultimate feature input

On the Calibration of Human Pose Estimation

is sufficient (Yu et al., 2024). In particular, keeping spatial information and predicting confidence using 1x1 channelwise convolution is crucial for detection-based methods with spatial penultimate features.

Confidence Aggregation studies how to convert the keypoint confidences into their corresponding instance confident. Existing works (Xiao et al., 2018; Gu et al., 2021a; 2023) empirically set a visibility threshold based on the confidence estimate (τˆs in Eq. 5). They found that the AP is sensitive to the choice of this thresholding hyperparameter (0.2 as default). Furthermore, the model has a different inductive bias from the human; keypoints with high confidence are not necessarily visible (Fig. 4(a)). This indicates that, human annotators would mark these keypoints as invisible, while the model remains confident in guessing the occluded keypoints positions. The calculation of instance confidence includes unnecessary keypoints that are not considered in the accuracy evaluation of only visible keypoints, leading to further misalignment. Thus, different common aggregations are studied in Fig. 4(b). The visibility classification strategy (the green bars) of the CCNet shows more consistency with human-perceived visibility without much computational burden. Additional confidence calibration (Eq. 19 shown with the red bars) further increases performance.

Limitations. While post-hoc methods share the merit of less training time and computation, they are also limited by the penultimate features given by the frozen pose estimator. Additionally, to extend our work to 3D pose estimation may need nontrivial changes since it has a different learning paradigm or output representation (3D coordinates or pose and shape parameters). The other bottom-up paradigm for multi-person pose estimation needs to further consider the association of keypoints with each individual. Their calibration problems are important and challenging for future work.

7. Conclusion

This work is the first to study the pose calibration problem of aligning the predicted confidences with the OKS accuracy metric. We show theoretically how current methods are miscalibrated and empirically verify the derivation with a closed-form solution to close the gap. We further propose a Calibrated Confidence Net (CCNet) to learn a network-aware branch to align the OKS. Our experiments demonstrate that CCNet applies to various pose methods on various datasets. The improved confidence is thoroughly evaluated and also shows promise to help downstream tasks.

Acknowledgments

This research/project is supported by A*STAR under its National Robotics Programme (NRP) (Award M23NBK0053). We would also like to thank the ACs and reviewers for their valuable suggestions.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U. R., et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 2021.

Amini, A., Schwarting, W., Soleimany, A., and Rus, D. Deep evidential regression. In Neur IPS, 2020.

Andriluka, M., Pishchulin, L., Gehler, P., and Schiele, B. 2D human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.

Bramlage, L., Karg, M., and Curio, C. Plausible uncertainties for human pose regression. In ICCV, 2023.

Chen, R., Yang, L., and Yao, A. MHEntropy: Entropy meets multiple hypotheses for pose and shape recovery. In ICCV, 2023.

Corbi ere, C., Thome, N., Bar-Hen, A., Cord, M., and P erez, P. Addressing failure prediction by learning model confidence. In Neur IPS, 2019.

Franchi, G., Yu, X., Bursuc, A., Tena, A., Kazmierczak, R., Dubuisson, S., Aldea, E., and Filliat, D. MUAD: Multiple uncertainties for autonomous driving, a benchmark for multiple uncertainty types and tasks. In BMVC, 2022.

Gu, K., Yang, L., and Yao, A. Dive deeper into integral pose regression. In ICLR, 2021a.

Gu, K., Yang, L., and Yao, A. Removing the bias of integral pose regression. In ICCV, 2021b.

Gu, K., Yang, L., Mi, M. B., and Yao, A. Bias-compensated integral regression for human pose estimation. TPAMI, 2023.

Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In ICML, 2017.

On the Calibration of Human Pose Estimation

Ilg, E., Cicek, O., Galesso, S., Klein, A., Makansi, O., Hutter, F., and Brox, T. Uncertainty estimates and multihypotheses networks for optical flow. In ECCV, 2018.

Jin, S., Xu, L., Xu, J., Wang, C., Liu, W., Qian, C., Ouyang, W., and Luo, P. Whole-body human pose estimation in the wild. In ECCV, 2020.

Kendall, A. and Gal, Y. What uncertainties do we need in bayesian deep learning for computer vision? In NIPS, 2017.

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2014.

Kolotouros, N., Pavlakos, G., Black, M. J., and Daniilidis, K. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In ICCV, 2019.

Kuleshov, V. and Deshpande, S. Calibrated and sharp uncertainties in deep learning via density estimation. In ICML, 2022.

Lakshminarayanan, B., Pritzel, A., and Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.

Li, J., Bian, S., Zeng, A., Wang, C., Pang, B., Liu, W., and Lu, C. Human pose regression with residual loglikelihood estimation. In ICCV, 2021a.

Li, K., Wang, S., Zhang, X., Xu, Y., Xu, W., and Tu, Z. Pose recognition with cascade transformers. In CVPR, 2021b.

Li, Z., Liu, J., Zhang, Z., Xu, S., and Yan, Y. CLIFF: Carrying location information in full frames into human pose and shape estimation. In ECCV, 2022.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll ar, P., and Zitnick, C. L. Microsoft COCO: Common objects in context. In ECCV, 2014.

Mao, W., Ge, Y., Shen, C., Tian, Z., Wang, X., Wang, Z., and den Hengel, A. v. Poseur: Direct human pose regression with transformers. In ECCV, 2022.

Mukhoti, J., Kirsch, A., van Amersfoort, J., Torr, P. H., and Gal, Y. Deep deterministic uncertainty: A simple baseline. CVPR, 2023.

Oh, D. and Shin, B. Improving evidential deep learning via multi-task learning. In AAAI, 2022.

Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., and Murphy, K. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.

Pathiraja, B., Gunawardhana, M., and Khan, M. H. Multiclass confidence and localization calibration for object detection. In CVPR, 2023.

Pierzchlewicz, P. A., Cotton, R. J., Bashiri, M., and Sinz, F. H. Multi-hypothesis 3D human pose estimation metrics favor miscalibrated distributions. In ar Xiv, 2022.

Qi, Q., Luo, Y., Xu, Z., Ji, S., and Yang, T. Stochastic optimization of areas under precision-recall curves with provable convergence. In Neur IPS, 2021.

Shen, M., Bu, Y., Sattigeri, P., Ghosh, S., Das, S., and Wornell, G. Post-hoc uncertainty learning using a Dirichlet meta-model. In AAAI, 2023.

Shi, D., Wei, X., Li, L., Ren, Y., and Tan, W. End-to-end multi-person pose estimation with transformers. In CVPR, 2022.

Song, H., Diethe, T., Kull, M., and Flach, P. Distribution calibration for regression. In ICML, 2019.

Sun, K., Xiao, B., Liu, D., and Wang, J. Deep highresolution representation learning for human pose estimation. In CVPR, 2019.

Sun, X., Xiao, B., Wei, F., Liang, S., and Wei, Y. Integral human pose regression. In ECCV, 2018.

Von Marcard, T., Henschel, R., Black, M. J., Rosenhahn, B., and Pons-Moll, G. Recovering accurate 3D human pose in the wild using IMUs and a moving camera. In ECCV, 2018.

Wehrbein, T., Rudolph, M., Rosenhahn, B., and Wandt, B. Probabilistic monocular 3D human pose estimation with normalizing flows. In ICCV, 2021.

Wei, F., Sun, X., Li, H., Wang, J., and Lin, S. Point-set anchors for object detection, instance segmentation and pose estimation. In ECCV, 2020.

Xiao, B., Wu, H., and Wei, Y. Simple baselines for human pose estimation and tracking. In ECCV, 2018.

Xu, Y., Zhang, J., Zhang, Q., and Tao, D. Vi TPose: Simple vision transformer baselines for human pose estimation. In Neur IPS, 2022.

Yu, X., Franchi, G., and Aldea, E. SLURP: Side learning uncertainty for regression problems. In BMVC, 2021.

Yu, X., Franchi, G., Gu, J., and Aldea, E. Discretizationinduced Dirichlet posterior for robust uncertainty quantification on regression. In AAAI, 2024.

Zhang, F., Zhu, X., Dai, H., Ye, M., and Zhu, C. Distribution-aware coordinate representation for human pose estimation. In CVPR, 2020.

Zhang, L., Rao, A., and Agrawala, M. Adding conditional control to text-to-image diffusion models. In ICCV, 2023.

On the Calibration of Human Pose Estimation

Figure e. An illustration of visual cues that have similar (underlying/predictive) uncertainty (red circle mean and green dashed circle range) but different per-sample annotation (blue crosses) treated as a sample from the distribution. For instance, the left one is further from the mean than the right one.

This appendix includes A. Theoretical Understanding, B. Implementation Details, and C. More Experimental Results, referred in the manuscript.

A. Theoretical Understanding

A.1. Illustration of Setting

Different from 1 image x corresponding to only 1 pose keypoint p, we consider stochastics caused by annotation error and occlusion ambiguity, etc., by a 1-to-many distribution p(p|x) (L275-276). Specifically, for two inputs x1, x2 with similar ambiguity σ1 σ2, accuracy (e.g., OKS) may be different sample-wisely (Fig. e), but they are supposed to have similar rankings regardless of uncontrollable and irreducible uncertainty (Kendall & Gal, 2017). Think about it from another perspective: if the person is asked to reannotate the two images, analogous to re-sampling of the distribution, the accuracy of the first image has chance of being higher than that of the second. The goal is to achieve the highest m AP in the expected sense of distributions.

A.2. Expected OKS Eq. 11

Proof. It follows a Normal distribution (L321-323); the integral is also tractable to compute as shown below.

Ep N(µ,σ2I)[OKS] (22)

1 2πσ2 exp p µ 2

p exp p µ 2

Lemma A.1. In L300 of manuscript, the form is regarded as resemblingly the random variable ˆp N(µ, (σ2+l2)I). I.e.,

p N(p|µ, σ2I)N(ˆp|p, l2I)dp = N(ˆp|µ, (σ2 + l2)I)

p exp p µ 2

(σ2 + l2) exp ˆp µ 2

Lemma A.2. In another perspective, term within exp of Eq. 24 can be also arranged w.r.t. p as

Dp E 2 F, (28)

σ2+l2 , E = l2µ + σ2ˆp q

2(l2 + σ2)l2σ2 , F = ˆp µ 2

2(σ2 + l2).

Substituting back into Eq. 24 obtains

p exp( Dp E 2 F)dp (30)

Dp E exp( Dp E 2) exp( F) 1

Dp E exp( Dp E 2)d Dp E (33)

σ2 + l2 exp ˆp µ 2

where D, F are independent of p conditional on the image (Eq. 33); Equation 35 is based on Z

2 exp( x 2)dx = 1. (36)

A.3. Verification of Detection ˆσ2 = σ2 + l2 (L308)

Figure f verifies Eq. 17 and model distribution (or heatmap) approximates noisy ground truth distribution instead of the pure one. Following (Wehrbein et al., 2021), sigmas are estimated by fitting heatmap with Gaussian.

On the Calibration of Human Pose Estimation

Figure f. Maximum values of the heatmap are almost coincident with our estimated scoring (peak density as Eq. 17), which verifies derivation.

A.4. Optima of Eq. 4 NLL

Proof. It is well-established, but we still include it here for the convenience of readers. Formally,

ˆp , ˆσ = arg min ˆp,ˆσ Lnll = arg max ˆp,ˆσ Lll, (37)

where in more general 2D case (1D in the manuscript for illustration), Log-Likelihood

Lll = Ep N(µ,σ2I)

log 1 2πˆσ2 exp p ˆp 2

log 2π log ˆσ2 p ˆp 2

log ˆσ2 + p ˆp 2

A = B + C, B = log ˆσ2, C = p ˆp 2

2ˆσ2 . (41)

The following is calculated:

B ˆp = 0, B

C ˆp = 1 2ˆσ2 p ˆp 2

ˆp = 1 2ˆσ2 p ˆp 2

= 1 2ˆσ2 2(p ˆp)T ( I) = p ˆp

C ˆσ2 = p ˆp 2

ˆσ2 ˆσ2 = p ˆp 2

2ˆσ4 . (46)

Optimal ˆp. Taking derivative of Lll w.r.t. ˆp and setting it to 0 give

ˆp = Ep [A]

ˆσ2 (Ep[p] ˆp) = 1

The facts that given an image, p in expectation is constant w.r.t. ˆp and ˆp, ˆσ2 are constant w.r.t. p are used in Equations (47) and (49), respectively. Thus, rearrangement gives optima

ˆp = µ. (51)

Optimal ˆσ. Similarly, we derive derivative of Lll w.r.t. ˆσ2

ˆσ2 = Ep [A]

ˆσ2 + 1 2ˆσ4 Ep[ p ˆp 2]. (54)

Equation 51 optimal ˆp helps simplify it as

ˆσ2 + 1 2ˆσ4 Ep[ p µ 2] = 1

ˆσ2 + 1 2ˆσ4 2σ2 (55)

For Eq. 55 the variance of the Normal distribution is used. Setting it to 0 arrives at

ˆσ = σ. (57)

A.5. The Case of Imperfect Prediction ˆp = µ

Proof. TL;DR: when prediction is imperfect, confidence will decrease correspondingly.

It is a more general case and will lead to more misalignment to the ideal score (Eq. 11). For instance, the prediction deviation of easy samples is likely to be less than that of hard samples. We derive optimal ˆσ in this case. It makes sense to some extent for confidence is usually easier to estimate than mean since it only requires to predict a range instead of an exact value. Denote prediction deviation as

ˆδ = ˆp µ, ˆ 2 = ˆδ 2 = 0; δ = p µ. (58)

On the Calibration of Human Pose Estimation

For Regression, Eq. 54= 0 tells

2Ep[ p ˆp 2] = 1

2Ep[ p µ + µ ˆp 2]

2Ep[δT δ 2δT ˆδ + ˆδT ˆδ] (60)

2(Ep[ δ 2] 2Ep[δ]T ˆδ + ˆ 2) (61)

2(2σ2 20T ˆδ + ˆ 2) = σ2 + ˆ 2

Equation 61 is based on ˆδ is constant w.r.t. p. The score Eq. 18 becomes

ˆsreg = 1 ˆσ = 1

2 < 1 σ. (63)

For Detection, derivation assumes

Proposition A.3. Imperfect (but not bad) heatmap follows (Gu et al., 2021a)

ˆhm = ˆo exp m ˆp 2

where ˆo is a scaling factor.

MSE (Eq. 14) is derived as

m (ˆhm hm)2 #

For each location m,

ˆh = (ˆh h)2

ˆh = 2(ˆh h); (66)

ˆh ˆσ2 = ˆo exp m ˆp 2

2ˆσ2 ˆσ2 = ˆh m ˆp 2

ˆh ˆo = exp m ˆp 2

= ˆh ˆo. (68)

Derivation of Eq. 67 uses Eq. 46.

Detection s Optimal ˆo (entangling with ˆσ2). Further,

ˆo = Ep h P

m(ˆhm hm)2i

m 2(ˆhm hm) ˆhm

m (ˆh 2 m ˆhm Ep[hm]) = 0. (72)

The last step makes use of that given the image, only hm depends on p.

ˆh 2 m = ˆo2Gm, Gm = exp m ˆp 2

(73) ˆhm Ep[hm] = ˆo Hm, (74)

σ2 exp m ˆp 2

Ep[hm] comes from Eq. 17, and

σ2 σ2 + l 2. (76)

Substituting them back to Eq. 72, we obtain

m (ˆo2Gm ˆo Hm) = 0 (77)

m Hm = 0 (78)

m Gm . (79)

Consider the limit as m 0 and almost full support of nonnegligible Hm, Gm is within heatmap

m Gmdm (80)

On the Calibration of Human Pose Estimation

σ2 exp m ˆp 2

σ2 σ2 + ˆσ2, (84)

with Lemma A.1, Eq. 83 is calculated as

σ2 2π σ2ˆσ2

σ2 exp ˆp µ 2

σ2 exp ˆp µ 2

σ2 exp ˆp µ 2

πˆσ2 = 2 l 2

Remarks of ˆo, ˆσ2. We can further compute

σ4 exp +2 l 2

2 σ4 = 0 (88)

root σ2 = ˆ 2

Since the derivative is monotonical w.r.t. ˆσ2, it is concluded that when ˆσ2 > ˆ 2 σ2, the scale factor ˆo decreases with ˆσ2 (; increases, otherwise).

For Detection s Optimal ˆσ,

ˆσ2 = Ep h P

m(ˆhm hm)2i

m 2(ˆhm hm) ˆhm m ˆp 2

m (ˆh 2 m m ˆp 2 ˆhm Ep[hm] m ˆp 2)

Similarly, we introduce m as Eq. 80, and the first term becomes X

m ˆh 2 m m ˆp 2 m ˆo2πˆσ2Em g[ m ˆp 2] (95)

= ˆo2πˆσ2ˆσ2 = πˆo2ˆσ4, (96)

as g(m) = N(ˆp, ˆσ2

Following Lemma A.2, H can also be expressed as a Normal w.r.t. m for

= K2πJ N( I , J I),

I = σ2ˆp + ˆσ2µ

σ2 , J = σ2ˆσ2

σ2 , K = exp

m ˆhm Ep[hm] m ˆp 2 m (99)

2πJ KEm h[ m ˆp 2] (100)

c=E[ m I + I ˆp 2] (101)

=E[ m I 2] + E[ ˆp I 2] = 2 σ2ˆσ2

σ2 + ˆσ4 ˆ 2

Equation (99) = 2πˆo l 2ˆσ4 2 σ2

σ4 + ˆσ2 ˆ 2

where h(m) = N( I , J I).

Therefore, substituting Equations (96) and (103) back into Eq. 93 gives

πˆo2 2πˆo l 2 2 σ2

σ4 + ˆσ2 ˆ 2

K = 0 (104)

ˆo = 2 l 2 2 σ2 σ2 + ˆσ2 ˆ 2

Combining Equations (87) and (105) gets

1 σ2 = 2 σ2 σ2 + ˆσ2 ˆ 2

( σ2 + ˆσ2)2 2 σ2( σ2 + ˆσ2) ˆ 2ˆσ2 = 0 (107)

ˆσ4 ˆ 2ˆσ2 σ4 = 0 (108)

On the Calibration of Human Pose Estimation

Algorithm 1 CCNet Pseudocode, Py Torch-like

enc, predhead = freeze(prednet) # the locks in

def forward(x):

f = enc(x) # penultimate features phat = predhead(f) shat, vhat = ccnet(f) return phat, [shat, vhat]

def train_step(data, kp_metric=kp_oks, cal_loss=mse,

w_vis=2e-2): x, p, l, v = data phat, [shat, vhat] = forward(x) s = kp_metric(phat, p, l)

loss_kp_conf = cal_loss(shat, s, weight=v) #

calibration loss in Eq. 20 loss_vis = bce(vhat, v) # Eq. 21 loss = loss_kp_oks + w_vis * loss_vis # Eq. 22 ...

The score is unnormalized density at ˆp Eq. 64 as

ˆsdet = ˆo = 2 l 2

B. Implementation Details

Pseudocode is attached in Alg. 1, facilitating reproducibility for the community.

Architectures part supplements Para 1 in Sec. 5. For regression-based methods, the architecture is one fully connected layer with 2048D flattened feature input, 17D keypoint confidence and 17D visibility output; for heatmapbased method, we apply a 2D 1x1 convolution with 256Dchannel input and 34D-channel output before a spatial global average pooling. Different numbers of layers and widths of the network are experimented, and only one FC/- Conv layer is found to work very well with the rich penultimate features. Negligible additional inference latency is brought by this lightweight head. Sigmoid is used for normalized confidence.

Training. The parameters are initialized with the default Kaiming initialization. The Adam (Kingma & Ba, 2014) optimizer is used for training without weight decay. The initial learning rate is 1e 3, multiplied by 0.1 in the 9K-th step, and results are reported for 12K steps. Ground truth bounding boxes are provided as input to top-down pose estimation methods.

OKS on MPII (Andriluka et al., 2014). Since annotated keypoint sets are different between COCO (Lin et al., 2014) and MPII (Andriluka et al., 2014) but images are similar, perkeypoint falloff coefficients of the neighboring hip, shoulder,

Figure g. Calibration plot. Estimated confidence well reflects the expected OKS value after pose calibration in our context.

RLE Alea CE Deep Ens SOAP

m AP 72.2 73.4 73.5 73.5 73.4

Method +CCNet +CCNet+Add1 +CCNet+Add2

RLE 72.2 73.6 73.6 73.5 SBL 72.4 73.3 73.2 73.3

Method Pool+FC Conv+Pool

SBL 72.4 72.6 73.3

Table g. m AP results of input feature studies on the COCO: the use of (Add1) information from keypoints and their original undercalibrated confidence, where for detection-based methods, the penultimate 2D feature map is concatenated with the interpolated final keypoint heatmap; (Add2) lower-level feature input (Res Net s layer3.5 ); (3) keeping spatial information is important for detected-based methods.

and nose are applied to that of the pelvis, thorax, upper neck, and head top, respectively.

Pose Calibration variants in Pierzchlewicz et al. (2022); Bramlage et al. (2023) are mainly based on keypoint EPE instead of instance OKS and also do not focus on m AP. Instead, in our context, calibrated pose confidence is expected to well predict pose accuracy (Fig. g).

C. More Experimental Results

Surrogate Losses. Different losses including Bayesian weight posterior (Lakshminarayanan et al., 2017) and surrogate optimization (Qi et al., 2021), perform similarly well (Tab. C).

Input Features. The additional input original prediction and confidence estimate, along with lower-level features, do not result in an m AP improvement (Tab. g). Our design exploration also found maintaining the 2D spatial layout

On the Calibration of Human Pose Estimation

Method 2e 3 2e 1 2e1

RLE 72.2 73.4 73.6 73.4 SBL 72.4 73.2 73.3 73.1

Table h. The loss weighting hyperparameter λ tolerates multiple magnitudes and is not sensitive to select. The numbers are m AP.

and applying 1x1 channel-wise convolution proved to be crucial for detection-based methods.

Loss Weighting Hyperparameter λ balances the OKS loss and visibility loss. λ = 2e 1 is used in our paper. Varying λ over several magnitudes has a limited impact on the m AP (Tab. h). We speculate that the confidence and visibility prediction tasks are relevant and not contradictory; thus, the losses will not conflict with each other.

Visualizations. Figure h shows the effects of area, perkeypoint falloff, and visibility, respectively. Our CCNet better aligns with OKS and human perception. For the 2D pose models themselves, CCNet can calibrate both underconfident and overconfident samples (Fig. i & j). From the visualizations on the downstream task in Fig. k, we can see that the better-calibrated 2D confidences can better instruct the optimization of 3D mesh.

On the Calibration of Human Pose Estimation

(a) (b) (c) Figure h. Visualizations of (a) area, (b) per-keypoint falloff (e.g. hip s>wrist s), and (c) visibility effects to OKS, respectively. Red dot and circle represent predicted keypoints and sigma confidence; green dot indicates ground truth keypoint location; blue circle depicts OKS l range.

Ear Shoulder

Figure i. Our CCNet also helps calibrate underconfident pose estimation the keypoint detection is not far from the ground truth but the confidence cannot reflect the accuracy well.

On the Calibration of Human Pose Estimation

Nose Eye Wrist Ankle

Figure j. Some qualitative visualizations of better-calibrated confidence estimation achieved by the RLE incorporated with our CCNet.

Green dots are for ground truth while blue dots represent predictions.

Predicted location of right elbow

Predicted location of right wrist

Predicted confidence of right elbow

Predicted confidence of right wrist

Input image SBL SBL + CCNet Ground truth 2D

SPIN SPIN + 2D fitting

SPIN + 2D fitting

(SBL + CCNet)

SPIN + 2D fitting

(Ground truth)

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure k. Illustration of benefits of pose confidence calibration in downstream 3D model fitting. We demonstrate a case where the right arm (highlighted in a red circle) is heavily occluded. While the uncalibrated SBL and calibrated SBL+CCNet predict the same (wrong) 2D keypoint location (b & c), the poorly calibrated SBL estimates a high confidence which misleads the mesh fitting to a wrong position for the right arm (from e to f). In contrast, after adding the CCNet, the confidence for the occluded right elbow and right wrist are lower, which allows the mesh to maintain its original prediction for the right arm and spare more efforts on optimizing other keypoints with higher correctly estimated confidences, e.g., left arm and legs (g).