# uast_uncertaintyaware_siamese_tracking__4873092a.pdf

UAST: Uncertainty-Aware Siamese Tracking

Dawei Zhang 1 Yanwei Fu 2 Zhonglong Zheng 1 3

Visual object tracking is basically formulated as target classiﬁcation and bounding box estimation. Recent anchor-free Siamese trackers rely on predicting the distances to four sides for efﬁcient regression but fail to estimate accurate bounding box in complex scenes. We argue that these approaches lack a clear probabilistic explanation, so it is desirable to model the uncertainty and ambiguity representation of target estimation. To address this issue, this paper presents an Uncertainty-Aware Siamese Tracker (UAST) by developing a novel distribution-based regression formulation with localization uncertainty. We exploit regression vectors to directly represent the discretized probability distribution for four offsets of boxes, which is general, ﬂexible and informative. Based on the resulting distributed representation, our method is able to provide a probabilistic value of uncertainty. Furthermore, considering the high correlation between the uncertainty and regression accuracy, we propose to learn a joint representation head of classiﬁcation and localization quality for reliable tracking, which also avoids the inconsistency of classiﬁcation and quality estimation between training and inference. Extensive experiments on several challenging tracking benchmarks demonstrate the effectiveness of UAST and its superiority over other Siamese trackers.

1. Introduction

Visual tracking is a fundamental yet challenging research topic in computer vision. It has a wide range of applications, such as surveillance system, UAV-based monitoring, human-

1College of Mathematics and Computer Science, Zhejiang Normal University, Jinhua, China 2School of Data Science, Fudan University, Shanghai, China 3Key Laboratory of Intelligent Education Technology and Application of Zhejiang Province, Zhejiang Normal University, Jinhua, China. Correspondence to: Zhonglong Zheng <zhonglong@zjnu.edu.cn>.

Proceedings of the 39 th International Conference on Machine Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copyright 2022 by the author(s).

jcr score: 89% certainty: 86%

Certain & Sharp

L_C: 96% T_C: 94%

R_C: 61% B_C: 93%

(a) Different approaches for target box estimation

(b) A qualitative case of uncertainty-aware tracking

anchor-based anchor-free distribution-based (ours)

Figure 1. (a) Comparison of different regression methods in visual tracking: anchor-based (such as Siam RPN, Siam RPN++), anchorfree (such as Siam FC++, Ocean), and our distribution-based UAST. (b) A representative example of the proposed uncertainty-aware tracking. Due to occlusion and similar objects, the ground-truth (green) may be not explainable enough, and many trackers are limited by such issues. Instead, distribution-based regression (yellow) can reﬂect the uncertainty information of localization prediction, where a ﬂatten distribution depicts an uncertain and ambiguous boundary, and vice versa. Notably, UAST further provides the estimated certainty with respect to 4-directions (L C, T C, R C and B C), and the whole certainty value of the predicted box, while jcr score denotes our joint conﬁdence representation score.

computer interaction, and so on. Given only an arbitrary target annotation in the initial frame, object trackers aim at predicting its location and scale in subsequent frames of the video sequence. In the most general form, there is no prior knowledge of the object category and its surrounding environment (Huang et al., 2019). Although much progress has been achieved in recent years, accurate tracking is still a challenging task due to occlusion, motion blur, geometric deformation, scale and appearance variations.

In general, visual tracking can be formulated as a combination of classiﬁcation and localization sub-tasks. The former aims to robustly predict the coarse location of the target, while the latter is designed to estimate precise bounding

UAST: Uncertainty-Aware Siamese Tracking

boxes. To enable accurate tracking, regression branch is of great importance as it is responsible for target box estimation. Based on this aspect, previous anchor-based Siamese trackers (Li et al., 2018; 2019) introduce region proposal networks (Ren et al., 2016) to perform bounding box regression. Recent anchor-free Siamese trackers (Xu et al., 2020; Zhang et al., 2020) that become more popular owning to its concise, elegant design and no anchor prior knowledge, directly regress distances to four sides of the box using fully convolutional networks. From a distributional perspective of view, these aforementioned regression methods can be regarded as a simple Dirac delta distribution since the goal of regression is to ﬁt a single value for each output of the target box. Despite signiﬁcant progress has been achieved, existing trackers do not consider to estimate the uncertainty of box coordinates. In other words, a single prediction of target boundary has no clear probabilistic interpretation due to lacking of extra localization representation information. Therefore, the resulting boxes are prone to inaccuracy or failures in some complex scenes. It is essential to model and estimate the uncertainty of bounding box representation.

Our main motivation is to explore the uncertainty of tracking. Although recent work (Danelljan et al., 2020) tries to exploit Gaussian distribution to model probabilistic representation of bounding boxes, it is not capable to completely and ﬂexibly reﬂect the underlying distribution of object bounds. In fact, the real distribution is not necessarily symmetric like Gaussian, and even can be more arbitrary (Jiang et al., 2018). So can we model general distribution of bounding boxes to estimate the uncertainty for accurate object tracking?

Following the above analysis, we propose a novel general distribution-based regression formulation to learn localization uncertainty representation of bounding boxes for accurate tracking, inspired by the success of GFL (Li et al., 2020) in object detection. To be consistent with the existing anchor-free Siamese trackers (Xu et al., 2020; Zhang et al., 2020; Guo et al., 2021), the goal of our regression branch is also to predict the relative offsets of the spatial position to the four sides of bounding boxes. Differently, the proposed tracking framework can additionally model the uncertainty and ambiguity representation via learning discretized probability distribution along each of four directions over its continuous domain, without any extra prior knowledge. As shown in Figure 1, the learned distributions obviously reﬂect the underlying information by its shape. Impressively, the predicted distributions are usually sharp when boundaries are clear and certain, and is ﬂatten when the right border is ambiguous. More than that, our tracker enables to inform which direction of the box boundary is uncertain using a quantitative value. Beneﬁting from this elegant solution for localization uncertainty reasoning, more accurate bounding boxes can be obtained owing to aware of the potential distributions of target boundaries.

Another limitation of most existing tracking methods is the misalignment between classiﬁcation and regression. Namely, the position with high classiﬁcation score may not correspond high regression accuracy, and vice versa, leading to a poor tracking performance. Recent anchor-free trackers (Xu et al., 2020) apply a quality estimation branch to assist the classiﬁcation branch for ﬁnal predictions. Nevertheless, the independent optimization of them also brings inconsistency between training and test. To this end, we present a simple yet effective joint representation head of classiﬁcation and localization quality, which can be trained end-to-end and used directly during tracking. Furthermore, considering the strong correlation between the estimated uncertainty and regression accuracy, we exploit the learned distributions to design a task alignment sub-network for facilitating the learning of our joint representation head. In this way, it eliminates the misalignment and unsolved training-test inconsistency. Notice that our method almost does not degrade the training/inference time of basic trackers due to negligible additional computation cost.

We integrate our general distribution and joint representation into the recent state-of-the-art anchor-free trackers, termed as Uncertainty-Aware Siamese Tracking, UAST. In summary, our main contributions are as follows:

We propose a novel distributional regression paradigm by learning general representation of bounding boxes for single object tracking, which is capable of ﬂexibly capturing more informative target boundaries for accurate localization, and explicitly estimating the certainty value of each direction in a probabilistic way.

Based on the learned distributions of bounding box, we propose a simple yet effective joint representation head of classiﬁcation and localization quality by leveraging the estimated uncertainty and a lightweight task alignment sub-network, which bridges the gap between training and inference. Notably, it is almost cost-free.

The proposed UAST achieves state-of-the-art performance on ﬁve public tracking benchmarks, including GOT-10k, La SOT, OTB-100, VOT-2019 and UAV-123, demonstrating its effectiveness and tracking efﬁciency.

2. Related Work

In this section, we brieﬂy review recent single object trackers from the aspect of target state estimation, and introduce uncertainty estimation in computer vision, as well as discuss localization quality estimation of anchor-free methods.

2.1. Visual Object Tracking

Comparing with early popular correlation ﬁlters based trackers (Bolme et al., 2010; Henriques et al., 2014), Siamese

UAST: Uncertainty-Aware Siamese Tracking

network based methods have achieved great progress in tracking community since its good balance of performance and speed. As a pioneering work, Siam FC (Bertinetto et al., 2016) applied multi-scale testing to obtain the target box, which is inefﬁcient and inaccurate. This strategy is severely limited since no speciﬁc scale estimation is designed.

For the another popular category, ATOM (Danelljan et al., 2019) presents a customized Io U prediction network (Jiang et al., 2018) for target estimation. Nevertheless, it aggravates the computation burden and many hyper-parameters since multiple initial boxes need to be iteratively reﬁned.

More recent advanced Siamese trackers consider performing classiﬁcation and regression simultaneously, leading to superior tracking performance. Siam RPN tracker family (Li et al., 2018; 2019) introduce region proposal networks to regress the shift of position and scale between pre-deﬁned anchor boxes and ground truth. Inspired by FCOS (Tian et al., 2019) in object detection, numerous anchor-free trackers (Guo et al., 2020; Chen et al., 2020; Zhang et al., 2020; Peng et al., 2021) have emerged to avoid relying on the prior of candidate boxes, and become more popular due to its simplicity in design. To be speciﬁc, Siam FC++ (Xu et al., 2020), Siam CAR (Guo et al., 2020) and Siam BAN (Chen et al., 2020) directly regress the offsets to box borders in a per-pixel-prediction manner. To alleviate the misalignment of classiﬁcation and regression, Ocean (Zhang et al., 2020) uses a feature alignment module to obtain object-aware predictions for penalizing the classiﬁcation branch. However, the two branches are trained separately but combined during tracking. Furthermore, Siam RCR (Peng et al., 2021) presents reciprocal links for making training and inference more consistent. Different from them, we devise a joint conﬁdence representation head to tackle this issue.

2.2. Uncertainty Estimation in Computer Vision

Existing object detectors (Ren et al., 2016; Tian et al., 2019) and trackers (Li et al., 2019; Xu et al., 2020) apply Dirac delta distribution to govern the bounding box representation, learning a single prediction for each side of target boxes. Recently, in the object detection ﬁeld, to model the localization uncertainty, Gaussian YOLOv3 (Choi et al., 2019) and KL-Loss (He et al., 2019) adopt Gaussian assumption to predict the variance of four edges. When the variance is larger, the distribution is ﬂatter, indicating that the prediction is uncertain; the smaller the variance, the sharper the distribution, indicating that the predicted box is conﬁdent at the mean position. Nevertheless, these representations are either too simpliﬁed or too rigid, which can not reﬂect the underlying distribution in practice. Furthermore, GFL (Li et al., 2020) relaxes the assumption and directly learns a more ﬂexible general distribution of boxes.

Naturally, the uncertainty estimation of targets also can be

applied to visual tracking. Pr Di MP (Danelljan et al., 2020) learns to predict the conditional probability density using a probabilistic regression model, which is trained by minimizing the KL divergence between the prediction and label distribution. (Zhong et al., 2021) uses KL to learn policy from teacher for distraction-robust active object tracking. UATracker (Zhou et al., 2021) estimates the uncertainty of Io U prediction, and exploits it to ﬁlter out unreliable samples for online learning based discriminative classiﬁer in Di MP (Bhat et al., 2019). In contrast to them, we consider learning the discrete probability distribution of each side of bounding boxes for localization uncertainty estimation, which is more ﬂexible and informative. Meanwhile, our approach still beneﬁts from the advanced Io U-based loss due to compatible with anchor-free trackers. In addition, the certainty can be depicted by an explicit probability value.

2.3. Localization Quality Estimation

Siam FC++ estimates the localization quality based on centerness proposed in FCOS (Tian et al., 2019). However, centerness can not fully account for localization quality. Intersection-over-Union (Io U) between predicted boxes and ground-truth is also explored and proved to be effective in Io UNet (Jiang et al., 2018). After that, Ocean introduces an object-aware branch with predicted boxes, while Siam RCR (Peng et al., 2021) assigns dynamic weights in classiﬁcation loss based on predicted Io U score. Differently, we exploit distance-Io U score (Zheng et al., 2020) as the label of our joint head, which is more comprehensive and suitable for object tracking. Recent advance (Li et al., 2021) suggests that the bounding box distribution with a sharp peak usually corresponds to accurate localization, and vice versa. Beneﬁting from the proposed distributed regression, we further utilize the estimated uncertainty representation of localization to weight the classiﬁcation branch for high-quality examples.

3. Uncertainty-Aware Siamese Tracking

In this section, we describe the proposed UAST in detail. As shown in Figure 2, UAST has a similar structure with existing anchor-free trackers. Nevertheless, our approach is not only capable of learning a discrete probability distribution of four directions for describing the uncertainty of bounding boxes, but also models a joint representation head of classiﬁcation and localization quality by leveraging the estimated uncertainty in box distributions. To our best knowledge, UAST is the ﬁrst attempt to explore the power of uncertainty estimation for anchor-free tracking.

3.1. Anchor-Free Tracking

Different from RPN-based trackers, Anchor-free tracking methods directly classify and regress the target bounding box at per-pixel spatial location. Following the paradigm

UAST: Uncertainty-Aware Siamese Tracking

Template Search Image

25x25x256 Classification

Joint Representation

25x25x1 7x7x256

25x25x256 Distributions

Distributed Regression

25x25x4 31x31x256

Near Neighbor

FC + Sigmoid

Task Alignment Sub-Network

Figure 2. The main structure of the proposed Uncertainty-Aware Siamese Tracking framework. It consists of a backbone network for feature extraction, a feature matching module, an anchor-free head with distributional regression and joint representation, and a task alignment sub-network. Note that and mean depth-wise cross-correlation and element-wise multiplication operations, respectively.

of FCN (Long et al., 2015), for each position (i, j) in the feature map, we can map it to the search region for obtaining corresponding coordinates (s/2 + is, s/2 + js) (s denotes the total stride of the network) in the original image. Speciﬁcally, the output of classiﬁcation head represents the foreground and background scores of the corresponding locations in the input, while regression head with a 4D vector T = (l, t, r, b) predicts the distances from corresponding locations to four sides of the ground-truth box. Let (x0, y0) and (x1, y1) denote the left-top and right-bottom corner of the ground truth, so the regression targets (left, top, right, bottom) of the location (i, j) can be calculated as:

l = i x0, t = j y0 r = x1 i, b = y1 j (1)

Consequently, it allows to predict distances from the location (i, j) to four sides of the box. However, it has no a clear probabilistic explanation of bounding boxes due to lacking the uncertainty representation of target coordinates. It has insufﬁcient information for accurate tracking, and is inﬂexible to deal with object variations in complex scenes.

3.2. Distributional Regression Representation

From a distribution perspective of view, the existing anchorbased and anchor-free trackers can be considered as a simple Dirac delta distribution δ(x ξ) since the regression target is to ﬁt a single label value ξ for each output of the box. It satisﬁes R + δ(x ξ)dx = 1, and the integral form to resume ξ can be presented as the following equation:

δ(x ξ)x dx (2)

To address the limitation of Dirac delta, we propose to directly model a general distribution P(x) without other priors. Given a range of label ξ (ξ0 ξ ξn, n N+)

with minimum ξ0 and maximum ξn, we can obtain the prediction ξ of each side via calculating its integral:

P(x)x dx = Z ξn

ξ0 P(x)x dx (3)

For this general distribution, a problem that needs to be solved is that it is difﬁcult to model an arbitrary and continuous probability distribution with a small number of parameters in neural networks. To this end, we consider a discrete representation to ﬁt this distribution. Speciﬁcally, the range [ξ0, ξn] can be divided into a set [ξ0, ξ1, ξ2, ..., ξn 1, ξn] with even interval. Hence, our regression branch has n + 1 predicted values for each edge of bounding boxes, which can represent probabilities through a softmax layer. Based on the discrete distribution property Pn i=0 P (ξi) = 1, the estimated regression value ξ can be calculated as ξ = Pn i=0 P (ξi) ξi. Therefore, the proposed distributional regression formulation can also use previous loss objectives like Io U Loss in anchor-free trackers to train ξ.

Although the regression target can be obtained according to Equation 3, we expect that the learned distributions are as certain or compact as possible for interpretability since the same integral result may correspond to different arbitrary distributions. In order to explicitly focus on the values (ξi and ξi+1) that are close to the label ξ, we further consider to optimize the shape of distributions using Distribution Focal Loss, DFL proposed in (Li et al., 2020):

Ldfl = ((ξi+1 ξ) log (Pi) + (ξ ξi) log (Pi+1)) (4) where Pi and Pi+1 denote P (ξi) and P (ξi+1) respectively. Intuitively, DFL enlarges the probabilities of ξi and ξi+1.

3.3. Joint Conﬁdence Representation

Recent research suggests that localization quality also needs to be considered with the classiﬁcation score for ﬁnal predictions during online tracking, but existing trackers (Guo et al.,

UAST: Uncertainty-Aware Siamese Tracking

Centerness: 0.28 Io U score: 0.63 DIo U score: 0.54

Centerness: 0.42 Io U score: 0.37 DIo U score: 0.26

Centerness: 0.89 Io U score: 0.37 DIo U score: 0.35

(a) one-hot classification label (b) our soft label (localization quality)

0 0 1 1 0 1 0 0 0 0 0 0.7 0.9 0 0.4 0 0 0

(c) Comparison of different localization quality scores

Figure 3. An illustration of our joint conﬁdence representation. Instead of the fore/background label (a), we merge the target conﬁdence and localization quality as the supervision of our jcr (b). (c) Comparisons of different localization quality targets including centerness, Io U and Distance-Io U scores applied in UAST.

2020; Zhang et al., 2020) exist an inconsistent problem between training and inference phases. To this end, we present a simple yet effective joint conﬁdence representation head by leveraging the information from both classiﬁcation and regression branches. To be speciﬁc, given the classiﬁcation vector Vcls and localization quality vector Vlq, our joint conﬁdence representation Vjcr can be formulated as:

Vjcr = Vcls Vlq (5)

which can be trained end-to-end and directly utilized during tracking, because we explicitly optimize the ﬁnal joint formulation (i.e., Vjcr). In contrast to a standard binary classiﬁcation label in Siamese tracking, we redeﬁne the supervision for our joint representation head. To be speciﬁc, negative samples are still supervised by 0, while the supervision of positives is determined by the localization quality label. As shown in Figure 3, the on-hot label is replaced by our soft label for joint conﬁdence representation. Namely, Vjcr where its value at the center range of ground-truth box directly learns its corresponding localization quality.

3.3.1. LOCALIZATION QUALITY LABEL

Current trackers utilize centerness (Xu et al., 2020) or standard Io U score (Zhang et al., 2020) to supervise localization quality. Unfortunately, centerness mainly emphasize the center of target box, while Io U may lead to slow convergence and inaccuracy. Differently, Distance-Io U (Zheng et al., 2020) between the predicted bounding boxes and its ground-truth is applied as the label of positive samples in our joint head, which is a dynamic value being [0, 1].

D-Io U = Io U ρ2 (b, bgt)

where ρ (b, bgt) denotes the Euclidean distance between the central points of predicted box and target box, and c is the diagonal length of the smallest enclosing box covering the two boxes. Notice that DIo U incorporates both normalized

center distance and Io U score, which is more suitable for visual tracking task (see examples in Figure 3). Because most evaluation metrics are actually the center distance error (precision) and the average overlap rate (AUC score).

3.3.2. TASK ALIGNMENT SUB-NETWORK

Beneﬁting from distributional regression, instead of convolutional features, we can exploit the uncertainty information in box distributions to perform task alignment, facilitating the learning of our joint conﬁdence representation. Speciﬁcally, considering that the learned distributions are highly related to the quality of regressed boxes, we construct a lightweight task alignment sub-network from the regression branch to generate high-quality estimation. As shown in Figure 2, we ﬁrstly select two nearneighbor values of prediction in each distribution P(x), and concatenate them as the initial localization quality features F R4 2:

F = Concat ({Neighbor (P(x)) | x {l, r, t, b}}) (7)

where Neighbor( ) feature can basically reﬂect the ﬂatness of each distribution, and is robust to object scales.

Based on F from the regression branch, the localization quality vector Vlq can be obtained by the task alignment sub-network with two Fully-Connected (FC) layers, which are followed by Re LU and Sigmoid, respectively.

Vlq = Sigmoid (W2(Re LU (W1F))) (8)

where W1 R32 8 and W2 R1 32 represent two FC layers, respectively. It is worth noting that our TASN is very lightweight, and also has little computation overhead.

3.4. Training Objective

We optimize the overall training objective as follows:

L = Ljcr + λ1Lreg + λ2Ldfl (9)

where Ljcr is the binary cross entropy loss to train the joint representation head. We only consider positive samples for regression objective. Lreg is the Io U Loss for bounding box regression, while Ldfl forces the model to focus on learning the probabilities of values neighbored with the target box, leading to a reasonable distribution. In our experiments, λ1 (2 as default) and λ2 (1/4, averaged over four directions) are the hyper-parameters for balancing these three losses.

4. Experiments

4.1. Implementation Details

4.1.1. FRAMEWORK

Like Ocean (Zhang et al., 2020), we employ a modiﬁed Res Net-50 (He et al., 2016) that only contains the ﬁrst four

UAST: Uncertainty-Aware Siamese Tracking

Algorithm 1 Uncertainty-Aware Siamese Tracking

1: Input: Frames {Ik}K 1 , initial target box B1 2: Output: Target box {Bk}K 2 , certainty value {Ck}K 2 3: for k = 2 to K do 4: Perform feature extraction and matching; 5: Model distributed representation {Dl k, Dt k, Dr k, Db k}; 6: Obtain 4 offsets {Lk, Tk, Rk, Bk} by Eq. 3; 7: Extract feature Vlq according to Eq. 7 and Eq. 8; 8: Calculate the joint conﬁdence score Vjcr; 9: Select the highest jcr and corresponding box Bk; 10: Compute {Cl k, Ct k, Cr k, Cb k} for 4 sides of box Bk; 11: Average them and achieve the whole certainty Ck. 12: if Ck < 0.5 then 13: Warning: Uncertain Tracking Result! 14: end if 15: end for

stages as our backbone. To be a fair comparison, the depthwise correlation is utilized to generate the fused features for subsequent anchor-free head. Differently, we remove the separate quality assessment branch owing to our joint conﬁdence representation of classiﬁcation and quality. The last layer of our regression head for each side has n + 1 outputs instead of 1, incurring negligible computing cost.

4.1.2. TRAINING PHASE

The backbone is pre-trained on Image Net (Russakovsky et al., 2015). The training image pairs are sampled by Image Net VID and DET (Russakovsky et al., 2015), COCO (Lin et al., 2014), Youtube-BB (Real et al., 2017), GOT-10K and La SOT (Fan et al., 2019). Template image is 127 127 pixels, while search region is 255 255 pixels. We totally train the network using synchronized stochastic gradient descent (SGD) with a batch size of 128 on 4 GPUs for 20 epochs, and employ warm-up in the ﬁrst 5 epochs, and a learning rate exponentially decayed from 5e-3 to 1e-6 in the last 15 epochs. We freeze the backbone in the ﬁrst 10 epochs, and ﬁne-tune it in the remaining epochs. The weight decay and momentum are set as 1e-5 and 0.9, respectively.

4.1.3. TRACKING PHASE

The intuitive outputs of UAST are a set of distance probabilities and jcr score. We can easily predict the bounding boxes by calculating the integral of each distribution. Following (Li et al., 2018), the score map is also penalized by cosine window and scale change for motion smoothness. The corresponding box of the location with best jcr score is selected and updates the target state by linear interpolation. Meanwhile, UAST takes the summation of two adjacent probability in each border as the certainty values for four directions, and the mean of them as the overall reliably of tracking. Algorithm 1 shows the procedure in details.

Figure 4. The histogram of regression targets in anchor-free tracking over 120000 training samples on GOT-10k train set, and the scatter diagram represents the correlation between Io U and the joint conﬁdence scores for some randomly sampled instances.

Table 1. Ablation experiments of different variants of UAST on GOT-10K test set, baseline is Ocean without object-aware branch.

COMPONENTS AO SR0.5 SR0.75

O OCEAN 0.592 0.695 0.465

I BASELINE 0.572 0.674 0.435 II + GENERAL DIST. 0.584 0.687 0.446 III + DIST. FL 0.596 0.705 0.462 IV + JOINT REP. 0.614 0.723 0.485 V + TASK ALIGN. 0.635 0.741 0.514

UAST with a speed of 65 fps is implemented by Py Torch 1.1. Our experiments are conducted on a server with Intel Xeon (R) Gold 5118 CPU, and a Tesla V-100 16 GB GPU.

4.2. Ablation Study

4.2.1. COMPONENT-WISE ANALYSIS.

To verify the inﬂuence of the proposed approach, we perform a component-wise study on GOT-10k, as presented in Table 1. The ofﬂine version of Ocean (O) achieves 0.592 AO score. The baseline (I) denotes Ocean (Zhang et al., 2020) with a classiﬁcation head (without localization quality branch) and an anchor-free regression head, so that only obtaining an AO score of 0.572. We replace the regression module with our general distributions, which yields an AO gain of 1.2 point, conﬁrming that the proposed distribution based method (II) performs better than single prediction of simple Dirac delta distribution. In Line 3 (III), DFL also brings an improvement of 1.2% in terms of AO due to focus on nearby values of the ground-truth, which is helpful to accurate target estimation. Furthermore, adding the joint representation head of classiﬁcation and localization quality (IV) can improves the AO of 1.8% and the SR0.75 of 2.3%, since it beneﬁts from our distance-Io U guided predictions. Finally, the uncertainty-aware quality feature generated by the proposed task alignment sub-network (V) brings a signiﬁcant improvement of 2.1 point, showing the effectiveness of our uncertainty representation. Therefore, those different components all contribute to accurate tracking.

UAST: Uncertainty-Aware Siamese Tracking

Table 2. Comparisons of different localization quality estimation.

LQE NONE CENTER IOU D-IOU JCR-DIOU

AO 0.572 0.587 0.591 0.596 0.605

Table 3. Performances of various popular anchor-free Siamese trackers integrated by the proposed UAST on La SOT test set.

TRACKER DIS. REP. JCR SUCCESS FPS

SIAMCAR 0.507 52 SIAMCAR + UAST 0.543 52 SIAMBAN 0.514 40 SIAMBAN + UAST 0.548 40 SIAMGAT 0.539 70 SIAMGAT + UAST 0.567 70 OCEAN 0.526 68 OCEAN + UAST 0.571 68

4.2.2. DISCUSSION ON DISTRIBUTED REGRESSION

To determine a reasonable range of n, we illustrate the distribution of bounding box regression targets in Figure 4. According to the statistical histogram over large training samples, the recommended value is preferably greater than or equal 14, and we set it to 16. In Table 1, we ﬁnd that the general distribution can achieve better results, and DFL further boosts its performance. A representative case with its distributions and uncertainty over four directions is depicted in Figure 1, showing that the proposed distributed regression method can effectively represent the prediction conﬁdence with respect to four sides of the target bounding box by its shape and the estimated certainty value. Notably, the right distance of zebra is ambiguous due to partly occlusion.

4.2.3. DISCUSSION ON JOINT REPRESENTATION

In addition to classiﬁcation, the measurement of localization quality is also important but ignored in the ﬁled of tracking. Centerness is a pre-deﬁned label that indicates the distances between locations and target center, while Io U scores reﬂect localization accuracy. We ﬁnd that both of them can improve the AO more or less in Table 2. Nevertheless, DIo U performs better than them with an AO of 0.596 due to comprehensiveness. Figure 3 also shows that DIo U can depict the localization quality more accurately. To this end, we apply DIo U as the label of our joint head, and yields an obvious gain of 3.3%, demonstrating its effectiveness. More importantly, it can be trained end-to-end and directly utilized during tracking. Furthermore, we plot the scatter diagram between Io U scores and the predicted joint scores in Figure 4, leading to a more consistent correlation.

4.2.4. COMPATIBILITY FOR ANCHOR-FREE TRACKERS

We integrate the distributed regression and joint conﬁdence representation in UAST to a series of recent anchor-free

Table 4. State-of-the-art comparison on the GOT-10k test set in terms of average overlap (AO) and success rate (SR).

Trackers AO SR0.5 SR0.75

MDNet 0.299 0.303 0.099 ECO 0.316 0.309 0.111 Siam FC 0.374 0.404 0.144 Siam RPN++ 0.517 0.616 0.325 ATOM 0.556 0.634 0.402 Siam CAR 0.569 0.670 0.415 Siam FC++ 0.595 0.695 0.479 Ocean 0.592 0.695 0.473 D3S 0.597 0.676 0.462 Di MP50 0.611 0.717 0.492 Light Track 0.623 0.726 - RPT 0.624 0.730 0.504 Siam GAT 0.627 0.743 0.488 Pr Di MP 0.634 0.738 0.543 UAST 0.635 0.741 0.514

trackers, and make the minimal and necessary modiﬁcations to perform uncertainty-aware tracking. Based on the results in Table 3, UAST can consistently improve the success by 3 points or more on La SOT, without loss of inference speed.

4.3. Comparison with State-of-the-art Methods

We evaluate UAST with state-of-the-art methods on ﬁve tracking benchmarks including GOT-10k (Huang et al., 2019), VOT-2019 (Kristan et al., 2019), OTB-100 (Wu et al., 2015), UAV-123 (Mueller et al., 2016) and La SOT (Fan et al., 2019). Without bells and whistles, UAST achieves the state-of-the-art performance, and experimental results are presented in detail in the following subsections.

4.3.1. GOT-10K BENCHMARK

GOT-10k (Huang et al., 2019) is a large-scale generic object tracking benchmark with 10000 video sequences, which includes 180 videos for testing. Note that it is zero-classoverlap between the train subset and test subset. Following the ofﬁcial protocol, we train UAST only with its training set, and evaluate it with 14 state-of-the-art tracking methods on the test set. As shown in Table 4, our UAST achieves 0.635 of AO, which is superior to other anchor-free trackers Siam GAT (Guo et al., 2021), RPT (Ma et al., 2020), Ocean (Zhang et al., 2020) and D3S (Lukezic et al., 2020). These results show the effectiveness of our localization uncertainty estimation. Moreover, UAST slightly performs better than the recent online learning based trackers Pr Di MP (Danelljan et al., 2020), which further proves the generalization ability of the proposed tracker on some unseen target classes.

4.3.2. LASOT BENCHMARK

La SOT (Fan et al., 2019) is a high-quality large-scale tracking benchmark with 280 long-term testing videos. We eval-

UAST: Uncertainty-Aware Siamese Tracking

Figure 5. Precision and success plots of OPE on La SOT test set.

Table 5. Comparison of tracking results on VOT-2019 Benchmark.

Trackers EAO Accuracy Robustness

Siam FCOS 0.223 0.561 0.788 SPM-Tracker 0.275 0.577 0.507 Siam Mask 0.287 0.594 0.461 Siam RPN++ 0.292 0.580 0.446 Siam DW 0.299 0.600 0.467 PACNet 0.300 0.573 0.401 ATOM 0.301 0.603 0.411 Di MP50 0.321 0.582 0.371 Siam BAN 0.327 0.602 0.396 Ocean 0.327 0.590 0.376 UAST 0.334 0.608 0.386

uate our tracker with Di MP-50 (Bhat et al., 2019), Ocean (Zhang et al., 2020), PACNet (Zhang et al., 2021), DROLRPN (Zhou et al., 2020) and other 12 methods. Figure 5 shows that the proposed UAST achieves state-of-the-art performance with an AUC score of 0.571 and a precision of 0.587, performing better than other SOTA Siamese trackers. Impressively, our method obtains the best metrics among all trackers in comparison, and surpasses Ocean-online and Di MP-50 by a visible margin. It proves that UAST is also effective to reliably and accurately track long-term targets.

4.3.3. VOT-2019 BENCHMARK

We evaluate UAST on the Visual Object Tracking real-time challenge 2019 (Kristan et al., 2019). As shown in Table 5, our UAST achieves the performance on EAO criteria of 0.334, Robustness of 0.386 and Accuracy of 0.608, which is better than recent state-of-the-art trackers, such as Ocean, Siam BAN and Di MP. Note that UAST has an obvious advantage in terms of accuracy in all comparisons. It suggests that our tracker can accurately estimate the target box owing to the proposed distributional regression formulation. We further report the experimental results of EAO in Figure 6.

4.3.4. OTB-100 BENCHMARK

OTB-100 (Wu et al., 2015) is a classical benchmark in visual tracking, containing 100 short-term videos. We report the results on OTB-100 with Siam RPN++ (Li et al., 2019),

Figure 6. Expected averaged overlap result on VOT-2019.

Figure 7. Precision and success plots of OPE on OTB100.

Di MP50 (Bhat et al., 2019), Ocean (Zhang et al., 2020), ATOM (Danelljan et al., 2019), Siam FC++ (Xu et al., 2020), etc. Figure 7 shows that UAST achieves a comparable performance with AUC of 0.689 and precision of 0.911, and obtains 2.3% and 0.9% improvements than Ocean.

4.3.5. UAV-123 BENCHMARK

UAV123 (Mueller et al., 2016) consists of 123 sequences captured by low-altitude UAVs. It can be used to evaluate whether the tracker is suitable for deployment in aerial scenarios. To this end, we compare the proposed method with 9 state-of-art trackers. Figure 8 shows the results in detail. UAST outperforms most previous Siamese trackers, and obtains a close auc score with Siam GAT (Guo et al., 2021). For precision, our tracker obtains the top rank of 0.860, which is superior than Siam GAT, Di MP50 and ATOM. It demonstrates the effectiveness of our uncertainty-aware tracker.

4.4. Discussion

Both GFL (Li et al., 2020) and our method directly learn the joint representation. However, UAST is designed especially for visual tracking since only one object should be tracked. On the other hand, GFL mainly develops Focal loss (Lin et al., 2017) for data imbalance problem in object detection, while UAST aims at exploring uncertainty for tracking. In addition to the shape of distributions, UAST further estimates the uncertainty by a quantitative value, which is instructive and potentially inﬂuential in the ﬁeld of tracking (see more related discussions in the appendix). It is expected that the estimated uncertainty can be utilized as crucial information for safety-critical vision systems.

UAST: Uncertainty-Aware Siamese Tracking

Figure 8. Precision and success plots of OPE on UAV-123.

5. Conclusion

In the paper, we propose to learn a distribution based regression formulation for accurate visual tracking, which models localization uncertainty representation. It is an entirely new perspective in tracking community, since our method has an explicit probabilistic interpretation with highly ﬂexible discretized distributions. Furthermore, we address the task misalignment of anchor-free trackers by learning a joint representation of classiﬁcation and quality estimation. Experiments show that UAST outperforms previous state-ofthe-arts on several tracking benchmarks. We hope our work could inspire the research of uncertainty in object tracking.

Acknowledgements

This work was supported by the Natural Science Foundation of China under Grant No. 11871438, and the Key Projects of Natural Science Foundation of Zhejiang Province under Grant No. LZ22F020010.

Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., and Torr, P. H. S. Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision, pp. 850 865, 2016.

Bhat, G., Danelljan, M., Gool, L. V., and Timofte, R. Learning discriminative model prediction for tracking. In IEEE International Conference on Computer Vision, pp. 6182 6191, 2019.

Bolme, D. S., Beveridge, J. R., Draper, B. A., and Lui, Y. M. Visual object tracking using adaptive correlation ﬁlters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544 2550, 2010.

Chen, Z., Zhong, B., Li, G., Zhang, S., and Ji, R. Siamese box adaptive network for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6668 6677, 2020.

Choi, J., Chun, D., Kim, H., and Lee, H.-J. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving. In IEEE International Conference on Computer Vision, pp. 502 511, 2019.

Danelljan, M., Bhat, G., Khan, F. S., and Felsberg, M. Atom: Accurate tracking by overlap maximization. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4660 4669, June 2019.

Danelljan, M., Gool, L. V., and Timofte, R. Probabilistic regression for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7183 7192, 2020.

Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., and Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5374 5383, June 2019.

Guo, D., Wang, J., Cui, Y., Wang, Z., and Chen, S. Siamcar: Siamese fully convolutional classiﬁcation and regression for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6269 6277, 2020.

Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., and Shen, C. Graph attention tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 9543 9552, 2021.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770 778, 2016.

He, Y., Zhu, C., Wang, J., Savvides, M., and Zhang, X. Bounding box regression with uncertainty for accurate object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888 2897, 2019.

Henriques, J. F., Caseiro, R., Martins, P., and Batista, J. High-speed tracking with kernelized correlation ﬁlters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583 596, 2014.

Huang, L., Zhao, X., and Huang, K. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1562 1577, 2019.

Jiang, B., Luo, R., Mao, J., Xiao, T., and Jiang, Y. Acquisition of localization conﬁdence for accurate object detection. In European Conference on Computer Vision, pp. 784 799, 2018.

Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Pﬂugfelder, R., Kamarainen, J.-K., Cehovin Zajc, L., et al.

UAST: Uncertainty-Aware Siamese Tracking

The seventh visual object tracking vot2019 challenge results. In IEEE International Conference on Computer Vision Workshops, pp. 2206 2241, 2019.

Li, B., Yan, J., Wu, W., Zhu, Z., and Hu, X. High performance visual tracking with siamese region proposal network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971 8980, June 2018.

Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., and Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282 4291, June 2019.

Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., and Yang, J. Generalized focal loss: Learning qualiﬁed and distributed bounding boxes for dense object detection. In Advances in Neural Information Processing Systems, 2020.

Li, X., Wang, W., Hu, X., Li, J., Tang, J., and Yang, J. Generalized focal loss v2: Learning reliable localization quality estimation for dense object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 11632 11641, 2021.

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., and Zitnick, L. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pp. 740 755, 2014.

Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollar, P. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, pp. 2999 3007, Oct 2017.

Long, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431 3440, 2015.

Lukezic, A., Matas, J., and Kristan, M. D3s-a discriminative single shot segmentation tracker. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7133 7142, 2020.

Ma, Z., Wang, L., Zhang, H., Lu, W., and Yin, J. Rpt: Learning point set representation for siamese visual tracking. In European Conference on Computer Vision, pp. 653 665, 2020.

Mueller, M., Smith, N., and Ghanem, B. A benchmark and simulator for uav tracking. In European Conference on Computer Vision, pp. 445 461, 2016.

Peng, J., Jiang, Z., Gu, Y., Wu, Y., Wang, Y., Tai, Y., Wang, C., and Weiyao, L. Siamrcr: Reciprocal classiﬁcation

and regression for visual object tracking. In International Joint Conference on Artiﬁcial Intelligence, pp. 952 958, 2021.

Real, E., Shlens, J., Mazzocchi, S., Pan, X., and Vanhoucke, V. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 7464 7473, July 2017.

Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137 1149, 2016.

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015.

Tian, Z., Shen, C., Chen, H., and He, T. Fcos: Fully convolutional one-stage object detection. In IEEE International Conference on Computer Vision, pp. 9627 9636, 2019.

Wu, Y., Lim, J., and Yang, M.-H. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834 1848, 2015.

Xu, Y., Wang, Z., Li, Z., Yuan, Y., and Yu, G. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In AAAI Conference on Artiﬁcial Intelligence, pp. 12549 12556, 2020.

Zhang, D., Zheng, Z., Jia, R., and Li, M. Visual tracking via hierarchical deep reinforcement learning. In AAAI Conference on Artiﬁcial Intelligence, pp. 3315 3323, 2021.

Zhang, Z., Peng, H., Fu, J., Li, B., and Hu, W. Ocean: Object-aware anchor-free tracking. In European Conference on Computer Vision, pp. 771 787, 2020.

Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., and Ren, D. Distance-iou loss: Faster and better learning for bounding box regression. In AAAI Conference on Artiﬁcial Intelligence, pp. 12993 13000, 2020.

Zhong, F., Sun, P., Luo, W., Yan, T., and Wang, Y. Towards distraction-robust active visual tracking. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp. 12782 12792, 2021.

Zhou, J., Wang, P., and Sun, H. Discriminative and robust online learning for siamese visual tracking. In AAAI Conference on Artiﬁcial Intelligence, pp. 13017 13024, 2020.

Zhou, L., Ledent, A., Hu, Q., Liu, T., Zhang, J., and Kloft, M. Model uncertainty guides visual object tracking. In AAAI Conference on Artiﬁcial Intelligence, pp. 3581 3589, 2021.

UAST: Uncertainty-Aware Siamese Tracking

A. More Discussions about Distributed Regression.

Figure 9 shows some examples of noisy, incorrect or ambiguous ground truth bounding box annotations from GOT-10K (Huang et al., 2019). However, the previous bounding box regression methods (i.e., Siam RPN (Li et al., 2018), Siam FC++ (Xu et al., 2020)) do not take such the ambiguities of the ground truth bounding boxes into account. As a result, the learning is unstable, and the loss is relatively large in these cases. To address this issue, we propose a novel bounding box regression formulation with general distribution. The learned probability distribution is interpretable, since it can reﬂect the level of uncertainty of bounding box predictions.

Figure 9. In visual object tracking, the ground-truth bounding boxes have inherent ambiguities in some cases. The ﬁrst row shows that the object boundary is unclear and ambiguous due to shadow or itself factor; the ambiguities of the second row are introduced by similar objects or background noises; and the last row are examples of occlusion. These aspects are modeled by our distributed representation.

As shown in Figure 10, it illustrates the representations of Dirac delta, and the proposed general distributions, where the assumption goes from rigid (Dirac delta) to ﬂexible (General). A very signiﬁcant advantage of our work is that the learned probability distributions ca reﬂect the uncertainty of bounding box predictions. We also list several key comparisons about these distributions in Table 6. The proposed distribution decouples the representation and loss objective of bounding box regression, making it compatible for existing anchor-free tracking methods, including both edge level for learning its probability representation and box level for learning bounding box regression.

Dirac delta distribution General distribution (ours)

rigid flexible

Figure 10. Illustrations of distributions of bounding box regression, from rigid (Dirac delta) to ﬂexible (General). Existing trackers roots at a ﬁxed point using Dirac delta, have limitations in modeling real data distribution. In contrast, our distribution is more ﬂexible as its shape can reﬂect the uncertainty information of bounding box predictions.

UAST: Uncertainty-Aware Siamese Tracking

Type Dirac delta distribution Ours distribution

Probability Density δ(x ξ) P(x) Inference Target x R P(x)x dx Loss Objective (x ξ)2

2 LIo U ( R P (x)xdx ξ)2

2 LIo U Optimization Level edge box edge box

Table 6. Comparisons of different representation of distributions for bounding box regression. Edge level denotes optimization over four directions, while the box level means Io U-based Losses that consider bounding boxes as a whole objective.

B. Analysis of Different Localization Quality Label.

Centerness (Xu et al., 2020) mainly consider the center location of the target box, while Io U score (Zhang et al., 2020) may cause inaccurate quality label due to lacking of modeling center distance. To this end, our localization quality label applies Distance-Io U score (Zheng et al., 2020) between the predicted bounding box and its ground-truth, which incorporates both the normalized center distance and Io U score, which is more suitable and effective in visual tracking. As shown in Fig. 3, in the left case, the red point is a bit far from the target center, so the centerness is small. We discover the major problem of centerness is that its deﬁnition leads to unexpected small label value, which causes unstable training. In practice, its receptive ﬁeld corresponds the head of the cat, so the predicted box is not bad, has a 0.63 Io U score. In contrast, the Distance Io U provides a suitable and reliable label vale. Another case in the right ﬁgure also shows that DIo U performs better than Io U and centerness with same Io U score but different spatial locations.

C. Discussion of Difference with Anchor-free Siamese Trackers.

Existing trackers do not consider to estimate the uncertainty of box coordinates, so that it has no clear probabilistic interpretation. In contrast, we propose a novel formulation to learn general distribution of bounding box representation with uncertainty for visual object tracking, termed as distribution-based regression method.

There is an usage inconsistent problem between the classiﬁcation and quality estimation since the classiﬁcation and localization quality are trained separately but combined during online tracking. Different from them, we devise a joint representation head to tackle this issue.

Most existing tracking methods have a limitation of the task misalignment between classiﬁcation and regression. Namely, the position with high classiﬁcation score may not achieve high regression accuracy, and vice versa. Differently, beneﬁting from the proposed distributed regression framework, we propose to utilize the uncertainty information from box distributions to guide the learning of our joint representation head.

D. Discussion of Difference with GFL in Object Detection.

Both GFL (Li et al., 2020) and our method directly learn the joint representation of classiﬁcation and localization quality. However, GFL is much more ﬁt for object detection, while in visual tracking one and only one object should be tracked. Nevertheless, our work differs from GFL in four fundamental ways.

1). In GLF, the training samples of the classiﬁcation and regression heads are identical. Both are sampled from the positions within the ground-truth boxes. The ambiguous matching between anchors and object severely hinders the robustness of tracker. Differently, our method is asymmetric which is tailored for visual tracking task. To be speciﬁc, the joint representation head only considers the pixels closing to the target center as positive samples, while the regression head considers all the pixels in the ground-truth box as training samples. This ﬁne-grained sampling strategy guarantees the joint head can learn a robust similarity metric for localization, which is important for tracking.

2). GFL uses Io U score and Quality Focal Loss (QFL) to supervise the joint head. However, the Io U score may be not credible in some cases (see the ﬁgure 3), and QFL is also not suitable for single object tracking that belongs a binary classiﬁcation problem. To this end, our supervision applies Distance-Io U (Zheng et al., 2020) between the predicted box and its ground-truth, which incorporates both normalized center distance and Io U score. This measurement is more suitable and effective in visual tracking. After that, we compare our loss function with others in Table 7 .

3). GFL qualitatively interprets the uncertainty according to the shape of distributions (e.g., sharp or ﬂatten). However, more than that, our UAST can estimate the uncertainty using a quantitative values in [0, 1]. Speciﬁcally, as shown in Figure 1,

UAST: Uncertainty-Aware Siamese Tracking

UAST provides the estimated certainty with respect to 4-directions (L C, T C, R C and B C), and the whole certainty value of the predicted box. For examples, UAST estimates lower right-directional certainty value (e.g., R C: 61%) of the target due to the ambiguity caused by partly occlusion, conﬁrming its effectiveness.

4). GFLV2 utilizes the statistics of bounding box distributions to perform localization quality estimation. To be speciﬁc, GFLV2 chooses the Top-k values along with the mean value of each distribution vector as the basic statistical feature, which is not representative and unstable in some bad cases. Differently, we select two nearneighbor values of prediction in each distribution as our initial localization quality features, providing a more simple, efﬁcient yet effective method.

Loss type y > 0, P y = 0, N

FL α|y p|γlog(p) (1 α)pγlog(1 p) QFL |y p|γ LBCE pγlog(1 p) QFLv2 f(km) |y p|γ LBCE pγlog(1 p) VFL y LBCE αpγlog(1 p) w BCE w+ LBCE w LBCE Ours w+ f(nn) LBCE w f(nn) LBCE

Table 7. Comparison of different loss functions used in the classiﬁcation or joint representation branch. y is target Io U between the predicted box and ground-truth. p denotes the predicted classiﬁcation score, α is a weighting factor, and LBCE means binary cross-entropy loss. f( ) denotes the different functions of localization quality estimation.

E. More Experimental Results on La SOT.

In addition to the success and precision plots shown in the body part, we here provide the normalized precision plot over the La SOT test set (Fan et al., 2019) containing 280 video sequences. The normalized precision score is computed as the percentage of frames where the normalized distance (relative to the target size) between the predicted and ground-truth target center location is less than a threshold D. It is plotted over a range of thresholds D [0, 0.5]. The trackers are ranked using the area under this curve, which is shown in the legend of the Figure 11. We compare with state-of-the-art trackers Ocean (Zhang et al., 2020), Di MP (Bhat et al., 2019), DROL (Zhou et al., 2020), Siam FC++ (Xu et al., 2020), Siam GAT (Guo et al., 2021), and etc. Our UAST outperforms previous state-of-the-art Siamese trackers. Compared to the Res Net-50 based Ocean-online, Di MP50 and Siam GAT, our approach achieves gains of 0.7%, 1.2% and 2.5% respectively.

Figure 11. Normalized precision plot on the La SOT test set. The average normalized precision is shown in the legend.

UAST: Uncertainty-Aware Siamese Tracking

F. More Examples of Distributed Regression Representation.

We demonstrate more examples with our distributed bounding boxes predicted by UAST. As demonstrated in Figure 12, we show several cases with boundary ambiguities. In some cases, our model can produce more reasonable coordinates of bounding boxes. The predicted distributions are informative since its shape reﬂects the level of the certainty of the bounding boxes. The ﬁrst three rows exist unclear boundaries, where distributions are ﬂatten. The last row with clear boundaries and sharp distributions are shown, where is very conﬁdent to generate accurate bounding boxes.

jcr score: 91% certainty: 85%

L_C: 87% T_C: 68% R_C: 91% B_C: 92%

L_C: 61% T_C: 93% R_C: 89% B_C: 86% jcr score: 88% certainty: 83%

L_C: 65% T_C: 84% R_C: 91% B_C: 88%

jcr score: 94% certainty: 82%

L_C: 90% T_C: 88% R_C: 91% B_C: 94%

jcr score: 95% certainty: 91%

Figure 12. Examples of distributed bounding box representation. The ﬁrst three rows exist some boundary ambiguities and uncertainties, where the learned distributions may tend to be ﬂatten. In some cases, we even observe a distribution with two peaks. Interestingly, they do correspond to ambiguous boundaries in the input image. For example, the top boundary of the airplane, the left boundary of the cat, and the left boundary of the deer. The last row has extremely clear boundaries, so that the learned distributions are relatively sharp and result in more reliable and accurate bounding box estimations. Predictions are marked yellow in images, while ground-truth boxes are green.

G. Quantitative Results

The representative quantitative results of our proposed UAST on the test set of GOT-10k dataset are shown in Figure 13. We also present the quantitative results of two representative state-of-the-art trackers: online learning based Di MP-50 and anchor-free Siam FC++ for a comparison. To be speciﬁc, Figure 2 demonstrates that Di MP50 and Siam FC++ may fail to track the targets in cases of partial occlusions, fast motion, scale variation and occlusion. In the third row sequence, Di MP50 and Siam FC++ drift from the moving animal in frame 38. Our UAST can locate the target accurately with more reasonable localization conﬁdence thanks to our joint representation which integrate the classiﬁcation and localization quality. In the ﬁrth sequences, UAST can quickly adapt to the great scale variations of the ﬂying people.

UAST: Uncertainty-Aware Siamese Tracking

Siam FC++ UAST Di MP-50

Figure 13. Quantitative result comparison of our UAST (red) with representative trackers Di MP-50 (green) and Siam FC++ (blue). Observed from the visualization results, UAST can estimate more precise bounding boxes when encountering circumstances of partial occlusions, deformation, scale changes and fast movement. This comparison demonstrates that the proposed distributed regression formulation is more effective because our method provides a clear interpretation of the boxes.