# deep_attentive_tracking_via_reciprocative_learning__1064cb06.pdf

Deep Attentive Tracking via Reciprocative Learning

Shi Pu1 Yibing Song2 Chao Ma3 Honggang Zhang1 Ming-Hsuan Yang4

1Beijing University of Posts and Telecommunications, Beijing, China {pushi_519200, zhhg}@bupt.edu.cn 2Tencent AI Lab, Shenzhen, China dynamicstevenson@gmail.com 3Shanghai Jiao Tong University, Shanghai, China chaoma@sjtu.edu.cn 4University of California at Merced, Merced, U.S.A mhyang@ucmerced.edu https://ybsong00.github.io/nips18_tracking/index

Visual attention, derived from cognitive neuroscience, facilitates human perception on the most pertinent subset of the sensory data. Recently, signiﬁcant efforts have been made to exploit attention schemes to advance computer vision systems. For visual tracking, it is often challenging to track target objects undergoing large appearance changes. Attention maps facilitate visual tracking by selectively paying attention to temporal robust features. Existing tracking-by-detection approaches mainly use additional attention modules to generate feature weights as the classiﬁers are not equipped with such mechanisms. In this paper, we propose a reciprocative learning algorithm to exploit visual attention for training deep classiﬁers. The proposed algorithm consists of feed-forward and backward operations to generate attention maps, which serve as regularization terms coupled with the original classiﬁcation loss function for training. The deep classiﬁer learns to attend to the regions of target objects robust to appearance changes. Extensive experiments on large-scale benchmark datasets show that the proposed attentive tracking method performs favorably against the state-of-the-art approaches.

1 Introduction

The recent years have witnessed growing interest in developing visual tracking methods for various vision applications. Visual attention plays an important role in facilitating tracking target objects in videos. For example, the state-of-the-art trackers based on discriminative correlation ﬁlters (DCFs) [21, 11] regress input features into a Gaussian response map for target localization. They often apply empirical spatial weights to input features to suppress the boundary effect caused by the Fourier transform. The spatial weights are generated by either a cosine [34] or a Gaussian [11] function. From the perspective of visual attention, we interpret these spatial weights as a speciﬁc type of attention maps. To improve the localization accuracy, the weights closer to the center regions of the input features are set to be larger. However, using these empirical attention maps limits the tracking performance when target objects undergo large movements. Meanwhile, deemphasizing non-central regions tends to downgrade the target response, leading to inaccurate localizations.

On the other hand, two-stage tracking-by-detection approaches ﬁrst draw a set of proposals and then classify each proposal as either the target or the background. Visual attention has a great potential of facilitating learning discriminative classiﬁers. Existing deep attentive trackers [8, 27]

Honggang Zhang is the corresponding author.

32nd Conference on Neural Information Processing Systems (Neur IPS 2018), Montréal, Canada.

mainly use additional attention modules to generate feature weights. In other words, attention schemes are implemented by performing feature selection. Beneﬁted from the end-to-end training, such attention schemes improve the tracking accuracy by strengthening the discriminative power of features. However, the feature weights learned in single frames are unlikely to enable classiﬁers to concentrate on robust features over a long temporal span. Moreover, slight inaccuracy of feature weights will exacerbate the misclassiﬁcation problem. This requires an in-depth investigation on how to best exploit the visual attention of deep classiﬁers so that they can attend to target objects over time.

In this paper, we propose a reciprocative learning algorithm to exploit visual attention which advances the tracking-by-detection framework. Different from existing trackers using additional attention modules to weigh features, we directly train an attentive classiﬁer. The training process consists of both a forward and a backward step. In the forward step, we feed an input sample into a deep tracking-by-detection network and compute the corresponding classiﬁcation score. In the backward step, we take the partial derivative of this classiﬁcation score with respect to the input sample along the direction from the last fully-connected layer towards the ﬁrst convolutional layer. Note that, in the backward step, we do not update any network parameters. Instead, we take the partial derivative output of the ﬁrst layer as the attention map. Each pixel value on this attention map indicates the importance of the corresponding pixel of the input sample to affect the classiﬁcation accuracy. We exploit this map as a regularization term and add it in the loss function during training. The network parameters are updated following the conventional backward propagation scheme. As a result, the deep classiﬁer learns to attend to the target regions and effectively eliminates background interference. In the test stage, the learned classiﬁers directly predict the classiﬁcation score of each input sample for target localization. We validate the effectiveness of the proposed method on large-scale benchmark datasets. Our method shows favorable performance against the state-of-the-art approaches.

We summarize the main contributions of our work as follows:

We propose a reciprocative learning algorithm to exploit visual attention within the trackingby-detection framework. We use the attention maps as regularization terms coupled with the classiﬁcation loss to train deep classiﬁers, which in themselves learn to attend to temporal robust features. We conduct extensive experiments on benchmark datasets where the proposed tracker performs favorably against state-of-the-art approaches.

2 Related Work

Visual tracking has been widely surveyed in the literature [42]. In this section, we mainly discuss the representative trackers and the related topic of visual attention.

Visual tracking. The state-of-the-art visual tracking methods typically use a one-stage regression framework or a two-stage classiﬁcation framework. The representative one-stage regression framework is based on discriminative correlation ﬁlters (DCFs), which regress all the circular-shifted versions of the input features into soft labels generated by a Gaussian function. By computing the spatial correlation in the Fourier domain as an element-wise product, the DCF trackers have received considerable attention recently due to the fast-tracking speed. Starting from the MOSSE tracker [4], extensions include kernelized correlation ﬁlters [20, 21], spatial regularization and multi-scale fusion [11, 13], CNN feature integrations [34, 38, 57, 26], and end-to-end predictions [48, 3, 46, 43]. Different from the one-stage regression framework, the two-stage tracking-by-detection framework mainly consists of two steps. The ﬁrst step draws a sparse set of samples around the previously predicted location, and the second step classiﬁes each sample as the target or the background. Considerable efforts have been made to improve the tracking-by-detection framework, including online boosting [17, 1], P-N learning [25], structured SVM [18, 36], CNN-SVM [22], random forests [56], multiple domain learning [35], adversarial learning [44], active tracking [32] and multiple object tracking [33]. In this work, we extend the two-stage tracking-by-detection framework by exploiting visual attention. Our deep classiﬁer learns to attend to every discriminative region to separate target objects from the background.

Visual attention. The visual attention starts from cognitive neuroscience, where the human perception focuses on the most pertinent subset of the sensory data. The visual attention scheme

Attention Map

Input Frame

Attention Regularization

+ Training: Classification loss + Attention regularization

Prediction: Binary classification

Reciprocate

Figure 1: Overview of the proposed reciprocative learning algorithm. Given a training sample, we ﬁrst compute its classiﬁcation score in a forward operation. Then we obtain attention maps in a backward operation by taking the partial derivative of the classiﬁcation score with respect to this sample. We use these maps as a regularization term coupled with the classiﬁcation loss to train the classiﬁer. In the test stage, no attention maps are generated. The classiﬁer directly predicts the target location.

has been widely exploited for many computer vision applications, including image classiﬁcation [40, 47, 23, 24], image caption [52], pose estimation [14], etc. In visual tracking, the spatial weights, which are widely used by DCF trackers to suppress the boundary effect, can be interpreted as one type of visual attention. Examples of such spatial weights include the cosine window map [4] and the Gaussian window map [11, 13]. Recently, a number of efforts [7, 8, 6, 27] have been made to exploit visual attention within deep models. These approaches emphasize attentive features and resort to additional attention modules to generate feature weights. The classiﬁers in these approaches are assumed not to be attentive. In this work, we propose a reciprocative learning process to learn an attentive classiﬁer in the tracking-by-detection framework. We exploit attention maps as regularization terms coupled with the original classiﬁcation loss for training classiﬁers. Our deep classiﬁer with reciprocative learning in itself can attend to temporal robust features to improve the tracking accuracy.

3 Proposed Method

We propose a reciprocative learning scheme to activate the attentive ability of the classiﬁer in the tracking-by-detection framework. Figure 1 shows an overview. Different from the existing attention models [47, 23, 7, 8, 6, 27] proposing additional modules to produce attention maps, we use the partial derivatives of the network outputs with respect to input images as attention maps. While some visualization methods [40, 39] use the attention maps to understand deep models, our method uses the attention maps to serve as regularization terms in the training stage to help the classiﬁer attend to target regions robust to appearance changes. During testing, we directly use the classiﬁcation scores to locate target objects. In the following, we ﬁrst introduce how to incorporate attention maps in the original classiﬁcation loss function. Then, we illustrate how the attention map gradually regularizes the classiﬁer through reciprocative learning.

3.1 Attention Exploitation

We ﬁrst present how we exploit the visual attention within the tracking-by-detection network. We denote by I the input to the CNN tracking-by-detection network. The network outputs a vector of scores. Each element score indicates how likely I belongs to a predeﬁned class c. Given a speciﬁc input sample I0, we use the ﬁrst-order Taylor expansion [40] at a point z0 to approximate the score function fc(I) as follows: fc(I) A c I + B. (1)

The point z0 belongs to the deleted ϵ-neighborhood of I0 (ϵ 0). The approximation (Eq. 1) holds true for any point in the ϵ-neighborhood of I0. Therefore, the derivatives of fc(I) at the points z0 and I0 are equal (f c(z0) = f c(I0)) as these two points are inﬁnitely close. In Eq. 1, Ac is the derivative

of fc(I) with respect to the input I at the sample I0:

Eq. 1 indicates that the output score of class c is affected by the element values of Ac. In other words, the values of Ac indicate the importance of the corresponding pixels of I0 to generate the class score. As such, we can interpret Ac as an attention map. For another speciﬁc input sample I1, we again use Taylor expansion at a point z1 to approximate fc(I). The point z1 belongs to the deleted ϵ-neighborhood of I1. The new approximation holds true for any point in the ϵ-neighborhood of I1. Thus, the attention map Ac corresponding to each input image sample is speciﬁc.

According to Eq. 2, we compute the partial derivative of the network output fc(I) with respect to the input I at one speciﬁc sample I0. This is achieved in two steps. First, we feed an input sample I0 into the network and obtain the predicted score fc(I0) in a forward propagation. Then, we take the partial derivative of fc(I) with respect to I when I = I0. According to the chain rule, this partial derivative is computed through backward propagation. We take the output of the ﬁrst layer during backward propagation as the attention map Ac. We only select the gradients with positive values as they have clear contributions to the class scores with positive values. Thus, the attention map Ac is always positive value and reﬂects how the network attends to the input sample I0. Note that in the backward propagation, the network parameters are ﬁxed without updating.

3.2 Attention Regularization

The tracking-by-detection framework usually deﬁnes the target object as the positive class and the background as the negative class to train a binary classiﬁer. For each input sample I0, we obtain two attention maps. One is the positive attention map (denoted by Ap) and the other is the negative attention map (denoted by An). For one positive training sample (labeled as y = 1), we expect the pixel values of Ap related to target objects to be large. In comparison, the pixel values of An related to target objects should be small. The attention regularization term for one positive sample can be formulated as: R(y=1) = σAp

where µ and σ are the mean and standard deviation operators for the attention maps. On the other hand, for one negative training sample (denoted as y = 0), we formulate its corresponding regularization term as: R(y=0) = µAp

Using Eq. 3 and Eq. 4, we add the attention regularization terms into the original classiﬁcation loss as: L = LCE + λ [y R(y=1) + (1 y) R(y=0)], (5) where R(y=1) and R(y=0) denote the regularization terms of the positive and negative training examples, respectively. The scaler parameter λ balances the attention regularization terms and the cross-entropy loss LCE.

Eq. 5 shows how attention maps contribute to training the deep classiﬁer. In addition to the classiﬁcation loss, we incorporate the constraints from attention maps. For positive samples, we aim to increase the attention around the target object in two aspects. The ﬁrst one is to increase the mean but decrease the standard deviation of Ap so that the pixel intensity values are large and with small variance. The second one is to decrease the mean but increase the standard deviation of An so that the pixel intensity values are small and with large variance. These two aspects reﬂect that the classiﬁer learns to increase the true positive rates while decreasing the false negative rates. A similar intuition is shown in Eq. 4 where we decrease the false positive rates and increase the true negative rates of the classiﬁer. As a result, the regularization terms help in increasing the classiﬁcation accuracy by using the constraint from attention maps. This contributes to the classiﬁer training process as the attention maps heavily inﬂuence the output class scores as shown in Eq. 1.

3.3 Reciprocative Learning

By incorporating the regularization terms in the loss function, the reciprocative learning algorithm is easy to implement. We resort to the standard backward propagation and chain rule. In each

Iteration #50 Iteration #100 Iteration #150 Iteration #200

(a) Training input

(b) Without reciprocative learning

(c) With reciprocative learning

Figure 2: Visualization of attention maps on the Car4 sequence [50]. With the proposed reciprocative learning algorithm, attention maps gradually cover the whole area of the target. The classiﬁer thus attends to every region which can differentiate the targe from the background.

iteration of the classiﬁer training, we compute the attention maps of every input training example. These attention maps reﬂect the attention of the classiﬁer at the current status. Ideally, the classiﬁer will selectively pay more attention to target objects than the background. As shown in Figure 2(b), the classiﬁer without using attention regularization terms tends to focus on a limited number of discriminative regions. When target objects undergo large appearance changes, those limited regions are unlikely to represent target objects robustly in the whole video. With the use of the attention regularization, the classiﬁer iteratively learns to attend to every region that can differentiate the target from the background. The classiﬁer gradually focuses its attention on the whole target area. Figure 2(c) shows that the attention map only covers a subpart of the target region at the beginning of the training (i.e., Iteration #50). By means of reciprocative learning, the attention map gradually covers the whole target region. In the test stage, the attention regularization terms are not used. The classiﬁer itself is able to attend to input samples.

4 Tracking Process

In this section, we discuss how to carry out the tracking task on a video. The proposed tracking algorithm does not require ofﬂine training. We mainly discuss three components regarding model initialization, online detection and model update.

Model initialization. In the ﬁrst frame, we randomly draw N1 samples around the initial target location. These samples are labeled as either positive or negative according to whether their intersection over union (Io U) scores with the ground truth annotations are greater than 0.5. We use H1 iterations in the initialization step. For each sample in every iteration, we compute its loss using Eq. 5 and update the fully-connected layers accordingly.

Online detection. Given one frame in the video sequence, we ﬁrst draw N2 samples around the predicted location of the target in the previous frame. Then we feed each sample into our network. We select the candidate with the highest classiﬁcation score and reﬁne the target location using bounding box regression as in [16].

Model update. In each frame, we draw N2 samples around the predicted target location. These samples are labeled as either positive or negative according to their Io U scores with the predicted bounding box. Then we use these samples to update the fully-connected layers using H2 iterations in every T frames.

We analyze how reciprocative learning contributes to the tracking-by-detection framework by visualizing the attention map, conﬁdence map and the tracking results. The attention map shows how the network attends to the input image. The scores in conﬁdence map indicate the probability of being the target object. We plot the predicted bounding boxes in red and the ground truth annotations in green. Figure 3 compares the visualization results with and without the reciprocative learning scheme. The state-of-the-art tracking-by-detection framework [35] is used as a baseline. We notice

(a) Without reciprocative learning (b) With reciprocative learning

Figure 3: Visualization of the prediction process with and without reciprocative learning on the Girl sequence [50]. From the ﬁrst to the third row are: attention maps, conﬁdence scores, and tracking results (in red). Ground truth annotations are in green. With the proposed reciprocative learning algorithm, the classiﬁer pays more attention to temporal robust features and well differentiate the target from obstructions with similar appearance in Frame #432.

that at the beginning of the video sequence (Frame #2), the attention map, conﬁdence map and the tracking results are almost the same between the baselines with and without reciprocative learning, respectively. This means that the attention map has not regularized the classiﬁer well as it has not fully identiﬁed the target regions. However, along the tracking process, the reciprocative learning scheme helps the attention map cover the whole target region and thus strengthens the instance awareness. As a result, the network is able to produce a high conﬁdence map on the target region. This helps the classiﬁer differentiate the target from the background during occlusion even when the obstructions are highly similar (i.e., both are human faces) in Frame #432 and Frame #467. In comparison, the baseline without the reciprocative learning scheme drifts at the presence of occlusion as the classiﬁer does not attend to temporal robust features.

5 Experiments

In this section, we ﬁrst present the implementation details. Then we conduct ablation studies from two perspectives: 1) We investigate how the regularization terms contribute to learning discriminative classiﬁers. 2) To demonstrate the effectiveness of reciprocative learning, we compare with an alternative implementation using additional attention modules to generate feature weights. Finally, we evaluate our method on the standard benchmarks, i.e., OTB-2013 [50], OTB-2015 [51] and VOT-2016 [28]. We present more experimental results in the supplementary materials, and will make the source code available to the public.

Implementation details. We use the same network architecture as in [35] to develop our baseline tracker. There are three ﬁxed convolutional layers as the feature extractor and three learnable fullyconnected layers as the classiﬁer. The convolutional layers share the same weights with the VGG-M model [41]. The fully connected layers are randomly initialized and incrementally updated during the tracking process. Our goal is to update the parameters of the fully-connected layers to learn an attentive classiﬁer. In the ﬁrst frame, the number N1 of samples is set to 5500. We train the randomly initialized classiﬁer using H1 = 50 iterations with a learning rate of 2e-4. In each iteration, we feed 1 mini-batch containing 32 positive and 32 negative samples into the network. In the online model update step, we ﬁne-tune the classiﬁer using H2 = 15 iterations in every T = 10 frames with a learning rate of 3e-4. The network solver is stochastic gradient descent (SGD). During online detection, the number N2 of proposals is set to 256. Our implementation is based on pytorch [37] and runs on a PC with an i7-3.4 GHz CPU and a Ge Force GTX 1080 GPU. The average tracking speed is 1 FPS.

Table 1: Parameter sensitivity analysis of λ on the OTB-2013 dataset. Red: best. Blue: second best.

λ 0 1 2 3 4 5 6 7 8

DP 0.911 0.917 0.936 0.941 0.940 0.944 0.938 0.929 0.911 OS 0.671 0.688 0.694 0.704 0.704 0.704 0.701 0.695 0.684

Table 2: Evaluation of baseline improvements using attentive features and attentive classiﬁers on the OTB-2013 dataset. Red: best. Blue: second best.

Baseline Baseline Baseline + Attentive Features + Attentive Classiﬁer (Ours)

DP 0.911 0.932 0.944 OS 0.671 0.686 0.704

Evaluation metrics. We follow the standard benchmark protocols on the OTB-2013 and OTB-2015 datasets. We report the distance precision (DP) and overlap success (OS) rates under the one-pass evaluation (OPE). The distance precision rate measures the center pixel distance between the predicted locations and the ground truth annotations. We report the DP rate at a threshold of 20 pixels. The overlap success rate measures the overlap between the predicted bounding boxes and the ground truth annotations. We report the area-under-the-curve scores of the OS rate plots. In addition, we compute the average center distance errors in pixels, e.g., the center location error (CLE), and the overlap success rates at a threshold of 0.5 Io U (OS0.5). On the VOT-2016 dataset, we use three evaluation metrics: expected average overlap (EAO), accuracy ranks (Ar) and robustness ranks (Rr).

5.1 Ablation Studies

We propose the reciprocative learning algorithm to extend the tracking-by-detection framework. In this section, we investigate how reciprocative learning contributes to learning discriminative classiﬁers. Note that we exploit attention maps as a regularization term in the loss function. The scalar parameter λ in Eq. 5 makes a balance between the classiﬁcation loss and the attention regularization term. We set λ between 0 to 8 at an interval of 1 to evaluate the tracking performance on the OTB-2013 dataset. Table 1 shows that reciprocative learning consistently improves the baseline tracker (i.e., λ = 0) even the value of λ is in a wide range. This afﬁrms the effectiveness of the proposed reciprocative learning algorithm. Our method generally achieves top performing results when λ is between 3 and 5. In the following experiments, we ﬁx λ = 5 to report our tracking results.

Note that using attention maps as feature weights can also improve the tracking-by-detection framework as in [7, 8, 6, 27]. For fair comparison, we implement this strategy on top of our baseline by adding an attention module to generate feature weights. The classiﬁer is learned on the attentive features with the original classiﬁcation loss. Table 2 shows that using attentive features indeed improves the baseline tracking performance. However, the baseline tracker with reciprocative learning achieves much larger performance gains than that with the attentive feature scheme: 3.2% vs. 2.1% in DP, and 3.3% vs. 1.5% in OS. It is because our deep classiﬁer itself learns to attend to the target object through attention regularization. While the attentive features learned in single frames are unlikely to be invariant to large appearance changes over a long temporal span.

5.2 Overall Performance

OTB-2013 dataset. We compare our method with 57 trackers. Among them 29 trackers are from the OTB benchmarks [50, 51]. The remaining 28 state-of-the-art trackers include ACFN [6], BACF [26], ADNet [53], CCOT [13], CREST [43], MCPF [57], CSR-DCF [31], SRDCFD [12], SINT [45], MDNet[35], HDT [38], Staple [2], GOTURN [19], KCF [21], TGPR [15], CNT [55], DSST [9], MEEM [54], RPT [30], SAMF [29], DLSSVM [36], BIT [5], SO-DLT [49], SCT [7], CNN-SVM [22], FCNT [48], HCF [34] and HART [27]. For presentation clarity, we only display the top 10 trackers. Figure 4 shows that our method generally performs well against state-of-the-art approaches on the OTB-2013 dataset with the distance precision and overlap success metrics. Speciﬁcally, our method improves the baseline MDNet by a large margin, which is not ofﬂine trained on auxiliary

Location error threshold (pixels)

0 10 20 30 40 50

1 Distance precision plots - OTB2013

Ours [0.944] MCPF [0.916] MDNet [0.911] CREST [0.908] CCOT [0.908] ADNet [0.903] HDT [0.889] SINT [0.882] SRDCFD [0.870] BACF [0.861]

Overlap threshold (Io U)

0 0.2 0.4 0.6 0.8 1

Success rate

1Overlap success plots - OTB2013

Ours [0.704] CCOT [0.677] MCPF [0.677] CREST [0.673] MDNet [0.671] ADNet [0.659] BACF [0.657] SINT [0.655] SRDCFD [0.653] DLSSVM [0.608]

Location error threshold (pixels)

0 10 20 30 40 50

1 Distance precision plots - OTB2015

CCOT [0.903] Ours [0.895] ADNet [0.880] MDNet [0.878] MCPF [0.873] HDT [0.848] CREST [0.838] SRDCFD [0.825] BACF [0.824] CNN-SVM [0.814]

Overlap threshold (Io U)

0 0.2 0.4 0.6 0.8 1

Success rate

1Overlap success plots - OTB2015

CCOT [0.673] Ours [0.668] ADNet [0.646] MDNet [0.646] MCPF [0.628] SRDCFD [0.627] CREST [0.623] BACF [0.621] SINT [0.592] ACFN [0.569]

Figure 4: Distance precision and overlap success plots using the one-pass evaluation on the OTB-2013 and OTB-2015 datasets.

Table 3: Comparisons with the state-of-the-art trackers on the OTB-2013 and OTB-2015 datasets. Our tracker performs favorably against existing trackers in center location error (CLE), and the overlap success rate at a threshold of 0.5 Io U (OS0.5). Red: best. Blue: second best.

Trackers Ours CCOT CREST MDNet MCPF ADNet BACF SINT SRDCFD ACFN

CLE OTB-2013 7.85 15.58 10.22 13.05 11.19 13.79 26.20 11.85 29.51 18.69 OTB-2015 12.42 13.99 21.19 16.87 20.86 14.65 28.10 25.75 31.52 25.14

OS0.5 OTB-2013 0.913 0.837 0.860 0.849 0.858 0.836 0.840 0.816 0.814 0.750 OTB-2015 0.848 0.823 0.776 0.806 0.780 0.802 0.776 0.719 0.766 0.686

sequences for fair comparison. Table 3 shows the results in center location error and OS0.5. The favorable results against state-of-the-art approaches demonstrate the effectiveness of our method in signiﬁcantly reducing the average center distance error and increasing the tracking success rates.

OTB-2015 dataset. We compare our method with aforementioned trackers on the OTB-2015 dataset. Figure 4 shows that our tracker overall performs well against the state-of-the-art. While the top performing tracker CCOT achieves higher distance precision results, Table 3 shows that our tracker has a smaller center location error on all the benchmark sequences than CCOT. The overall favorable performance of our tracker can be explained by the fact that the proposed reciprocative learning algorithm strengthens the discriminative power of the classiﬁer. The compared methods do not explicitly exploit visual attention. Our tracker does not perform as well as the top performing tracker CCOT in overlap success rate. It is because our tracker randomly draws a sparse set of samples for scale estimation. But CCOT crops the samples in a continuous space. We will explore this idea in the future work.

VOT-2016 dataset. We evaluate our method on the VOT-2016 dataset with the comparison to state-of-the-art trackers including Staple [2], MDNet [35], CCOT [13], EBT [58], Deep SRDCF [10], and Siam FC [3]. Table 4 shows that CCOT performs the best under the EAO metric. Our tracker overall performs comparably with CCOT, but much better than the others. The VOT-2016 report suggests that trackers whose EAO value exceeds 0.251 belong to the state-of-the-art. All these compared trackers including ours are thus state-of-the-art.

Table 4: Comparisons with the state-of-the-art trackers on the VOT-2016 dataset. The results are presented in terms of expected average overlap (EAO), accuracy rank (Ar) and robustness rank (Rr). Red: best. Blue: second best.

Trackers Ours CCOT Staple MDNet EBT DSRDCF Siam FC

EAO 0.320 0.331 0.295 0.257 0.291 0.276 0.277 Ar 1.47 1.98 1.87 1.72 3.62 2.12 1.30 Rr 2.10 1.95 3.23 2.80 2.13 2.82 3.17

6 Concluding Remarks

In this paper, we propose a reciprocative learning scheme to exploit visual attention within the tracking-by-detection framework. For each input sample, we ﬁrst compute the classiﬁcation loss in a forward propagation, and take the partial derivatives with respect to this sample in a backward propagation as attention maps. Then we use attention maps as a regularization term coupled with the original classiﬁcation loss function for training discriminative classiﬁers. Compared with existing attention models proposing additional modules to generate feature weights, the proposed reciprocative learning algorithm uses attention maps to regularize the classiﬁer learning. Our classiﬁer learns to attend to the robust features over a long temporal span. In the test stage, no attention maps are generated. Our classiﬁer directly classiﬁes each input sample. Extensive evaluations on the benchmarks demonstrate that our method performs favorably against state-of-the-art approaches.

Acknowledgements

The work is supported in part by the Beijing Municipal Science and Technology Commission project under Grant No. Z181100001918005, Fundamental Research Funds for the Central Universities (2017RC08), NSF CAREER Grant No. 1149783, and gifts from NVIDIA. Shi Pu is supported by a scholarship from China Scholarship Council.

[1] B. Babenko, M.-H. Yang, and S. Belongie. Robust object tracking with online multiple instance learning. IEEE PAMI, 2011.

[2] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr. Staple: Complementary learners for real-time tracking. In CVPR, 2016.

[3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr. Fully-convolutional siamese networks for object tracking. In ECCVW, 2016.

[4] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive correlation ﬁlters. In CVPR, 2010.

[5] B. Cai, X. Xu, X. Xing, K. Jia, J. Miao, and D. Tao. Bit: Biologically inspired tracker. IEEE TIP, 2016.

[6] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, J. Y. Choi, et al. Attentional correlation ﬁlter network for adaptive visual tracking. In CVPR, 2017.

[7] J. Choi, H. Jin Chang, J. Jeong, Y. Demiris, and J. Young Choi. Visual tracking using attentionmodulated disintegration and integration. In CVPR, 2016.

[8] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu. Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In CVPR, 2017.

[9] M. Danelljan, G. Häger, F. Khan, and M. Felsberg. Accurate scale estimation for robust visual tracking. In BMVC, 2014.

[10] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Convolutional features for correlation ﬁlter based visual tracking. In ICCVW, 2015.

[11] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Learning spatially regularized correlation ﬁlters for visual tracking. In ICCV, 2015.

[12] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg. Adaptive decontamination of the training set: A uniﬁed formulation for discriminative visual tracking. In CVPR, 2016.

[13] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg. Beyond correlation ﬁlters: Learning continuous convolution operators for visual tracking. In ECCV, 2016.

[14] W. Du, Y. Wang, and Y. Qiao. Rpan: An end-to-end recurrent pose-attention network for action recognition in videos. In CVPR, 2017.

[15] J. Gao, H. Ling, W. Hu, and J. Xing. Transfer learning based visual tracking with gaussian processes regression. In ECCV, 2014.

[16] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.

[17] H. Grabner, M. Grabner, and H. Bischof. Real-time tracking via on-line boosting. In BMVC, 2006.

[18] S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr. Struck: Structured output tracking with kernels. IEEE PAMI, 2016.

[19] D. Held, S. Thrun, and S. Savarese. Learning to track at 100 fps with deep regression networks. In ECCV, 2016.

[20] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploiting the circulant structure of tracking-by-detection with kernels. In ECCV, 2012.

[21] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation ﬁlters. IEEE PAMI, 2015.

[22] S. Hong, T. You, S. Kwak, and B. Han. Online tracking by learning discriminative saliency map with convolutional neural network. In ICML, 2015.

[23] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. ar Xiv preprint: 1709.01507, 2017.

[24] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.

[25] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE PAMI, 2012.

[26] H. Kiani Galoogahi, A. Fagg, and S. Lucey. Learning background-aware correlation ﬁlters for visual tracking. In CVPR, 2017.

[27] A. Kosiorek, A. Bewley, and I. Posner. Hierarchical attentive recurrent tracking. In NIPS, 2017.

[28] M. Kristan and et al. The visual object tracking vot2016 challenge results. In ECCVW, 2016.

[29] Y. Li and J. Zhu. A scale adaptive kernel correlation ﬁlter tracker with feature integration. In ECCVW, 2014.

[30] Y. Li, J. Zhu, and S. C. Hoi. Reliable patch trackers: Robust visual tracking by exploiting reliable patches. In CVPR, 2015.

[31] A. Lukezic, T. Vojír, L. C. Zajc, J. Matas, and M. Kristan. Discriminative correlation ﬁlter with channel and spatial reliability. In CVPR, 2017.

[32] W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y. Wang. End-to-end active object tracking via reinforcement learning. In ICML, 2018.

[33] W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, X. Zhao, and T.-K. Kim. Multiple object tracking: A literature review. ar Xiv preprint: 1409.7618, 2014.

[34] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In ICCV, 2015.

[35] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In CVPR, 2016.

[36] J. Ning, J. Yang, S. Jiang, L. Zhang, and M.-H. Yang. Object tracking via dual linear structured svm and explicit feature map. In CVPR, 2016.

[37] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.

[38] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H. Yang. Hedged deep tracking. In CVPR, 2016.

[39] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.

[40] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. ar Xiv preprint: 1312.6034, 2013.

[41] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

[42] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. IEEE PAMI, 2014.

[43] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. Lau, and M.-H. Yang. Crest: Convolutional residual learning for visual tracking. In ICCV, 2017.

[44] Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, L. Rynson, and M.-H. Yang. Vital: Visual tracking via adversarial learning. In CVPR, 2018.

[45] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instance search for tracking. In CVPR, 2016.

[46] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr. End-to-end representation learning for correlation ﬁlter based tracking. In CVPR, 2017.

[47] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classiﬁcation. In CVPR, 2017.

[48] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. In ICCV, 2015.

[49] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung. Transferring rich feature hierarchies for robust visual tracking. ar Xiv preprint: 1501.04587, 2015.

[50] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: A benchmark. In CVPR, 2013.

[51] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark. IEEE PAMI, 2015.

[52] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

[53] S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi. Action-decision networks for visual tracking with deep reinforcement learning. In CVPR, 2017.

[54] J. Zhang, S. Ma, and S. Sclaroff. Meem: robust tracking via multiple experts using entropy minimization. In ECCV, 2014.

[55] K. Zhang, Q. Liu, Y. Wu, and M.-H. Yang. Robust visual tracking via convolutional networks without training. IEEE TIP, 2016.

[56] L. Zhang, J. Varadarajan, P. N. Suganthan, N. Ahuja, and P. Moulin. Robust visual tracking using oblique random forests. In CVPR, 2017.

[57] T. Zhang, C. Xu, and M.-H. Yang. Multi-task correlation particle ﬁlter for robust object tracking. In CVPR, 2017.

[58] G. Zhu, F. Porikli, and H. Li. Beyond local search: Tracking objects everywhere with instancespeciﬁc proposals. In CVPR, 2016.