# visual_tracking_with_reliable_memories__78140b5e.pdf Visual Tracking with Reliable Memories Shu Wang1, Shaoting Zhang2, Wei Liu3 and Dimitris N. Metaxas1 1CBIM Center, Rutgers University, Piscataway, NJ, USA 2Department of Computer Science, UNC Charlotte, Charlotte, NC, USA 3Didi Research, Beijing, China 1{sw498, dnm}@cs.rutgers.edu, 2szhang16@uncc.edu, 3wliu@ee.columbia.edu In this paper, we propose a novel visual tracking framework that intelligently discovers reliable patterns from a wide range of video to resist drift error for long-term tracking tasks. First, we design a Discrete Fourier Transform (DFT) based tracker which is able to exploit a large number of tracked samples while still ensures real-time performance. Second, we propose a clustering method with temporal constraints to explore and memorize consistent patterns from previous frames, named as reliable memories . By virtue of this method, our tracker can utilize uncontaminated information to alleviate drifting issues. Experimental results show that our tracker performs favorably against other stateof-the-art methods on benchmark datasets. Furthermore, it is significantly competent in handling drifts and able to robustly track challenging long videos over 4000 frames, while most of others lose track at early frames. 1 Introduction Visual tracking is one of the fundamental and challenging problems in computer vision and artificial intelligence. Though much progress has been achieved in recent years [Yilmaz et al., 2006; Wu et al., 2013], there are still unsolved issues due to its complexity on various factors, such as illumination and angle changes, clutter background, shape deformation and occlusion. Extensive studies on visual tracking employ a tracking-by-detection framework and achieve promising results by extending existing machine learning methods (usually discriminative) with online learning manner [Avidan, 2007; Grabner et al., 2008]. To adaptively model various appearance changes, they deal with a large amount of samples1 at both detection and updating stages. However, all of them face the same dilemma: While more samples grant better accuracy and adaptiveness, they also come with higher computational cost and risk of drifting. In addition to discriminative methods, [Ross et al., 2008; Mei and Ling, 2011; Wang and Lu, 2014] utilize generative models with a fixed 1Here samples refers to positive (and negative) target patches for trackers based on generative (or discriminative) models. learning-rate to account for target appearance changes. The learning-rate is essentially a trade-off between adaptiveness and stability. However, even with very small rate, former samples influence on their models still drops exponentially through frames, and drift error may still accumulate. In order to alleviate drift error, [Babenko et al., 2011; Hare et al., 2011; Zhang et al., 2014b] are designed to exploit hidden structured information around the target region. Other methods [Collins and Liu, 2003; Avidan, 2007; Kwon and Lee, 2010] try to avoid drifting by making the current model a combination of the labeled samples in the first frame and the learned samples from the tracking process. However, limited number of samples (e.g., the first frame) can be regarded as very confident , which in turn restrict their robustness in long-term challenging tasks. Recently, several methods [Bolme et al., 2010; Danelljan et al., 2014b; Henriques et al., 2015] employ Discrete Fourier Transform (DFT) to perform extremely fast detection and achieve high accuracy with the least computational cost. However, same as other generative methods, the memory length of their models is limited by a fixed forgetting rate, and therefore they still suffer from accumulated drift error in long-term tasks. A very important observation is that, when the tracked target moves smoothly, e.g., without severe occlusion or outof-plane rotations, its appearances across frames share high similarity in the feature space (e.g., edge features). Contrarily, when it undergoes drastic movements such as in/out-ofplane rotations or occlusions, its appearances may not be that similar to previous ones. Therefore, if we impose a temporal constraint on clustering these samples, such that only temporally adjacent ones can be grouped together, the big clusters with large intra-cluster correlation can indicate the periods when the target experiences small appearance changes. We take human memory as an analogy for these clusters, using reliable memories to represent large clusters that have been consistently perceived for a long time. In this context, earlier memories supported by more samples have higher probability to be reliable than more recent ones with less support, especially when drift error accumulates across frames. Thus, a tracker may recover from drift error with preference to choose candidates that share high correlation to earlier memories. Based on these motivations, we propose a novel tracking framework, which efficiently explores self-correlated appearance clusters across frames, and then preserves reliable mem- Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) tracker1 path tracker2 path drift samples drift samples our tracker path Find consistent our tracker preserved reliable memory Figure 1: This figure illustrates the basic philosophy of our method. Here we use Snake (video game) as an analogy for learning-rate based visual trackers (tracker1 and tracker2): In order to track the target on ideal path, they continuously take in new samples, and forget old ones due to limited memory length. Contrarily, our tracker discovers and preserves multiple temporally constrained clusters as memories, covering a much wider range on the whole sequence. As shown above, tracker1, tracker2 and our tracker depart from the ideal path at time t1 and t2 for drastic target appearance changes. After that, all three trackers absorb a certain amount of drifted samples. With only limited length of memory, tracker2 can hardly recover from drift error even if familiar target appearance shows up at t3. Similarly, tracker1 deviates from the ideal path for long and is degraded by drifted samples from time t1 to t3. Even it happens to be close to the ideal path at t3 by chance, without keeping memory on similar samples long before, it still drifts from the ideal path with a high probability. On the contrary, when similar target appearance occurs at t3, our tracker corrects tracking result with consistent and reliable memories, and recovers from drift error. ories for long-term robust visual tracking. First, we design a DFT-based visual tracker, which is capable of retrieving good memories from a vast number of tracked samples for accurate detection, while still ensures real-time performance. Second, we propose a novel clustering method with temporal constraints to discover distinct and reliable memories from previous frames to help our tracker resist drift error. This method harvests the inherent correlation of the streaming data, and is guaranteed to converge at a fast speed2 by carefully designing upon Integral Image. To the best of our knowledge, our temporally constrained clustering is novel to vision streaming data analysis, and its high converging speed and promising performance show great potential in online streaming problems. Particularly, it is very competent in discovering clusters (i.e., reliable memories) consisted of uncontaminated sequential samples that are tracked before, and grants our tracker remarkable ability to resist drift error. Experimental results show that our tracker is considerably competent in handling drift error and performs favorably against other state-of-theart methods on benchmark datasets. Further, it can robustly track challenging long videos with over 4000 frames, while most of the others lose track at early frames. 2 Circulant Structure based Visual Tracking Recent works [Bolme et al., 2010; Henriques et al., 2012; Danelljan et al., 2014b; Henriques et al., 2015] achieve the state-of-the-art tracking accuracy with the least computational cost by exploiting the inherent relationship between DFT and the circulant structure of dense sampling on the target region. In this section, we briefly introduce these methods that are highly related to our work. Suppose x 2 RL is a vector of an image patch with size M N, centered at the target (L = M N), and xl denotes a 2Its computational complexity is O(n log n), which costs less than 30 ms for n = 1000 frames. 2D circular shift from x by m n (l is an index for all M N possible shifts, 1 l L). y 2 RL is a vector of a designed response map of size M N with a Gaussian pulse centered at the target, too. (x, x0) =< '(x), '(x0) > is a positive definite kernel function defined by mapping '(x) : RL ! RD. We aim to find a linear classifier f(xl) = !T '(xl) + b that minimizes the Regularized Least Square (RLS) cost function: min "(!) = min ||yl f(xl)||2 + λ||f||2 The first term is an empirical risk to minimize the difference between the designed gaussian response y and the mapping x ! f L(x) 2 RL, where fl(x) = f(xl). The second term ||f|| is a regularization term. It is denoted by ||f|| since it lies in the Kernel Hilbert Space reproduced by . By Representer Theorem [Sch olkopf et al., 2001], cost "(!) can be minimized by a linear combination of inputs: ˆ! = P l'(xl). By defining kernel matrix K 2 RL L, K(l, l0) = (xl, xl0), a much simpler form for Eq. 1 can be derived as: min F( ) = min (y K )T(y K ) + λ TK . (2) This function is convex and differentiable, and has a closed form minimizer ˆ = (K+λI) 1y. As proved in [Henriques et al., 2012], if the kernel is unitarily invariant, its kernel matrix K is a circulant matrix, that K = C(k), where vector k 2 RL, ki = (x, P ix). P i is a permutation matrix that shifts vectors by i-th element(s), C(k) is a circulant matrix from k by concatenating all L possible cyclic shifts of k. and ˆ can be obtained without inverting (K + λI) by: ˆ = F 1( F(y) F(k) + λ where F and F 1 are DFT and its inverse, and is an n by 1 vector with all entries to be 1. Division in Eq. 3 is in Fourier domain, and is thus performed element-wise. In practice, there is no need to compute ˆ from ˆA, since fast detection can be performed on given image patch z by ˆy = F 1(F( k) F( ˆ )), where k 2 RL with kl = (z, ˆxl). ˆx is the learned target appearance. Pulse peak in ˆy shows the target translation in input image z. Detailed derivation is in [Gray, 2005; Rifkin et al., 2003; Henriques et al., 2012]. Though recent methods MOSSE [Bolme et al., 2010], CSK [Henriques et al., 2012] and ACT [Danelljan et al., 2014b], have different configurations of kernel functions and features (e.g., dot product kernel leads to MOSSE, and RBF kernel leads to the latter two), all of them employ a simple linear combination to learn target appearance model {ˆxp, ˆAp} at current frame p by ˆQp = (1 γ) ˆQp 1 + γQp, Q = {x, A, AN,D}. (4) While CSK updates its classifier coefficients ˆAp by Eq. 4 directly, MOSSE and ACT update the numerator ˆAp N and denominator ˆAp D of coefficients ˆAp separately for stability purpose. The learning-rate γ is a trade-off parameter between long memory and model adaptiveness. After expanding Eq. 4 we obtain: γ(1 γ)p j Qj, Q = {x, A, AN,D}. (5) This shows that, all three methods have an exponentially decreasing pattern of memory: Though the learning-rate γ is usually small, e.g., γ = 0.1, the impact of a sample {xj, Aj} at a certain frame j is negligible after 100 frames (γ(1 γ)100 10 8). In other words, these learning-rate based trackers are unable to recourse to samples accurately tracked long before to help resist accumulated drift error. 3 Proposed Method Aside from the convolution based visual trackers mentioned above, many other trackers [Jepson et al., 2003; Nummiaro et al., 2003; Ross et al., 2008; Babenko et al., 2011] also update their models ˆQ in similar form as ˆQp = (1 γ) ˆQp 1 +γQp with a learning-rate parameter γ 2 (0, 1] and suffers from the drifting problem. We observe that smooth movements usually offer consistent appearance cues, which can be modelled as reliable memories to recover the tracker from drifting issues caused by drastic appearance change (illustrated in Fig. 1). In this section, we first introduce our novel framework that is capable of handling vast number of samples while still ensures fast detection speed. Then we elaborate the details of intelligently arranging past samples into distinct and reliable clusters that grant our tracker resistance to drift error. 3.1 The Circulant Tracker over Vast Samples Given new positive sample xp at frame p, we aim to build an adaptive model {ˆxp, ˆAp} for fast detection in the coming frame p + 1 with sample image z by yp+1 = F 1( ˆAp F( kp)), (6) where yp+1 is the response map which shows the estimated translation of the target position, vector kp 2 RL, with its l-th entry kp l := (z, ˆxp l ). As we advocated, this model {ˆxp, ˆAp} should be built upon vast samples for robustness and adaptiveness. Thus, ˆxp should have the form: ˆxp = (1 γ) βjxj + γxp, γ 2 (0, 1], (7) As shown, the adaptive learned appearance ˆxp is a combination of past p samples with concentration on xp of a certain proportion γ. Coefficients {βj}p 1 j=1 represent the correlation between the current estimated appearance ˆ xp and the past appearances {xj}p 1 j=1. A proper choice of {βj}p 1 j=1 should make the model: 1) adaptive to new appearance changes, and 2) consistent with past appearances to avoid risk of drifting. In this paper, we argue that the set {βj}p 1 j=1 with preference to previous reliable memories can provide our tracker with considerable robustness to resist drift error. We discuss how to find these reliable memories in Sec. 3.2, and their connections with {βj}p 1 j=1 are introduced in Sec. 3.3. Now, we focus on finding a set of classifier coefficients that fit both the learned appearance ˆxp for consistency and the current appearance xp for adaptiveness. Based on Eq. 1 and Eq. 2, we derive the following cost function to minimize: F p( ) =(1 γ) (y ˆKp )T(y ˆKp ) + λ T ˆKp (y Kp )T(y Kp ) + λ TKp where the kernel matrix ˆKp = C(ˆkp), and vector entry ˆkp l = (ˆxp, ˆxp l ) (similar for Kp and kp). γ is a balance factor between the memory consistency and model adaptiveness. By setting the derivative r F p = 0, the accurate solution ˆ p will have a very complicated form. We observe that the adaptively learned appearance ˆxp should be very close to the current one xp, since it is a linear combination of close appearances in the past {xj}p 1 j=1 and the current appearance xp, as shown in Eq. 7. Notice both kernel matrix Kp and ˆKp (and their linear combination with λI) is positive semidefinite. By relaxing Eq. 8 with ˆKp ' (1 γ) ˆKp+γKp ' Kp, we obtain an approximate minimizer ˆ p in a very simple form: C((1 γ)ˆkp + γkp + λδ) = F 1( F(y) (1 γ)F(ˆkp) + γF(kp) + λ δ is an L-dimensional vector in the form δ = [1, 0, ..., 0]T, with property that C(δ) = I and F(δ) = is an L-dimension vector of ones). Note that in the bracket of F 1( ), division is performed element-wise. As long as we find a proper set of coefficients {βj}p 1 j=1, we can build up our detection model {ˆxp, ˆAp} by Eq. 7 and Eq. 9. In the next frame p+1, fast detection can be performed by Eq. 6 with this learned model. Frame 0001 - 0416 Memory 02 Frame 0417 - 0480 Memory 05 Frame 0529 - 0592 Memory 11 Frame 0913 - 0928 Memory 23 Frame 1281 - 1345 200 400 600 800 1000 1200 Frame 0625 - 0768 Distance Matrix and Clustering Result Six temporally constrained clusters with distinct appearances Figure 2: Left: the distance matrix D as described in Alg. 1. Right: Six representative clusters with corresponding colored bounding boxes are shown for intuitive understanding. The image patches in the big bounding boxes is an average appearance of a certain cluster (memory), while the small patches are samples chosen evenly on the temporal domain from each cluster. Algorithm 1 Temporal Constrained Clustering Algorithm Input: Integral image J of Distance Matrix D 2 Rp p, s.t. Dij = ||φ(xi) φ(xj)||2, 8i, j = 1, ..., p; M = {mi}p i=1, mi = Piδ, 8i = 1, ..., p; Pi is a shifting matrix and δ = [1, 0, ..., 0]T; Stoping factor , and N = |M| + 1. Output: ˆM = {mi}H i=1. while (|M| < N) do N = |M|; for h = 1 : 2 : |M| do do Evaluate (sh, sh+1) = C(sh T sh+1) (C(sh) + C(sh+1)) using J. if (C(sh, sh+1)) (C(sh) + C(sh+1)) then mh = mh + mh+1, remove mh+1 from M; end if end for end while 3.2 Clustering with Temporal Constraints In this subsection, we introduce our temporally constrained clustering, which learns distinct and reliable memories from the incoming samples in a very fast manner. Together with the ranked memories (Sec. 3.3), our tracker is robust to inaccurate tracking result, and is able to recover from drift error. Suppose a set of positive samples are given at frame p: S = {xi}p i=1, and we would like to divide them into H subsets {sh}H h=1 with indexing vector set M = {m1, ..., m H} 2 {0, 1}p, such that sh := {xi : mh i = 1, 8i = 1, ..., p}. Our objective are as follows: 1) Samples in each subset sh are highly correlated; 2) Samples from different subsets have relatively large appearance difference, so a linear combination of them is vague or even ambiguous to describe the tracked target (e.g., samples from different viewpoints of the target). Thus, it can be modeled as a very general clustering problem: C(sh) + r(|M|), s.t. hmi, mji = 0, 8mi, mj 2 M, i 6= j; Function C(sh) measures the average sample distance in feature space φ( ) within subset sh, in the form: C(sh) = ( p 1T mh) 1 P 8xi,xj2sh,i