# spatialtemporal_person_reidentification__2e40cd65.pdf

The Thirty-Third AAAI Conference on Artiﬁcial Intelligence (AAAI-19)

Spatial-Temporal Person Re-Identiﬁcation

Guangcong Wang,1 Jianhuang Lai,1,2,3 Peigen Huang,1 Xiaohua Xie1,2,3

1School of Data and Computer Science, Sun Yat-sen University, China 2Guangdong Key Laboratory of Information Security Technology 3Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education {wanggc3, huangpg}@mail2.sysu.edu.cn, {stsljh, xiexiaoh6}@mail.sysu.edu.cn

Most of current person re-identiﬁcation (Re ID) methods neglect a spatial-temporal constraint. Given a query image, conventional methods compute the feature distances between the query image and all the gallery images and return a similarity ranked table. When the gallery database is very large in practice, these approaches fail to obtain a good performance due to appearance ambiguity across different camera views. In this paper, we propose a novel two-stream spatial-temporal person Re ID (st-Re ID) framework that mines both visual semantic information and spatial-temporal information. To this end, a joint similarity metric with Logistic Smoothing (LS) is introduced to integrate two kinds of heterogeneous information into a uniﬁed framework. To approximate a complex spatial-temporal probability distribution, we develop a fast Histogram-Parzen (HP) method. With the help of the spatial-temporal constraint, the st-Re ID model eliminates lots of irrelevant images and thus narrows the gallery database. Without bells and whistles, our st-Re ID method achieves rank-1 accuracy of 98.1% on Market-1501 and 94.4% on Duke MTMC-re ID, improving from the baselines 91.2% and 83.8%, respectively, outperforming all previous state-of-theart methods by a large margin.

Introduction Person Re ID aims to re-target pedestrian images across nonoverlapping camera views given a query image. Recently, state-of-the-art person Re ID methods (Wang et al. 2016; Zhong et al. 2017a; Bai, Bai, and Tian 2017; Tang et al. 2017; Zhuo et al. 2018; Lin et al. 2017a) gained a signiﬁcant improvement (e.g., rank-1 accuracy of 80-90% on Market-1501) by using deep learning for feature representation. However, these methods are still far from applied in real-world scenarios that may contain a large amount of gallery images. It is hard to further improve the performance using only general visual features due to appearance ambiguity. For example, different persons may share a similar appearance, a lighting condition or a human pose. How to exploit extra information to get around this bottleneck becomes a hot topic in person Re ID community. Recent studies attempt to exploit person structure information to improve the performance of Re ID methods. They

Corresponding author Copyright c 2019, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Visual Feature

Spatial-temporal

Figure 1: Conventional person Re ID vs. our st-Re ID. (a) Retrieval results of the conventional person Re ID. Without the help of spatial-temporal information, it is difﬁcult for the conventional Re ID to deal with the appearance ambiguity (red boxes denote false alarms). (b) Retrieval results of our st-Re ID. With spatial-temporal information, st-Re ID can eliminate irrelevant images. Besides, spatial-temporal information (camera ID and timestamp) of st-Re ID widely exists in video surveillance and can be easily collected without any manual annotation. (Best viewed in color)

believe that person structure information, such as body parts, human poses, person attributes, and background context information, can help Re ID methods capture discriminative local visual features. For example, part-based methods (Li et al. 2017; Zhao et al. 2017b) make a assumption that a person image consists of head, upper body, lower body and foot from top to bottom. Considering the person structure information, they can jointly learn both the global full-body and local body-part features for person Re ID. Pose-based methods (Su et al. 2017; Zhao et al. 2017a) aim to extract pose-invariant features by exploiting keypoint annotations to localize and align the poses. Other methods mine the cues of attribute, semantic segmentation or background context (Kalayeh et al. 2018; Song et al. 2018) for person Re ID. However, these models obtain a limited improvement for person Re ID to address the appearance ambiguity problem. Instead of using person structure information, a wide variety of approaches also attempt to exploit spatial-temporal information. The straightforward way is to exploit both spatial and temporal information from videos. Image-to-video and video-based person Re ID methods (Wang, Lai, and Xie 2017; Li et al. 2018) aim to learn spatialand temporal-

Figure 2: (a) Camera topology of Duke MTMC-re ID. (b) Spatial-temporal distribution, i.e, frequency of positive image pairs (an image pair with the same person identity denotes a positive pair) with respect to time interval. (Best viewed in color)

invariant visual features. However, these approaches still focus on visual feature representations, but not a spatialtemporal constraint across different cameras. For example, a person captured by Camera 1 at t should not be captured by Camera 2 that is far away from Camera 1 at t + t ( t is a small value). Such a spatial-temporal constraint eliminates lots of irrelevant target images in gallery, and thus signiﬁcantly alleviates the appearance ambiguity problem. To distinguish the spatial-temporal concept of video-based methods, we call this the spatial-temporal person Re ID (st Re ID), as shown in Figure 1. St-Re ID is explicitly or implicitly investigated in distributed camera network topology inference (Huang et al. 2016; Cho et al. 2017) and cross-camera multiple object tracking. However, these approaches either make some strong assumptions for model simpliﬁcation or do not focus on how to build an effective joint metric for the visual similarity and the spatial-temporal distribution. Formally, st Re ID is to learn a mapping f : X Y from a training set {(xv i , xs i, xt i, yi)}, where xv i , xs i, xt i and yi represent a visual feature vector, a camera ID (spatial information), a timestamp, and a person ID, respectively. St-Re ID has three properties: 1) extra information of st-Re ID (i.e., xs i,xt i) widely exists in video surveillance and can be easily collected without any manual annotation (see Figure 1); 2) with the cheap spatial-temporal information, the performance of Re ID can be signiﬁcantly improved (6.9% and 10.6% improvement on Market1501 and Duke MTMC-re ID, respectively); 3) st-Re ID can be thought of as analogous to the more difﬁcult version of cross-camera multiple object tracking, which misses lots of in-between frames. St-Re ID bridges the gap between the conventional person Re ID and the cross-camera multiple object tracking. There are three key challenges to model the spatialtemporal pattern in person Re ID. First, it is extremely difﬁcult to estimate the spatial-temporal pattern of person Re ID that follows a complex distribution. Take the Duke MTMCre ID dataset as an example (Figure 2 (a)), there are several paths between Camera 1 and Camera 6 and therefore several peaks exist in the spatial-temporal distribution from Camera 1 to Camera 6 (Figure 2 (b)). Second, even though we can ﬁnd a good formulation to describe the complex spatial-

temporal distribution based on a ﬁnite dataset, it is still unreliable due to uncertain walking trajectories and velocities. That is, a person may appear at anytime and from anywhere. Third, given a reliable visual appearance similarity and an unreliable spatial-temporal distribution, it is difﬁcult to build a reliable joint metric because the spatial-temporal distribution is unreliable and it is hard to assign appropriate weighting factors for these two types of metrics. Considering these intractable problems, a novel joint similarity metric with Logistic Smoothing (LS) is proposed to integrate both visual feature similarity and spatial-temporal patterns into a uniﬁed metric function. Specially, we ﬁrst train a deep convolutional neural network for visual feature representation based on the PCB model (Sun et al. 2017b). A fast Histogram-Parzen method (HP) is then introduced to describe the probability of positive image pairs with respect to time difference for each camera pair. To avoid missing low-probability positive gallery images, we propose to use logistic smoothing (LS) to alleviate the problem of uncertain walking trajectories and velocities in person Re ID. Overall, this paper makes three main contributions:

First, we propose a novel two-stream spatial-temporal person Re ID (st-Re ID) framework that takes both visual semantic information and spatial-temporal information into consideration. With the help of the cheap spatialtemporal information that can be easily collected without any manual annotation, the st-Re ID model eliminates lots of irrelevant images and thus to alleviate the problem of appearance ambiguity in person Re ID.

Second, we propose a joint similarity metric with Logistic Smoothing (LS) to integrate two kinds of heterogeneous information into a uniﬁed framework. Furthermore, we develop a fast Histogram-Parzen (HP) method to approximate the spatial-temporal probability distribution.

Third, without bells and whistles, our st-Re ID method achieves rank-1 accuracy of 98.1% on Market-1501 and 94.4% on Duke MTMC-re ID, improving from the baselines 91.2% and 83.8%, respectively, outperforming all previous state-of-the-art methods by a large margin.

Related Work Recent person Re ID methods concentrate on deep learning for visual feature representation. Basically, these deep models either attempt to design effective convolutional neural networks or adopt different kinds of loss functions, e.g., classiﬁcation loss (Zheng et al. 2016; Feng, Lai, and Xie 2018; Liang et al. 2018), veriﬁcation loss (Li et al. 2014; Chen, Guo, and Lai 2015), and triplet loss (Ding et al. 2015; Wang, Lai, and Xie 2017; Hermans, Beyer, and Leibe 2017; Wang et al. 2016). Due to the remarkable ability of CNN representation, state-of-the-art approaches achieve a good performance, e.g., rank-1 accuracy of 80-90% on Market1501. However, these methods can hardly address the appearance ambiguity problem. In order to achieve this goal, many studies try to exploit person structure information (Li et al. 2017; Zhao et al. 2017b; Su et al. 2017; Zhao et al. 2017a; Kalayeh et al. 2018; Song et al. 2018). For example, a multi-scale context-aware network (Li et al. 2017) is used to learn powerful features over full body and body parts to capture the local context information. A pose-driven deep convolutional model (Su et al. 2017) is introduced to alleviate the pose variations and learn robust feature representations from both the global images and different local parts. A human parsing method (Song et al. 2018) is adopted to improve the performance of person Re ID with the help of the pixel-level accuracy. Rather than using the person structure information, another group of researchers pay attention to spatial-temporal information. According to different kinds of annotations, spatial-temporal methods can be categorized into two subgroups. In the ﬁrst sub-group, spatial-temporal information is implicitly hidden in videos, e.g., image-to-video (Wang, Lai, and Xie 2017) and video-based person Re ID (Li et al. 2018; Zheng et al. 2016). For example, Wang et al. (Wang, Lai, and Xie 2017) proposed a point-to-set network for the image-to-video person Re ID. Li et al. (Li et al. 2018) introduced a spatial-temporal attention model to discover a diverse set of distinctive body parts for the video-based person Re ID. In the second sub-group, spatialtemporal information is explicitly used as a constraint that eliminate the irrelevant gallery images (Cho et al. 2017; Huang et al. 2016; Lv et al. 2018). For example, camera network topology inference methods (Cho et al. 2017; Lv et al. 2018) aim to perform the person Re ID and camera network topology inference alternately in an online or unsupervised learning manner. Given a person image with a timestamp t, they make a strong assumption that this person should be appear at (t t, t + t). Different from these methods, our st-Re ID approach seeks an effective joint metric that naturally integrates spatial-temporal information into the visual feature representation for supervised person Re ID. Besides, a Camera Network based Person Re ID (CNPR) (Huang et al. 2016) is introduced to consider both the visual feature representation and the spatial-temporal constraint. However, the CNPR model makes a strong assumption that the time difference for the transition between cameras follows a Weibull distribution that contains a peak value and thus is unavailable in complex scenarios, e.g., Duke MTMCre ID. Besides, CNPR fails to address the problem of uncer-

tain walking trajectories and velocities. Different from the CNPR model, we propose to use a Histogram-Parzen window method for the Probability Density Function (PDF) approximation and introduce a logistic smoothing approach to solve the uncertainty problem.

Proposed Method

St-Re ID aims to exploit both the visual feature similarity and the spatial-temporal constraint in a uniﬁed framework. To this end, we propose a two-stream architecture which consists of three sub-modules, i.e., a visual feature stream, a spatial-temporal stream, and a joint metric sub-module. Figure 3 shows the two-stream architecture for the st-Re ID.

Visual Feature Stream

Visual feature representation approaches are investigated in lots of studies. We do not focus on how to extract a discriminative and robust feature representation in this paper. Therefore, we use a clear Part-based Convolutional Baseline (PCB) (Sun et al. 2017b) as a visual feature stream without considering a reﬁned part pooling. This stream contains a Res Net backbone network, a stripe-based average pooling layer, six 1 1 kernel-sized convolutional layers, six fullyconnected layers, and six classiﬁers (Cross-Entropy loss). During the training phase, each classiﬁer is used to predict the class (person identity) of a given image. With the part-level feature representation learning scheme, PCB can learn local discriminative features and thus achieve competitive accuracy. During the test phase, six stripe-based features are concatenated into a column vector for the visual feature representation. In Figure 3, we only show the test phase of the visual feature stream. Given two images Ii and Ij (i and j denote image indexes in a dataset), we extract visual features by using the PCB model and obtain two feature vectors, denoting xi and xj, respectively. We compute a similarity score according to the cosine distance

s(xi, xj) = xi xj ||xi||||xj|| (1)

Spatial-temporal Stream

A spatial-temporal stream is to capture spatial-temporal complementary information to assist the visual feature stream. Instead of using a closed form probability distribution function (Huang et al. 2016) that follows a strong assumption, we estimate the spatial-temporal distribution by using a non-parameter estimation approach, i.e., Parzen Window approach. However, it costs much time to directly estimate a PDF because there are too much spatial-temporal data points. To alleviate the expensive computation problem, we develop a Histogram-Parzen approach. That is, we ﬁrst estimate spatial-temporal histograms and then use the Parzen Window method to smooth it. Let (IDi, ci, ti) and (IDj, cj, tj) (ti < tj) denote the identity labels, camera IDs, timestamps of two images Ii and Ij, respectively. We create coarse spatial-temporal histograms to describe the

Training Set

Parzen Window

Joint Metric

pooling Concat.

Backbone Network

Visual Feature Stream Spatial-Temporal Stream

Figure 3: The proposed two-stream architecture. It consists of three sub-modules, i.e., a visual feature stream, a spatial-temporal stream, and a joint metric sub-module. (Best viewed in color)

probability of a positive image pair by

ˆp(y = 1|k, ci, cj) = nk cicj P

l nlcicj (2)

where k indicates the kth bin of a histogram, i.e., the time interval tj ti ((k 1) t, k t). nk cicj represents the number of person image pairs whose time differences are at the kth bin from ci to cj. y = 1 denotes that Ii and Ij (i.e., IDi = IDj) share the same person identity, while y = 0 for different person identities ((i.e., IDi = IDj)). With the Parzen Window method, we smooth the histogram by

p(y = 1|k, ci, cj) = 1

l ˆp(y = 1|l, ci, cj)K(l k) (3)

where K(.) is a kernel and Z = P

k p(y = 1|k, ci, cj) is a

normalized factor. In this work, we use a gaussian function as a kernel K, namely

Joint Metric After we obtain two kinds of heterogeneous patterns, it is intuitively assumed that the visual similarity probability is independent of the spatial-temporal probability. The joint probability can be simply formulated as

p(y = 1|xi, xj, k, ci, cj) = s(xi, xj)p(y = 1|k, ci, cj) (5)

however, Eqn. 5 neglects two points. First, it is unreasonable to directly use the similarity score as the visual similarity probability, i.e., p(y = 1|xi, xj) = s(xi, xj). Second, the spatial-temporal probability p(y = 1|k, ci, cj) is unreliable and uncontrollable because the walking trajectory and velocity of a person is uncertain, i.e., a person may appear at anytime and from anywhere. Directly using p(y = 1|k, ci, cj)

as the spatial-temporal probability function leads to a lower recall rate while keeping the same precision. As an example, given a query image, one gallery image is with a 0.9 similarity score, and a 0.01 spatial-temporal probability, while another gallery image is with 0.3, 0.1. Eqn. 5 tends to return the second gallery image. Those who have low spatialtemporal probabilities may be regarded as irrelevant images. However, this is impractical in real-world scenarios, especially video surveillance systems. For example, when retrieving the images of a thief, (s)he may not be retrieved because (s)he may walk faster than common person and has a low spatial-temporal probability. So, can we transform the similarity score as the visual similarity probability or can we build a robust spatial-temporal probability? Our observation is two-fold: Observation 1: Laplace smoothing. Laplace smoothing is a technique which is widely used to estimate a prior probability in Naive Bayes

pλ(Y = dk) = mk + λ

where dk indicates the label of the kth category, mk indicates the number of the kth category, M is the total number of examples, D is the total number of categories, λ is the smoothing parameter. As a special case, the number of categories D is 2 and λ = 1, we obtain

pλ(Y = dk) = mk + 1

We can see that Laplace smoothing is used to adjust the probability of rare (but not impossible) events so those probabilities are not exactly zero and zero-frequency problems are avoided. It serves as a type of shrinkage estimator, as the smoothing result will be between the empirical estimate mk

M , and the uniform probability 1

2. Observation 2: Logistic function. The logistic model is widely applied for the binary classiﬁcation problem. Spe-

cially, it is deﬁned as

f(x; λ, γ) = 1 1 + λe γx (8)

where λ and γ are constant coefﬁcients, λ is a smoothing factor and γ is a shrinking factor. Observation 1 shows the basic idea of a smoothing operator to alleviate unreliable probability estimation. Observation 2 shows logistic function can be used for the binary classiﬁcation problem. Based on these two observations, we propose a logistic smoothing approach that both adjusts the probability of rare events and compute the probability of two images belonging to the same ID given the certain information. We modify Eqn. (5) as

pjoint = f(s; λ0, γ0)f(pst; λ1, γ1) (9)

For notation simplicity, we use pjoint, s and pst to denote p(y = 1|xi, xj, k, ci, cj), s(xi, xj) and p(y = 1|k, ci, cj), respectively. According to Eqn. (1) and (3), we can see that s ( 1, 1) is shrunk by the logistic function like the Laplace smoothing, but not so much. Differently, pst (0, 1) is truncated and lifted up largely. Even the spatialtemporal probability pst is close to zero, f(pst; λ1, γ1) f(0) = 1 1+λ1 . With the logistic smoothing, Eqn. (9) is robust to rare events. This is reasonable because the spatialtemporal probability is unreliable as discussed above while visual similarity are relatively reliable. Besides, using the logistic function to transform the similarity score (spatialtemporal probability) into a binary classiﬁcation probability (positive pair or negative pair) is intuitive and self-evident as described in Observation 2.

Implementation Details As for the visual feature stream, we set the hyper-parameters following the PCB method (Sun et al. 2017b) without considering the reﬁned pooling scheme. The training images are augmented with horizontal ﬂip and normalization and resized to 384 192. We use SGD with a mini-batch size of 32. We train the visual feature stream for 60 epochs. The learning rate starts from 0.1 and is decayed to 0.01 after 40 epochs. The backbone model is pre-trained on Image Net and the learning rate for all the pre-trained layers are set to 0.1 of the base learning rate. As for the spatial-temporal stream, we set the time interval t to 100 frames. We set the gaussian kernel parameter σ to 50 and use the three-sigma rule to further reduce the computation. As for the joint metric, we set λ0, λ1, γ0 and γ1 to 1, 2, 5 and 5, respectively.

Experiments In this section, we evaluate our st-Re ID method on two large-scale person Re ID benchmark datasets, i.e., Market1501 and Duke MTMC-re ID, and show the superiority of the st-Re ID model compared with other state-of-the-art methods. We then present ablation studies to reveal the beneﬁts of each main component/factor of our method. Datasets. The Market-1501 dataset is collected in front of a supermarket in Tsinghua University. A total of six cameras are used, including 5 high-resolution cameras, and one

Methods R-1 R-5 R-10 m AP Bo W+kissme 44.4 63.9 72.2 20.8 KLFDA 46.5 71.1 79.9 - Null Space 55.4 - - 29.9 WARCA 45.2 68.1 76.0 - PAN 82.8 - - 63.4 SVDNet 82.3 92.3 95.2 62.1 HA-CNN 91.2 - - 75.7 SSDAL 39.4 - - 19.6 APR 84.3 93.2 95.2 64.7 Human Parsing 93.9 98.8 99.5 - Mask-guided 83.79 - - 74.3 Background 81.2 94.6 97.0 - PDC 84.1 92.7 94.9 63.4 PSE+ECN 90.3 - - 84.0 Multi Scale 88.9 - - 73.1 Spindle Net 76.9 91.5 94.6 - Latent Parts 80.3 - - 57.5 Part-Aligned 81.0 92.0 94.7 63.4 PCB(*) 91.2 97.0 98.2 75.8 TFusion-sup 73.1 86.4 90.5 - st-Re ID 97.2 99.3 99.5 86.7 st-Re ID+RE 98.1 99.3 99.6 87.6 st-Re ID+RE+re-rank 98.0 98.9 99.1 95.5

Table 1: Comparison of the proposed method with the stateof-the-arts on Market-1501. The compared methods are categorized into seven groups. Group 1: handcrafted feature methods. Group 2: clear deep learning based methods. Group 3: attribute-based methods. Group 4: mask-guided methods. Group 5: part-based methods. Group 6: pose-based methods. Group 7: spatial-temporal methods. * denotes the methods that are reproduced by ourselves.

low-resolution camera. Overlap exists among different cameras. Overall, this dataset contains 32,668 annotated bounding boxes of 1,501 identities. Among them, 12,936 images from 751 identities are used for training, and 19,732 images from 750 identities plus distractors are used for gallery. As for query, 3,368 hand-drawn bounding boxes from 750 identities are adopted. In this open system, images of each identity are captured by at most six cameras. Each annotated identity is present in at least two cameras. Each image contains its camera id and frame num (time stamp). Duke MTMC-re ID is a subset of the Duke MTMC dataset for image-based re-identiﬁcation. There are 1,404 identities appearing in more than two cameras and 408 identities (distractor ID) who appear in only one camera. Specially, 702 IDs are selected as the training set and the remaining 702 IDs are used as the testing set. In the testing set, one query image is picked for each ID in each camera and the remaining images are put in the gallery. In this way, there are 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images (702 ID + 408 distractor ID). Each image contains its camera id and frame num (time stamp).

Methods R-1 R-5 R-10 m AP Bo W+kissme 25.1 - - 12.2 LOMO+XQDA 30.8 - - 17.0 PAN 71.6 - - 51.5 SVDNet 76.7 - - 56.8 HA-CNN 80.5 - - 63.8 APR 70.7 - - 51.9 Human Parsing 84.4 91.9 93.7 71.0 PSE+ECN 85.2 - - 79.8 Multi Scale 79.2 - - 60.6 PCB(*) 83.8 91.7 94.4 69.4 st-Re ID 94.0 97.0 97.8 82.8 st-Re ID+RE 94.4 97.4 98.2 83.9 st-Re ID+RE+re-rank 94.5 96.8 97.1 92.7

Table 2: Comparison of the proposed method with the stateof-the-arts on Duke MTMC-re ID. The compared methods are categorized into seven groups. Group 1: handcrafted feature methods. Group 2: clear deep learning based methods. Group 3: attribute-based methods. Group 4: mask-guided methods. Group 5: part-based methods. Group 6: pose-based methods. Group 7: spatial-temporal methods. * denotes the methods that are reproduced by ourselves.

Evaluation Protocol. For each query, an algorithm computes the distances between the query image and all the gallery images and return a ranked table from small to large. Top-k accuracy is computed by checking if top-k gallery images contain the query identity. For each individual query identity, his/her gallery samples from the same camera are excluded due to the setting of cross-view matching in person Re ID. Mean average precision (m AP) is used to evaluate the overall performance. For each query, we calculate the area under the Precision-Recall curve, i.e., average precision (AP). Then, the mean value of APs of all queries, i.e., m AP, is calculated, which considers both precision and recall of an algorithm, thus providing a more comprehensive evaluation.

Comparisons to the State-of-the-Art

In this sub-section, we evaluate our st-Re ID approach compared with lots of existing state-of-the-arts on two largescale person Re ID benchmark datesets to shows the superiority of the st-Re ID approach. Evaluations on Market-1501. We evaluated the proposed st-Re ID model against twenty existing state-of-theart methods, which can be grouped into seven categories, i.e., 1) handcrafted feature methods including Bo W+kissme (Zheng et al. 2015), KLFDA (Karanam et al. 2016), Null Space (Zhang, Xiang, and Gong 2016) and WARCA (Jose and Fleuret 2016); 2) clear deep learning based methods including PAN (Zheng, Zheng, and Yang 2016), SVDNet (Sun et al. 2017a), and HA-CNN; 3) attribute-based methods including SSDAL (Su et al. 2016) and APR (Lin et al. 2017b); 4) mask-guided methods including Human Parsing (Kalayeh et al. 2018), Mask-guided (Song et al. 2018), and

Background (Tian et al. 2018); 5) part-based methods including Multi Scale (Chen, Zhu, and Gong 2017), PDC (Su et al. 2017) and PSE+ECN (Saquib Sarfraz et al. 2018); 6) pose-based methods including Spindle Net (Zhao et al. 2017a), Latent Parts (Li et al. 2017), Part-Aligned (Zhao et al. 2017b) and PCB 7) spatial-temporal methods including TFusion-sup (Lv et al. 2018). Among them, attribute-based methods use person attribute annotations, mask-guided methods use the person masks or human body parsing annotations, part-based methods make the person body assumption or use body part detectors, and pose-based methods use keypoint annotations. These methods obtain a good performance compared with handcrafted feature methods and clear deep learning based methods, but they need expensive annotations and are quite time-consuming, e.g., pixel-level human parsing annotations, eighteen keypoints, and body part annotations. Our st-Re ID method uses the cheap spatial-temporal information (i.e., camera ID and timestamp) and achieves the rank-1 accuracy of 97.2% and m AP of 87.6%, outperforming all the existing state-of-the-art methods by a large margin. With random erase (RE) (Zhong et al. 2017c), our st Re ID achieves the rank-1 accuracy of 98.1% and m AP of 87.6%. With the re-ranking scheme (Zhong et al. 2017b), our st-Re ID obtains m AP of 95.5%. TFusion-sup also use the spatial-temporal constraint. but it makes a strong assumption that a gallery person always appears in (t t, t+ t) when given a query image with a timestamp t. Such a method may be not effective in complex scenarios, especially Duke MTMC-re ID. Besides, TFusion-sup focuses on cross-dataset unsupervised learning for the person Re ID by alternately iterating between learning visual feature representations and estimating spatial-temporal patterns. Therefore, TFusion-sup actually does not investigate how to estimate the spatial-temporal probability distribution and how to model the joint probability of the visual similarity and the spatial-temporal probability distribution. Evaluations on Duke MTMC-re ID. Duke MTMC-re ID is a new dataset and manifests itself as one of the most challenging re ID datasets up to now. We compare our st-Re ID method with ten state-of-the-art methods on the Duke MTMC-re ID dataset. All of the competing methods are also evaluated on the Market-1501 dataset except LOMO+XQDA (Liao et al. 2015). As shown in Table 2, it is encouraging to see that our approach (without any reranking scheme) signiﬁcantly outperforms the competing methods by a large margin, e.g., by improving the stateof-the-art rank-1 accuracy from 85.2% to 94.0% and m AP from 79.8% to 82.8% compared with PSE+ECN (with a re-ranking scheme). With random erase (RE), our st-Re ID achieves the rank-1 accuracy of 94.4% and m AP of 83.9%. With the re-ranking scheme, our st-Re ID obtains m AP of 92.7%. Remarks. Without bells and whistles, our st-Re ID model outperforms all of the previous state-of-the-art person Re ID methods, e.g., a rank-1 accuracy of 98.1% and 94.4% on Market-1501 and Duke MTMC-re ID datasets, respectively. While outside the scope of this work, we expect many such techniques (e.g., reﬁned pooling) to be applicable to ours.

2 4 6 8 10 12 14 16 18 20 0

Duke MTMC re ID

Only ST stream 5.5% Only VIS stream 83.8% Both 94.0%

(a) Effect of the ST and VIS streams.

2 4 6 8 10 12 14 16 18 20 0.5

Duke MTMC re ID

VIS stream 83.8% Baseline 86.9% Our joint metric 94.0%

(b) Effectiveness of the joint metric.

0 0.4 0.8 1.2 1.6 2 2.4 2.8 0

Duke MTMC re ID

(c) Inﬂuence of λ.

0 1 2 3 4 5 6 7 8 9 10 0

Duke MTMC re ID

(d) Inﬂuence of γ.

Figure 4: Effectiveness of our method.

Methods R-1 R-5 R-10 m AP Res Net50 baseline 76.9 87.8 91.0 58.7 Res Net50+ST 87.7 94.1 95.8 72.2 Dense Net121 baseline 79.3 89.9 92.6 63.3 Dense Net121+ST 90.8 95.2 96.5 76.9 PCB(*) 83.8 91.7 94.4 69.4 PCB+ST 94.0 97.0 97.8 82.8

Table 3: Generalization of the st-Re ID on Duke MTMCre ID.

Ablation Studies and Model Analysis To provide more insights on the performance of our approach, we conduct a lot of ablation studies on the most challenging Duke MTMC-re ID dataset by isolating each key component, i.e., the visual feature representation stream, the spatial-temporal probability estimation stream and the joint metric sub-module. Effect of the visual feature stream. To show the beneﬁt of the visual feature stream (VIS stream), we conduct an ablation study by isolating this sub-module. To achieve this, we remove the visual feature stream and thus we only use the spatial-temporal stream. It is observed that the rank-1 accuracy drops by 88.5% (from 5.5% to 94.0%) when removing the VIS stream, shown in Figure 4 (a). In this experiment, we conﬁrm that the VIS stream plays in a key role in the st-Re ID approach. Effect of the spatial-temporal stream. To show the beneﬁt of the spatial-temporal stream (ST stream), we remove this sub-module to see how the spatial-temporal stream makes an effect in the st-Re ID. In this case, the st-Re ID model is degraded as the PCB model. As shown in 4 (a), we can see that without the spatial-temporal probability estimation stream, the rank-1 accuracy drops 10.2% (from 94.0% to 83.8%). Effectiveness of the joint metric. To show the effectiveness of the joint metric, we set a baseline by using Eqn. 5. In the baseline, both the VIS stream and the ST stream are normalized. For the fair comparison, we use the same VIS and ST streams. As shown in Figure 4 (b), our joint metric method improves the performance from 86.9% to 94.0%. Compared with the VIS stream (PCB model), the baseline also obtains a 3.1% improvement because it integrates the

spatial-temporal information. Inﬂuence of parameters. To investigate the impact of two important parameters in our st-Re ID, i.e., the smoothing factor λ and the shrinking factor γ, we conduct two sensitivity analysis experiments. As shown in Figure 4 (c) and (d), when λ is in the range of 0.4 2.8 or γ is in the range of 1 7, our model nearly keeps the best performance. Generalization of the st-Re ID. To show the good generalization of the st-Re ID, we further use different deep models as the VIS streams, respectively. The deep models are Res Net-50 (a clear model with the cross entropy loss), Dense Net-121 (a clear model with the cross entropy loss) and PCB. As shown in Table 3, it is observed that when adding the ST stream into these VIS streams and using our joint metric, we can achieve more than 10% improvement.

Conclusion In this paper, we propose a novel two-stream spatialtemporal person Re ID (st-Re ID) framework that mines both the visual semantic similarity and the spatial-temporal information. Without bells and whistles, our st-Re ID method achieves rank-1 accuracy of 98.1% on Market-1501 and 94.4% on Duke MTMC-re ID, improving from the baseline 91.2% and 83.8%, respectively, outperforming all previous state-of-the-art methods by a large margin. We intend to extend this work in two directions. First, the st-Re ID builds a bridge between the conventional Re ID and the cross-camera multiple object tracking and thus can be easily generalized to the cross-camera multiple object tracking. Second, we intend to further improve the performance of the st-Re ID method using an end-to-end training manner.

Acknowledgments This project was supported by the National Natural Science Foundation of China (U1611461, 61573387, 61672544).

References Bai, S.; Bai, X.; and Tian, Q. 2017. Scalable person reidentiﬁcation on supervised smoothed manifold. In CVPR, 2530 2539. Chen, S.-Z.; Guo, C.-C.; and Lai, J.-H. 2015. Deep ranking for person re-identiﬁcation via joint representation learning. ar Xiv:1505.06821.

Chen, Y.; Zhu, X.; and Gong, S. 2017. Person reidentiﬁcation by deep learning multi-scale representations. CVPRW. Cho, Y.-J.; Kim, S.-A.; Park, J.-H.; Lee, K.; and Yoon, K.- J. 2017. Joint person re-identiﬁcation and camera network topology inference in multiple cameras. ar Xiv:1710.00983. Ding, S.; Lin, L.; Wang, G.; and Chao, H. 2015. Deep feature learning with relative distance comparison for person re-identiﬁcation. Pattern Recognition 48(10):2993 3003. Feng, Z.; Lai, J.; and Xie, X. 2018. Learning view-speciﬁc deep networks for person re-identiﬁcation. TIP 27(7):3472 3483. Hermans, A.; Beyer, L.; and Leibe, B. 2017. In defense of the triplet loss for person re-identiﬁcation. ar Xiv:1703.07737. Huang, W.; Hu, R.; Liang, C.; Yu, Y.; Wang, Z.; Zhong, X.; and Zhang, C. 2016. Camera network based person reidentiﬁcation by leveraging spatial-temporal constraint and multiple cameras relations. In MMM. Jose, C., and Fleuret, F. 2016. Scalable metric learning via weighted approximate rank component analysis. In ECCV. Kalayeh, M. M.; Basaran, E.; G okmen, M.; Kamasak, M. E.; and Shah, M. 2018. Human semantic parsing for person reidentiﬁcation. In CVPR, 1062 1071. Karanam, S.; Gou, M.; Wu, Z.; Rates-Borras, A.; O.Camps; and Radke, R. J. 2016. A comprehensive evaluation and benchmark for person re-identiﬁcation. ar Xiv:1605.09653. Li, W.; Zhao, R.; Xiao, T.; and Wang, X. 2014. Deepreid: Deep ﬁlter pairing neural network for person reidentiﬁcation. In CVPR, 152 159. Li, D.; Chen, X.; Zhang, Z.; and Huang, K. 2017. Learning deep context-aware features over body and latent parts for person re-identiﬁcation. In CVPR, 384 393. Li, S.; Bak, S.; Carr, P.; and Wang, X. 2018. Diversity regularized spatiotemporal attention for video-based person reidentiﬁcation. In CVPR. Liang, W.; Wang, G.; Lai, J.; and Zhu, J. 2018. M2m-gan: Many-to-many generative adversarial transfer learning for person re-identiﬁcation. ar Xiv:1811.03768. Liao, S.; Hu, Y.; Zhu, X.; and Li, S. Z. 2015. Person reidentiﬁcation by local maximal occurrence representation and metric learning. In CVPR, 2197 2206. Lin, L.; Wang, G.; Zuo, W.; Xiangchu, F.; and Zhang, L. 2017a. Cross-domain visual matching via generalized similarity measure and feature learning. TPAMI 39:1089 1102. Lin, Y.; Zheng, L.; Zheng, Z.; Wu, Y.; and Yang, Y. 2017b. Improving person re-identiﬁcation by attribute and identity learning. ar Xiv:1703.07220. Lv, J.; Chen, W.; Li, Q.; and Yang, C. 2018. Unsupervised cross-dataset person re-identiﬁcation by transfer learning of spatial-temporal patterns. In CVPR, 7948 7956. Saquib Sarfraz, M.; Schumann, A.; Eberle, A.; and Stiefelhagen, R. 2018. A pose-sensitive embedding for person reidentiﬁcation with expanded cross neighborhood re-ranking. In CVPR, 420 429.

Song, C.; Huang, Y.; Ouyang, W.; and Wang, L. 2018. Mask-guided contrastive attention model for person reidentiﬁcation. In CVPR, 1179 1188. Su, C.; Zhang, S.; Xing, J.; Gao, W.; and Tian, Q. 2016. Deep attributes driven multi-camera person re-identiﬁcation. ar Xiv:1605.03259. Su, C.; Li, J.; Zhang, S.; Xing, J.; Gao, W.; and Tian, Q. 2017. Pose-driven deep convolutional model for person reidentiﬁcation. In ICCV, 3960 3969. Sun, Y.; Zheng, L.; Deng, W.; and Wang, S. 2017a. Svdnet for pedestrian retrieval. In ICCV), 3800 3808. Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; and Wang, S. 2017b. Beyond part models: Person retrieval with reﬁned part pooling. ar Xiv:1711.09349. Tang, S.; Andriluka, M.; Andres, B.; and Schiele, B. 2017. Multiple people tracking by lifted multicut and person reidentiﬁcation. In CVPR, 3539 3548. Tian, M.; Yi, S.; Li, H.; Li, S.; Zhang, X.; Shi, J.; Yan, J.; and Wang, X. 2018. Eliminating background-bias for robust person re-identiﬁcation. In CVPR, 5794 5803. Wang, G.; Lin, L.; Ding, S.; Li, Y.; and Wang, Q. 2016. Dari: Distance metric and representation integration for person veriﬁcation. In AAAI. Wang, G.; Lai, J.; and Xie, X. 2017. P2snet: Can an image match a video for person re-identiﬁcation in an end-to-end way. TCSVT 28:2777 2787. Zhang, L.; Xiang, T.; and Gong, S. 2016. Learning a discriminative null space for person re-identiﬁcation. In CVPR, 1239 1248. Zhao, H.; Tian, M.; Sun, S.; Shao, J.; Yan, J.; Yi, S.; Wang, X.; and Tang, X. 2017a. Spindle net: Person re-identiﬁcation with human body region guided feature decomposition and fusion. In CVPR, 1077 1085. Zhao, L.; Li, X.; Zhuang, Y.; and Wang, J. 2017b. Deeply-learned part-aligned representations for person reidentiﬁcation. In ICCV, 3219 3228. Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian, Q. 2015. Scalable person re-identiﬁcation: A benchmark. In ICCV, 1116 1124. Zheng, L.; Bie, Z.; Sun, Y.; Wang, J.; Su, C.; Wang, S.; and Tian, Q. 2016. Mars: A video benchmark for large-scale person re-identiﬁcation. In ECCV. Zheng, Z.; Zheng, L.; and Yang, Y. 2016. Pedestrian alignment network for large-scale person re-identiﬁcation. ar Xiv:1707.00408. Zhong, Z.; Zheng, L.; Cao, D.; and Li, S. 2017a. Reranking person re-identiﬁcation with k-reciprocal encoding. In CVPR, 1318 1327. Zhong, Z.; Zheng, L.; Cao, D.; and Li, S. 2017b. Reranking person re-identiﬁcation with k-reciprocal encoding. In CVPR. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2017c. Random erasing data augmentation. ar Xiv:1708.04896. Zhuo, J.; Chen, Z.; Lai, J.; and Wang, G. 2018. Occluded person re-identiﬁcation. ar Xiv:1804.02792.