# resolutioninvariant_person_reidentification__c1e50265.pdf

Resolution-invariant Person Re-Identiﬁcation

Shunan Mao1 , Shiliang Zhang1 and Ming Yang2

1Peking University 2Horizon Robotics, Inc {snmao, slzhang.jdl}@pku.edu.cn, ming.yang@horizon-robotics.com

Exploiting resolution invariant representation is critical for person Re-Identiﬁcation (Re ID) in real applications, where the resolutions of captured person images may vary dramatically. This paper learns person representations robust to resolution variance through jointly training a Foreground Focus Super-Resolution (FFSR) module and a Resolution-Invariant Feature Extractor (RIFE) by end-to-end CNN learning. FFSR upscales the person foreground using a fully convolutional autoencoder with skip connections learned with a foreground focus training loss. RIFE adopts two feature extraction streams weighted by a dual-attention block to learn features for low and high resolution images, respectively. These two complementary modules are jointly trained, leading to a strong resolution invariant representation. We evaluate our methods on ﬁve datasets containing person images at a large range of resolutions, where our methods show substantial superiority to existing solutions. For instance, we achieve Rank-1 accuracy of 36.4% and 73.3% on CAVIAR and MLR-CUHK03, outperforming the state-of-the art by 2.9% and 2.6%, respectively.

1 Introduction Person Re-identiﬁcation (Re ID) aims to ﬁnd a probe person from a large-scale person image gallery collected by a camera network. [Li et al., 2019] Person Re ID is challenging since it is confronted by many appearance variations due to camera viewpoint, person pose, illumination, background, etc. Thanks to the introduction of many benchmark datasets like VIPe R [Gray and Tao, 2008], CUHK03 [Li et al., 2014], Market1501 [Zheng et al., 2015] and MSMT17 [Wei et al., 2018], most of these challenges are covered in these datasets, leading to a signiﬁcant progress in person Re ID performance. Among the above challenges, varying resolutions of person images are probably the most common one, due to the distance to a camera, or camera focus and resolution. Matching persons at different resolutions requires the Re ID algorithms to attend to distinct visual cues. For example, Fig. 1 illustrates two instances of three persons sampled from CAVIAR [Cheng

Figure 1: Illustration of 6 images from 3 persons in CAVIAR [Cheng et al., 2011]. Person Re ID needs to match the same person and discern different persons across different resolutions.

et al., 2011]. With high resolution image samples, those three persons can be distinguished by their hair styles or strips on the pants. As these details are not available in low resolution images, a Re ID method needs to resort to silhouettes or global textures for a reliable matching. Moreover, the high and low resolution samples of the same person may even present a larger discrepancy than to those samples from different persons at a similar resolution. Therefore, dedicated treatments are desired for Re ID methods to cope with large resolution variations of person images. Matching persons at dramatically different resolutions has not been extensively studied, partly because of the limitation of current Re ID benchmark datasets. Most widely used benchmark datasets usually consist of person images with limited resolution variations. CAVIAR [Cheng et al., 2011] is particularly collected to consider two levels of resolutions. MLR-VIPe R and MLR-CUHK03 [Jiao et al., 2018] are adapted from VIPe R [Gray and Tao, 2008] and CUHK03 [Li et al., 2014] by including three levels of resolutions, respectively. These datasets have inspired many works on lowresolution person Re ID [Li et al., 2015; Jing et al., 2015; Wang et al., 2016; Jiao et al., 2018; Wang et al., 2018b], yet not many efforts on how to handle person images with a large range of resolution variance. Traditional methods [Li et al., 2015; Jing et al., 2015; Wang et al., 2016] address person Re ID with varying person resolutions mainly by learning a shared feature space between low and high resolutions. Recent approaches focus on deep learning based Super-Resolution (SR) [Jiao et al., 2018; Wang et al., 2018b]. Although SR methods can recover some visual details, they do not differentiate person foregrounds and backgrounds and are not optimized for person Re ID, i.e., their goal is to minimize the pixel-level L2 loss, rather than

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

the person Re ID errors. In practice, SR methods are not capable to fully recover the missing details in low resolution images. We argue that the person feature extractor shall be explicitly designed and optimized to combat against challenging resolution variance in real-world scenarios. This paper proposes to jointly optimize person image resolution and feature extraction for person Re ID. Speciﬁcally, we propose a deep network consisting of two modules. The Foreground-Focus Super-Resolution (FFSR) module upscales the resolution of an input image using a fully convolutional auto-encoder with skip connections. Different from general SR modules, FFSR is jointly trained with the person Re ID loss and a foreground focus loss, which recovers the details on the person body and suppresses the cluttered backgrounds. The subsequent Resolution-Invariant Feature Extractor (RIFE) extracts person representations for person Re ID. RIFE consists of several feature learning blocks, each of which adopts two CNN branches to learn features from low and high-resolution images, respectively. This design learns more dedicated feature extractors for low-resolution inputs. In other words, RIFE explicitly differentiates high and low resolution inputs during feature learning to ensure its robustness to resolution variance. Features from those two branches are fused with the weights predicted by a Dual-Stream Block (DSB) as the resolution invariant feature. By jointly training FFSR and RIFE, our approach achieves consistent improvements on the three multi-resolution Re ID datasets, i.e., CAVIAR [Cheng et al., 2011], MLR-VIPe R, and MLR-CUHK03 [Jiao et al., 2018]. Besides those three datasets, we also construct two large datasets with large variations of person resolutions, i.e., VR-Market1501 and VRMSMT17 by modifying Market1501 [Zheng et al., 2015] and MSMT17 [Wei et al., 2018], respectively. On these two datasets, our method also achieves promising performance. To our best knowledge, this is an original work that jointly considers foreground focus super resolution and multiple CNN branches for resolution invariant representations in person Re ID. Extensive ablation studies as well as comparisons on ﬁve datasets have shown the competitive performance of the proposed approach.

2 Related Work This section brieﬂy reviews low-resolution person Re ID and image super-resolution, which are closely related to our work.

Low-Resolution Person Re-ID. Some works use matric learning methods to address low-resolution person Re ID mainly by learning a shared feature space between low and high resolutions. For example, JUDEA [Li et al., 2015] optimizes the distance between images of different resolutions by requiring features on the same person to be close to each other. SLD2L [Jing et al., 2015] uses the Semi-Coupled Low Rank dictionary learning to build the mapping between features from low and high-resolution images. SDF [Wang et al., 2016] learns a discriminating surface to separate feasible and infeasible functions in the scale distance function space. [Chen et al., 2019] uses adversarial loss and reconstruction loss to decrease distance between deep features from different resolution. Other works use deep learning based Super-

Resolution (SR). CSR-GAN [Wang et al., 2018b] focuses on the super resolution part, and uses a deep Cascaded SR-GAN as well as several handcraft restrictions to enhance the image resolution. SING [Jiao et al., 2018] adds a Super Resolution network before the feature extraction and trains two networks jointly.

Super-Resolution beneﬁts from the advance of deep models. SRCNN [Dong et al., 2014] ﬁrst introduces a Fully Convolutional Network for image Super-Resolution. Many works [Kim et al., 2016; Tai et al., 2017] have been proposed by designing deeper, wider, and denser network architectures. SRGAN [Ledig et al., 2017] designs additional loss functions to recover more semantic cues. Those works are general SR models and do not concern with image contents. SFTGAN [Wang et al., 2018a] uses image segmentation to help the texture super resolution. [Yu et al., 2018] use an encoderdecoder structure to leverage attributes and use GAN [Goodfellow et al., 2014] and STN [Jaderberg et al., 2015] to make the generated faces appear realistic.

Different from SING and CSR-GAN, our FFSR focuses on person foreground and RIFE learns different feature extractors for high and low-resolution images. This further enhances the robustness to resolution variance.

3 Problem Formulation In surveillance videos, a person image can be regarded as a sample of one person captured by a camera, where the resolution is decided by shooting parameters like sensor resolution, shooting distance, camera focus, imaging processor, etc. i.e., Ir i = camera-sample(Pi, θ), (1) where Ir i is a person image with resolution r and image index i. Pi denotes the person ID label of Ir i , θ denotes the shooting parameters. It is hard to precisely deﬁne the resolution r, because the parameters θ could be complicated. For simplicity, for Ir i in a dataset D, we use a scalar r [0, 1], computed with width(Ir i )/widthmax as its resolution, where widthmax is the maximum width of images on D. For example with widthmax = 96, resizing an original 128 48 sized image to 64 24 degrades its resolution from 0.5 to 0.25. To simplify the deﬁnition of resolution, we note that enlarging an image with interpolation does not enhance its resolution. The task of person Re ID can be described as matching a query person against the collected person image dataset using a feature representation f, with the goal of minimizing the distance between images of the same person, meanwhile maintaining larger distances between images of different persons. Considering the variance of image resolution, we denote the objective function O of person Re ID as, min f O(r1, r2) = Dsim(r1, r2)/Ddif(r1, r2),

Dsim(r1, r2) X

Pi=Pi f r1 i f r2 i 2 2,

Ddif(r1, r2) X

Pi =Pi f r1 i f r2 i 2 2,

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

0.125 0.25 0.5 1 0.40

r1, r2( r1=r2 )

0.125 0.25 0.5 1 0.40

MSMT17 Market1501

r1, r2( r1=r2 ) r1, r2 ( r1=r2 ) r1 ( r2=1)

Figure 2: Values of object function O in Eq. (2) computed with variations of resolution on MSMT17 and Market1501. (a) ﬁxes r1 = r2 and increase r1 and r2 from 0.125 to 1. (b) ﬁxes r2 = 1 and increase r1 from 0.125 to 1. It veriﬁes that, both low resolution and varied resolution increase the difﬁculty of person Re ID.

where 2 2 computes the distance between feature vectors. Ddif( ) and Dsim( ) compute the distance between two images of the same person and different persons, respectively. We use superscripts r1 and r2 to denote resolutions of two images considered in distance computation. Before proceeding to the formulation of our algorithm, we ﬁrst illustrate the effects of resolution variance to person Re ID performance on two large Re ID datasets Market1501 [Zheng et al., 2015] and MSMT17 [Wei et al., 2018], respectively. We ﬁrst train a Res Net50 baseline [He et al., 2016] as the feature extractor, then compute O(r1, r2) on two datasets with different combinations of r1 and r2. Fig. 2 (a) ﬁxes r1 = r2 and increases their values from 0.125 to 1. We observe that, lower resolution leads to larger O, resulting in a lower person Re ID accuracy. Fig. 2 (b) ﬁxes r2 = 1 and increases r1 from 0.125 to 1. It is clear that, larger variance of resolution corresponds to increased person Re ID difﬁculty. We also observe that, the curves in Fig. 2 (b) are more abrupt than the ones in Fig. 2 (a), indicating that varied-resolution Re ID could be more challenging than the low-resolution case. Our solution is inspired by the above observations, i.e., to improve person Re ID accuracy, two compared images should present at 1) high resolution and 2) similar resolution. The person image resolution should be enhanced to recover visual details. To facilitate feature extraction, the SR model is expected to focus on the person foreground and suppress the cluttered backgrounds. Meanwhile, the feature extractor should be able to alleviate the resolution variances. Those two intuitions correspond to two modules in our network, i.e., the Foreground Focus Super Resolution (FFSR) and Resolution Invariant Feature Extractor (RIFE), respectively. For an input person image Ir i , FFSR ﬁrst enhances its resolution to r , r r , then it is processed by RIFT for resolution invariant feature extraction. The forward computation of our network can be denoted as,

Ir i = MF F SR(Ir i ), fi = MRIF E(Ir i ), (3) where fi is the ﬁnal feature, MF F SR and MRIF E denote the two modules, respectively. With a training set T = {(Ir i , Ih i , Pi)}, i = 1, ..., N, where Ih i is the groundtruth high-resolution image and Pi is the person ID label, the network is optimized with two losses computed on two modules, i.e.,

i=1:N LF F SR(Ir i ) + αLRIF E(Ir i ), (4)

where α balances the two losses. The following section introduces our network architecture and the implementations of those two loss functions.

4 Proposed Methods Our network architecture is illustrated in Fig. 3. This section introduces the FFSR and RIFT modules, respectively.

4.1 Foreground-Focus Super-Resolution As the initial stage before feature extraction, FFSR model should be compact and efﬁcient to compute. Additionally, FFSR is expected to work with varied resolutions, e.g., perform super-resolution to low resolution inputs, and preserve original details of high resolution inputs. Instead of following existing SR models [Kim et al., 2016; Tai et al., 2017], we use a light-weight FFSR module illustrated in Fig. 3. FFSR is implemented based on the autoencoder architecture. The ﬁrst several convolutional layers down-sample the input with stride width 2. Then, small convolutional kernels with stride width 1 are applied for feature extraction. Following the RED-net [Mao et al., 2016] and Unet [Ronneberger et al., 2015], we add symmetric skip connections between low and high layers. Skip connects could preserve the visual cues in original images, hence help to enhance the quality of reconstructed images. Pixel-wised distance Mean Square Error (MSE) is commonly applied for SR model training. Simply minimizing the MSE may not be optimal for person Re ID task, because it does not differentiate person foregrounds and backgrounds. Person foregrounds generally provide more valuable cues for person Re ID. To recover more visual cues on person foregrounds and depress cluttered backgrounds, we propose the foreground-focus SR loss LF F SR, i.e.,

LF F SR(Ir i ) = M (Ir i Ih i ) 2 2, (5)

where denotes element-wise multiply and M is a mask with the same size of Ir i . Our method is compatible with different mask generation strategies. Image segmentation algorithms like [Insafutdinov et al., 2016] can be applied to generate binary foreground masks. With a well-trained person bounding box detector, person foregrounds are more likely to appear in the center of bounding boxes. For simplicity, Gaussian kernels can be applied as foreground masks, as illustrated in Fig. 3.

4.2 Resolution-Invariant Feature Extractor Since super-resolution is an ill-posed problem, solely applying FFSR is not strong enough to achieve resolution invariance. We further design RIFE to generate resolution invariant features. As illustrated in Fig. 1, high and low-resolution images convey substantially different amount of visual cues, they should be treated with different feature extractors. RIFE explicitly differentiates high and low resolution images into two feature extraction streams. As shown in Fig. 3, RIFE consists of several Dual-Stream Blocks (DSB), each introduces two feature extraction streams with an identical architecture but different training objectives. The following part ﬁrst introduces the forward procedure of RIFE, then discusses its training objectives.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Foreground-Focus Super-Resolution Resolution Invariant Feature Extractor

HR image mask

Figure 3: The architecture of our network, which consists of two modules: Foreground-Focus Super-Resolution (FFSR) and Resolution Invariant Feature Extractor (RIFE). FFSR is an auto-encoder with skip connections trained with both person Re ID loss and a foreground-focus super resolution loss LF F SR. RIFE consists of several Dual-Stream Blocks (DSB), each progressively learns resolution invariant features through two CNN streams. Features from two streams are fused with weights learned by a resolution weighting loss LR. RIFE ﬁnally outputs a 256-d feature, which is hence used to compute the cross-entropy loss LX.

In RIFE, each DSB applies two streams of convolutional layers to extract feature maps for high and low-resolution inputs, respectively. For the t-th DSB, we denote its two streams as DSBL t and DSBH t , and their generated feature maps as m L t and m H t , where superscripts L and H denote the low and high-resolution streams, respectively. m L t and m H t are adaptively fused as the output of the DSB to achieve better robustness to resolution variance. For example, m L t is fused with larger weights for low-resolution images, because DSBL t is more suited for feature extraction on low-resolution images. We denote the computation of output feature mt of t-th DSB as, mt = w L t m L t + w H t m H t , (6) where w L t and w H t are related to the resolution of input image. For high-resolution images, w L would be smaller than w H, and vice versa. As shown in Fig. 3, those two weights are predicted with two FC layers based on m L t and m H t , respectively. In order to learn w L t and w H t , we introduce the resolution weighting loss LR into each DSB. With a training image Ir i , the LR t for the t-th DSB is deﬁned as,

LR t (Ir i ) = w L t (1 r) 2 2 + w H t r 2 2, (7)

where r denotes the resolution of Ir i . The fused feature map mt is propagated to the next DSB. Stacking multiple DSBs leads to a deep neural network with strong feature learning capability. The output of ﬁnal DSB is processed with a Global Average Pooling (GAP) layer and a Fully Connected (FC) layer as the ﬁnal feature f. A FC layer is trained on f to predict person ID labels. A cross entropy loss can be computed as the person Re ID loss, i.e.,

LX(Ir i ) = Cross Entropy(FC(fi), Pi), (8) where Pi denotes the person ID label of a training image Ir i . With T DSBs in total, RIFE is trained with one cross entropy loss and T resolution weighting losses. The RIFE loss on training image Ir i can be represented as

LRIF E(Ir i ) = LX(Ir i ) + β X

t=1:T LR t (Ir i ), (9)

where parameter β weights the two losses. Fusing features with Eq. (6) enforces DSBL and DSBH to focus on low and high-resolution images during training. For low-resolution images, the back propagated person Re ID loss makes more modiﬁcations to DSBL than to DSBH because of larger w L. This mechanism ﬁnally learns different parameters for DSBL and DSBH, respectively and leads to a strong resolution invariant representation. Implementation details of RIFE will be presented in Sec. 5.2.

5 Experiment

5.1 Datasets

We evaluate our methods on ﬁve datasets, including three existing datasets and two datasets we constructed. CAVIAR [Cheng et al., 2011] contains 1220 images of 72 identities. Images are captured by one High-Resolution (HR) camera and one Low-Resolution (LR) camera. Among 72 identities, 50 have images from two cameras. Those 50 identities are divided into a LR query set and a HR gallery set. MLR-VIPe R and MLR-CUHK03 are constructed on VIPe R [Gray and Tao, 2008] and CUHK03 [Li et al., 2014] datasets, where both were captured by two cameras. Following SING [Jiao et al., 2018], every image from one camera is down-sampled with a ratio evenly selected from { 1

4} as the query set. Original images from the other camera are used as the test set. MLR-VIPe R has 316 identities for training and testing. MLR-CUHK03 has 100 identities for testing and 1367 for training, respectively. VR-Market1501 and VR-MSMT17 are constructed by us based on Market1501 [Zheng et al., 2015] and MSMT [Wei et al., 2018], respectively. VR-Market1501 contains 32,217 images of 1,501 people captured by 6 cameras. VR-MSMT17 consists of 126,441 images of 4,101 persons from 15 cameras. All images are down-sampled to make the width within the range of [8, 32) in VR-Market1501 and [32,128) in VR-MSMT17. Hence these two datasets present 24 and 96 different resolutions separately. We keep original divisions of training and testing sets, i.e., 751 and 710 identities for

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

SR model Mask JL Rank-1 Rank-5 FLOPs(G) Bilinear - - 47.6 66.4 - SRCNN - - 46.7 65.0 2.53 VDSR - - 48.5 67.7 32.79 AE - - 48.1 67.1 3.71 FFSR Gaussian - 49.4 67.8 3.71 FFSR Gaussian 52.8 69.0 3.71 FFSR Deepercut 52.9 69.2 3.71

Table 1: Re ID Performance of different SR methods on VRMSMT17. FLOPs shows the SR complexity. JL denotes joint learning with Re ID loss.

structure weight learning Rank-1 Rank-5 Res Net50 - 47.6 66.4 two Res Net50 - 49.1 68.2 two Res Net50 50.4 67.4 RIFE 53.3 70.1

Table 2: Performance of different feature extractors on VR-MSMT17

training and testing on VR-Market1501, 1,041 and 3,060 for training and testing on VR-MSMT17. Compared with existing datasets, VR-Market1501 and VR-MSMT17 are substantially larger in size and are more challenging, because both query and gallery images show a large range of resolution variance.

5.2 Implementation Details Our FFSR module is a 12-layer fully convolutional network. Two convolutional layers with a stride of 2 and two transposed convolutional layers are applied to downsample and up-sample the feature maps, respectively. We use Res Net50 [He et al., 2016] as the backbone of RIFE module. Each main block of Res Net50 is modiﬁed to be a DSB by duplicating its convolutional layers as DSBL and DSBH, respectively. Following Res Net50, our RIFE has 4 DSBs. Two FC layers with the output channels of 64 and 1 are used for w L and w H prediction. Our network is trained on Py Torch by Stochastic Gradient Descent (SGD). Training is ﬁnished with three steps. 1) We initialize and pre-train the FFSR model on Image Net [Russakovsky et al., 2015] with the MSE loss. Then it is ﬁnetuned on person Re ID training datasets with LF F SR. 2) We initialize and ﬁne-tune the RIFE module on target dataset with LRIF E. 3) FFSR and RIFE modules are jointly trained with the loss function in Eq. (4). We ﬁx hyperparameters as α = 1, β = 0.1 for all datasets. Each step has 60 epoches and the batch size is set as 32. The initial learning rate is set as 0.01 at the ﬁrst two steps and 0.001 at the ﬁnal step. The learning rate is reduced ten times after 30 epoches. Input images are resized to 256 128 in VR-market1501 and 384 128 in other datasets. The ﬁnal 256-D feature is used for Re ID with Euclidean distance. All of our experiments are implemented with GTX 1080Ti GPU, Intel i7 CPU, and 128GB memory.

5.3 Ablation Study

Validity of FFSR: To show the validity of our FFSR model, we ﬁx the feature extraction module as Res Net50 and test

8 16 32 64 0.40

r1( r2=64 )

Res Net50 RIFE FFSR RIFE+FFSR

8 16 32 64 0.45

r1( r2=64 )

Res Net50 RIFE FFSR RIFE+FFSR

8 16 32 64 0.40

r1, r2 ( r1=r2 )

8 16 32 64 0.45

r1, r2 ( r1=r2 ) r1, r2 ( r1=r2 )

r1, r2 ( r1=r2 )

0.125 0.25 0.5 1 0.125 0.25 0.5 1

0.125 0.25 0.5 1 0.125 0.25 0.5 1

(a) Market1501

Figure 4: Effects of FFSR and RIFT to the object function O in Eq. (2). This ﬁgure follows the conﬁgurations of Fig. 2. It is clear that FFSR and RIFE boost the robustness to resolution variance.

different super resolution methods including Bilinear interpolation, SRCNN [Dong et al., 2014], VDSR [Kim et al., 2016], as well as variants of our module, i.e., baseline Auto Encoder (AE), FFSR trained with the Gaussian mask and segmented mask by deepercut [Insafutdinov et al., 2016], as well as training with/without person Re ID loss. The experiments are conducted on the large VR-MSMT17. We illustrate experimental results in Table 1 In Table 1, with the Gaussian mask our method outperforms the baseline AE, indicating the validity of emphasizing the person foreground in SR for person Re ID. It is also clear that, jointly training with person Re ID loss substantially boosts the Re ID accuracy. Conveying more accurate foreground locations, segmented mask further outperforms the Gaussian mask. We also compare the computational complexity of FFSR with other super resolution methods. It can be observed that, FFSR introduces marginal computational overhead to the compact SRCNN [Dong et al., 2014], but shows substantially better performance, e.g., outperforms SRCNN by 6.2% in Rank-1 Accuracy. FFSR also substantially outperforms VDSR in the aspects of both accuracy and complexity.

Validity of RIFE: To show the validity RIFE module, we ﬁx the super resolution module as the Bilinear interpolation and compare RIFE with three feature extractors, i.e., a) Res Net50 baseline, b) two Res Net50 with their features fused with equal weight, and c) two Res Net50 with their features fused with learned weights learned with Eq. (6). We summarize the experimental results in Table 2. Table 2 shows that increasing the amount of network parameters by fusing two Res Net50 only brings marginal improvements over the one branch Res Net50 baseline. Fusing two Res Net50 with learned weights brings 1.3% improvements to the Rank-1 Accuracy. Among compared methods, RIFE module achieves the best performance, substantially outperforming baseline by 5.7% in Rank-1 Accuracy.

Effect to the Objective Function: We further show the effect of FFSR and RIFE to the person Re ID objective function deﬁned in Eq. (2). Referring to Fig. 2, we illustrate the effect

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

Dataset CAVIAR MLR-VIPe R MLR-CUHK03 VR-Market1501 VR-MSMT17 Rank-1 Rank-5 Rank-1 Rank-5 Rank-1 Rank-5 Rank-1 Rank-5 Rank-1 Rank-5 Densenet121 [Huang et al., 2017] 31.1 65.5 31.4 63.1 70.8 91.3 60.0 78.8 51.2 67.4 SE-resnet50 [Hu et al., 2018] 30.8 65.1 33.5 63.6 70.8 92.3 58.2 78.6 52.3 68.9 JUDEA [Li et al., 2015] 20.0 60.1 26.0 55.1 26.2 58.0 - - - - SLD2L [Jing et al., 2015] 18.4 44.8 20.3 44.0 - - - - - - SDF [Wang et al., 2016] 14.3 37.5 9.52 38.1 22.2 48.0 - - - - SING [Jiao et al., 2018] 33.5 72.7 33.5 57.0 67.7 90.7 60.5 81.8 52.1 68.3 CSR-GAN [Wang et al., 2018b] 32.3 70.9 37.2 62.3 70.7 92.1 59.8 81.3 51.9 67.5 Res Net50 29.6 64.0 29.9 62.2 67.4 91.7 57.0 78.7 47.6 66.4 FFSR 31.1 68.7 40.3 65.3 70.5 92.3 59.2 80.1 52.8 69.0 RIFE 35.7 74.9 33.9 63.6 69.7 91.5 62.6 82.4 53.3 70.1 FFSR+RIFE 36.4 72.0 41.6 64.9 73.3 92.6 66.9 84.7 55.5 72.4

Table 3: Comparison with recent works on ﬁve datasets.

of FFSR and RIFE in Fig. 4. It is clear that, both FFSR and RIFE decreases the objective O, indicating improved Re ID accuracy. It is also clear that, either FFSR and RIFE decrease the slope of the original curves, implying they improve the robustness of a Re ID system to resolution variants. Combining FFSR and RIFE brings the best performance. Following part compares our methods with recent works.

5.4 Comparison with Recent Works We compare our method with ﬁve recent low-resolution Re ID methods including three traditional methods, i.e., JUDEA [Li et al., 2015], SLD2L [Jing et al., 2015], SDF [Wang et al., 2016], and two deep learning based methods, i.e., SING [Jiao et al., 2018], CSR-GAN [Wang et al., 2018b]. Three deep neural networks including Res Net50, Densenet121 [Huang et al., 2017], and SE-resnet50 [Hu et al., 2018] are also implemented and compared. We summarize the experimental results on ﬁve datasets in Table 3, which show the reported performance of JUDEA, SING, SDF, SING, and CSR-GAN on CAVIAR, MLR-VIPe R, and MLR-CUHK03. Performance of compared methods on VR-Market1501 and VR-MSMT17 are implemented with the code provided by their authors. From the comparison we observe that, deep learning based methods substantially outperform the traditional ones. Our method shows promising performance on the ﬁrst three datasets. On CAVIAR, our RIFE module outperforms the recent CSR-GAN by 3.4% in Rank-1 Accuracy. Combining FFSR and RIFE further boosts the performance, and outperforms CSR-GAN by 4.1%. On MLR-VIPe R and MLRCUHK03, our method outperforms CSR-GAN by 4.4% and 2.6% in Rank-1 Accuracy, respectively. Our method also shows promising performance on VRMarket1501 and VR-MSMT17. Among existing methods that are compared, SING shows the best performance on VRMarket1501. FFSR+RIFE outperforms SING by 6.4%. FFSR and RIFE also achieves the best performance on VRMSMT17, outperforming SE-resnet50 by 3.5%. It can be observed that, combining FFSR and RIFE commonly leads to the best performance on those ﬁve datasets. We show some results of image super resolution and person Re ID in Fig. 5.

6 Conclusion

This paper proposes a deep neural network composed of FFSR and RIFE modules for resolution invariant person re-

FFSR w/o JL FFSR w/ JL

FFSR w/o JL FFSR w/ JL : True positive : False positive

Figure 5: Sample results of person Re ID and super resolution on VR-MSMT17. Our method (second row) outperforms the Res Net50 (ﬁrst row). It is interesting to observe that, without joint learning with person Re ID loss, FFSR module gets a lower Re ID accuracy, but produces images with better visual quality.

identiﬁcation. FFSR upscales the person foreground using a fully convolutional auto-encoder with skip connections learned with a foreground focus loss. RIFE adopts two feature extraction streams weighted by a dual-attention block to learn features for low and high resolution images, respectively. These two complementary modules are jointly trained to optimize the person Re ID objective, leading to a strong resolution invariant representation. Extensive experiments on ﬁve datasets have shown the validity of introduced components and the promising performance of our methods.

Acknowledgments

This work is supported in part by Beijing Natural Science Foundation under Grant No. JQ18012, in part by Natural Science Foundation of China under Grant No. 61620106009, 61572050, 91538111, in part by Peng Cheng Laboratory.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)

[Chen et al., 2019] Yun-Chun Chen, Yu-Jhe Li, Xiao-fei Du, and Yu-Chiang Frank Wang. Learning resolution-invariant deep representations for person re-identiﬁcation. In AAAI, 2019. [Cheng et al., 2011] Dong Seon Cheng, Marco Cristani, Michele Stoppa, Loris Bazzani, and Vittorio Murino. Custom pictorial structures for re-identiﬁcation. In BMVC. Citeseer, 2011. [Dong et al., 2014] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In ECCV, pages 184 199. Springer, 2014. [Goodfellow et al., 2014] Ian Goodfellow, Jean Pouget Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672 2680, 2014. [Gray and Tao, 2008] Douglas Gray and Hai Tao. Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, pages 262 275. Springer, 2008. [He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770 778, 2016. [Hu et al., 2018] Jie Hu, Li Shen, and Gang Sun. Squeezeand-excitation networks. In CVPR, pages 7132 7141, 2018. [Huang et al., 2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 2261 2269, 2017. [Insafutdinov et al., 2016] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, pages 34 50. Springer, 2016. [Jaderberg et al., 2015] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In NIPS, pages 2017 2025, 2015. [Jiao et al., 2018] Jiening Jiao, Wei-Shi Zheng, Ancong Wu, Xiatian Zhu, and Shaogang Gong. Deep low-resolution person re-identiﬁcation. AAAI, 2018. [Jing et al., 2015] Xiao-Yuan Jing, Xiaoke Zhu, Fei Wu, Xinge You, Qinglong Liu, Dong Yue, Ruimin Hu, and Baowen Xu. Super-resolution person re-identiﬁcation with semi-coupled low-rank discriminant dictionary learning. In CVPR, pages 695 704, 2015. [Kim et al., 2016] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In CVPR, pages 1646 1654, 2016. [Ledig et al., 2017] Christian Ledig, Lucas Theis, Ferenc Husz ar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P Aitken, Alykhan Tejani, Johannes Totz,

Zehan Wang, et al. Photo-realistic single image superresolution using a generative adversarial network. In CVPR, volume 2, page 4, 2017. [Li et al., 2014] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep ﬁlter pairing neural network for person re-identiﬁcation. In CVPR, pages 152 159, 2014. [Li et al., 2015] Xiang Li, Wei-Shi Zheng, Xiaojuan Wang, Tao Xiang, and Shaogang Gong. Multi-scale learning for low-resolution person re-identiﬁcation. In ICCV, pages 3765 3773, 2015. [Li et al., 2019] Jianing Li, Shiliang Zhang, and Tiejun Huang. Multi-scale 3d convolution network for video based person re-identiﬁcation. In AAAI, 2019. [Mao et al., 2016] Xiaojiao Mao, Chunhua Shen, and Yu Bin Yang. Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In NIPS, pages 2802 2810, 2016. [Ronneberger et al., 2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234 241. Springer, 2015. [Russakovsky et al., 2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211 252, 2015. [Tai et al., 2017] Ying Tai, Jian Yang, and Xiaoming Liu. Image super-resolution via deep recursive residual network. In CVPR, volume 1, page 5, 2017. [Wang et al., 2016] Zheng Wang, Ruimin Hu, Yi Yu, Junjun Jiang, Chao Liang, and Jinqiao Wang. Scale-adaptive lowresolution person re-identiﬁcation via learning a discriminating surface. In IJCAI, pages 2669 2675, 2016. [Wang et al., 2018a] Xintao Wang, Ke Yu, Chao Dong, and Chen Change Loy. Recovering realistic texture in image super-resolution by deep spatial feature transform. In CVPR, pages 606 615, 2018. [Wang et al., 2018b] Zheng Wang, Mang Ye, Fan Yang, Xiang Bai, and Shin ichi Satoh. Cascaded sr-gan for scaleadaptive low resolution person re-identiﬁcation. In IJCAI, pages 3891 3897, 2018. [Wei et al., 2018] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identiﬁcation. In CVPR, pages 79 88, 2018. [Yu et al., 2018] Xin Yu, Basura Fernando, Richard Hartley, and Fatih Porikli. Super-resolving very low-resolution face images with supplementary attributes. In CVPR, pages 908 917, 2018. [Zheng et al., 2015] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identiﬁcation: A benchmark. In ICCV, pages 1116 1124, 2015.

Proceedings of the Twenty-Eighth International Joint Conference on Artiﬁcial Intelligence (IJCAI-19)