# humanintheloop_vehicle_reid__5afde65e.pdf

Human-in-the-Loop Vehicle Re ID

Zepeng Li1, Dongxiang Zhang1*, Yanyan Shen2, Gang Chen1

1 Key Lab of Intelligent Computing Based Big Data of Zhejiang Province, Zhejiang University 2 Department of Computer Science and Engineering, Shanghai Jiao Tong University {lizepeng,zhangdongxiang,cg}@zju.edu.cn, shenyy@sjtu.edu.cn

Vehicle Re ID has been an active topic in computer vision, with a substantial number of deep neural models proposed as end-to-end solutions. In this paper, we solve the problem from a new perspective and present an interesting variant called human-in-the-loop vehicle Re ID to leverage interactive (and possibly wrong) human feedback signal for performance enhancement. Such human-machine cooperation mode is orthogonal to existing Re ID models. To avoid incremental training overhead, we propose an Interaction Re ID Network (IRIN) that can directly accept the feedback signal as an input and adjust the embedding of query image in an online fashion. IRIN is offline trained by simulating the human interaction process, with multiple optimization strategies to fully exploit the feedback signal. Experimental results show that even by interacting with flawed feedback generated by non-experts, IRIN still outperforms state-of-the-art Re ID models by a considerable margin. If the feedback contains no false positive, IRIN boosts the m AP in Veri776 from 81.6% to 95.2% with only 5 rounds of interaction per query image.

Introduction Given a query image and an image gallery harnessed across multiple surveillance cameras, vehicle Re ID retrieves images that refer to the same real-world vehicle. The problem is challenging due to the presence of different viewpoints (Lou et al. 2019), low-image resolution (Zhao et al. 2021), illumination changes (Liu et al. 2016a) and partial occlusions (Rao et al. 2021; Zhang et al. 2022). To overcome these challenges, state-of-the-art methods typically resort to devising advanced neural networks (Rao et al. 2021; Zhao et al. 2021) or effective loss functions (Yan et al. 2020; Quispe et al. 2021) to extract discriminative visual features and establish remarkable performance in the benchmark datasets. Nonetheless, new breakthrough via model improvement has become more and more challenging. In this paper, we jump out of the box and propose a new paradigm called human-in-the-loop vehicle Re ID. It is orthogonal to existing Re ID models and works in a humanmachine cooperative mode to leverage iterative feedback for further performance improvement. Note that the quality of feedback could be unreliable, because identifying two

*Corresponding author Copyright 2023, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Figure 1: An example of human-in-the-loop vehicle Re ID.

matching vehicles through their appearance is also challenging for human beings1. Figure 1 depicts a toy example to deliver the overall idea. In the initial step, a list of images is returned and ranked by the similarity to the query image, using any existing Re ID models. Afterwards, users are allowed to employ an operation to provide feedback signal on the results. To avoid incurring cumbersome efforts for human intervention, we define the operation as picking a positive match from a small set of uncertain candidates. Our goal is to develop a mechanism to effectively take advantage of the feedback signal and update the order of images in the rank list. In the next-round interaction, a new subset of uncertain images will be selected for human verification. This process is repeated until the final results are satisfactory or a maximum number of iterations is reached. A straightforward approach to leverage human feedback is to treat the human-picked positive sample as a new observation and apply incremental learning to update the Re ID results. For instance, we can adopt the online incremental learning framework proposed in (He et al. 2020) to maintain an exemplar set for each vehicle class. In each iteration, this set is updated to incorporate new training samples derived from human feedback. The network retains previous knowledge as part of the final classifier and is re-trained us-

1The annotation of vehicle Re ID benchmark datasets often needs to rely on additional clues such as vehicle plate or travel routes.

The Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-23)

ing the exemplar set. (Wu and Gong 2021) designs a more comprehensive learning objective that incorporates the coherence of classification, distribution and representation in a unified framework. The underlying motivation is to support life-long Re ID without forgetting. However, these incremental learning methods suffer from two major drawbacks when applied in the online and interactive scenario of Figure 1. First, the human feedback in each iteration is lightweight as it only contains a positive sample selected from an uncertain candidate set. Its effect is limited even with the assistance of data augmentation. Second, it incurs additional online training cost, which is not friendly for real-time humanmachine interaction. To resolve these two issues, we propose a novel mechanism to better exploit the human feedback signal without incurring additional training cost. Our idea is to devise a network that accepts human feedback as part of the input and dynamically adjusts the embedding of query image to reduce its distance to the positive samples. To achieve the goal, we propose an Interaction Re ID Network (IRIN), which augments the existing Re ID models with a Transformer (Wu et al. 2021) encoder with gating mechanism. In the offline training of IRIN, we simulate the human interaction process to generate the feedback signal, by constructing a selective uncertain candidate set for each vehicle in the training set and picking a positive sample using the groundtruth labels. In addition, we adopt supervised contrastive learning (Khosla et al. 2020) to pull the query image and its positive samples closer in the embedding space. Finally, IRIN is jointly trained with the backbone Re ID model to minimize the identification loss and contrastive loss. We invite 20 postgraduates with 10 female students and 10 male students for performance evaluation. It is possible that these students pick false positives from the uncertain candidate sets and generate misleading feedback signals. Experimental results show that the m AP of IRIN increases steadily with more iterations, implying that it can effectively leverage the feedback signal. Even when IRIN is provided with flawed feedback, the new human-machine cooperation mode can still surpass pure machine models, student-based annotations and existing online learning frameworks. In the ideal case with perfect feedback, with only 5 rounds of interaction for each query image, IRIN boosts the m AP in Veri776 (Liu et al. 2016b) from 81.6% to 95.2%. In the end of the experiments, we also apply IRIN in the task of person Re ID to validate its generality.

Related Work Vehicle Re ID: The mainstream strategy of vehicle Re ID is to learn robust and discriminative vehicle representation via devising advanced neural network or effective loss functions. In the former category, (Zhao et al. 2021) proposes a heterogeneous relational complement network that combines region-specific features and cross-level features as a supplement to the original high-level output. (Khorramshahi et al. 2019) employs a dual adaptive attention mechanism to focus on the most informative key-points of vehicle image. (He et al. 2021) proposes a Transformer-based Re ID model, with a jigsaw patch module and side information

embeddings to enhance the robust feature learning. The model achieves state-of-the-art performance in vehicle Re ID benchmark datasets. As to loss function improvement, Circle loss (Sun et al. 2020) is developed to achieve a more flexible and targeted pair similarity optimization. To stabilize the triplet loss, (Ghosh, Shanmugalingam, and Lin 2021) proposes a grid-based motion statistical feature matcher for relation-preserving triplet mining. (Quispe et al. 2021) uses attribute-related cross entropy loss and triplet loss to distill crucial attribute information. Readers can refer to (Zakria et al. 2021) for a comprehensive survey. Human-in-the-Loop Visual Tasks: We review human-inthe-loop visual tasks according to the machine learning pipeline, including the stages of data annotation, model training and online inference. 1) Data annotation: To improve data quality, (Berti Equille 2019; Muthuraman et al. 2021) utilize model sensitivity to identify potentially incorrect labels for human verification. In (Liu et al. 2019), reinforcement learning is adopted in the task of person Re ID to iteratively prioritize a set of data samples for human annotation. 2) Model training. This step is focused on how to iteratively leverage human feedback to improve model performance. As mentioned, (He et al. 2020) and (Wu and Gong 2021) are two representative incremental learning strategies that work in the online scenario. 3) Online inference: In this stage, human can assist models to accomplish a task together and achieve better performance. For instances, (Brenner, Priyadarshi, and Itti 2016) leverage the human feedback of online viewpoint adjustment to improve object detection confidence without additional training cost. (Stonebraker et al. 2020) presents a search-and-mark framework to facilitate surveillance tasks. This work belongs to the stage of online inference, i.e., human users work with Re ID models in an online fashion, without the need to re-train the underlying model. The research challenges come from the requirements for real-time interaction and robustness to the flawed feedback.

Proposed Model Framework Overview

In this paper, we study iterative vehicle Re ID with the assistance of human feedback for continuous performance improvement. Given a query image Iq, we define the human operator as picking the most promising image Ip from a set of uncertain candidates U and represent the feedback as a positive matching pair (Iq, Ip). Other types of human operator can also be explored and we consider this direction as our future work. With the feedback signal, our idea is to devise a network that can accept the human feedback as part of input and dynamically adjust the embedding of query image to reduce its distance to the positive samples. As shown in Figure 2, we propose an Interaction Re ID Network (IRIN) that augments a backbone Re ID model with a Transformer encoder with gating mechanism. In the offline training stage, we simulate the human interaction process to generate the feedback signal, by constructing a selective uncertain candidate set for each vehicle in the training set and picking a positive sample using the groundtruth labels. In addition,

Figure 2: The network structure and training pipeline of IRIN.

we treat the query image as an anchor and adopt supervised contrastive learning (Khosla et al. 2020) to pull together the anchor and its positive samples in the embedding space. Finally, IRIN is jointly trained to minimize the identification loss and contrastive loss.

Backbone Re ID Network We adopt Res Ne Xt101 (Xie et al. 2017) appended by an Instance Batch Normalization (IBN) (Pan et al. 2018) layer as our backbone network, mainly due to its simplicity and popularity it is easy to implement and has been widely applied in previous Re ID models (Zhao et al. 2021; Ghosh, Shanmugalingam, and Lin 2021). Our IRIN is built on top of the backbone network and augment it with a module to fuse the human feedback and dynamically adjust the feature of query image.

Human Feedback Simulation Since IRIN treats the human feedback signal as part of the input, we need to simulate the procedure of human interaction in the offline training state. In our simulation process, we randomly pick an image from each vehicle in the training set to constitute a query set Q. For each Iq Q, we determine a group of uncertain samples U that require human assistance. More specifically, we adopt the popular sampling strategy in active learning and select uncertain samples based on the classification entropy of the model output.

x U = arg max x X

y Pθ(y | x) log Pθ(y | x) (1)

where x refers to a vehicle image, y is its class id, and P is the vehicle classification probability. θ is our model parameters. Since our human operator is defined as picking a matching image from a set of uncertain candidates, we need to select an instance Ip from U to generate human feedback and the instances in U that belong to the same vehicle as

the query image are beneficial to improve the performance of the model against the query. Intuitively, Ip with the maximum distance to Iq is preferred because it is the hardest sample that cannot be well resolved by the current model. However, we observe that users are inclined to select the sample that they feel the most promising, i.e., with the highest probability of belonging to the same vehicle as the query. This observation motivates us to simulate human behavior by selecting Ip from set U, which is associated with the minimum distance to Iq. Our offline training process also simulates another feature of human-in-the-loop Re ID, in which the iterative feedback is applied on the same query image for a number of consecutive iterations. In our batch setting, let B be the batch size in model training. In each batch, we select m vehicles and each vehicle is assigned with B

m training samples. In the subsequent iterations, we will train the model with only the samples from these m vehicles until all of their samples have been accessed. Afterwards, we proceed to the next group of m vehicles.

Feedback Fusion Module We propose a Transformer-based encoder with gating mechanism to fuse the query feature with the feedback signals. In the k-th iteration, the feedback signal f (k) (i.e., the visual feature of the picked image) is concatenated with previous signals to form a matrix F(k) = [f (1), f (2), . . . , f (k)]. Let q(k 1) denote the fused query feature in the (k 1)-th iteration. Our goal is to fuse q(k 1) and F(k) to derive a new query-specific feature q(k). In our implementation, q(k 1)

is first concatenated with F(k) and used as the input to the Transformer-based encoder. We use [q(k 1) 1 ,ˆf (1) 1 , . . . , ˆf (k) 1 ] to denote the output by the first encoder layer E1. Since the feedback signals could be false positive, we also devise a gating mechanism to determine the contribution of the encoded feature ˆf (i) 1 . In particular, q(k 1) 1 and ˆf (i) 1 (i = 1, . . . , k) are concatenated to calculate the gate weight

through a multi-layer perceptron (MLP) and activate function (Sigmoid). If the feedback refers to a positive match, we expect the weight output by Sigmoid to be close to 1. Otherwise, a small gate weight is preferred. As a trade-off between efficiency and model performance, we use 2-layer encoder for the feature fusion between q(k 1) and F(k). Finally, q(k) is obtained by pooling the output of the two-layer encoder.

[q(k 1) 1 ,ˆf (1) 1 , . . . , ˆf (k) 1 ] = E1([q(k 1), f (1), . . . , f (k)]) (2)

f (i) 1 = ˆf (i) 1 Sigmoid(MLP1(q(k 1) 1 ˆf (i) 1 )) (3)

[q(k 1) 2 ,ˆf (1) 2 , . . . , ˆf (k) 2 ] = E2([q(k 1) 1 , f (1) 1 , . . . , f (k) 1 ]) (4)

f (i) 2 = ˆf (i) 2 Sigmoid(MLP2(q(k 1) 2 ˆf (i) 2 )) (5)

q(k) = Pooling([q(k 1) 2 , f (1) 2 , . . . , f (k) 2 ]) (6)

Supervised Contrastive Learning Contrastive learning is a self-supervised approach to learn an embedding space in which similar sample pairs are pulled together while dissimilar ones stay far apart. Supervised contrastive learning extends the self-supervised mode into fully-supervised setting to effectively leverage label information. It chooses positive pairs from the same class and negative samples from different classes. The learning objective is to pull together samples belonging to the same class and push apart those from different classes. To utilize supervised contrastive learning, we maintain the query feature as anchor, which will be iteratively updated. Images referring to the same vehicle with the query vehicle will be regarded as positive samples while others as negative samples. Considering the quantitative limitation of batch size on negative samples, we propose a memory bank called Hard Negative Memory Bank (HNMB) for negative sample expansion with at low cost inspired by (Wu et al. 2018). More specifically, we maintain a fixed-size queue to store negative samples for each query vehicle, which is denoted by N +(i). Hard negatives in the batch samples are continuously stored in N +(i) and follow the principle of first-in-first-out for eviction, i.e., the earliest samples will be popped as long as the size of N +(i) exceeds the memory size.

Joint Training As shown in Figure 2, the feedback fusion module is jointly trained with the backbone Re ID network, with two optimization objectives. First, for the backbone network, we can simply apply the loss function commonly used in previous Re ID models to learn discriminative feature representation. In our implementation, we follow recent works (Ghosh, Shanmugalingam, and Lin 2021; Quispe et al. 2021) to adopt the combination of ID Loss and Metric Loss. As shown in the following equations, we choose Cross Entropy Loss as the ID Loss and soft-margin Triplet Loss as the Metric Loss:

i=1 ˆpi log (pi) , ˆpi = 0, i = y ˆpi = 1, i = y (7)

LMetric = log[1 + exp ( va vp va vn )] (8)

where y is the groundtruth ID label, pi is the ID prediction logits of class i, va is an anchor feature, vp is a positive feature and vn is a negative feature. Second, for the component of feedback signal fusion, we apply Supervised Contrastive Loss (SCL) LSCL to adjust the query embedding using feedback and make it as close as possible to those of the matching images. We apply SCL to this task because it encourages the model to pay more attention to the hard samples (containing both positives and negatives) so as to generate a more discriminative query embedding by leveraging feedback signals. LSCL can be formally expressed as:

p P (i) log exp q(k) v p /τ

P a A(i) exp q(k) v a /τ (9)

where we set the iteratively updated vehicle feature vq,k as anchor in SCL. A(i) = P(i) + N(i) + N +(i) is a set of vehicle image indices in the batch for vehicle i, in which P(i) is the set of indices of all positives, N(i) is all negatives indices set, and N +(i) is HNMB for vehicle i after k 1 iterations of simulated human feedback. With the two types of loss functions, the final objective of joint learning is to minimize

L = λid LID + λm LMetric + λh LSCL (10)

where λid, λm and λh are weight parameters.

Online Inference With the trained IRIN model, we explain the online inference procedure with iterative feedback signals (as shown in Algorithm 1). Given a query image, we extract its features using the backbone network (which is Res Ne Xt101 appended by an IBN in our implementation). The images in the gallery are sorted according to their similarity to the query image and top-n candidates are returned. Among the returned images, a set of uncertain images U are identified according to the classification entropy. The user picks an image Ip from U that is considered to be positive with the highest confidence. With the feedback signal, the query feature is adjusted by the IRIN network. In the next iteration, the top-n candidates and uncertain image set U are updated. The user picks another positive image from U to update query feature. The procedure repeats until the maximum number of iterations is reached or the user is satisfied with the top-n results.

Experiment Experimental Setup Datasets. We use two popular vehicle Re ID benchmarks. Veri-776 (Liu et al. 2016b) contains 49, 357 images of 776 different vehicles, captured by 20 cameras in multiple viewpoints. Vehicle ID (Liu et al. 2016a) is a larger-scale dataset, with 221, 567 images and 26, 328 vehicles. In evaluation, it provides three test datasets in different scales (small, medium, and large). Implementation Details. Following previous Re ID models, the input images are resized to 240 240 and augmented

Algorithm 1: Vehicle Re-ID with iterative feedback

1 q(0) Extract visual feature of the query image;

2 CANDn top-n similar images to q(0);

3 U images in CANDn with top-|U| entropy;

4 for k 1 to Imax do

5 if CANDn is satisfactory then

7 Ip the image picked by the user from U;

8 q(k) IRIN(q(k 1), Ip);

9 Update CANDn and U according to q(k);

10 return the sorted images w.r.t. the similarity to q(k)

by random flipping, random padding and random erasing. The feature dimension is set to 2, 048. The model is trained with 120 epochs with a batch size of 128. SGD optimizer is employed with a momentum of 0.9 and the weight decay of 5e 4. Each batch contains 8 images per vehicle. The initial learning rate is set to 0.01 and linearly decayed to 0.0001. Our backbone network is created by appending an Instance Batch Normalization (IBN) (Pan et al. 2018) to a Res Ne Xt101 (Xie et al. 2017) model and we select 10 uncertain samples for human interaction simulation. The size of hard negative memory bank is set to 512. In the loss function of joint training, the weight parameters λid, λm and λh are both set to 1.0. In the online inference stage, top-50 candidates are returned in each iteration and the size of uncertain images U is set to 10. The model is implemented with Py Torch and trained on Tesla-V100 GPU. Performance Metrics. The performance of our human-inthe-loop vehicle Re ID can still be evaluated by conventional Re ID metrics. We select mean average precision (m AP), Cumulative Matching Characteristics at top-1 (rank-1) and Cumulative Matching Characteristics at top-5 (rank-5) for performance evaluation. Comparison Methods. We consider Trans Re ID (He et al. 2021) as the state-of-the-art vehicle Re ID model. In our performance evaluation, it is treated as a machine baseline without human intervention. As to incremental learning in the online scenario, we select ILOS (He et al. 2020) and Gw FRe ID (Wu and Gong 2021) as two representative approaches. They maintain an exemplar set for each vehicle which incorporates the candidates picked by the user with high confidence. Online training is conducted on the exemplar set, with different learning objectives.

Sensitivity to Human Feedback Quality

Since it is possible that users provide flawed feedback with false positives, we simulate the human interaction in the evaluation stage and control the probability that a correct positive pair is picked as the feedback. As shown in Figure 3, we vary the probability of positive feedback from 1.0 to 0.8 and compare IRIN with online incremental learning methods (ILOS and Gw FRe ID). The m AP of Trans Re ID is also plotted as the machine baseline without human intervention. It is interesting to find that the backbone model of IRIN can

achieve higher m AP than Trans Re ID, probably due to the joint training framework.

(b) Veri776

(c) Vehicle ID-Small

(d) Vehicle ID-Medium

(e) Vehicle ID-Large

Figure 3: Sensitivity to the quality of human feedback.

When the feedback signal is relatively reliable, we can see that the m APs of IRIN, ILOS and Gw FRe ID increase with more rounds of human interaction, indicating that both IRIN and online incremental learning can benefit from iterative human feedback their m APs also increase with the quality of feedback. In the oracle scenario with perfect feedback, with only 5 iterations for each query image, IRIN can boost the accuracy to 95.2% from 81.6% in Veri776, which is remarkably higher than 80.8% achieved by Trans Re ID. IRIN can significantly better leverage the feedback signal than the incremental learning frameworks of ILOS and Gw FRe ID. With the same quality of the feedback, the m AP of IRIN surpasses ILOS and Gw FRe ID by a wide margin. Between the two competitors, Gw FRe ID outperforms ILOS because it is associated with a more comprehensive learning objective.

Sensitivity to Candidate Set Size We also investigate how the size of uncertain candidate images impacts human-in-the-loop vehicle Re ID. Figure 4 shows that the m AP results for IRIN-1.0, IRIN-0.9 and IRIN-0.8 under different sizes of candidate set. Intuitively, the IRIN-1.0 that always receives a true positive feedback can benefit from the expansion of candidate set, whereas IRIN-0.9 and IRIN-0.8 with noisy feedback cannot increase steadily as the feedback candidate set increases. Since human feedback is not perfect, we set the default size of candidate images to 10 in our experiments with real human interaction.

Efficiency Study In this experiment, we evaluate the time cost spent on human feedback signal processing. IRIN can directly accept the feedback as part of the input and adjust the feature embedding for the query image. Thus, its running time includes

Method Veri776 Vehicle ID-Small Vehicle ID-Medium Vehicle ID-Large m AP time m AP time m AP time m AP time

IRIN 95.2 0.031s 97.9 0.029s 96.4 0.033s 96.0 0.036s ILOS-0.1 85.7 10.8s 92.4 21.91s 90.2 23.14s 89.1 24.16s ILOS-0.2 86.4 13.85s 93.3 28.43s 90.9 29.89s 90.1 30.95s ILOS-0.3 86.3 17.5s 92.8 35.94s 90.6 36.88s 89.9 39.00s Gw FRe ID 87.7 41.3s 94.0 63.7s 92.5 65.3s 91.8 69.9s

Table 1: Comparison on feedback processing time in Veri776. Here, ILOS-0.1/0.2/0.3 sets the weight of the distillation loss to 0.1/0.2/0.3, respectively. In this paper, 0.2 is the default setting.

(b) Veri776

(c) Vehicle ID-Small

Figure 4: Sensitivity to the size of candidate set.

the cost to update the query feature and obtain a new ranking list. For incremental learning, online training overhead is incurred to update the underlying model with the augmented positive samples by random flipping, padding and erasing. With the number of augmented samples fixed to 24, we vary the weight of distillation loss (denoted by λd) from 0.1 to 0.3 in the loss function of ILOS. As we can see from Table 1, it takes higher running time when λd increases. This is because the model training becomes more query-specific and its online training requires more epochs to be converged. Thus, ILOS-0.2 is more accurate than ILOS-0.1 but incurs higher processing time. However, when λd continues to rise, we also find that the m AP does not increase monotonically, probably because the information from the feedback signal are not sufficient to provide reliable clues to guide the optimization directions of model parameters. In contrast, IRIN is much faster and more accurate than incremental learning. It takes less than 0.04s to handle a feedback signal and can easily support real-time user interaction, whereas ILOS requires users to wait more than 10 seconds for each interaction. The running time of Gw FRe ID is even worse, around 3-4 times higher than that of ILOS-0.1. The reason is that Gw FRe ID maintains a larger exemplar set than ILOS and requires higher re-training overhead.

Experiments with Real Human Interaction

We invite 20 postgraduates with 10 female and 10 male students to participate in the real human interaction experiment. Each student is required to provide iterative feedback for each query image with 5 iterations. We observe that the quality of feedback is different for male and female students. Thus, we will report their results separately and refer them as IRIN-Male and IRIN-Female, respectively.

The comparison approaches include three modes. 1) For machine-only mode, we still use Trans Re ID as the baseline without human interaction. 2) For human-machine cooperation mode, we compare IRIN with incremental learning method for Re ID (denoted by IRIN-Male and IRIN-Female, respectively). Furthermore, we also report IRIN-Oracle to show the upper bound performance with perfect feedback signal. 3) For human-only baselines, we require the students to annotate the matching vehicles for the query images. To reduce their annotation cost, we make an assumption that the positive samples are contained in the top-200 most similar images obtained in the initial round of IRIN. This procedure generates two baselines: Male-Annotation and Female Annotation.

Figure 5: The m AP results with real human interaction on Veri776. Shaded areas represent the variance of m AP.

In Figure 5, we report the mean m AP results for all participants with increasing number of iterations applied on each query image. For the human-machine cooperative methods, since the quality of annotation is different among the students, we also plot the variance with shaded area. The results lead us to the following key observations. 1) The m APs of human-only baselines are significantly higher than the machine-only baseline, indicating that human users are more capable of distinguishing the true positives from a set of similar images. 2) The results of Male-Annotation and Female Annotation show that female students can provide more ac-

Method IRIN-Oracle IRIN-Female IRIN-Male m AP rank-1 rank-5 m AP rank-1 rank-5 m AP rank-1 rank-5

IRIN 95.2 99.5 100 93.2 98 99.5 89.9 96.5 99 IRIN-max-dist 94.5 99 99 92.1 97.5 99 88 96 98 IRIN-random-sample 93.8 99.5 99.5 91.8 97.5 98.5 87.7 95.5 98.5 IRIN-triplet-loss 92.3 99 99.5 90.6 95.5 98 85.9 94.5 97.5 IRIN-w/o-HNMB 92.7 99 99.5 91.3 97 98.5 86.8 96.5 98

Table 2: The ablation study of sampling and optimization strategy.

curate feedback. Consequently, when the feedback is provided to IRIN, we can see that IRIN-Female outperforms IRIN-Male. This finding is consistent with the experiment on feedback quality in Figure 3. 3) Both IRIN, ILOS and Gw FRe ID can benefit from human feedback. Their m APs increase steadily with the number of interactions per query. 4) Compared with ILOS and Gw FRe ID, IRIN is more effective in leveraging the human feedback and achieves higher accuracy. 5) With sufficient number of iterations, IRIN, as a human-machine cooperation method, eventually outperforms machine-only and human-only baselines. Besides, we can observe that its variance reduces with more iterations of human interaction.

Ablation Study We examine four variants of IRIN for ablation study. IRIN-random-sample replaces the active learning based sampling strategy with random sampling to obtain the set of uncertain candidates for human verification. IRIN-max-dist simulates human interaction by selecting the positive sample with the maximum distance to the query image, whereas the original IRIN uses the minimum distance. IRIN-triplet-loss replaces contrastive loss function with traditional triplet loss. IRIN-w/o-HNMB removes the component of hard negative memory bank. In Table 2, we report the results of these four IRIN variants under different settings of feedback quality. In the offline training stage, the active learning based sampling strategy for uncertain candidate selection is more effective than random sampling. With these uncertain candidates, we can see that the strategy to simulate human users to pick the sample with the minimum distance to query image is indeed superior to picking the most different candidate. Overall, the supervised contrastive loss and hard negative memory bank play more important effect in this ablation study. When either component is removed, we can observe considerable performance degradation.

Generality to Person Re ID In the final experiment, we evaluate the generality of our framework by applying it to person Re ID and conduct experiments with human interaction in MSMT17 (Wei et al. 2018). The dataset contains 126, 441 pictures of 4, 101 pedestrians, captured by 15 cameras. We use similar comparison methods as those in Figure 5. Since Trans Re ID claims to achieve state-of-the-art performance in both vehicle and person Re ID, we choose it as the machine base-

line without human intervention. The results in Figure 6 the m AP derived from the backbone network of IRIN is inferior to Trans Re ID (74.6% vs 79.1%). Nevertheless, with only one iteration of feedback, both IRIN-Male and IRIN-Female have outperformed Trans Re ID. With more iterations, the performance advantage is further widened. When there are 5 iterations, their m APs are substantially higher than that of Trans Re ID, reaching 89.3% and 91.5% respectively. It is also interesting to observe that IRIN-Female achieves almost the same accuracy with IRIN-Oracle when the number of iterations is set to 5, implying that our proposed IRIN is effective in leveraging imperfect feedback.

Figure 6: Experiment on person Re ID.

Conclusion In this paper, we study vehicle Re ID in a new scenario with human in the loop to provide iterative feedback for continuous performance enhancement. We propose an Interaction Re ID Network (IRIN) to effectively fuse the feedback signal and dynamically adjust the feature embedding of the query image. It can support real-time interaction and is suitable for applications with high accuracy requirement. Experimental results validate the superiority of such humanmachine cooperation mode over machine-only or humanonly baselines. With flawed feedback, the proposed IRIN can even outperform the state-of-the-art vehicle Re ID model by a wide margin. In our future study, we will explore a more diversified set of human operators as feedback signals.

Acknowledgments This work is sponsored by CCF-Huawei Populus Grove Fund.

References Berti Equille, L. 2019. Reinforcement Learning for Data Preparation with Active Reward Learning. In Yacoubi, S. E.; Bagnoli, F.; and Pacini, G., eds., Internet Science - 6th International Conference, INSCI 2019, Perpignan, France, December 2-5, 2019, Proceedings, volume 11938 of Lecture Notes in Computer Science, 121 132. Springer. Brenner, R.; Priyadarshi, J.; and Itti, L. 2016. Perfect Accuracy with Human-in-the-Loop Object Detection. In ECCV 2016, volume 9914 of Lecture Notes in Computer Science, 360 374. Ghosh, A.; Shanmugalingam, K.; and Lin, W. 2021. Relation Preserving Triplet Mining for Stabilizing the Triplet Loss in Vehicle Re-identification. Co RR, abs/2110.07933. He, J.; Mao, R.; Shao, Z.; and Zhu, F. 2020. Incremental Learning in Online Scenario. In CVPR 2020, 13923 13932. Computer Vision Foundation / IEEE. He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; and Jiang, W. 2021. Trans Re ID: Transformer-based Object Re Identification. In ICCV 2021, 14993 15002. IEEE. Khorramshahi, P.; Kumar, A.; Peri, N.; Rambhatla, S. S.; Chen, J.; and Chellappa, R. 2019. A Dual-Path Model With Adaptive Attention for Vehicle Re-Identification. In ICCV 2019, 6131 6140. IEEE. Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; and Krishnan, D. 2020. Supervised Contrastive Learning. In Neur IPS 2020. Liu, H.; Tian, Y.; Wang, Y.; Pang, L.; and Huang, T. 2016a. Deep Relative Distance Learning: Tell the Difference between Similar Vehicles. In CVPR 2016, 2167 2175. IEEE Computer Society. Liu, X.; Liu, W.; Mei, T.; and Ma, H. 2016b. A Deep Learning-Based Approach to Progressive Vehicle Reidentification for Urban Surveillance. In Leibe, B.; Matas, J.; Sebe, N.; and Welling, M., eds., ECCV 2016, volume 9906 of Lecture Notes in Computer Science, 869 884. Springer. Liu, Z.; Wang, J.; Gong, S.; Tao, D.; and Lu, H. 2019. Deep Reinforcement Active Learning for Human-in-the Loop Person Re-Identification. In ICCV 2019, 6121 6130. IEEE. Lou, Y.; Bai, Y.; Liu, J.; Wang, S.; and Duan, L. 2019. Embedding Adversarial Learning for Vehicle Re-Identification. IEEE Trans. Image Process., 28(8): 3794 3807. Muthuraman, K.; Reiss, F.; Xu, H.; Cutler, B.; and Eichenberger, Z. 2021. Data cleaning tools for token classification tasks. In Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances, 59 61. Pan, X.; Luo, P.; Shi, J.; and Tang, X. 2018. Two at Once: Enhancing Learning and Generalization Capacities via IBNNet. In Ferrari, V.; Hebert, M.; Sminchisescu, C.; and Weiss, Y., eds., Computer Vision - ECCV 2018 - 15th European

Conference, Part IV, volume 11208 of Lecture Notes in Computer Science, 484 500. Springer. Quispe, R.; Lan, C.; Zeng, W.; and Pedrini, H. 2021. Attribute Net: Attribute enhanced vehicle re-identification. Neurocomputing, 465: 84 92. Rao, Y.; Chen, G.; Lu, J.; and Zhou, J. 2021. Counterfactual Attention Learning for Fine-Grained Visual Categorization and Re-identification. Co RR, abs/2108.08728. Stonebraker, M.; Bhargava, B.; Cafarella, M.; Collins, Z.; Mc-Clellan, J.; Sipser, A.; Sun, T.; Nesen, A.; Solaiman, K.; Mani, G.; et al. 2020. Surveillance video querying with a human-in-the-loop. In Workshop on Human-In-the-Loop Data Analytics. Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; and Wei, Y. 2020. Circle Loss: A Unified Perspective of Pair Similarity Optimization. In CVPR 2020, 6397 6406. Computer Vision Foundation / IEEE. Wei, L.; Zhang, S.; Gao, W.; and Tian, Q. 2018. Person Transfer GAN to Bridge Domain Gap for Person Re Identification. In CVPR 2018, 79 88. Computer Vision Foundation / IEEE Computer Society. Wu, C.; Wu, F.; Qi, T.; Huang, Y.; and Xie, X. 2021. Fastformer: Additive Attention Can Be All You Need. Co RR, abs/2108.09084. Wu, G.; and Gong, S. 2021. Generalising without Forgetting for Lifelong Person Re-Identification. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, 2889 2897. AAAI Press. Wu, Z.; Xiong, Y.; Yu, S. X.; and Lin, D. 2018. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, 3733 3742. Computer Vision Foundation / IEEE Computer Society. Xie, S.; Girshick, R. B.; Doll ar, P.; Tu, Z.; and He, K. 2017. Aggregated Residual Transformations for Deep Neural Networks. In CVPR 2017, 5987 5995. IEEE Computer Society. Yan, C.; Pang, G.; Bai, X.; Zhou, J.; and Gu, L. 2020. Beyond Triplet Loss: Person Re-identification with Fine-grained Difference-aware Pairwise Loss. Co RR, abs/2009.10295. Zakria; Deng, J.; Khokhar, M. S.; Aftab, M. U.; Cai, J.; Kumar, R.; and Kumar, J. 2021. Trends in Vehicle Reidentification Past, Present, and Future: A Comprehensive Review. Co RR, abs/2102.09744. Zhang, D.; Li, Z.; Wang, X.; Tan, K.; and Chen, G. 2022. Towards One-Size-Fits-Many: Multi-Context Attention Network for Diversity of Entity Resolution Tasks. IEEE Trans. Knowl. Data Eng., 34(12): 6018 6032. Zhao, J.; Zhao, Y.; Li, J.; Yan, K.; and Tian, Y. 2021. Heterogeneous Relational Complement for Vehicle Reidentification. Co RR, abs/2109.07894.