# topreid_multispectral_object_reidentification_with_token_permutation__9fad9b51.pdf TOP-Re ID: Multi-Spectral Object Re-identification with Token Permutation Yuhao Wang1, Xuehu Liu2, Pingping Zhang1*, Hu Lu3, Zhengzheng Tu4, Huchuan Lu1 1School of Future Technology, School of Artificial Intelligence, Dalian University of Technology 2School of Computer Science and Artificial Intelligence, Wuhan University of Technology 3School of Computer Science and Communication Engineering, Jiangsu University 4School of Computer Science and Technology, Anhui University 924973292@mail.dlut.edu.cn, liuxuehu@whut.edu.cn, {zhpp, lhchuan}@dlut.edu.cn, luhu@ujs.edu.cn zhengzhengahu@163.com Multi-spectral object Re-identification (Re ID) aims to retrieve specific objects by leveraging complementary information from different image spectra. It delivers great advantages over traditional single-spectral Re ID in complex visual environment. However, the significant distribution gap among different image spectra poses great challenges for effective multi-spectral feature representations. In addition, most of current Transformer-based Re ID methods only utilize the global feature of class tokens to achieve the holistic retrieval, ignoring the local discriminative ones. To address the above issues, we step further to utilize all the tokens of Transformers and propose a cyclic token permutation framework for multi-spectral object Re ID, dubbled TOP-Re ID. More specifically, we first deploy a multi-stream deep network based on vision Transformers to preserve distinct information from different image spectra. Then, we propose a Token Permutation Module (TPM) for cyclic multi-spectral feature aggregation. It not only facilitates the spatial feature alignment across different image spectra, but also allows the class token of each spectrum to perceive the local details of other spectra. Meanwhile, we propose a Complementary Reconstruction Module (CRM), which introduces dense token-level reconstruction constraints to reduce the distribution gap across different image spectra. With the above modules, our proposed framework can generate more discriminative multi-spectral features for robust object Re ID. Extensive experiments on three Re ID benchmarks (i.e., RGBNT201, RGBNT100 and MSVR310) verify the effectiveness of our methods. The code is available at https://github.com/924973292/TOP-Re ID. Introduction Object Re-identification (Re ID) aims to retrieve specific objects from images or videos across non-overlapping cameras, which has advanced significantly over the past decades. In the traditional object Re ID, researchers primarily utilize single-spectral images (such as RGB, depth) to extract visual information of the targets. However, single-spectral images provide very limited representation abilities in scenarios characterized by low resolution, darkness, glare, etc. As illustrated in the top row of Fig. 1, the outlines of persons are notably blurred, leading to an evident confusion between *Corresponding author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Multi-Spectral Test Missing-Spectral Test Low Resolution Figure 1: The top displays instances from RGBNT201 in various challenges, while the bottom presents the object Re ID settings for multi-spectral and missing-spectral test. persons and the background in the RGB image spectrum. Hence, relying only on RGB images poses great challenges for robust object Re ID. Fortunately, other image spectra are very useful to address above problems. In fact, Near Infrared (NIR) imaging is unaffected by darkness and adverse weather conditions (Li et al. 2020b). Thus, there have been some efforts (Li et al. 2020a; Liu et al. 2021a; Zhang and Wang 2023) to incorporate NIR images to enhance the performance of object Re ID. Nonetheless, NIR images retain some limitations (Zheng et al. 2021), as depicted in Fig. 1. For example, the details of persons in NIR images tend to be substantially obscured in the presence of glare. Meanwhile, Thermal Infrared (TIR) imaging is more robust to these scenarios (Zheng et al. 2021). As illustrated in Fig. 1, TIR images can highlight persons from the background and preserve crucial details, such as glasses and backpacks. These facts clearly show the information complementarity of different image spectra for object Re ID. Based on the above facts, multi-spectral object Re ID aims to retrieve specific ob- The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) jects by leveraging complementary information from different image spectra, e.g., RGB, NIR, TIR etc. It delivers great advantages over single-spectral Re ID in complex visual environment. In fact, some methods (Zheng et al. 2021; Wang et al. 2022b) have already tried to integrate multi-spectral features with simple fusion methods. However, there are significant distribution gaps among different image spectra. Simple fusions can not well address the heterogenous challenges for effective feature representations. In addition, it often involves the absence of image spectra in real world, as shown in the Missing-spectral Test of Fig. 1. Thus, there are much room for improving multi-spectral feature fusion. Meanwhile, with the great advance of vision Transformers (Dosovitskiy et al. 2020), some works (He et al. 2021; Pan et al. 2022) have introduced Transformers for object Re ID. However, most of current Transformer-based Re ID methods only utilize the global feature of class tokens to achieve the holistic retrieval, ignoring the local discriminative ones. To address the above issues, we step further to utilize all the tokens of Transformers and propose a cyclic token permutation framework for multi-spectral object Re ID, dubbled TOP-Re ID. Specifically, it consists of two key modules: Token Permutation Module (TPM) and Complementary Reconstruction Module (CRM). Technically, we first deploy a multi-stream deep network based on vision Transformers to preserve distinct information from different image spectra. Then, TPM takes all the tokens from the multistream deep network as inputs, and cyclically permutes the specific class tokens and the corresponding patch tokens from other spectra. In this way, it not only facilitates the spatial feature alignment across different image spectra, but also allows the class token of each spectrum to perceive the local details of other spectra. Meanwhile, CRM is proposed to facilitate local information interaction and reconstruction across different image spectra. Through introducing tokenlevel reconstruction constraints, it can reduce the distribution gap across different image spectra. As a result, the CRM can further handle the missing-spectral problem. With the proposed modules, our framework can extract more discriminative features from multi-spectral images for robust object Re ID. Comprehensive experiments are conducted on three multi-spectral object Re ID benchmarks, i.e., RGBNT201, RGBNT100 and MSVR310. Experimental results clearly show the effectiveness of our proposed methods. In summary, our contributions can be stated as follows: We propose a novel feature learning framework named TOP-Re ID for multi-spectral object Re ID. To our best knowledge, our proposed TOP-Re ID is the first work to utilize all the tokens of vision Transformers to improve the multi-spectral object Re ID. We propose a Token Permutation Module (TPM) and a Complementary Reconstruction Module (CRM) to facilitate multi-spectral feature alignment and handle spectralmissing problems effectively. We perform comprehensive experiments on three multispectral object Re ID benchmarks, i.e., RGBNT201, RGBNT100 and MSVR310. The results fully verify the effectiveness of our proposed methods. Related Work Single-spectral Object Re ID Single-spectral object Re ID focuses on extracting discriminative features from single-spectral images. Typical singlespectral forms include RGB, NIR, TIR and depth. Due to the easy requirement, RGB images play a fundamental role in the single-spectral object Re ID. As for the techniques, most of existing object Re ID methods are based on Convolutional Neural Networks (CNNs). For example, Luo et al. (Luo et al. 2019) utilize a deep residual network and introduce the BNNeck technique for object Re ID. Furthermore, PCB (Sun et al. 2018) and MGN (Wang et al. 2018) adapt a stripe-based image division strategy to obtain multi-grained representations. OSNet (Zhou et al. 2019) employs a unified aggregation gate for fusing omni-scale features. AGW (Ye et al. 2021) incorporates non-local attention mechanisms for fine-grained feature extraction. Nevertheless, due to the limited receptive field, CNN-based methods(Qian et al. 2017; Li, Zhu, and Gong 2018; Chang, Hospedales, and Xiang 2018; Chen et al. 2019; Sun et al. 2020; Rao et al. 2021; Zhao et al. 2021; Liu et al. 2021b) are not robust to complex scenarios. Inspired by the success of vision Transformers (Vi T) (Dosovitskiy et al. 2020), He et al. (He et al. 2021) propose the first pure Transformer-based method named Trans Re ID for object Re ID, yielding competitive results through the adaptive modeling of image patches. Afterwards, numerous Transformer-based methods (Zhu et al. 2021; Zhang et al. 2021; Chen et al. 2022; Wang et al. 2022a; Liu et al. 2023) demonstrate their advantages in object Re ID. However, all these methods take single-spectral images as inputs, providing limited representation abilities. Thus, they can not handle the all-day object Re ID problem. Multi-spectral Object Re ID The robustness of multi-spectral data draws the attention of numerous researchers. For multi-spectral person Re ID, Zheng et al. (Zheng et al. 2021) advance the field and design a PFNet to learn robust RGB-NIR-TIR features. Then, Wang et al. (Wang et al. 2022b) boost modality-specific representations with three learning strategies, named IEEE. Furthermore, Zheng et al. (Zheng et al. 2023) design a DENet to address the spectral-missing problem. For multi-spectral vehicle Re ID, Li et al. (Li et al. 2020b) propose a HAMNet to fuse different spectral features. Considering the relationship between different image spectra, Guo et al. (Guo et al. 2022) propose a GAFNet to fuse the multiple data sources. He et al. (He et al. 2023) propose a GPFNet to adaptively fuse multi-spectral features. Zheng et al. (Zheng et al. 2022) propose a CCNet to simultaneously overcome the discrepancies from both modality and sample aspects. Pan et al. (Pan et al. 2022) propose a HVi T to balance modal-specific and modal-shared information. Furthermore, they employ a random hybrid augmentation and a feature hybrid mechanism to improve the performance (Pan et al. 2023). Although effective, previous methods mainly treat the NIR and TIR as an assistant to RGB, rather than adaptively fuse them with multi-level spatial correspondences. In contrast, we facilitate the spatial feature alignment across different image spectra. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Transformer Layers Linear Projection Transformer Layers Transformer Layers Linear Projection Transformer Layers Transformer Layers Linear Projection Transformer Layers Triplet Loss CE Loss C Concatenation Position Embedding Camera Embedding Class Token Figure 2: An illustration of the proposed TOP-Re ID. First, deep features from RGB, NIR and TIR images are extracted by using three independent Vi T-B/16. Then, a Token Permutation Module (TPM) is proposed for cyclic multi-spectral feature aggregation through three consecutive token permutations. Meanwhile, a Complementary Reconstruction Module (CRM) is used to achieve token-level reconstruction constraints. When inference, we utilize the permutated features for ranking the person candidates. Proposed Method As illustrated in Fig. 2, our proposed TOP-Re ID consists of three main components: Multi-stream Feature Extraction, Token Permutation Module (TPM) and Complementary Reconstruction Module (CRM). Multi-stream Feature Extraction In this work, we take images of three spectra for object Re ID, i.e., RGB, NIR and TIR. To capture the distinctive characteristics of each spectrum, we follow previous works (Li et al. 2020b; Zheng et al. 2021) and adopt three independent backbones. More specifically, vision Transformers (Vi T) can be deployed as the backbone in each stream. Formally, the multi-stream features can be represented as FR = Vi TR (IR) , (1) FN = Vi TN (IN) , (2) FT = Vi TT (IT) , (3) where IR RH W 3, IN RH W 3 and IT RH W 3 denote the input RGB, NIR and TIR images, respectively. Here, Vi T can be any vision Transformers (e.g., Vi TB/16 (Dosovitskiy et al. 2020), Dei T-S/16 (Touvron et al. 2021), T2T-Vi T-24 (Yuan et al. 2021)). The token features FR, FN, FT RD (M+1) are extracted from the final layer of Vi T, respectively. Additional learnable class token is included. D denotes the embedding dimension while M means the number of patch tokens. These independent streams enable the extraction of spectral-specific features, capturing rich information from different image spectra. Token Permutation Module To achieve the spatial feature alignment among different image spectra and the effective aggregation of heterogeneous features, we introduce the Token Permutation Module (TPM) with a cyclic token permutation mechanism, as illustrated at the top right corner of Fig. 2. Technically, TPM takes the token features FR, FN and FT as inputs, and generates the fused feature ftp with three consecutive token permutations. Without loss of generality, we take the RGB stream as a starting example. As shown in Fig. 3 (a), we utilize a Multi-Head Cross-Attention (MHCA) (Dosovitskiy et al. 2020) with Nh heads to achieve the token permutation. More specifically, the class token f(R,0) RD from FR is passed into a linear transformation to generate a query matrix Q RD. The patch tokens F patch N RD M from FN are passed into two linear transformations to generate a key matrix K and a value matrix V , respectively. Thus, the interaction of FR and FN in the h-th head is represented as ˆf h (R,1) = σ(Qh Kh d )V h, (4) where σ is the softmax function and ( ) means the matrix transposition. Here, Qh Rd, Kh, V h Rd M, d = D Nh . The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Linear Linear Linear Q K V Multi-Head Cross Attention (b) Trans Re Block (a) Token Permutation Permutation Multi-Head Self Attention Permute to the next spectrum Figure 3: Our token permutation and Trans Re blocks with the RGB stream. Other streams share a similar structure. The outputs of Nh heads ( ˆf 1 (R,1), , ˆf h (R,1), , ˆf Nh (R,1)) are concatenated to be ˆf(R,1) RD. Then, ˆf(R,1) is passed through a Feed-Forward Network (FFN) to generate a new class token f(R,1), f(R,1) = FFN( ˆf(R,1)) + ˆf(R,1). (5) It serves as the initial spatial alignment of the RGB and NIR image features. Similar operations can be performed for other spectra, f(R,1) = FFN(MHCA LN(f(R,0)), LN(F patch N ) ), (6) f(N,1) = FFN(MHCA LN(f(N,0)), LN(F patch T ) ), (7) f(T,1) = FFN(MHCA LN(f(T,0)), LN(F patch R ) ). (8) As shown in Fig. 3 (a) and above equations, we additionally introduce the Layer Norm (LN) (Ba, Kiros, and Hinton 2016) to Q and K to ensure the numerical stability. Thus, the first token permutation can be totally formulated as f(R,1), f(N,1), f(T,1) = TPM1(FR, FN, FT), (9) From the above equations, it can be observed that the token permutation enables the global class token from each spectrum to interact with the local patch tokens of the next spectrum, achieving the initial feature fusion and alignment. Furthermore, the permutated class tokens f(R,1), f(N,1), and f(T,1) are paired with their initial patch tokens to form FR N, FN T, and FT R, respectively. The class tokens keep shifting to the next spectrum, f(R,2), f(N,2), f(T,2) = TPM2(FR N, FN T, FT R). (10) At this stage, each spectrum has already incorporated detail information from other spectra. Similar to the previous step, the permutated class tokens f(R,2), f(N,2), and f(T,2) are paired with permutated patch tokens to form FRN T, FNT R, and FTR N, respectively. Finally, the token permutation process ends with each class token interacting with its own patch tokens, f(R,3), f(N,3), f(T,3) = TPM3(FRN T, FNT R, FTR N). (11) Through the above token permutation, the information from all other spectra is conveyed to the patch tokens through the class token, enabling robust feature alignment. Finally, we concatenate the permutated class tokens to obtain the permutated representation ftp R3D, ftp = Concat f(R,3), f(N,3), f(T,3) . (12) This cyclic token permutation enhances the spatial fusion and implicit alignment of deep features across spectra, improving the ability of inter-spectral dependencies. Complementary Reconstruction Module There are significant distribution gaps among different image spectra. In addition, it often involves the absence of certain image spectra in real world. Inspired by the image generation (Zhu et al. 2017), we propose a Complementary Reconstruction Module (CRM) to reduce the distribution gap across different image spectra. The key is to incorporate dense token-level reconstruction constraints. Without loss of generality, we take the RGB stream as an example and consider the NIR and TIR spectra missing. To reconstruct the missing tokens, we pass FR through a Transformer-based Reconstruction (Trans Re) block (See Fig. 3 (b)) and generate the corresponding tokens by FR2N = Trans Re(FR), (13) FR2T = Trans Re(FR), (14) where FR2N, FR2T RD (M+1) are the reconstructed tokens. The reconstructed tokens FR2N and FR2T are constrained by the real token features FN and FT using the Mean Squared Error (MSE) loss: LR2N = 1 M + 1 i=1 ||FR2N FN||2 2, (15) LR2T = 1 M + 1 i=1 ||FR2T FT||2 2, (16) LR = LR2N + LR2T. (17) Through the above token-level reconstruction constraints, the distribution gap between RGB and other spectra is reduced. To improve the reconstruction ability, we introduce similar constraints to all the image spectra and achieve a multi-spectral complementary reconstruction. The complementary reconstruction loss Lcr can be expressed as the sum of the individual losses for each spectrum: Lcr = LR + LN + LT. (18) By introducing token-level constraints, our CRM effectively reduces the distribution gap among different image spectra. Moreover, it can generate corresponding tokens of missing spectra, ensuring a unified learning framework even in scenarios where one or more spectra are absent. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Dynamic Cooperation Between CRM and TPM In this work, we further introduce the dynamic cooperation between the CRM and TPM to handle the absence of any image spectrum. For example, when the RGB image spectrum is missing, the token features FN and FT will activate their reconstruction blocks to generate the corresponding RGB token features FN2R and FT2R, respectively. Then, the reconstructed RGB token features can be represented as FR = (FN2R + FT2R) Then, FR, FN and FT can be fed into the TPM to perform the token permutation as normal. Hence, our CRM can dynamically cooperate with TPM, ensuring that the missing spectrum can still participate in the permutation process. Objective Function As illustrated in Fig. 2, our objective function comprises three components: loss for the Vi T backbone, loss for the token permutation and loss for the CRM. As for the Vi T backbone and the token permutation, they are both supervised by the label smoothing cross-entropy loss (Szegedy et al. 2016) and triplet loss (Hermans, Beyer, and Leibe 2017). Finally, the total loss in our framework can be defined by Ltotal = LV i T tri + LV i T ce + LT P tri + LT P ce + Lcr. (20) Experiments Dataset and Evaluation Protocols To evaluate the performance, we adopt three multi-spectral object Re ID datasets. RGBNT201 (Zheng et al. 2021) is the first multi-spectral person Re ID dataset with RGB, NIR and TIR spectra. RGBNT100 (Li et al. 2020b) is a large-scale multi-spectral vehicle Re ID dataset. MSVR310 (Zheng et al. 2022) is a small-scale multi-spectral vehicle Re ID dataset with more complex scenarios. Following previous works, we adopt the mean Average Precision (m AP) and Cumulative Matching Characteristics (CMC) at Rank-K (K = 1, 5, 10) as our evaluation metrics. Implementation Details Our model is implemented with the Py Torch toolbox. We conduct experiments with one NVIDIA A800 GPU. We use pre-trained Transformers on the Image Net classification dataset (Deng et al. 2009) as our backbones. All images are resized to 256 128 3 pixels. When training, random horizontal flipping, cropping and erasing (Zhong et al. 2020) are used as data augmentation. We set the mini-batch size to 128. Each mini-batch consists of 8 randomly selected object identities, and 16 images are sampled for each identity. We use the Stochastic Gradient Descent (SGD) optimizer with a momentum coefficient of 0.9 and a weight decay of 0.0001. Furthermore, the learning rate is initialized as 0.009. The warmup strategy and cosine decay are used during training. Methods RGBNT201 m AP R-1 R-5 R-10 HACNN 21.3 19.0 34.1 42.8 MUDeep 23.8 19.7 33.1 44.3 OSNet 25.4 22.3 35.1 44.7 MLFN 26.1 24.2 35.9 44.1 CAL 27.6 24.3 36.5 45.7 PCB 32.8 28.1 37.4 46.9 HAMNet 27.7 26.3 41.5 51.7 PFNet 38.5 38.9 52.0 58.4 DENet 42.4 42.2 55.3 64.5 IEEE 47.5 44.4 57.1 63.6 TOP-Re ID* 72.3 76.6 84.7 89.4 Table 1: Performance comparison on RGBNT201. The best and second results are in bold and underlined, respectively. * signifies Transformer-based approaches, while others are CNN-based ones. Methods RGBNT100 MSVR310 m AP R-1 m AP R-1 DMML 58.5 82.0 19.1 31.1 Circle Loss 59.4 81.7 22.7 34.2 PCB 57.2 83.5 23.2 42.9 MGN 58.1 83.1 26.2 44.3 Bo T 78.0 95.1 23.5 38.4 HRCN 67.1 91.8 23.4 44.2 OSNet 75.0 95.6 28.7 44.8 AGW 73.1 92.7 28.9 46.9 Trans Re ID* 75.6 92.9 18.4 29.6 GAFNet 74.4 93.4 - - GPFNet 75.0 94.5 - - PHT* 79.9 92.7 - - PFNet 68.1 94.1 23.5 37.4 HAMNet 74.5 93.3 27.1 42.3 CCNet 77.2 96.3 36.4 55.2 TOP-Re ID* 81.2 96.4 35.9 44.6 Table 2: Performance on RGBNT100 and MSVR310. Comparison with State-of-the-Art Methods Multi-spectral Person Re ID. In Tab. 1, we compare our TOP-Re ID with both single-spectral methods and multispectral methods on RGBNT201. The results indicate that single-spectral methods generally achieve lower performance compared with multi-spectral methods. It demonstrates the effectiveness of utilizing complementary information from different image spectra. Among the singlespectral methods, PCB achieves the highest performance, attaining the m AP and Rank-1 accuracy of 32.8% and 28.1%, respectively. As for the multi-spectral methods, our TOP-Re ID achieves remarkable performance. Specifically, it achieves a m AP that is 24.8% higher and a Rank-1 accuracy that surpasses IEEE by 32.2%. These performance gains provide strong evidences for our TOP-Re ID in tackling the challenges of multi-spectral person Re ID. Multi-spectral Vehicle Re ID. As shown in Tab. 2, singlespectral methods such as OSNet (Zhou et al. 2019), AGW The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Methods M (RGB) M (NIR) M (TIR) M (RGB+NIR) M (RGB+TIR) M (NIR+TIR) m AP R-1 m AP R-1 m AP R-1 m AP R-1 m AP R-1 m AP R-1 HACNN 12.5 11.1 20.5 19.4 16.7 13.3 9.2 6.2 6.3 2.2 14.8 12.0 MUDeep 19.2 16.4 20.0 17.2 18.4 14.2 13.7 11.8 11.5 6.5 12.7 8.5 OSNet 19.8 17.3 21.0 19.0 18.7 14.6 12.3 10.9 9.4 5.4 13.0 10.2 MLFN 20.2 18.9 21.1 19.7 17.6 11.1 13.2 12.1 8.3 3.5 13.1 9.1 CAL 21.4 22.1 24.2 23.6 18.0 12.4 18.6 20.1 10.0 5.9 17.2 13.2 PCB 23.6 24.2 24.4 25.1 19.9 14.7 20.6 23.6 11.0 6.8 18.6 14.4 Multi PFNet - - 31.9 29.8 25.5 25.8 - - - - 26.4 23.4 DENet - - 35.4 36.8 33.0 35.4 - - - - 32.4 29.2 TOP-Re ID 54.4 57.5 64.3 67.6 51.9 54.5 35.3 35.4 26.2 26.0 34.1 31.7 Table 3: Experimental results of missing-spectral tasks on RGBNT201. M (X) stands for missing the X image spectra. Modules RGBNT201 BL AL TPM CRM m AP R-1 R-5 R-10 A 55.9 54.9 70.8 77.6 B 62.9 64.5 77.4 82.7 C 67.8 69.4 83.3 88.8 D 72.3 76.6 84.7 89.4 Table 4: Performance comparison with different modules. (Ye et al. 2021) and Trans Re ID (He et al. 2021), stand out for their competitive performance. For multi-spectral methods, CCNet achieves remarkable results across both datasets. On the RGBNT100 dataset, our TOP-Re ID outperforms CCNet with a 4.0% higher m AP. On the small-scale MSVR310 dataset, our TOP-Re ID maintains competitive performance, showing its versatility and robustness. Evaluation on Missing-spectral Scenarios. As shown in Tab. 3, all single-spectral methods suffer from performance degradations when image spectra are missing. Multi-spectral methods demonstrate better robustness compared with single-spectral methods. Our proposed TOPRe ID achieves remarkable performance even in the presence of missing spectra. It consistently outperforms both singlespectral and multi-spectral methods in all missing-spectral scenarios, indicating its effectiveness in handling the spectral incompleteness. In addition, compared with PFNet and DENet, our TOP-Re ID is a more flexible and diverse framework to address any spectra missing. Ablation Studies To investigate the effect of different components, we further perform a scope of ablation studies on RGBNT201. Effects of Key Modules. Tab. 4 illustrates the performance comparison with different modules. The Model A is the baseline which utilizes the multi-stream Vi T-B/16 backbones. BL means the triplet loss and cross-entropy loss are added before the concatenation of multi-spectral features, while AL means these losses are employed after the feature concatenation. It can be observed that the AL setting shows better results. The main reason is that the fused multispectral features is more powerful than the simple feature concatenation. Furthermore, by integrating our TPM, the Model C yields higher performance with m AP of 67.8% and Rank-1 of 69.4%. By introducing CRM, the final model can achieve the best performance with m AP of 72.3% and Rank- 3 6 9 12 Layers m AP and Rank-1 Performance of Different Models [ftp] - m AP [ftp] - Rank-1 Figure 4: Performance of deploying TPM at different layers. 1 2 3 4 Depths 72.3 72.0 71.7 71.5 35.3 37.7 37.4 35.4 m AP Comparison for Different Missing Spectra Full M (R) M (N) M (T) M (R+N) M (R+T) M (N+T) Figure 5: Effects of different depths of Trans Re blocks. 1 of 76.6%. These improvements validate the effectiveness of our key modules in handling complex Re ID scenarios. TPM at Different Layers. In fact, our TPM is a plugand-play module. We explore the effect of TPM at different layers of the Vi T backbone. Fig. 4 shows the performance of TPM at different layers. We observe that as the plugged depth of TPM increases, the performance greatly improves. When deployed in the last layer, it achieves the best performance. This indicates that our TPM is more pronounced in deep layers, capturing more discriminative representations. Effects of Trans Re Blocks in CRM. The depth of Trans Re blocks may impact the reconstruction ability. As illustrated in Fig. 5, the Re ID performance is relatively consistent when using different depths of Trans Re blocks. In The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Methods Vi T-B/16 Dei T-S/16 T2T-Vi T-24 m AP R-1 R-5 R-10 m AP R-1 R-5 R-10 m AP R-1 R-5 R-10 RGB 29.0 26.2 44.5 56.1 33.3 30.6 49.5 58.0 30.1 30.3 47.2 56.6 NIR 18.7 14.0 31.1 44.9 22.7 21.4 39.4 47.6 15.8 15.3 27.0 36.8 TIR 33.4 32.3 52.4 63.3 27.1 26.3 41.3 51.4 34.0 36.2 52.0 62.0 NIR-TIR 45.9 43.4 59.9 69.4 40.6 40.9 54.3 61.0 40.9 40.7 56.3 64.2 RGB-NIR 39.0 40.2 56.6 65.7 46.7 45.0 62.6 70.0 36.3 35.2 53.8 66.3 RGB-TIR 52.6 53.8 69.0 78.2 49.3 47.8 64.1 72.8 49.9 51.7 66.7 73.8 RGB-TIR-NIR 55.9 54.9 70.8 77.6 55.1 53.3 67.3 76.2 52.2 51.3 64.1 74.3 Baseline (AL) 62.9 64.5 77.4 82.7 59.9 61.1 73.9 80.9 56.2 60.4 73.0 78.6 Baseline (AL) + TPM 67.8 69.4 83.3 88.8 63.0 63.9 78.1 83.9 58.2 60.8 74.9 81.4 Baseline (AL) + CRM + TPM 72.3 76.6 84.7 89.4 69.0 73.6 81.8 84.7 60.0 61.6 76.2 82.3 Table 5: Performance comparison of different backbones with different spectra and modules on RGBNT201. (a) RGB-TIR-NIR (b) Baseline (AL) (c) Baseline (AL) + TPM (d) Baseline (AL) + CRM +TPM RGB NIR TIR Figure 6: Comparison of feature distributions by using t SNE. Different colors represent different identities. Fig. 5, we also provide the comparison results with missingspectral cases. It can be observed that the overall performance is acceptable when only using one block. Thus, we utilize one Trans Re block to reduce the computation. Effects of Different Transformer-based Backbones. To verify the generalization of our TOP-Re ID, we adopt three different Transformer-based backbones, i.e., Vi T-B/16, Dei T-S/16 and T2T-Vi T-24. Tab. 5 illustrates the performance comparison. As can be observed, the Vi T-B/16 delivers the best results. With more image spectra, different backbones can consistently improve the performance. Our proposed TPM and CRM can improve the performance with different backbones. We believe that the performance can be further improved by using more powerful backbones. Visualization Analysis To clarify the learning ability, we present visual results on the feature distributions and discriminative attention maps. Multi-spectral Feature Distributions. Fig. 6 illustrates the feature distributions of different models by using t SNE (Van der Maaten and Hinton 2008). In Fig. 6 (a), it represents the direct concatenation of single-spectral features, where each stream is individually trained. It can be observed that the AL setting can effectively align the features of dif- Figure 7: Discriminative attention maps. (a) Input images; (b) Full; (c) M (RGB); (d) M (NIR); (e) M (TIR); (f) M (NIR+TIR); (g) M (RGB+TIR); (h) M (RGB+NIR); ferent spectra with a better ID consistence. With our TPM, the features of the same ID across different spectra are more concentrated, and the gaps between different IDs are more distinct. Furthermore, with CRM, the feature distribution becomes more compact, and the number of outliers for each ID is reduced. This visualization provides strong evidences for the effectiveness of our proposed methods. Discriminative Attention Maps. As shown in Fig. 7, we utilize Grad-CAM (Selvaraju et al. 2017) to visualize the discriminative attention maps with different image spectra. Obviously, there are discriminative differences between different image spectra. Our model is powerful and can highlight discriminative regions when missing image spectra. Conclusion In this work, we propose a novel feature learning framework based on token permutations for multi-spectral object Re ID. Our approach incorporates a Token Permutation Module (TPM) for spatial feature alignment and a Complementary Reconstruction Module (CRM) for reducing the distribution gap across different image spectra. Through the dynamic cooperation between TPM and CRM, it can handle the missing-spectral problem, which is more flexible than previous methods. Extensive experiments on three benchmarks clearly demonstrate the effectiveness of our methods. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Acknowledgments This work was supported in part by the National Key Research and Development Program of China (No. 2018AAA0102001), National Natural Science Foundation of China (No. 62101092), Open Project of Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, Anhui University (No. MMC202102) and Fundamental Research Funds for the Central Universities (No. DUT22QN228). Ba, J. L.; Kiros, J. R.; and Hinton, G. E. 2016. Layer normalization. ar Xiv preprint ar Xiv:1607.06450. Chang, X.; Hospedales, T. M.; and Xiang, T. 2018. Multilevel factorisation net for person re-identification. In CVPR, 2109 2118. Chen, G.; Zhang, T.; Lu, J.; and Zhou, J. 2019. Deep meta metric learning. In ICCV, 9547 9556. Chen, Y.; Xia, S.; Zhao, J.; Zhou, Y.; Niu, Q.; Yao, R.; Zhu, D.; and Liu, D. 2022. Res T-Re ID: Transformer block-based residual learning for person re-identification. PRL, 157: 90 96. Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR, 248 255. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929. Guo, J.; Zhang, X.; Liu, Z.; and Wang, Y. 2022. Generative and attentive fusion for multi-spectral vehicle reidentification. In ICSP, 1565 1572. He, Q.; Lu, Z.; Wang, Z.; and Hu, H. 2023. Graph-Based Progressive Fusion Network for Multi-Modality Vehicle Re Identification. TITS, 1 17. He, S.; Luo, H.; Wang, P.; Wang, F.; Li, H.; and Jiang, W. 2021. Transreid: Transformer-based object re-identification. In ICCV, 15013 15022. Hermans, A.; Beyer, L.; and Leibe, B. 2017. In defense of the triplet loss for person re-identification. ar Xiv preprint ar Xiv:1703.07737. Li, D.; Wei, X.; Hong, X.; and Gong, Y. 2020a. Infraredvisible cross-modal person re-identification with an x modality. In AAAI, volume 34, 4610 4617. Li, H.; Li, C.; Zhu, X.; Zheng, A.; and Luo, B. 2020b. Multispectral vehicle re-identification: A challenge. In AAAI, volume 34, 11345 11353. Li, W.; Zhu, X.; and Gong, S. 2018. Harmonious attention network for person re-identification. In CVPR, 2285 2294. Liu, H.; Chai, Y.; Tan, X.; Li, D.; and Zhou, X. 2021a. Strong but simple baseline with dual-granularity triplet loss for visible-thermal person re-identification. SPL, 28: 653 657. Liu, X.; Yu, C.; Zhang, P.; and Lu, H. 2023. Deeply Coupled Convolution Transformer With Spatial Temporal Complementary Learning for Video-Based Person Re-Identification. TNNLS. Liu, X.; Zhang, P.; Yu, C.; Lu, H.; and Yang, X. 2021b. Watching You: Global-Guided Reciprocal Learning for Video-Based Person Re-Identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 13334 13343. Luo, H.; Gu, Y.; Liao, X.; Lai, S.; and Jiang, W. 2019. Bag of tricks and a strong baseline for deep person re-identification. In CVPRW, 1487 1495. Pan, W.; Huang, L.; Liang, J.; Hong, L.; and Zhu, J. 2023. Progressively Hybrid Transformer for Multi-Modal Vehicle Re-Identification. Sensors, 23(9): 4206. Pan, W.; Wu, H.; Zhu, J.; Zeng, H.; and Zhu, X. 2022. Hvit: Hybrid vision transformer for multi-modal vehicle reidentification. In CICAI, 255 267. Springer. Qian, X.; Fu, Y.; Jiang, Y.-G.; Xiang, T.; and Xue, X. 2017. Multi-scale deep learning architectures for person reidentification. In ICCV, 5399 5408. Rao, Y.; Chen, G.; Lu, J.; and Zhou, J. 2021. Counterfactual attention learning for fine-grained visual categorization and re-identification. In ICCV, 1025 1034. Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 618 626. Sun, Y.; Cheng, C.; Zhang, Y.; Zhang, C.; Zheng, L.; Wang, Z.; and Wei, Y. 2020. Circle loss: A unified perspective of pair similarity optimization. In CVPR, 6398 6407. Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; and Wang, S. 2018. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 480 496. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In CVPR, 2818 2826. Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; and J egou, H. 2021. Training data-efficient image transformers & distillation through attention. In ICML, 10347 10357. Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. JMLR, 9(11). Wang, G.; Yuan, Y.; Chen, X.; Li, J.; and Zhou, X. 2018. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, 274 282. Wang, H.; Shen, J.; Liu, Y.; Gao, Y.; and Gavves, E. 2022a. Nformer: Robust person re-identification with neighbor transformer. In CVPR, 7297 7307. Wang, Z.; Li, C.; Zheng, A.; He, R.; and Tang, J. 2022b. Interact, embed, and enlarge: Boosting modality-specific representations for multi-modal person re-identification. In AAAI, volume 36, 2633 2641. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24) Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; and Hoi, S. C. 2021. Deep learning for person re-identification: A survey and outlook. TPAMI, 44(6): 2872 2893. Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F. E.; Feng, J.; and Yan, S. 2021. Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV, 558 567. Zhang, G.; Zhang, P.; Qi, J.; and Lu, H. 2021. Hat: Hierarchical aggregation transformers for person re-identification. In ACM MM, 516 525. Zhang, Y.; and Wang, H. 2023. Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-identification. In CVPR, 2153 2162. Zhao, J.; Zhao, Y.; Li, J.; Yan, K.; and Tian, Y. 2021. Heterogeneous relational complement for vehicle re-identification. In ICCV, 205 214. Zheng, A.; He, Z.; Wang, Z.; Li, C.; and Tang, J. 2023. Dynamic Enhancement Network for Partial Multi-modality Person Re-identification. ar Xiv preprint ar Xiv:2305.15762. Zheng, A.; Wang, Z.; Chen, Z.; Li, C.; and Tang, J. 2021. Robust multi-modality person re-identification. In AAAI, volume 35, 3529 3537. Zheng, A.; Zhu, X.; Ma, Z.; Li, C.; Tang, J.; and Ma, J. 2022. Multi-spectral vehicle re-identification with crossdirectional consistency network and a high-quality benchmark. ar Xiv preprint ar Xiv:2208.00632. Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2020. Random erasing data augmentation. In AAAI, volume 34, 13001 13008. Zhou, K.; Yang, Y.; Cavallaro, A.; and Xiang, T. 2019. Omni-scale feature learning for person re-identification. In ICCV, 3702 3712. Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2223 2232. Zhu, K.; Guo, H.; Zhang, S.; Wang, Y.; Huang, G.; Qiao, H.; Liu, J.; Wang, J.; and Tang, M. 2021. Aaformer: Auto-aligned transformer for person re-identification. ar Xiv preprint ar Xiv:2104.00921. The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)