# bidirectional_adapter_for_multimodal_tracking__152a143f.pdf

Bi-directional Adapter for Multimodal Tracking

Bing Cao, Junliang Guo, Pengfei Zhu*, Qinghua Hu

Tianjin Key Lab of Machine Learning, College of Intelligence and Computing, Tianjin University, China {caobing,guojunliang,zhupengfei,huqinghua}@tju.edu.cn

Due to the rapid development of computer vision, singlemodal (RGB) object tracking has made significant progress in recent years. Considering the limitation of single imaging sensor, multi-modal images (RGB, infrared, etc.) are introduced to compensate for this deficiency for all-weather object tracking in complex environments. However, as acquiring sufficient multi-modal tracking data is hard while the dominant modality changes with the open environment, most existing techniques fail to extract multi-modal complementary information dynamically, yielding unsatisfactory tracking performance. To handle this problem, we propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter, cross-prompting multiple modalities mutually. Our model consists of a universal bi-directional adapter and multiple modality-specific transformer encoder branches with sharing parameters. The encoders extract features of each modality separately by using a frozen, pretrained foundation model. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another, performing visual feature prompt fusion in an adaptive manner. With adding fewer (0.32M) trainable parameters, our model achieves superior tracking performance in comparison with both the full finetuning methods and the prompt learning-based methods. Our code is available: https://github.com/Spark Tempest/BAT.

Introduction Object tracking, a foundation visual task of computer vision, has achieved significant progress over the past decades. Many excellent approaches (Zhang et al. 2021b; Yang et al. 2022; Zhang et al. 2021a; Lu et al. 2022; Zhu et al. 2023), and benchmarks (Li et al. 2016, 2019, 2021; Zhang et al. 2022) have emerged and achieved promising performance on RGB-based object tracking. However, due to the imaging mechanism of visible light, some complex scenarios in open environments, such as illumination variation, limit the practical effectiveness of solely RGB-based object tracking, leading to target missing or error tracking. Different from RGB cameras that capture the reflected light of objects, thermal infrared (TIR) imaging sensors capture the heat emitted by the object itself. Compared to RGB images that con-

*Corresponding author Copyright 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

Vi PT Ours Ground Truth

RGB > TIR RGB = TIR RGB < TIR

Figure 1: Different dominant modality in complex scenarios. The image with green box represents the dominant modality, and the red box represents the auxiliary modality.

tain rich color texture in light condition while failing in dark condition, TIR images provide significant contrast for heat objects while presenting low resolution and poor texture. Consequently, to conquer the inherent shortcomings of single-modality-based methods, multi-modal object tracking emerged, which fully leveraged RGB and thermal images to perform more robust all-weather tracking. However, existing multi-modal tracking tasks also meet two main issues: i) Due to the high data labeling cost of multi-modal object tracking, most existing datasets are scale-limited, which is insufficient to support building an effective multi-modal tracker; ii) The dominant correlation among multi-modal data is not fixed as shown in Fig. 1, because different imaging modalities have varying sensitivities to objects in changing environments. Since pure RGB sequences are much easier to acquire than RGB-T sequence pairs, some multi-modal tracking works (Zhu et al. 2019; Gao et al. 2019), accounting for the first limitation, are pre-trained on RGB sequences first and then transferred to multi-modal scenarios in a full fine-tuning manner. For example, The mf Di MP (Zhang et al. 2019) takes pre-trained Di MP as foundation models, and fine-tunes it on the generated RGB-T images. Some researchers develop attribute-based multi-modal fusion model (Li et al. 2020; Zhang et al. 2021a; Xiao et al. 2022) to reduce reliance on large-scale training data while improving fusion capabilities with a small number of pa-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

rameters. Despite these methods achieving considerable progress, they also suffered from time expensive and inefficiencies, while showing limited performance. In addition to full fine-tuning approaches, some recent methods (Yang et al. 2022; Zhu et al. 2023) introduced the parameterefficient prompt tuning to multi-modal tracking by freezing the backbone parameters and attaching a set of learnable parameters. These methods commonly took one modality (usually RGB) as the dominant modality and another one as the auxiliary modality. However, these methods ignore the dynamic dominant correlation of multi-modal data, making it difficult to fully exploit the complementary multi-modal information in complex scenarios as shown in Fig. 1, thus limiting the tracking performance. In this point, we proposed a Bi-directional Adapter for Multi-modal Tracking (BAT). Different from the methods that add auxiliary modalities as prompts to the dominant modality to enhance the representation ability of the foundation model in downstream tasks (which often use RGB as the primary modality), we do not preset the fixed dominant modality-auxiliary modality, instead BAT dynamically extracts effective information from changing auxiliary modality to dominant modality. BAT consists of two modalityspecific branches and a universal bi-directional adapter. Each modality-specific branch is initialized by the foundation model with fixed parameters during training. Each modal branch learns the prompt information from the other modality to integrate with the feature information of the current modality, enhancing the representation ability. The two modality-specific branch performs interaction by the universal bi-directional adapter to dynamically fuse dominantauxiliary information mutually in a multi-modal non-fixed association paradigm. The universal bi-directional adapter has a lightweight hourglass structure. It can be embedded in each transformer layer of the foundation model without introducing a large number of learnable parameters. Experiments on RGBT234 (Li et al. 2019) and Las He R (Li et al. 2021) datasets validate the effectiveness of our BAT framework. By training only a few parameters, BAT achieves significant advantages compared with the competing methods. Our main contributions are summarized as follows:

We first propose an adapter-based visual prompt framework for multi-modal tracking. Our model perceives the dynamic changes of the dominant modality in open scenarios, effectively fusing multi-modal information in an adaptive manner.

To the best of our knowledge, we for the first time propose a universal bi-directional adapter for the foundation model. It effectively cross-prompts multi-modal tracking with a simple and efficient structure. By only adding 0.32M learnable parameters, our model copes with robust multi-modal tracking in open scenarios.

We delved into the effects of our universal adapter on different depths of layers with in-depth analysis. We also explore even more efficient adapter architecture in experiments, and validated our superiority on multiple RGBT tracking-related datasets against the state-of-the-arts.

Related Works Multi-modal Tracking

Object tracking is designed to track the assigned initial object in the initial frame and predict its position and scale in subsequent frames. Although numerous excellent studies (Ye et al. 2022; Cui et al. 2022; Lan et al. 2023) have been proposed and achieved impressive tracking performance, single-modal object tracking is not adequate to meet certain situations, such as low illumination, occlusion, or thermal crossover. Accounting for this, multi-modal tracking has gained increased attention because different modalities have the potential to offer complementary information mutually, boosting tracking performance in challenging scenarios that are difficult to handle purely by single-modal images. For example, FANet (Zhu et al. 2020) design a feature aggregation module to fuse multi-modal features within each modality and an adaptive aggregation module to fuse multi-modal features in different modalities. HMFT (Zhang et al. 2022) design a hierarchical fusion framework to integrate multimodal features. APFNet (Xiao et al. 2022) used an attributebased fusion framework to aggregate attribute-specific fusion features, with a transformer structure to strengthen the multi-modality features.

Parameter-Efficient Tuning

Fine-tuning is a widely studied technique over the past decades that usually transfers large pre-trained models to downstream tasks by updating all the parameters on taskoriented data. These full fine-tuning method are parameterinefficient and also require sufficient data to optimize all the parameters. Recently, prefix-tuning, a new paradigm for parameter-efficient tuning, has become widely employed in natural language processing (NLP), which has demonstrated its efficiency in a variety of extended computer vision tasks (Khattak et al. 2023). VPT (Jia et al. 2022) introduces prompt-tuning into the vision task, adding learnable tokens from the input layer and freezing the backbone to train the classification head and the newly added prompt token, achieving better results than full-tuning-based methods. Protrack (Yang et al. 2022) provides a new perspective for multi-modal tracking by transforming multi-modal inputs into a single modality through a prompt paradigm. It exploits the tracking ability of pre-trained RGB trackers, rather than building complex multi-modal fusion modules. Inspired by this, Vi PT (Zhu et al. 2023) designs a learnable prompts generation module to generate prompts for RGB modality based on thermal infrared modality on downstream tasks. Unlike visual-language models, RGB-T tracking employs two comparable modalities, which can be both contributed by the pre-trained visual foundation model. However, previous methods mainly take RGB as the dominant modality, ignoring the dynamically changing environments where TIR has stronger representation ability than RGB, performing the dominant modality. This motivates us to break away from the fixed multi-modal correlation paradigm and design a universal bi-directional adapter that does not predefine the dominant modality and can adaptively extract features from RGB and IR.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

prediction head

Transformer Encoder

Transformer Encoder

Transformer Encoder

Transformer Encoder

Transformer Encoder

Multi-head Attention

Transformer Encoder

Multi-head Attention

Frozen parameters

Tuned parameters

Bi-directional Adapter

Transformer Encoder

Transformer Encoder

Template Search Searc Template h

Figure 2: The overall architecture of our proposed BAT. We first transformed the template frame and search frame of each modality into tokens, then concatenated them together to pass the N-layer dual-stream transformer encoder, respectively. The bi-directional adapter is paralleled with the dual-stream encoder layer, which could learn feature prompts from one modality to another. To this end, the output features of the two branches are added and fed into the prediction head for final tracking result.

Methodology In this paper, we propose a novel universal bi-directional adapter for multi-modal tracking (BAT), which crossprompts multi-modal data mutually. Instead of fully finetuning the foundation model, BAT transfers the pre-trained tracker to multi-modal scenarios effectively and efficiently by only learning the lightweight adapter, performing excellent multi-modal complementarity and superior tracking accuracy. We present the overall architecture of our BAT in Fig. 2.

Multi-modal Tracking Given a video V with an initial box position B0 of the target object Z0 in the first frame Itemplate, single-model object tracking learns to search for this object in the subsequent frames Isearch. Typically, the object tracker T consists of a feature extraction function F and a head box H. For the transformer-based foundation model, the template frame Itemplate and the search frame Isearch are transformed into tokens by patch embedding and position embedding, and then concatenated together to pass through N-layer transformer encoder for joint feature extraction. Finally, the output token of encoder for the corresponding search image is fed into the prediction head to obtain the target tracking result. Thus, the position of the box B in subsequent frames is predicted by

B = H(F(Itemplate, Isearch, B0)), (1)

where F is a pre-trained transformer backbone with powerful representation ability. Multi-modal tracking (MMT) extends this setting to multiple videos in different modalities by formally introducing another modal stream, which jointly makes the final decision for the tracking objects. Take RGB-T as an example, the RGB and thermal modalities are temporally synchronized and spatially aligned. MMT tracks Z0 from both the

subsequent frames of both RGB modality IRGB search and TIR modality IT search as,

B = H(F(IRGB template, IT template, IRGB search, IT search, B0)). (2) As shown in Fig. 2, our BAT has a dual-stream encoder structure for RGB modality and thermal infrared modality respectively, each stream of which shares the same parameters. BAT first feeds the two modalities IRGB template, IRGB search and IT template, IT search to a patch and position embedding layer, and obtain the RGB tokens x RGB 0 and TIR tokens x T IR 0 . Then, our universal bi-directional adapter is embedded in the i-th layer of transformer encoder, penetrating the encoders of two modalities. For the i + 1 layer of each encoder, it learns to integrate the modality-specific feature with complementary information of another modality from its previous layer. Each encoder learns feature prompts from another modality in a layer-by-layer manner,

(x RGB i+1 , x T IR i+1 ) = FA i (x RGB i , x T IR i ), i = 1, 2, , N, (3)

where FA i refers to the dual-stream encoder layer paralleled with our bi-directional adapter structure. To this end, the multi-modal feature of tracking objects is progressively and dynamically extracted during the N layers of the transformer encoder in the foundation model. Finally, the features of two modal branches are added and fed into the prediction head to obtain the final tracking results as

Bbox = H(x RGB N + x T IR N ), i = 1, 2, , N. (4)

Bi-directional Adapter Our bi-directional adapter (BAT) is designed to cope with transferring complementary features from one modality to another in a universal manner. The input modality is selfadaptive, capable of dynamically extracting the features of

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Down Projection Layer

UP Projection Layer

Linear Projection

Bi-directional Adapter

feature prompts

feature prompts

Figure 3: The detailed architecture of bi-directional adapter. It consists of three linear projection layers, tn represents the token s num of each modality, the input token is first dimensional reduced to de and passed through a linear projection layer, then up-project to the original dimension dt and fed into another modality as feature prompts.

the auxiliary modality and transferring them to the dominant modality as the environment changes. As shown in Fig. 2, the bi-directional adapter adopts a modular design, which is embedded in the multi-head self attention stage and the MLP stage, separately. Here, we take the processing of x RGB i x T IR i as an example to detail our bi-directional adapter. The i-th layer of RGB branch integrates the auxiliary information from TIR branch through the adapter as,

x RGB i = x RGB i + FAtt(x RGB i ) + P T IR i ,

P T IR i = FAda(x T IR i ), i = 1, 2, ..., N (5)

where FAtt and FAda represent the multi-head selfattention block and our bi-directional adapter network, respectively. F Ada( ) refers to the output feature prompt of the bi-directional adapter. The P T IR i is the feature prompt extracted from TIR modality. x RGB i is fed to the multi-head self attention block after passing a layernorm operator, and then added with x RGB i and P T IR i to obtain x RGB i . In the next stage, x RGB i is fed into the multi-layer perceptron F MLP , and added with feature prompt P T IR i and x RGB i together to obtain the output x RGB i+1 of the i + 1 layer in the RGB encoder.

x RGB i+1 = x RGB i + F MLP (x RGB i ) + P T IR i , (6)

P T IR i = F Ada(x T IR i ), i = 1, 2, ..., N. (7) The detailed architecture of our bi-directional adapter is depicted in Fig. 3, which is designed to transfer feature prompts from one modality to another modality. The input token of bi-directional adapter block is first reduced to de dimension by down projection layer, and then passed through

a linear projection layer. Then, it is up-projected to the original dimension, and fed back to transformer encoder layer of the other modality as the feature prompt. Through this simple structure, bi-directional adapter effectively perform feature prompting between x RGB i and x T IR i modalities for multi-modal tracking. As for freezing the transformer encoder and prediction head, we only need to optimize a few parameters of the newly added adapter. It is worth noting that, different from most conventional adapters, our bi-directional adapter is performed as a cross-modal feature prompt for the dynamically changing dominant modality, ensuring promising tracking performance in the open world.

Objective Loss

The token sequence is first converted to a 2D spatial feature map, using a series of fully convolutional networks (FCN), and outputs the target classification score map (indicating the target location), offset, and the normalized bounding box. The overall loss function of BAT is formulated as,

Ltotal = Lcls + λ1Liou + λ2L1. (8)

where Lcls denotes the weighted focal loss for classification, the generalized Io U loss Liou and L1 are adopted for bounding box regression, λ1 and λ2 are trade-off parameters.

Experiments Experimental Setting

Datasets and Evaluation Metrics. We conduct experiments on two multi-modal tracking datasets: RGBT234 (Li et al. 2019) and Las He R (Li et al. 2021), and evaluate the tracking performance with four evaluation metrics: Precision Rate (PR), Maximum Precision Rate (MPR), Success Rate (SR), and Maximum Success Rate (MSR). RGBT234 provides 234 sequences of aligned RGB and infrared videos. It offers 12 attributes, including LI (Low Illumination), Occlusion, DEF (Deformation), Movement, etc. The total number of frames is about 234K, with a maximum of 8K frames per sequence. It provides the groundtruth label for both RGB and TIR modalities, allowing trackers to perform multi-modal performance evaluations. Due to the RGBT234 use of a parallel optical axis visible light-infrared imaging system, no pre-processing or postprocessing (such as stereo matching and color correction) is required. Its cross-modal alignment is more accurate, but the ground truth for RGB and IR is still not completely consistent. Therefore, for a fair comparison, we use the MPR and MSR instead of PR and SR as the evaluation metrics. Specifically, for each frame, the Euclidean distance between the result box and the ground truth is calculated separately in the RGB and IR modalities, and the smaller distance is used to calculate the accuracy. Las He R is an RGBT tracking dataset that contains 1224 RGBT sequences with 730K frames, captured in various types of imaging platforms. It includes 19 video attributes, adding 7 types of new attributes such as HI (High Illumination), FL (Frame Lost), and AIV (Abrupt Illumination

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Variation) on the basis of previous ones, making it an even more challenging dataset for RGBT tracking tasks. To address the alignment issue of radial distortion images in different RGBT modalities, the Las He R dataset only performs precise alignment on the local area covering the target object in each frame, since the object tracking task does not emphasize the tracking effect of the background. Thus, for each frame, a set of matching points is labeled to transform its RGB image into the same coordinate system as the thermal infrared image. This results in consistent ground truth for both modalities. Different from the RGBT234 dataset, PR and SR can be used as evaluation metrics.

Implementation Details. We implement our BAT based on the Pytorch and train it on 4 NVIDIA RTX A6000 GPUs with a batch size of 32. We follow the hyper-parameters setting of the foundation model in the loss function. The Adam W optimizer (Loshchilov and Hutter 2019) with a weight decay of 10 4 is adopted, and the learning rate is set to 4 10 4. The fixed parameters of the modal-specific branch in BAT are initialized by the pre-trained foundation model (Ye et al. 2022). The fine-tuning of our BAT on the Las He R training set takes 60 epochs for 8 hours, where each epoch contains 6 104 sample pairs.

Comparisons

We compare our model with 19 competing methods. The quantitative comparisons are reported in Table 1 and the qualitative evaluation results are presented in Fig. 4.

Quantitative Evaluation on RGBT234. As shown in Table 1, for full-tuning competing methods, DMCNet (Lu et al. 2022) achieved considerable performance with the runnerup MPR score of 83.9%. While the SOTA efficient-tuning methods Vi PT (Zhu et al. 2023) achieved similar performance as DMCNet, with a worse MPR score of 83.5% and a slightly higher MSR score of 61.7%. The existing efficienttuning methods did not perform significant improvements. This may be due to that these efficient-tuning methods are challenging to dynamically extract compatible information from both RGB and TIR modalities. As a comparison, our BAT achieves 86.8% MPR and 64.1% MSR, outperforming the runner-up MPR and MSR scores by 2.9% and 2.4% respectively, which is a significant improvement among all the competing methods. The experimental results demonstrate the effectiveness of our BAT model.

Quantitative Evaluation on Las He R. Compared with the RGB234 dataset, the Las He R dataset is more challenging, due to more extreme attributes being introduced. The performance gap of most existing methods is significantly widened. Previous advanced methods such as DMCNet and APFNet perform unsatisfactory on this dataset. Even OSTrack, a tracker only based on RGB modality, reaches a stronger performance than many RGBT trackers, which is completely opposite to the situation where multi-modal trackers take the lead in the RGBT234 dataset. The efficienttuning methods such as Pro Track and Vi PT are significantly superior to the traditional methods, which may benefit from

RGBT234 Las He R Method MPR MSR PR SR

ATOM (2019) - - 40.6 30.7 Di MP-50 (2019) - - 44.2 33.6 mf Di MP (2019) 64.6 42.8 44.8 34.3 DAPNet (2019) 76.6 53.7 43.1 31.4 Siam FC++ (2020) - - 34.8 27.4 CAT (2020) 80.4 56.1 45.0 31.4 CMPP (2020) 82.3 57.5 - - STARK ST-50 (2021) - - 44.9 36.1 Trans T (2021) - - 52.4 39.4 JMMAC (2021b) 79.0 57.3 - - MANet++ (2021) 79.5 55.9 46.7 31.4 FANet (2020) 78.7 55.3 44.1 30.9 ADRNet (2021a) 80.9 57.1 - - OSTrack-256 (2022) 72.9 54.9 51.5 41.2 APFNet (2022) 82.7 57.9 50.0 36.2 DMCNet (2022) 83.9 59.3 49.0 35.5 HMFT (2022) 78.8 56.8 - -

Pro Track (2022) 79.5 59.9 53.8 42.0 Vi PT (2023) 83.5 61.7 65.1 52.5

BAT (Ours) 86.8 64.1 70.2 56.3

Table 1: Overall performance on RGBT234 and Las He R dataset. Results are reported in percentage (%).

the strong representation ability of the used pre-trained foundation model. As shown in Table 1, Vi PT achieved 65.1% of PR and 52.5% of SR, which is a considerable improvement among the competing methods. However, our BAT even further improves the Vi PT at 5.1% and 3.8%, reaching 70.2% and 56.3% of PR and SR scores, respectively. This experiment further validates our universal bi-directional adapter in learning dynamically changing attributes in complex environments.

Qualitative Evaluation. Since our dual-stream encoder does not rely on a single modality as the dominant modality, it outperforms single-stream prompt-learning approaches in complex scenarios when RGB images are distorted or even unavailable. As shown in Fig. 4(a), the tracking information provided by the video sequence in the early stage strongly depends on the TIR image, and after a few frames, the RGB image progressively dominates and provides more effective information than TIR. Fixed correlation methods, such as Vi PT, mainly using RGB as the dominant modality, cope with tracking the object in subsequent bright light scenes while failing to dynamically track the accurate position in dark light environments. Our method effectively tracks the target even when RGB is completely unavailable, and the tracking results are much better when both RGB and TIR provide effective information in subsequent scenes. As shown in Fig. 4(b), for transparent object tracking, the features provided by the RGB modality in this scenario have strong interference. Compared with other methods that fail to track, our bi-directional adapter dynamically extracts effective features of the target from both RGB and

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

mf Di MP Vi PT APFNet Ours Ground Truth

frame 1 2 3 4

TIR RGB TIR RGB

Figure 4: Visualization of tracking results. The green rectangles indicate target objects in the template frame. Our method shows the best performance in different frame sequences as the dominant modality dynamically changes.

Encoder Layer

Encoder Layer

Encoder Layer

Encoder Layer

RGB Tokens IR Tokens

Encoder Layer

Encoder Layer

Figure 5: Different variants of bi-directional adapter for dual-stream encoder framework.

IR modalities, capturing a more accurate target response position, and eliminating the interference of the RGB modality. These experiments demonstrate our effectiveness in dynamically prompting effective information from the changing dominant-auxiliary modalities in complex scenarios.

Discussion Effect of Different Adapter Variants. In this section, we explore the different adapter variants in our BAT framework. As shown in Fig. 5, the adapter can be performed in two single directions: RGB TIR in Fig. 5(a) and TIR RGB in Fig. 5(b). The single-directional adapter only ex-

BAT Vi PT APFNet DMCNet MANet++ mf Dimp

0 10 20 30 40 50 60 70 80 90 100NO PO TO

LR DEF BC SA

Precision Rate with different attributes

LR DEF BC SA

Success Rate with different attributes

Figure 6: More comparisons of BAT and the competing methods under different attributes in the Las He R dataset.

tracts feature prompts from one modality to another, and only takes one stream s transformer encoder layer s output to regression final result. Fig. 5(c) presents the dual-adapter architecture without sharing parameters. Each adapter only extracts feature prompts from one specific modality to another modality. We use the foundation model as our baseline. The dual-stream framework initialized by the parameters of foundation model is denoted as Baseline-Dual, which takes the specific stream that has the maximum score-map value in the prediction head to calculate the final result. We reported the results of different variants of adapter in

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Method PR SR

Baseline 51.5 41.2 Baseline-Dual 52.2 42.8 BAT-RGB 69.0 55.4 BAT-TIR 68.5 54.8 BAT-Dual 69.6 56.4

BAT 70.2 56.3

Table 2: Quantitative comparison between different variants of BAT on the Las He R dataset

Table 2. The dual-stream baseline (Baseline-Dual) is slightly better than the foundation model (Baseline), which demonstrates the foundation model has the potential to be applied in both RGB and TIR modalities. For the single-directional adapters, BAT-RGB and BAT-TIR achieve a significant improvement over the baseline models. This might be due to that the effective information of one modality is transferred to the other modality through the adapter. It further validated that multi-modal data provides more complementary information than single modality. Meanwhile, the difference between BAT-RGB and BAT-TIR is small (less than 3%), indicating that the dominant correlation is not fixed, dynamically learning from the two modalities has the potential to perform even better tracking results in more complex conditions. The dual-adapter (BAT-Dual) requires double parameters compared to our universal bi-directional adapter, while maintaining the similar performance as our universal version. This may be because BAT adopts the same foundation model for two modality-specific branches that are fixed during training. Therefore, the feature distribution should be compatible with both modalities to dynamically balance the changing dominant and auxiliary modalities. Our bidirectional adapter cross-prompts the two branches with a universal adapter, learning compatibility of multiple modalities and achieving comparable performance with half learnable parameters. Our universal adapter is also more flexible to handle more modalities in a parameter-efficient manner.

Effect of Adapter in Different Layers. To explore a more efficient adapter architecture, we choose parts of transformer encoder layers to embed our bi-directional adapter. We preset 6 embedding types, formulated as BATn , where n represents the remaining number of adapter layers. The results are shown in Table 3. The layers indicate the position of our bi-directional adapter in the foundation model. The performance of BAT-1 is limited as few multi-modal information cross-prompts during training. BAT-4 in the middle layers achieves comparable performance as BAT-12, while saving more learnable parameters. It demonstrates our universal bi-directional adapter has the potential to be further simplified.

More Comparisons under Different Attributes. Since the Las He R dataset provides 19 additional attributes in addition to the point annotates, we further evaluate the proposed BAT with some advanced competing methods on each attribute. As shown in Fig. 6, our BAT outperforms the com-

Type Layers PR SR

BAT-1 1 61.4 49.3 BAT-1 12 61.6 49.9 BAT-4 1-4 65.2 52.5 BAT-4 5-8 68.6 55.2 BAT-4 9-12 66.4 53.4 BAT-12 1-12 70.2 56.3

Table 3: Results of different bi-directional adapter layers in Las He R

Attribute mf Di MP APFNet Vi PT Ours

NO 76.5/57.5 66.7/46.7 84.1/68.4 90.2/73.3 HO 19.8/23.8 27.1/27.7 46.8/43.4 56.5/51.0 LI 29.6/23.8 41.8/30.8 49.9/41.2 60.4/48.2 AIV 16.6/16.4 32.1/26.2 36.3/34.2 51.4/45.3 TC 38.0/28.8 43.1/31.6 57.4/46.0 62.7/50.1

Table 4: More quantitative comparisons of PR and SR scores under five extreme attributes in Las He R dataset

peting methods in all the attributes. This experiment not only validates the effectiveness of our method, but also demonstrates our superior generalization under different conditions. Moreover, we further explore the performance of BAT in five extreme attributes such as NO , HO , LI , AIV and TC . The experimental results of PR and SR scores are reported in Table 4. Compared with the comparisons on the complete Las He R dataset, our model outperforms the competing methods even more under extreme attributes. Our PR score of LI attribute and SR score in AIV attribute surpasses Vi PT by over 10%. These attributes show more dynamic than other attributes, which further verifies the effectiveness of our bi-directional adapter in dynamically learning the changing dominant-auxiliary information for multimodal tracking in complex environments.

In this work, we present BAT, a new bi-directional adapter by introducing a universal feature prompt-learning paradigm to multi-modal tracking. The core idea of BAT is to dynamically excavate the changing dominant-auxiliary relevance of multiple modalities in complex scenarios, and extract complementary information from the pre-trained foundation model. Extensive experiments on multiple RGBT tracking datasets demonstrate the superiority of BAT over competing methods. With the in-depth study of the adapter structure of BAT, we believe this work has the potential to be applied to broader tasks. We expect it can attract more attention to multi-modal parameter-efficient tuning and empower, more general, vision-language tasks. Moreover, our method is currently validated in the RGB and TIR tracking task, and in the future, we are interested in exploring a general model for more diverse modalities.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Acknowledgments This work was sponsored in part by the National Key R&D Program of China 2022ZD0116500, in part by the National Natural Science Foundation of China (62106171, 62222608, U23B2049, 61925602), in part by the CAAI-CANN Open Fund, developed on Open I Community, in part by the Haihe Lab of ITAI under Grant 22HHXCJC00002, in part by the Tianjin Natural Science Foundation under Grant 21JCYBJC00580, and in part by the Key Laboratory of Big Data Intelligent Computing, Chongqing University of Posts and Telecommunications under Grant BDIC-2023-A-008.

References Bhat, G.; Danelljan, M.; Gool, L. V.; and Timofte, R. 2019. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision, 6182 6191. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; and Lu, H. 2021. Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8126 8135. Cui, Y.; Jiang, C.; Wang, L.; and Wu, G. 2022. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13608 13618. Danelljan, M.; Bhat, G.; Khan, F. S.; and Felsberg, M. 2019. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 4660 4669. Gao, Y.; Li, C.; Zhu, Y.; Tang, J.; He, T.; and Wang, F. 2019. Deep adaptive fusion network for high performance RGBT tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0 0. Jia, M.; Tang, L.; Chen, B.-C.; Cardie, C.; Belongie, S.; Hariharan, B.; and Lim, S.-N. 2022. Visual prompt tuning. In European Conference on Computer Vision, 709 727. Springer. Khattak, M. U.; Rasheed, H.; Maaz, M.; Khan, S.; and Khan, F. S. 2023. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113 19122. Lan, J.-P.; Cheng, Z.-Q.; He, J.-Y.; Li, C.; Luo, B.; Bao, X.; Xiang, W.; Geng, Y.; and Xie, X. 2023. Procontext: Exploring progressive context transformer for tracking. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1 5. IEEE. Li, C.; Cheng, H.; Hu, S.; Liu, X.; Tang, J.; and Lin, L. 2016. Learning collaborative sparse representation for grayscalethermal tracking. IEEE Transactions on Image Processing, 25(12): 5743 5756. Li, C.; Liang, X.; Lu, Y.; Zhao, N.; and Tang, J. 2019. RGBT object tracking: Benchmark and baseline. Pattern Recognition, 96: 106977. Li, C.; Liu, L.; Lu, A.; Ji, Q.; and Tang, J. 2020. Challengeaware RGBT tracking. In European Conference on Computer Vision, 222 237. Springer.

Li, C.; Xue, W.; Jia, Y.; Qu, Z.; Luo, B.; Tang, J.; and Sun, D. 2021. Las He R: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing, 31: 392 404. Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization. ar Xiv:1711.05101. Lu, A.; Li, C.; Yan, Y.; Tang, J.; and Luo, B. 2021. RGBT tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30: 5613 5625. Lu, A.; Qian, C.; Li, C.; Tang, J.; and Wang, L. 2022. Duality-gated mutual condition network for RGBT tracking. IEEE Transactions on Neural Networks and Learning Systems. Wang, C.; Xu, C.; Cui, Z.; Zhou, L.; Zhang, T.; Zhang, X.; and Yang, J. 2020. Cross-modal pattern-propagation for RGB-T tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7064 7073. Xiao, Y.; Yang, M.; Li, C.; Liu, L.; and Tang, J. 2022. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, 3, 2831 2838. Xu, Y.; Wang, Z.; Li, Z.; Yuan, Y.; and Yu, G. 2020. Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI conference on artificial intelligence, 07, 12549 12556. Yan, B.; Peng, H.; Fu, J.; Wang, D.; and Lu, H. 2021. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision, 10448 10457. Yang, J.; Li, Z.; Zheng, F.; Leonardis, A.; and Song, J. 2022. Prompting for multi-modal tracking. In Proceedings of the 30th ACM International Conference on Multimedia, 3492 3500. Ye, B.; Chang, H.; Ma, B.; Shan, S.; and Chen, X. 2022. Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, 341 357. Springer. Zhang, L.; Danelljan, M.; Gonzalez-Garcia, A.; Van De Weijer, J.; and Shahbaz Khan, F. 2019. Multi-modal fusion for end-to-end RGB-T tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0 0. Zhang, P.; Wang, D.; Lu, H.; and Yang, X. 2021a. Learning adaptive attribute-driven representation for real-time RGBT tracking. International Journal of Computer Vision, 129: 2714 2729. Zhang, P.; Zhao, J.; Bo, C.; Wang, D.; Lu, H.; and Yang, X. 2021b. Jointly modeling motion and appearance cues for robust RGB-T tracking. IEEE Transactions on Image Processing, 30: 3335 3347. Zhang, P.; Zhao, J.; Wang, D.; Lu, H.; and Ruan, X. 2022. Visible-thermal UAV tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8886 8895.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Zhu, J.; Lai, S.; Chen, X.; Wang, D.; and Lu, H. 2023. Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9516 9526. Zhu, Y.; Li, C.; Luo, B.; Tang, J.; and Wang, X. 2019. Dense feature aggregation and pruning for RGBT tracking. In Proceedings of the 27th ACM International Conference on Multimedia, 465 472. Zhu, Y.; Li, C.; Tang, J.; and Luo, B. 2020. Qualityaware feature aggregation network for robust RGBT tracking. IEEE Transactions on Intelligent Vehicles, 6(1): 121 130.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)