# continual_visionlanguage_retrieval_via_dynamic_knowledge_rectification__80bff5ce.pdf

Continual Vision-Language Retrieval via Dynamic Knowledge Rectiﬁcation

Zhenyu Cui1, Yuxin Peng1*, Xun Wang2, Manyu Zhu2, Jiahuan Zhou1

1Wangxuan Institute of Computer Technology, Peking University 2Byte Dance Inc cuizhenyu@stu.pku.edu.cn, {pengyuxin, jiahuanzhou}@pku.edu.cn, {wangxun.2, zhumanyu}@bytedance.com

The recent large-scale pre-trained models like CLIP have aroused great concern in vision-language tasks. However, when required to match image-text data collected in a streaming manner, namely Continual Vision-Language Retrieval (CVRL), their performances are still limited due to the catastrophic forgetting of the learned old knowledge. To handle this issue, advanced methods are proposed to distil the afﬁnity knowledge between images and texts from the old model to the new one for anti-forgetting. Unfortunately, existing approaches neglect the impact of incorrect afﬁnity, which prevents the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. Therefore, we propose a novel framework called Dynamic Knowledge Rectiﬁcation (DKR) that simultaneously achieves incorrect knowledge ﬁltering and rectiﬁcation. Speciﬁcally, we ﬁrst ﬁlter the incorrect afﬁnity knowledge calculated by the old model on the new data. Then, a knowledge rectiﬁcation method is designed to rectify the incorrect afﬁnities while preserving the correct ones. In particular, for the new data that can only be correctly retrieved by the new model, we rectify them with the corresponding new afﬁnity to protect them from negative transfer. Additionally, for those that can not be retrieved by either the old or the new model, we introduce paired ground-truth labels to promote the acquisition of both old and new knowledge. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our DKR and its superiority against state-of-the-art methods.

Introduction In recent years, the pre-trained CLIP model (Radford et al. 2021) has become a milestone for vision-language retrieval, where an image is retrieved via a text description of its contents or vice versa. Owing to the large-scale pre-trained dataset, CLIP has demonstrated its promising ability to tackle new datasets in a zero-shot manner but still suffers from inadequate performance (Gu et al. 2021; Xu et al. 2022; Wu et al. 2023). Hence, various works adopt ﬁnetuning and adapter strategies to improve the performance of the pre-trained CLIP model on new datasets (Gao et al. 2023a; Luo et al. 2022; Zhang et al. 2022). However, when facing a practical and crucial scenario where a series of

*Corresponding author. Copyright 2024, Association for the Advancement of Artiﬁcial Intelligence (www.aaai.org). All rights reserved.

Image Text Description

lap Unseen Combination man

Unseen Distribution

candle Data Stream

A woman holding ... with one candle near a man with ... in his lap.

Figure 1: Challenges in CVLR. In addition to the unseen class label concerned by CIL, CVLR involves the unseen class combination and the unseen class distribution.

datasets come one by one, the above strategies always fail to handle all the datasets well. Their performance can be significantly degraded on the learned old datasets due to the wellknown catastrophic forgetting challenge (De Lange et al. 2021). Therefore, in this paper, we focus on adapting CLIP to new datasets while maintaining its performance on old datasets, known as the task of Continual Vision-Language Retrieval (CVLR). So far, most continual learning works focus on how to achieve the anti-forgetting of class information of the images in the learned old datasets (Masana et al. 2022) which is also known as class incremental learning (CIL). However, the existing CIL methods can not be directly used to handle the CVLR task. As shown in Fig. 1, on the one hand, images and text descriptions in the CVLR task do not have the necessary category label information for CIL, which prevents correctly matching images with corresponding descriptions. On the other hand, CVLR exhibits a more complicated scenario where the text description usually meets the unseen class label, the unseen class combination, and the unseen class distribution. These challenges leave the CVLR task still an unsolved problem. To tackle the above challenges, knowledge distillation has become an effective solution to CVLR task (Wang, Herranz, and van de Weijer 2021; Srinivasan et al. 2022; Dong et al. 2021; Ni et al. 2023), which involves transferring old knowledge learned by the old model to the new model. Recent works (Dong et al. 2021; Ni et al. 2023) regarded the afﬁnity relationship between image-text pairs calculated by the old model as old knowledge and developed knowledge distillation to alleviate the catastrophic forgetting problem. How-

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Training Data in New Dataset

0.16 0.04 0.11

0.68 0.03 0.25 0.04

0.95 0.01 0.02 0.01

0.20 0.04 0.25

0.03 0.02 0

0.34 0.01 0.62 0.03

0.94 0.03 0.02 0.01

0.04 0 0.07 0.89

Incorrect Distillation 0.95

V1 V2 V3 V4

V1 V2 V3 V4

0.08 0.02 0.53

0.34 0.02 1 0.02

0.94 0.03 0.02 0.01

0.04 0 0.07 0.89

Knowledge Rectification

Rectified Distillation

Incorrect Affinity

Rectified Affinity

A woman holding a birthday cake ...

A girl sitting on a bed in a room.

A batter that is trying to bunt ...

Glasses of wine, salad and ...

Figure 2: Existing CVLR methods fail to deal with the incorrect afﬁnity calculated by the old model, thereby limiting the acquisition of both old and new knowledge. In contrast, we rectify those incorrect afﬁnities and distil them into a new model to solve the above problem.

ever, as shown in Fig. 2, these existing methods neglect the impact of incorrect afﬁnity when distilling old knowledge, thereby limiting the acquisition of both old and new knowledge. Although the latest research (Ni et al. 2023) preliminarily alleviated this problem by ﬁltering the incorrect afﬁnity, it inevitably leads to the loss of relevant old knowledge. As a result, existing methods can hardly perform well on both old and new datasets. Inspired by the above observation, we propose a novel framework for the CVLR task, named Dynamic Knowledge Rectiﬁcation (DKR). As shown in Fig. 2, the core idea of DKR is to rectify the incorrect afﬁnity calculated by the old model while maintaining the correct one. To this end, a dynamic knowledge ﬁltering and rectiﬁcation module is designed to strike the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. Speciﬁcally, we ﬁrst compute an old afﬁnity and a new afﬁnity based on the deep embedding extracted by the old model and the new model, respectively. Then, for the new data that can be correctly matched by the old model, DKR directly maintains the old afﬁnity to avoid forgetting of the old knowledge. Besides, for those that can only be correctly retrieved by the new model, DKR exploits the new afﬁnity to rectify the incorrect part in the old afﬁnity to avoid the negative transfer of the old knowledge. In addition, for those that not be retrieved by either the old or the new model, DKR in-

troduces additional knowledge, i.e., the paired ground-truth label, to promote the acquisition of both old and new knowledge. Then, we combine the rectiﬁed results and exploit the knowledge distillation to constrain the learning process of the new model. In summary, the main contributions of this paper are as follows: 1) To alleviate the catastrophic forgetting problem in the continual vision-language retrieval task, we propose a novel Dynamic Knowledge Rectiﬁcation (DKR) framework to dynamically ﬁlter and rectify incorrect old knowledge, which strikes the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge. 2) A knowledge rectiﬁcation method is designed to protect the new model from negative transfer and promote the acquisition of both old and new knowledge. 3) Extensive experiments on ﬁve real-world visionlanguage retrieval benchmark datasets demonstrate the superiority of our proposed DKR against the state-of-the-art methods.

Related Work Vision-Language Retrieval

Matching the same semantics between images and texts is the key to vision-language retrieval. Most existing works calculated the pairwise afﬁnity by extracting and mapping vision and language features into a common embedding space. According to different interaction patterns, existing image-text retrieval methods can be roughly categorized into two branches: 1) Corss-modal interaction methods. These methods focus on inferring and aligning the pairwise relationship across cross-modal entities (Chen et al. 2020a,b; Li et al. 2020). IMRAM (Chen et al. 2020a) designed an iterative image-text retrieval framework to capture correspondences by stacking cross-attention neural networks. Although achieving some progress, the redundant computational cost of calculating the similarity between images and text limits their practicality (Chen et al. 2021). 2) Intramodal representation methods. To tackle the above limitations, methods of this category employ independent representation networks for vision and language modalities, thus improving the inference efﬁciency (Li et al. 2019; Wang et al. 2020; Chen et al. 2021). Wang (Wang et al. 2020) further exploit a consensus-aware visual-semantic embedding model to incorporate the commonsense knowledge share between both modalities. However, the above methods are typically designed for retrieval in speciﬁc datasets. Recently, large-scale multi-modal pre-trained models (Li et al. 2021; Radford et al. 2021; Li et al. 2023) have aroused great concern in the community. CLIP (Radford et al. 2021) greatly improves the performance of the visionlanguage retrieval task through large-scale self-supervised contrastive learning. These models are usually built with extremely large-scale vision-language datasets (Ordonez, Kulkarni, and Berg 2011; Sharma et al. 2018; Schuhmann et al. 2021)), and show impressive results in multiple downstream tasks (Wang et al. 2022; Chowdhury, Zhuang, and Wang 2022; Luo et al. 2022). However, their performance can be greatly degraded when it comes to sequentially given

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

training datasets due to the catastrophic forgetting problem, which limits their applicability in real scenarios.

Continual Learning Continual Learning (CL) (De Lange et al. 2021) aims to enable intelligent systems to continuously learn to adapt to new datasets while preserving old knowledge learned from old datasets. Most CL methods (Liu et al. 2022; Qiu et al. 2023; Gao et al. 2023b) are designed for class-incremental learning, namely CIL, i.e. categories in the new task never appeared in the old task. Qiu (Qiu et al. 2023) devised a causal intervened learning strategy to eliminate the causal path that causes the task-induced bias which resulted in the catastrophic forgetting problem. Gao (Gao et al. 2023b) transferred diverse knowledge from both task-speciﬁc and taskgeneral knowledge to the current task to balance the stability and the plasticity of the old knowledge. However, CIL methods are not suitable for the CVLR task when considering the lack of class label information for training and the complicated scenario. To address the above challenge, some methods employed knowledge distillation strategies (Wang, Herranz, and van de Weijer 2021; Srinivasan et al. 2022; Dong et al. 2021; Ni et al. 2023) for generalized uses. Dong (Dong et al. 2021) proposed to distillate relation knowledge between paired samples to prevent the forgetting of the structural knowledge. Ni (Ni et al. 2023) maintained the multimodal common representation space by aligning the contrastive matrices to alleviate the spatial disorder when learning new knowledge. Different from these methods, we propose a dynamic knowledge rectiﬁcation strategy for CVLR. It can dynamically ﬁlter and rectify the incorrect old knowledge, and distil it to the new model in a uniﬁed way to strike the balance between the anti-forgetting of old knowledge and the acquisition of both old and new knowledge.

Proposed Method Problem Deﬁnition and Notations In this paper, we focus on the challenging Continual Vision Language Retrieval (CVLR) task, which assumes the visionlanguage data comes in a streaming manner. Speciﬁcally, given a sequentially collected vision-language datasets D = {D(1), D(2), ..., D(S)}, where S denotes the total length of the sequence. The s-th dataset D(s) = {(v(s) i , t(s) i )}N (s) i=1 consists of N (s) input images v(s) i and corresponding text descriptions t(s) i . Each D(s) is randomly divided into a training set D(s) train and a testing set D(s) test. Our goal is to train an encoding model f(x) : x ex to map the input sample x to a deep embedding ex and maximize the similarity between the embedding of vi and ti. Notably, the previous s 1 datasets are completely unavailable when training on the sth dataset, while all s testing sets are jointly used to evaluate the overall retrieval performance of f(x) to all datasets.

Overview As shown in Fig. 3, our proposed DKR method consists of a CLIP-based image encoder, a CLIP-based text encoder, and

a Dynamic Knowledge Rectiﬁcation (DKR) module. The encoders are ﬁrst initialized with the pre-trained parameters, which are then sequentially trained on S datasets. For the beginning of the training stage s, we preserve a copy of the encoders as the old model to represent old knowledge. Then, we use the old model and the new model to calculate the afﬁnity between each two image-text samples in the s-th dataset, which forms an old afﬁnity matrix M(s 1)

and a new afﬁnity matrix M(s), respectively. Sequentially, the DKR module ﬁlters the incorrect afﬁnity in M(s 1), and dynamically rectiﬁes them with the correct new afﬁnity and paired Ground-Truth (GT) label to form a rectiﬁed old knowledge matrix M(T ). Finally, DKR distils the knowledge in M(T ) to M(s). During the training phase, we employ an original contrastive loss Lclip to train our model on D(1). When training the following steps (s >= 2), we add our customized knowledge distillation loss LJS to alleviate the catastrophic forgetting problem. During the inference phase, only the trained encoders are retained to validate the CVLR performance on all S datasets.

CLIP-based Retrieval Baseline

In this section, we present the CLIP-based retrieval baseline of our DKR. Considering that the vision-language retrieval task is mainly to match paired images and texts, we employ the CLIP-based encoders to learn the discriminative deep embedding. Formally, let fv(x) and ft(x) be the encoders for images and texts, respectively. The deep embedding zi v and zi t can be formulated as:

zi v = fv(vi) zi t = ft(ti). (1)

To maximize the afﬁnity of the matched image-text pairs, a contrastive loss Lclip is employed to pull the distance between the paired samples while pushing the distance of unpaired samples. The Lclip can be expressed as follows:

Li v = log( e zi v,zi t /τ PN k=1 e zi v,zk t /τ )

Li t = log( e zi t,zi v /τ PN k=1 e zi t,zk v /τ )

Ls clip = PN i=1(Li v + Li t)/2,

where , denotes the dot product of two embedding, N is the batch size, and τ R+ is a learnable temperature parameter.

Dynamic Knowledge Rectiﬁcation

Although the CLIP-based retrieval baseline shows impressive performance on new datasets, it is still challenging to maintain its performance on old datasets. Therefore, we propose a Dynamic Knowledge Rectiﬁcation (DKR) module to alleviate the catastrophic forgetting problem. Different from the common class incremental learning, which can use the

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Rectification with New Knowledge (RNK)

Rectification with Paired Knowledge (RPK)

Old Texts New Texts

Old Image Encoder New Image Encoder

. . . . . .

. . . . . . . . . . . .

. . . . . .

Old Text Encoder New Text Encoder

. . . . . .

. . . . . . . . . . . .

. . . . . .

...... ......

V1 V2 V3 V4

V1 V2 V3 V4

V1 V2 V3 V4

. . . . . .

. . . . . . . . . . . .

. . . . . .

Dynamic Knowledge Rectification

Norm Rectified

Rectified Knowledge

Images of s-th Dataset

Texts of s-th Dataset

Rectified Old Knowledge Matrix

Affinity My,:

...... ......

T1 T2 T3 T4

A woman holding a birthday cake ...

A girl sitting on a bed in a room.

A batter that is trying to bunt ...

Glasses of wine, salad and ...

Figure 3: The framework of our proposed Dynamic Knowledge Rectiﬁcation (DKR) model. DKR consists of three components: a CLIP-based image encoder, a CLIP-based text encoder, and a Dynamic Knowledge Rectiﬁcation (DKR) module. The encoders embed input data (images and texts) and form two afﬁnity matrices for old and new knowledge. The DKR module dynamically rectiﬁes the incorrect old knowledge and distils it into the new model.

class prediction results to represent old knowledge, continual retrieval does not have speciﬁc class label information. Therefore, we ﬁrst give a deﬁnition to the old knowledge space of the last training stage (s 1) by an afﬁnity matrix M(s 1) RN N. For convenience, we take the text-toimage afﬁnity matrix as an example. Then, M(s 1) can be calculated as follows:

M(s 1) i,j = e zi t,zj v /τ PN k=1 e zi t,zk v /τ , (i, j [1, N]). (3)

Notably, z in Eq. 3 is calculated using the model trained after the (s 1)-th stage. Then, we impose a knowledge distillation LJS based on Kullback-Leibler (KL) divergence to transfer knowledge from the old model to the new model, which can be calculated as follows:

LKL(M(x), M(y)) = X M(x)log(M(x)

M(y) ), (4)

LJS(s 1, s) =LKL(M(s 1), M(s 1) + M(s)

+ LKL(M(s), M(s 1) + M(s)

Note that the weight parameter in the old model is kept constant during the distillation. However, the distillation in Eq. 5 only holds if the old knowledge is correct. When the old knowledge cannot correlate paired samples in the new dataset, it will naturally lead to negative transfer and prevent knowledge acquisition. Formally, we deﬁne elements on the diagonal of M without the largest afﬁnity as the incorrect knowledge, ie:

i = arg max(Mi,1, ..., Mi,N), while the others as the correct knowledge. Next, we elaborate on the proposed two strategies to rectify the incorrect old knowledge.

Rectiﬁcation with New Knowledge To prevent the aforementioned negative transfer problem, we replace the incorrect afﬁnity in M(s 1) with the corresponding correct afﬁnity calculated by the new model. In order to protect the old feature space as much as possible, we only correct the afﬁnity values on the diagonal as follows:

M(T ) i,i = M(s) i,i , i = arg max(M(s 1) i,1 , ..., M(s 1) i,N ). (6)

To promote the transfer of the rectiﬁed old knowledge to new knowledge, we use the newly introduced afﬁnity to constrain the off-diagonal old afﬁnity, which eliminates the gap between the old and new knowledge as follows:

M(T ) i,j = M(s 1) i,j 1 M(s) i,i 1 M(s 1) i,i , i = j. (7)

Although the rectiﬁed knowledge calculated by Eq. 6 and Eq. 7 achieve anti-forgetting by exploiting regularized old off-diagonal afﬁnity, it meanwhile ignores the acquisition of new knowledge. Therefore, we further enhance the diagonal afﬁnity by: M(T ) i,i = 1, (8)

which is crucial to balance the adaptation to the new dataset. Based on the above strategy, DKR achieves antiforgetting by maintaining the relative relationship between

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

afﬁnities in the old model. Nevertheless, such a strategy is still undesirable when it comes to incorrect new knowledge. For example, when training on a new dataset that differs signiﬁcantly from the old datasets, the new model can hardly output the correct afﬁnity in time. In this case, such rectiﬁcation with new knowledge is ineffective and even harmful due to the new model subject to accumulating erroneous knowledge.

Rectiﬁcation with Paired Knowledge When encountering both erroneous old and new knowledge, the model will signiﬁcantly deviate from the learned knowledge and ﬁnally lead to performance degradation. To tackle the above issue, we use the known paired GT label to realize the rectiﬁcation. Derived from Eq. 6, we replace the incorrect afﬁnity in the old knowledge with the paired label which is correct for both old and new datasets as follows:

M(T ) i,i = 1, i = arg max(M(s,s 1) i,1 , ..., M(s,s 1) i,N ). (9)

Then, we constrain the learning of the current stage based on the correlation between the off-diagonal afﬁnity and the label information in the old feature space, which can be formulated as follows:

M(T ) i,j =

1 + PN k =i M(s 1) i,k , i = j

M(s 1) i,j 1 + PN k =i M(s 1) i,k , i = j. (10)

Finally, the complete DKR is shown in Alg. 1. We combine the above strategies to obtain the rectiﬁed old knowledge matrix MT and develop Eq. 5 to achieve the balance between the anti-forgetting of the old dataset and the adaptation to the new dataset. Particularly, Ljs consists of both text-to-image distillation and image-to-text distillation. We sum up the above two distillation losses to achieve the balance in the vision-language retrieval task.

Objective Function Finally, the total objective function of our proposed DKR is calculated as follows:

L = Lclip + λLJS(T, s), (11) where λ is a hyperparameter for training.

Experiments In this section, we conduct extensive experiments to validate the effectiveness of our proposed DKR.

Datasets and Evaluations We conduct the validation experiments on ﬁve real-world benchmark image-text retrieval datasets: 1) MS-COCO Caption: MS-COCO Caption (MS-COCO) (Lin et al. 2014) is a widely used image caption dataset. It contains 80K training images and 5K testing images, where each image

Algorithm 1: Dynamic Knowledge Rectiﬁcation (DKR)

Input: Old afﬁnity matrix M(s 1), New afﬁnity matrix M(s). Parameter: Batch size N. Output: Rectiﬁed afﬁnity matrix Ms 1.

1: for i = 1 to N do 2: if i = arg max(M(s 1) i,1 , ..., M(s 1) i,N ) then

3: MT i = M(s 1) i ;

4: else if i = arg max(M(s) i,1, ..., M(s) i,N) then

5: MT i,j = M(s 1) i,j 1 M(s) i,i 1 M(s 1) i,i , i = j;

6: MT i,i = 1; 7: else 8: Ms 1 i,i = 1;

9: MT i = Ms 1 i 1 + PN k =i M(s 1) i,k 10: end if 11: end for 12: return MT

has ﬁve captions. 2) Flickr30K: Flickr30K (Young et al. 2014) contains 31,783 images from the Flickr website, and each image is annotated by 5 sentences. We use 30K images as the training set and the rest 1K images as the testing set. 3) IAPR TC-12: IAPR TC-12 (Grubinger et al. 2006) consists of 20,000 images with corresponding captions collected around the world. We use 15K images for training and the rest 5K images for testing. 4) ECommerce-T2I: ECommerce-T2I (EC) (Yang et al. 2021) is a large-scale ecommerce products retrieval dataset. It contains 90K images for training and 5K images for testing, where each image is annotated with one sentence. 5) RSICD: RSICD (Lu et al. 2017) is a remote sensing image retrieval dataset, which contains 10,921 images of 30 scenes. We use 9,828 images for training and the rest 1,093 images for testing.

Considering that the CVLR data may come from a speciﬁc dataset or different datasets, we verify the effectiveness of our DKR under two settings. Setting-1: To evaluate the effectiveness of our proposed DKR on different datasets, we conducted experiments on ﬁve sequential given datasets (i.e., MS-COCO Flickr30K IAPR TC-12 EC RSICD). Setting-2: To further evaluate the performance on the speciﬁc dataset, we follow the benchmark in (Ni et al. 2023), which randomly and uniformly divides the EC dataset into 5 sub-datasets, and sequentially train on these 5 sub-datasets. The test dataset includes Flickr30K, MS-COCO, and EC.

We employ the widely used evaluation metrics in crossmodal retrieval, Recall at Top K (R@K), to evaluate and compare our method with the existing methods under the same setting for fair comparisons.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Method MS-COCO Flickr30K IAPR TC-12 EC RSICD Average R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 Joint 53.7 78.4 86.4 80.9 95.6 98.1 53.6 82.6 90.5 18.6 43.7 57.0 12.0 31.3 45.6 43.7 66.3 75.5 Zero-Shot 39.8 64.4 74.5 68.3 89.0 94.2 36.0 65.8 76.6 11.0 27.7 37.9 5.2 15.7 26.3 32.1 52.5 61.9 SFT 39.0 65.5 75.3 69.7 90.1 94.3 47.0 77.4 87.0 14.9 37.1 50.1 12.8 33.5 47.9 36.7 60.7 70.9 EWC 39.9 66.6 76.5 70.0 90.6 94.5 47.3 77.3 87.0 15.0 37.2 50.3 13.3 32.5 47.5 37.1 60.8 71.1 Lw F 47.0 72.9 82.1 76.2 93.6 97.1 54.6 83.6 91.2 17.0 40.2 53.8 14.0 35.0 49.7 41.7 65.0 74.8 ERL 48.3 73.5 82.6 77.3 93.6 96.7 51.0 80.5 88.8 18.2 42.1 55.5 14.1 33.8 48.2 41.8 64.7 74.3 AFC 48.3 73.7 82.6 77.7 93.6 96.7 51.2 80.7 89.0 18.3 42.2 55.4 14.5 33.3 47.7 42.0 64.7 74.3 Mod-X 49.7 73.8 82.6 77.8 93.7 97.1 51.1 80.7 89.0 18.2 42.3 55.5 14.5 33.5 47.8 41.9 64.8 74.3 Ours 51.4 76.0 84.4 79.3 94.6 97.4 53.9 83.1 90.8 18.0 42.0 54.8 14.1 34.3 49.2 43.3 66.0 75.3

Table 1: Comparison with state-of-the-art methods on different datasets (Training order: MS-COCO Flickr30K IAPR TC12 EC RSICD. The best results are bolded and the second-best results are underlined.)

Image-Text Retrieval Text-Image Retrieval Average Flickr30K MS-COCO EC Flickr30K MS-COCO EC R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 Joint 64.5 88.6 39.8 64.8 23.5 50.8 46.9 73.1 22.2 44.5 23.5 50.6 36.7 62.1 Zero-Shot 77.7 94.5 50.1 74.6 11.3 27.6 58.9 83.5 30.2 55.6 10.1 25.5 39.7 60.2 SFT 63.4 87.2 36.8 61.5 16.6 40.7 44.4 71.0 20.6 42.6 15.8 40.5 32.9 57.3 EWC 64.0 87.8 37.7 64.3 16.2 40.0 44.8 72.4 20.7 44.1 16.5 42.0 33.3 58.4 Mod-X 73.1 92.1 47.1 70.5 20.1 44.8 55.6 79.9 27.9 51.0 20.0 44.8 40.6 63.9 Ours 78.5 95.7 51.7 75.4 20.4 46.2 58.7 83.6 29.7 54.2 20.2 45.3 43.2 66.7

Table 2: Comparison with state-of-the-art methods on the speciﬁc dataset. (Training order: EC-1 EC-2 EC-3 EC-4 EC-5. The best results are bolded and the second-best results are underlined.)

Implementation Details

The proposed DKR is implemented in Py Torch with NVIDIA V100 GPUs. We use CLIP(Vi T-Based/32) (Radford et al. 2021) with the pre-trained weight on the largescale open-world datasets in (Open AI. 2021) as our backbone. Input images are resized to 224 224. Each task is trained with 35 epochs with a batch size of 280. We use Adam optimizer with (β1,β2)=(0.9, 0.99) and weight decay of 0.2 to update the whole CLIP. The initial learning rate is set to 1e-6 with 20% warm-up iterations, and a cosine-decay learning rate scheduler is also used to update the whole framework. The hyperparameter λ is set to 1.0 and 0.1 for Setting-1 and Setting-2, respectively.

Comparison with State-of-the-art Methods

In this section, we compare our DKR to ﬁve continual learning methods that do not rely on class label information: EWC (Elastic Weight Consolidation) (Kirkpatrick et al. 2017), Lw F (Learning Without Forgetting) (Li and Hoiem 2017), AFC (Adaptive Feature Consolidation) (Kang, Park, and Han 2022), ERL (Exemplars Relation Loss) (Dong et al. 2021) and Mod-X (Maintain Off-diagonal information matri X) (Ni et al. 2023). In addition, we report three basic training strategies: Joint, Zero-Shot, and SFT. Joint represents that all data are available at any time, which is an upper bound. Zero-shot represents directly testing the original pre-trained CLIP. SFT represents sequentially training and testing the model without any anti-forgetting strategies.

Comparison on Different Datasets Tab. 1 summarizes the results of our DKR on ﬁve different datasets. Compared

with existing methods, DKR ranks ﬁrst on all average metrics of I2T and T2I and achieves 43.4%, 66.0%, and 75.3% on R@1, R@5 and R@10, respectively. In particular, on the MS-COCO dataset, DKR outperforms Mod-X by 1.7%, 2.2%, and 1.8% on R@1, R@5 and R@10, which illustrates that our DKR can effectively alleviate the catastrophic forgetting problem. Note that existing methods traded expensive performance degradation in old datasets (e.g., MSCOCO and Flickr30K) for limited improvements in new datasets (e.g., EC and RSICD). Fig. 4 illustrates the R@1 accuracy on MS-COCO after each training step. It can be seen that our DKR achieves the best anti-forgetting performance, which signiﬁcantly outperforms existing methods. The above results illustrate the superiority of our DKR against state-of-the-art methods.

Comparison on the EC Dataset As shown in Tab. 2, our DKR achieves the best retrieval performance on the EC dataset. Speciﬁcally, DKR achieves the highest retrieval performance on the EC dataset, i.e. achieves 20.4% and 20.2% on R@1 in Image-Text Retrieval (I2T) and Text Image Retrieval tasks (I2T). Meanwhile, our DKR signiﬁcantly suppresses Mod-X in the old datasets (Flickr30K and MS-COCO) on all criteria. The above results illustrate that DKR can effectively preserve the old knowledge for antiforgetting. Note that although DKR is not directly trained on Flickr30K and MS-COCO, it can even achieve higher results than CLIP (Zero-Shot) on the Image-Text Retrieval task. This is because our DKR method can achieve positive transfer for both old and new knowledge by properly ﬁltering and rectifying the incorrect knowledge.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Base RNK RPK RNK* MS-COCO Flickr30k IAPR TC-12 EC RSICD Average 49.0 77.9 51.9 15.8 11.1 41.2 50.7 78.7 53.4 17.3 12.7 42.6 49.4 78.1 52.8 16.7 12.7 41.9 51.1 79.0 53.8 17.2 12.2 42.9 51.4 79.3 53.9 18.0 14.1 43.3

Table 3: Ablation studies of each component of DKR under Setting-1, where average R@1 accuracy is reported. (The best results are bolded and the second-best results are underlined.)

1 2 3 4 5 Training Stage (S)

R@1(%) of Text-Image Retrieval

SFT EWC Lw F ERL AFC Mod-X DKR

1 2 3 4 5 Training Stage (S)

R@1(%) of Image-T ext Retrieval

SFT EWC Lw F ERL AFC Mod-X DKR

Figure 4: The retrieval performance of different methods in each training stage under Setting-1 on MS-COCO, where Text-Image and Image-Text retrieval results are shown in (a) and (b), respectively.

0.1 0.3 0.5 0.7 0.9 1 2 4 6 8 10 Value of

43.25 43.3443.32

43.18 43.09

0.1 0.3 0.5 0.7 0.9 1 2 4 6 8 10 Value of

65.82 65.9665.98

Figure 5: The effects of hyperparameter λ under Setting-1, where averaged R@1 and R@5 accuracy (%) are shown in (a) and (b), respectively.

Ablation Study In this section, we conduct ablation studies under Setting-1 to evaluate the effectiveness of each component of DKR and the effect of the hyperparameter.

hyperparameter Study We ﬁrst evaluate the effect of the hyperparameter λ in Eq. 11 under Setting-1. The average R@1 and R@5 accuracy on all ﬁve sub-datasets are shown in Fig. 5. With λ increasing, the average R@1 and R@5 keep improving before λ arrives at 1.0. This is because a slight λ (< 1) cannot alleviate the catastrophic forgetting of old knowledge, while an excessive λ (> 1) will limit the acquisition of new knowledge. It can be seen that our DKR strikes a balance of the above two states when λ=1 in a general scene.

Effectiveness of Rectiﬁcation with New Knowledge To evaluate the effectiveness of our Rectiﬁcation with New Knowledge (RNK), we introduce two variants to carefully evaluate the impact of the new knowledge, i.e. RNK (full version) and RNK* (RNK w/o Eq. 8). As shown in Tab. 3, compared to the Base, RNK signiﬁcantly improves the average performance by 1.4%. When combined with RPK, our ﬁnal model achieves a signiﬁcant improvement of 2.1%. This shows that by rectifying the incorrect old knowledge with new knowledge, DKR signiﬁcantly improves the antiforgetting of old knowledge. In addition, compared to the other variant (RNK*), RNK gains an improvement on new datasets (+0.8% on EC and +1.9% on RSICD). These results indicate that the enhancement to new knowledge (w.r.t. Eq. 8) improves the performance on new datasets while maintaining the anti-forgetting of old knowledge.

Effectiveness of Rectiﬁcation with Paired Knowledge We compare the proposed Rectiﬁcation with Paired Knowledge (RPK) to the basic knowledge distillation strategy without any rectiﬁcation (Base). As shown in Tab. 3, compared with the Base, RPK brings an improvement by 0.7% on average R@1 accuracy on all datasets. When combined with RNK, the performance on each dataset is further improved by 0.7% on average R@1 accuracy. It indicates that by rectifying with additional knowledge, RPK promotes the acquisition of both old and new knowledge and thus improving the overall performance, verifying its high effectiveness. The above results verify that our DKR is essential for striking the balance between the anti-forgetting of old knowledge and the acquisition of new knowledge.

In this paper, we proposed a novel framework for the Continual Vision-Language Retrieval (CVLR) task, called Dynamic Knowledge Rectiﬁcation (DKR), which alleviates the catastrophic forgetting problem by dynamically ﬁltering and rectifying incorrect old knowledge. Speciﬁcally, we designed a knowledge rectiﬁcation method to achieve the anti-forgetting of the old knowledge and the acquisition of both old and new knowledge. Extensive experiments on ﬁve vision-language retrieval benchmark datasets demonstrate that our DKR achieves state-of-the-art performance.

Ackowledgements

This work was supported by the National Natural Science Foundation of China (61925201, 62132001, 62376011).

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

References Chen, H.; Ding, G.; Liu, X.; Lin, Z.; Liu, J.; and Han, J. 2020a. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12655 12663. Chen, J.; Hu, H.; Wu, H.; Jiang, Y.; and Wang, C. 2021. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15789 15798. Chen, Y.-C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; and Liu, J. 2020b. Uniter: Universal image-text representation learning. In European conference on computer vision, 104 120. Springer. Chowdhury, J. R.; Zhuang, Y.; and Wang, S. 2022. Novelty controlled paraphrase generation with retrieval augmented conditional prompt tuning. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 36, 10535 10544. De Lange, M.; Aljundi, R.; Masana, M.; Parisot, S.; Jia, X.; Leonardis, A.; Slabaugh, G.; and Tuytelaars, T. 2021. A continual learning survey: Defying forgetting in classiﬁcation tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7): 3366 3385. Dong, S.; Hong, X.; Tao, X.; Chang, X.; Wei, X.; and Gong, Y. 2021. Few-shot class-incremental learning via relation knowledge distillation. In Proceedings of the AAAI Conference on Artiﬁcial Intelligence, volume 35, 1255 1263. Gao, P.; Geng, S.; Zhang, R.; Ma, T.; Fang, R.; Zhang, Y.; Li, H.; and Qiao, Y. 2023a. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 1 15. Gao, X.; He, Y.; Dong, S.; Cheng, J.; Wei, X.; and Gong, Y. 2023b. DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 24236 24245. Grubinger, M.; Clough, P.; M uller, H.; and Deselaers, T. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop onto Image, volume 2. Gu, X.; Lin, T.-Y.; Kuo, W.; and Cui, Y. 2021. Openvocabulary Object Detection via Vision and Language Knowledge Distillation. In International Conference on Learning Representations. Kang, M.; Park, J.; and Han, B. 2022. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 16071 16080. Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13): 3521 3526. Li, J.; Li, D.; Savarese, S.; and Hoi, S. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ar Xiv:2301.12597.

Li, K.; Zhang, Y.; Li, K.; Li, Y.; and Fu, Y. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF international conference on computer vision, 4654 4662. Li, W.; Gao, C.; Niu, G.; Xiao, X.; Liu, H.; Liu, J.; Wu, H.; and Wang, H. 2021. UNIMO: Towards Uniﬁed-Modal Understanding and Generation via Cross-Modal Contrastive Learning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2592 2607. Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. 2020. Oscar: Objectsemantics aligned pre-training for vision-language tasks. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XXX 16, 121 137. Springer. Li, Z.; and Hoiem, D. 2017. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12): 2935 2947. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Doll ar, P.; and Zitnick, C. L. 2014. Microsoft coco: Common objects in context. In Computer Vision ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740 755. Springer. Liu, X.; Yi, J.; Cheung, Y.-m.; Xu, X.; and Cui, Z. 2022. OMGH: Online manifold-guided hashing for ﬂexible crossmodal retrieval. IEEE Transactions on Multimedia. Lu, X.; Wang, B.; Zheng, X.; and Li, X. 2017. Exploring models and data for remote sensing image caption generation. IEEE Transactions on Geoscience and Remote Sensing, 56(4): 2183 2195. Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; and Li, T. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 508: 293 304. Masana, M.; Liu, X.; Twardowski, B.; Menta, M.; Bagdanov, A. D.; and Van De Weijer, J. 2022. Class-incremental learning: survey and performance evaluation on image classiﬁcation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5): 5513 5533. Ni, Z.; Wei, L.; Tang, S.; Zhuang, Y.; and Tian, Q. 2023. Continual Vision-Language Representaion Learning with Off-Diagonal Information. ar Xiv:2305.07437. Open AI. 2021. CLIP. https://github.com/openai/CLIP. Accessed: 2023-08-16. Ordonez, V.; Kulkarni, G.; and Berg, T. 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24. Qiu, B.; Li, H.; Wen, H.; Qiu, H.; Wang, L.; Meng, F.; Wu, Q.; and Pan, L. 2023. Cafe Boost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16016 16025.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)

Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748 8763. PMLR. Schuhmann, C.; Kaczmarczyk, R.; Komatsuzaki, A.; Katta, A.; Vencu, R.; Beaumont, R.; Jitsev, J.; Coombes, T.; and Mullis, C. 2021. LAION-400M: Open Dataset of CLIPFiltered 400 Million Image-Text Pairs. In Neur IPS Workshop Datacentric AI, FZJ-2022-00923. J ulich Supercomputing Center. Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2556 2565. Srinivasan, T.; Chang, T.-Y.; Pinto Alva, L.; Chochlakis, G.; Rostami, M.; and Thomason, J. 2022. Climb: A continual learning benchmark for vision-and-language tasks. Advances in Neural Information Processing Systems, 35: 29440 29453. Wang, H.; Zhang, Y.; Ji, Z.; Pang, Y.; and Ma, L. 2020. Consensus-aware visual-semantic embedding for image-text matching. In Computer Vision ECCV 2020: 16th European Conference, Glasgow, UK, August 23 28, 2020, Proceedings, Part XXIV 16, 18 34. Springer. Wang, K.; Herranz, L.; and van de Weijer, J. 2021. Continual learning in cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3628 3638. Wang, Z.; Zhang, Z.; Lee, C.-Y.; Zhang, H.; Sun, R.; Ren, X.; Su, G.; Perot, V.; Dy, J.; and Pﬁster, T. 2022. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 139 149. Wu, W.; Luo, H.; Fang, B.; Wang, J.; and Ouyang, W. 2023. Cap4Video: What Can Auxiliary Captions Do for Text Video Retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10704 10713. Xu, M.; Zhang, Z.; Wei, F.; Lin, Y.; Cao, Y.; Hu, H.; and Bai, X. 2022. A simple baseline for open-vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision, 736 753. Springer. Yang, A.; Lin, J.; Men, R.; Zhou, C.; Jiang, L.; Jia, X.; Wang, A.; Zhang, J.; Wang, J.; Li, Y.; et al. 2021. M6-t: Exploring sparse expert models and beyond. ar Xiv:2105.15082. Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2: 67 78. Zhang, R.; Zhang, W.; Fang, R.; Gao, P.; Li, K.; Dai, J.; Qiao, Y.; and Li, H. 2022. Tip-adapter: Training-free adaption of clip for few-shot classiﬁcation. In European Conference on Computer Vision, 493 510. Springer.

The Thirty-Eighth AAAI Conference on Artiﬁcial Intelligence (AAAI-24)