# prompt_gradient_projection_for_continual_learning__103f37a8.pdf Published as a conference paper at ICLR 2024 PROMPT GRADIENT PROJECTION FOR CONTINUAL LEARNING Jingyang Qiao1 , Zhizhong Zhang1 , Xin Tan1, Chengwei Chen2, Yanyun Qu3, Yong Peng4, Yuan Xie1( ) 1East China Normal University, 2The Navy Military Medical University, 3Xiamen University, 4Central South University 52275901010@stu.ecnu.edu.cn, {zzzhang,xtan}@cs.ecnu.edu.cn timchen91@aliyun.com, yyqu@xmu.edu.cn, yong peng@csu.edu.cn yxie@cs.ecnu.edu.cn Prompt-tuning has demonstrated impressive performance in continual learning by querying relevant prompts for each input instance, which can avoid the introduction of task identifier. Its forgetting is therefore reduced as this instance-wise query mechanism enables us to select and update only relevant prompts. In this paper, we further integrate prompt-tuning with gradient projection approach. Our observation is: prompt-tuning releases the necessity of task identifier for gradient projection method; and gradient projection provides theoretical guarantees against forgetting for prompt-tuning. This inspires a new prompt gradient projection approach (PGP) for continual learning. In PGP, we deduce that reaching the orthogonal condition for prompt gradient can effectively prevent forgetting via the selfattention mechanism in vision-transformer. The condition equations are then realized by conducting Singular Value Decomposition (SVD) on an element-wise sum space between input space and prompt space. We validate our method on diverse datasets and experiments demonstrate the efficiency of reducing forgetting both in class incremental, online class incremental, and task incremental settings. The code is available at https://github.com/Jingyang Qiao/prompt-gradient-projection. 1 INTRODUCTION Learning continually while not forgetting is a long-standing pursuit of machine learning systems (Kumaran et al., 2016; Mc Clelland et al., 1995; Arani et al., 2022). Incremental learning, or continual learning is such a fabulous way to train a model with continuously expanded datasets by adding novel classes or domains (Ring, 1997; Hadsell et al., 2020; De Lange et al., 2021). In general, continual learning includes two distinct settings, i.e., classand task-incremental learning (Van de Ven & Tolias, 2019), abbreviated as CIL and TIL respectively. The main difference is whether the task identifier, i.e., the samples belong to which training tasks, is given for inference. Recently, the appearance of the prompt-tuning paradigm provides a new sight for class-incremental learning (Wang et al., 2022a; Li et al., 2023). In this framework, a tiny set of trainable tokens, i.e., prompts, are combined with image features, and forwarded into a fixed Transformer architecture. As the instance-wise query mechanism can select relevant prompts according to the input sample, only these relevant parts are updated during training (Wang et al., 2022c). Since the idea of prompttuning is borrowed from the area of natural language processing (NLP) (Lester et al., 2021; Li & Liang, 2021), its deep mechanism against forgetting has not been revealed yet (Zhou et al., 2023). Fortunately, it is observed that learning would not forget if the gradient is updated in the orthogonal direction to the subspace spanned by the old inputs, i.e., gradient projection approaches (GP) (Saha et al., 2021). However, one obvious limitation is that GP is only applicable for task incremental learning, because the gradient constraints would greatly restrict the learning of new tasks compared with normal training (Zhao et al., 2023). Thus, it needs task identifier to instruct updating. Equal contribution Published as a conference paper at ICLR 2024 ACC-20CIFAR ACC-Tiny Im FOR-10CIFAR ACC-10CIFAR FOR-20CIFAR FOR-Tiny Im L2P L2P-PGP 60.50 82.00 84.50 84.00 -8.50 -5.25 -8.75 -8.50 (a) L2P v.s. L2P-PGP Dual Prompt Dual Prompt-PGP ACC-Tiny Im ACC-20CIFAR FOR-10CIFAR ACC-10CIFAR FOR-20CIFAR FOR-Tiny Im -5.00 -5.50 87.00 (b) Dual Prompt v.s. Dual Prompt-PGP Figure 1: Radar chart of comparisons in terms of average accuracy and forgetting between baselines and our methods. L2P (Wang et al., 2022c) and Dual Prompt (Wang et al., 2022b) are two state-ofthe-art prompt-tuning approaches for continual learning. ACC refers to the average accuracy metric (higher is better). FOR refers to the forgetting metric (lower is better). Different scale standards are adopted for two metrics on benchmark datasets. Based on this observation, we propose to combine prompt-tuning and gradient projection for further anti-forgetting. This combination enjoys: i) prompt-tuning with the instance-wise query mechanism releases the limitation of task identifier for gradient projection; ii) gradient projection provides the theoretical guarantees against forgetting for prompt-tuning. In this paper, we propose a novel prompt gradient projection (PGP) for continual learning. We recall the pipeline of prompt-based continual learning (prompt-tuning) and deduce the orthogonal condition of anti-forgetting for prompt gradient via the self-attention mechanism in visiontransformer. We solve the condition equations by conducting Singular Value Decomposition (SVD) on an element-wise sum space between input space and prompt space. That allows us to obtain the gradient projection matrix in an efficient way. We validate our approach in four benchmark datasets: CIFAR-100, Image Net-R, Tiny Image Net, and CUB200, with three baselines of L2P (Wang et al., 2022c), Dual Prompt (Wang et al., 2022b), and CLIP (Radford et al., 2021), where an extraordinary anti-forgetting property is observed shown in Figure 1. We are the first to explicitly provide anti-forgetting mechanism for prompt-based continual learning, and hope our study will further inspire follow-up works. Our contributions are: (1) Prompt gradient projection is the first work to study the anti-forgetting mechanism of prompttuning. Our approach obtains the orthogonal condition of anti-forgetting for prompt gradient and hence the retention of old knowledge has a rigorous theoretical guarantee. (2) We provide a new viewpoint about stability and plasticity by investigating the selection of prompt gradient projection matrix. It appears that the essence of gradient projection is actually a trade-off, where the optimal solution is updating prompt in the orthogonal space of previous tasks. (3) We apply our approach in both prompt-tuning and prefix-tuning paradigms and show the effectiveness of our approach. Our approach achieves state-of-the-art results in terms of forgetting metric and average accuracy metric, under the settings of class incremental learning, online class incremental learning, and task incremental learning. 2 RELATED WORKS AND PRELIMINARIES Continual learning is defined as training deep neural networks (DNN) on time-variant data, i.e., a sequence of tasks, marked as D = {D1, ..., DT }, where t-th task Dt = {(Xt i, yt i)nt i=1} contains tuples of input sample Xt i Xt and corresponding label yt i Yt. When a task Xt arrives, a model fθ would be trained for the current task, while the data from previous tasks is unreachable. In this work, we mainly focus on class incremental learning, without knowing the task identifier during inference. Published as a conference paper at ICLR 2024 2.1 PROMPT-BASED CLASS INCREMENTAL LEARNING A simple yet effective prompt-based (prompt-tuning) CIL model: Learning to Prompt (L2P) (Wang et al., 2022c) is first proposed. In it, prompt p, a tiny set of trainable tokens, combined with image features, are sent into vision-transformer, instructing the model to resist forgetting. To pick appropriate prompts for task-specific training, L2P deployed a prompt pool P including plenty of prompt-key pairs, {pj, kj}M j=1, where kj represents the j-th key and M is the total number of prompt-key pairs. Based on L2P, Dual Prompt (Wang et al., 2022b) divided the prompts into two parts: expert prompt and general prompt for distinct features learning. Dual Prompt also replaced prompt-tuning with prefix-tuning, which has been successfully proven in the area of NLP. Dy Tox (Douillard et al., 2022) designed a novel task attention block, which utilized the task tokens to infer task identifier. Coda Prompt (Smith et al., 2023) replaced the prompt pool with a decomposed prompt that consists of a weighted sum of learnable prompt components, allowing itself optimized in an end-to-end fashion with high plasticity. LGCL (Khan et al., 2023) introduced the text information into the learning of prompt pool, improving performance without any additional learnable parameters. Although prompt-based CIL shows state-of-the-art performance, forgetting still exists compared with other incremental approaches (Saha et al., 2021). Since the problem of forgetting is not explicitly modeled in this framework, its mechanism against forgetting has not been revealed yet. 2.2 BACKGROUND OF GRADIENT PROJECTION METHOD Gradient limitation, i.e., restricting the gradient direction, originated from mathematical theory, provides an important explanation of the stability-plasticity dilemma (Kirkpatrick et al., 2017; Serra et al., 2018; Chaudhry et al., 2018; Farajtabar et al., 2020; Zeng et al., 2019; Wang et al., 2021). Recent studies found that learning would not forget if the gradient is updated in the orthogonal direction of the subspace spanned by the old features. Gradient projection method (GPM) (Saha et al., 2021) updated the weights in the direction orthogonal to the subspace spanned by all previously learned inputs. This ensured that new learning processes did not interfere with previously learned tasks. Trust Region Gradient Projection (TRGP) (Lin et al., 2022b) selected old tasks in the trust region to learn new tasks by a layer-wise scaling matrix, together with orthogonal gradient projection. Simple Linear Connector (Connector) (Lin et al., 2022a) merged two models by using a weighted sum function where one model is updated normally and another is updated with gradient projection. To further illustrate the anti-forgetting reason of gradient projection, we denote the inputs of task t for layer l as Sl t, the learned model for task t as {W l t}L l=1, and L is the total number of layers. In the subsequent sections, we omit layer L for simplicity. Let Wt denote the model change after learning task t + 1. If the update direction is orthogonal to the old features, it follows that Wtxt,i = 0, and xt,i St, where the index t, i means the i-th input image of task t (Saha et al., 2021; Lin et al., 2022b). Therefore, as the model Wt+1 is updated as Wt+1 = Wt+ Wt, validating the performance of model Wt+1 on task t, we have: Wt+1xt,i = (Wt + Wt)xt,i = Wtxt,i + Wtxt,i = Wtxt,i, (1) which indicates that no interference is introduced to old tasks after learning a new concept, thereby addressing the forgetting issue. However, one limitation of gradient projection methods that fail in the class-incremental inference is that the projected gradient needs task identifier to find relevant update parameters. In this paper, we will illustrate: prompt-tuning can break the constraints of needing task identifier in gradient projection and therefore the combination of prompt and gradient projection shows advanced properties in class incremental learning. The flowchart of our method is shown in Figure 2. In Figure 2(a), prompts are chosen according to the similarity between the key vector 1 and the query feature. The picked prompts are then 1In L2P, the key vector is initialized randomly in the form of one-dimension vector and trained to match the query feature of the corresponding task. Published as a conference paper at ICLR 2024 Prompt Pool p1 p2 p3 p4 p5 pm k1 k2 k3 k4 k5 km Prompt Gradient Projection qt Query Function Prompt Space Input Space Sum Space Prompt Gradient Projection 1,0 Store&Update 2,0 ,0 Projection Matrix Element-Wise Sum Gradient of p Projected Gradeint of p p k Prompt-Key Pair Task t p3 p4 p5 x1 x2 x3 x4 x5 Frozen Part Training Part Forward Propagation Backward Propagation Prompt Gradient Figure 2: Flowchart of our work, (a) Process of forward/backward (black/red line). An instancewise query mechanism is adopted during forward propagation. In the backward propagation, PGP is enabled and utilized to update the chosen prompts. (b) Process of prompt gradient projection. We sum input space and prompt space to obtain the sum space. Then with SVD, we attain the new orthogonal vectors from the sum space and update the projection matrix, which is described in Appendix C. Finally, we project the gradient by multiplying with the projection matrix. concatenated with visual embedding sequences for prediction. During the backward propagation, we modify the prompt gradient by meeting the requirement of orthogonal condition from gradient projection methods. Figure 2(b) shows the process of prompt gradient projection. We use the element-wise sum to obtain the so-called sum space. With SVD in this space, we obtain the gradient projection matrix, and modify the gradient with this projection matrix to finish the PGP process. 3.1 PROMPT GRADIENT PROJECTION From the perspective that the old inputs from previous tasks have the same outputs after learning a new task, we have the following proposition: proposition 1 To better preserve old knowledge, the update of network would satisfy the following equation: fθ(pt+1, xt) = fθ(pt, xt), (2) where xt denotes the feature embeddings from old task t, pt and pt+1 denote the prompts trained at task t and t + 1, respectively. Proposition 1 depicts the mathematical, or ideal condition of anti-forgetting. In previous gradient projection works (Saha et al., 2021; Wang et al., 2021), it could be achieved by (i) limiting gradient direction to minimal interference with old knowledge; (ii) projecting gradient onto orthogonal space of old inputs. But as a byproduct, both require the task identifier, an additional prerequisite for inference. In order to realize proposition 1 unlimitedly, we start from the implementation of prompt-based continual learning (PCL). In this framework, after the training of task t+1, we concatenate the prompts pt+1 and the embedding sequences xt, i.e., inputs from t-th task, along the embedding dimension: Zt+1 t = pt+1 xt . With the weights of Wq, Wk, Wv, PCL adopts the transformer architecture that allows us to obtain query (Qt+1 t = Wq Zt+1 t ) and key (Kt+1 t = Wk Zt+1 t ). Thus the attention matrix (Dosovitskiy et al., 2020) is calculated as: At+1 t = softmax(Qt+1 t Kt+1 t T q Published as a conference paper at ICLR 2024 Here, the denominator denotes a normalized factor and hence our focus turns to the numerator part Qt+1 t Kt+1 t T . It actually can be further expanded as Wq Zt+1 t Zt+1 t T W T k . Notice that Wq and Wk, the weights of visual encoder, are frozen and unchanged during training. The trainable parameters can be denoted as: Zt+1 t Zt+1 t T = pt+1 xt p T t+1 x T t = pt+1p T t+1 pt+1x T t xtp T t+1 xtx T t By contrast, the old embedding Zt t, is obtained through the concatenating prompts trained at task t and embedding sequences xt: Zt t Zt t T = pt xt p T t x T t = ptp T t ptx T t xtp T t xtx T t To achieve Eq.(2), i.e., the condition of anti-forgetting, the new prompts require to be: pt+1p T t+1 = ptp T t , xtp T t+1 = xtp T t , pt+1x T t = ptx T t . In Eq.(6), we divide pt+1 into pt and p, where p is the gradient of prompts when training task t + 12. Therefore, for the first term, we extend pt+1p T t+1 as: pt+1p T t+1 = (pt + p)(pt + p)T = ptp T t + pt p T + pp T t + p p T . (7) Here we ignore the high-order infinitesimal term of p p T . Thus if pt p T = 0, the condition, i.e., pt+1p T t+1 = ptp T t can be realized. In the same way, the second condition can be transformed to: xtp T t+1 = xt(p T t + p T ) = xtp T t + xt p T = xtp T t . (8) Eliminating xtp T t on both sides, we have xt p T = 0. Note that this condition also satisfies the third term in Eq.(6) because xtp T t+1 is the transpose of pt+1x T t . For prefix-tuning, this condition is also deduced, and included in Appendix E. Therefore, our key observation is reached: restricting the gradient of prompts by the following equations can realize anti-forgetting: ( xt p T = 0, pt p T = 0. (9) To solve this equation, we decompose xt with SVD: xt = UtΣt V T t . Here, Ut and Vt contain singular vectors corresponding to singular values in Σt, and diagonal matrix Σt can be further divided as: Σt = Σt,1 O O Σt,0 where Σt,1 denotes the non-zero elements of Σt (non-zero singular values) and Σt,0 denotes the near-zero elements of Σt (Deisenroth et al., 2020). Correspondingly, Vt can be divided into two parts along the column dimension: Vt = [Vt,1, Vt,0]. Thus, we have: xt[Vt,1, Vt,0] = Ut Σt,1 O O Σt,0 As a result, we obtain the following equation: xt Vt,0 = Ut Let p = p Vt,0V T t,0, we can get: xt p T = xt( p Vt,0V T t,0) T = xt Vt,0V T t,0 p T = O. (13) Eq.(13) allows us to successfully meet the first requirement in Eq.(9), by taking Vt,0 as the gradient projection matrix. We also have a similar conclusion for the second requirement in Eq.(9). In fact, to simplify the implementation process of Eq.(9), we combine pt and xt with element-wise sum: st = xt + pt. (14) Thus we conduct SVD on st and therefore the obtained projection matrix Vt,0 can realize st p T = 0, which equals to xt p T = 0 and pt p T = 0. In the training stage, we update the gradient with a projection of p = p Vt,0V T t,0. 2Here we omit the factor of learning rating since this simplification wouldn t influence our conclusion. Published as a conference paper at ICLR 2024 3.2 GRADIENT PROJECTION FOR PROMPT POOL For prompt pool, there is another learnable parameter: key. Prompt-based continual learning often deploys a query-key pair for seeking the matched prompts. In this case, the old knowledge would also be interfered with, as the update on key would influence this matching process. Fortunately, the gradient projection method would be generalized well in such case. First of all, let us recall the pipeline of PCL. To choose the relevant prompts, we first calculate cosine similarity between query feature and key: ϕ(q, k) = q T k ||q|| ||k||, (15) where q, k represent query feature and key respectively. To achieve anti-forgetting in the scenario of prompt pool, we have the following proposition: proposition 2 Old knowledge could be preserved, if the following equation holds: q T t kt+1 = q T t kt. (16) For further illustration, we expand kt+1 with kt and k, where k is the gradient change from task t to task t + 1, and have: q T t kt+1 = q T t (kt + k) = q T t kt + q T t k. (17) The above formulations suggest q T t k = 0, which is similar to Eq.(9). In fact, we notice that, q T t k = 0 is slightly different from Eq.(9), that it uses the transposition form. Therefore in our implementation, when sampling the orthogonal space of query features, we need first to transpose this feature matrix, like q T t = VtΣt U T t . 3.3 BALANCE BETWEEN STABILITY AND PLASTICITY We consider singular value decomposition (SVD) of st as st = ˆUt ˆΣt ˆV T t . ˆVt consists of singular vectors decomposed of st. Here, we define a threshold, ϵ [0, 1] to split ˆVt into two parts ˆVt = [ ˆV 1 t , ˆV 2 t ], where ˆV 1 t = ˆVt[:, 1 : ϵn], ˆV 2 t = ˆVt[:, ϵn : n] and n is the column size of ˆVt. We project the gradient gt+1 as: = gt+1 ˆV 2 t ˆV 2T t . (18) Therefore, there are three situations for ˆV 2 t . Firstly, if ˆV 2 t = 0, we have: In this situation, all trainable parameters are frozen and the old knowledge will be preserved completely. Secondly, if ˆV 2 t = Vt,0, we have3: = gt+1Vt,0Vt,0 T . (20) In this situation, we update prompts by projecting gradient onto the orthogonal space of old inputs. Hence, samples from old tasks can have the same outputs for the new model. The old knowledge will be preserved well and the model can also learn some new knowledge with updating. Thirdly, if ˆV 2 t = ˆVt, we have: = gt+1 ˆV 2 t ˆV 2T t = gt+1 ˆVt ˆV T t = gt+1. (21) In this situation, the parameters are updated normally without projection. In our implementation, we rearrange ˆVt sorted according to the corresponding singular values. We use ϵ to control this balance. More detailed discussion could be seen in Appendix B. 3Here Vt,0 is obtained by decomposition of st mentioned above. Published as a conference paper at ICLR 2024 Table 1: Main results of class incremental learning in terms of accuracy and forgetting on 10-Split CIFAR100, 20-Split-CIFAR100, and 10-Split-Image Net-R. Exemplar means the total buffer size for rehearsal methods. For detailed metrics information please refer to Appendix F. 10-Split-CIFAR100 20-Split-CIFAR100 10-Split-Image Net-R Method Exemplar ACC( ) Forgetting( ) ACC( ) Forgetting( ) ACC( ) Forgetting( ) Bi C 5000 81.42 17.31 73.02 6.23 64.63 22.25 DER++ 5000 83.94 14.55 - - 66.73 20.67 ICa RL 5000 66.00 5.33 78.02 5.80 - - DER+MCG 2000 67.62 14.64 65.84 13.72 - - Bi C 1000 66.11 35.24 63.12 21.89 52.14 36.70 DER++ 1000 61.06 39.87 - - 55.47 34.64 ICa RL 1000 61.25 14.19 71.32 15.98 - - FT 33.61 86.87 33.52 53.69 28.87 63.80 EWC 47.01 33.27 36.73 35.19 35.00 56.16 LWF 60.69 27.77 39.12 57.91 38.54 52.37 L2P 83.77 6.63 81.29 8.96 60.44 9.00 L2P-PGP(Ours) 84.34 5.59 82.00 8.39 61.40 8.03 Dual Prompt 86.50 5.77 82.98 8.20 68.13 4.68 Dual Prompt-PGP(Ours) 86.92 5.35 83.74 7.91 69.34 4.53 Upper-Bound - 90.85 - 90.85 - 79.13 - 4 EXPERIMENTAL SETUP Datasets: We evaluate our method on 1) 10/20-Split-CIFAR100 (Krizhevsky et al., 2009), constructed by splitting the 100 classes into 10 tasks/20 tasks. 2) 10-Split-Tiny Image Net (Abai & Rajmalwar, 2019), constructed by splitting the 200 classes into 10 tasks. 3) 10-Split-Image Net-R (Hendrycks et al., 2021), constructed by splitting the 200 classes into 10 tasks. Implementation: We use L2P (Wang et al., 2022c), Dual Prompt (Wang et al., 2022b), and CLIP (Radford et al., 2021) as our baselines, with prompt gradient projection for updating. We follow their original settings, and the only difference is we train Dual Prompt with extra 15 epochs on CIFAR100 suggested by (Khan et al., 2023). Detailed experiment information could be seen in Appendix G. Competitors: We compare our results with representative SOTA CIL methods including ICa RL (Rebuffi et al., 2017), Bi C (Wu et al., 2019), DER++ (Buzzega et al., 2020), LWF (Li & Hoiem, 2017), EWC (Kirkpatrick et al., 2017), DER+MCG (Cai et al., 2023). We adopt average accuracy (simplified as accuracy/ACC) and forgetting (simplified as FOR) as our validation metrics (Wang et al., 2022b). Results and comparisons of task incremental learning can be found in Appendix J. 5 RESULTS AND DISCUSSION Class Incremental Setting4: We compare our method with state-of-the-art CIL approaches, and the main results are shown in Table 1. We observe that Dual Prompt with PGP obtains the best results and achieves a new SOTA. When comparing Dual Prompt with Dual Prompt-PGP, it appears that PGP brings a decent improvement in anti-forgetting. On 10-Split-CIFAR100, PGP improves Dual Prompt by 0.42% on forgetting and 0.42% on accuracy. Similarly, on 20-Split-CIFAR100, PGP improves Dual Prompt by 0.29% on forgetting and 0.76% on accuracy, and on 10-Split-Image Net-R, PGP improves Dual Prompt by 0.15% on forgetting and 1.21% on accuracy. For L2P, PGP also brings evident performance improvements. On 10-Split-CIFAR100, PGP obtains an improvement of 1.04% on forgetting and 0.43% on accuracy. On 10-Split-Image Net-R, our method also obtains an improvement of 0.97% on forgetting and 0.96% on accuracy. Analysis of Training Time and Memory Space: We present the comparison between L2P-PGP and L2P-R (L2P with rehearsal exemplar) in terms of training time and memory cost in Table 2. For a fair comparison, we maintain complete consistency in experimental settings such as batch size, training epoch, and prompt length et al. 4Experiment results of CLIP model please refer to Appendix K. Published as a conference paper at ICLR 2024 It is worth noting that our method doesn t require any exemplar for rehearsal and therefore not only uses less memory space, but avoids the privacy leaking problem (Shokri & Shmatikov, 2015) as well. At the same time, our approach has a lower forgetting and shorter training time. Table 2: Comparison of ACC, forgetting, memory, and training time between L2P-PGP and L2P-R. Method Exemplar ACC( ) Forgetting( ) Memory Training Time L2P-R 1000 84.21 7.72 1.12 GB 0.787h L2P-PGP 84.26 5.64 1 MB 0.756 h Online Class Incremental Setting: online class incremental learning is a challenging class incremental task that only allows training each task for one epoch. We compare PGP and its baseline on this setting shown in Table 3. Table 3: Main results of online class incremental learning in terms of accuracy and forgetting. The comparison is made between our approach and the corresponding baselines. 10-Split-CIFAR100 20-Split-CIFAR100 10-Split-Tiny Image Net Method ACC( ) Forgetting( ) ACC( ) Forgetting( ) ACC( ) Forgetting( ) L2P 79.99 8.19 77.63 11.33 78.69 5.83 L2P-PGP 80.29 7.73 78.34 9.33 79.47 5.19 Dual Prompt 80.93 5.51 79.02 6.89 82.20 3.62 Dual Prompt-PGP 81.02 5.41 79.41 6.75 82.57 3.57 2 4 6 8 10 Task Order Accuracy(%) Accuracy on 10-Split-Tiny Image Net L2P L2P-PGP 2 4 6 8 10 Task Order Forgetting(%) Forgetting on 10-Split-Tiny Image Net L2P L2P-PGP 2 4 6 8 10 Task Order Accuracy(%) Accuracy on 10-Split-Tiny Image Net Dual Prompt Dual Prompt-PGP 2 4 6 8 10 Task Order Forgetting(%) Forgetting on 10-Split-Tiny Image Net Dual Prompt Dual Prompt-PGP Figure 3: Task-by-task performance changing curves in terms of accuracy and forgetting under online class incremental setting. On 10/20-Split-CIFAR100 and 10-Split-Tiny Image Net, it is observed that PGP is able to improve accuracy and reduce forgetting for both L2P and Dual Prompt. On 10-Split-CIFAR100 and 20Split-CIFAR100, we discover that our method can improve L2P by 0.30% and 0.71% on accuracy respectively, while reducing 0.46% and 2.00% on forgetting. On 10-Split-Tiny Image Net, we also find that our method improves L2P by 0.78% on accuracy and 0.64% on forgetting. Similar to L2P, for Dual Prompt, we take 20-Split-CIFAR100 dataset as an example, PGP brings the method improvement of 0.39% on accuracy and 0.14% on forgetting. Figure 3 shows the curves of accuracy and forgetting with the task number increasing on 10-Split Tiny Image Net. We observe that on all tasks, accuracy of our method is always higher than baseline, and forgetting is always lower than baseline. These phenomena demonstrate that our method has advantages over baseline with fewer training epochs. 6 ABLATION STUDY Impact of Projection Settings: As shown in Table 4 and Figure 4, we evaluate the results by performing gradient projection on the gradient of prompt (l2p-p), key (l2p-k) and both of them (l2ppk), respectively. Original L2P is named l2p-o . Table 4 quantitatively indicates that l2p-pk has the best anti-forgetting performance since it has the strictest constraint. Concretely, compared with l2po, l2p-p, l2p-k and l2p-pk decreases the forgetting by 0.99%, 0.76%, and 1.04%, while increasing the accuracy by 0.49%, 0.30% and 0.57% on 10-Split-CIFAR100. On 10-Split-Tiny Image Net, l2pp, l2p-k, and l2p-pk decrease the forgetting by 0.42%, 0.34%, and 0.78%, while increasing the accuracy by 0.65%, 0.37%, and 0.73% in comparison with l2p-o. We have observed that l2p-p performs better than l2p-k in both terms of accuracy and forgetting. The reason might be the prompt directly participates in the image encoding process. Figure 4 also Published as a conference paper at ICLR 2024 Table 4: Ablation study of various gradient projection manners. l2p-o, l2p-p, l2p-k, and l2p-pk denote the original L2P, gradient projection on prompt, key, and both prompt and key, respectively. 10-Split-CIFAR100 10-Split-Tiny Image Net Method Forgetting( ) ACC( ) Forgetting( ) ACC( ) l2p-o 6.63 83.77 5.68 81.92 l2p-p 5.64 84.26 5.26 82.57 l2p-k 5.87 84.07 5.34 82.29 l2p-pk 5.59 84.34 4.90 82.65 2 4 6 8 10 Task Order Forgetting(%) Forgetting for 10-Split-CIFAR100 2 4 6 8 10 Task Order Forgetting(%) Forgetting for 10-Split-Tiny Image Net 2 4 6 8 10 Task Order Accuracy(%) Accuracy for 10-Split-CIFAR100 2 4 6 8 10 Task Order Accuracy(%) Accuracy for 10-Split-Tiny Image Net l2p-o l2p-p l2p-k l2p-pk l2p-o l2p-p l2p-k l2p-pk l2p-o l2p-p l2p-k l2p-pk l2p-o l2p-p l2p-k l2p-pk Figure 4: Task-by-task performance changing curves in terms of accuracy and forgetting under various gradient projection manners. shows the changes of forgetting values with the task number increasing, where the line of l2p-p, l2p-k, and l2p-pk is always below l2p-o. Impact of Distinct Threshold: We study the hyperparameter sensitivity by setting ϵ with values in [0.60, 0.70, 0.80]. and conduct experiments on 10-Split-CIFAR100 and 10-Split-Tiny Image Net datasets respectively, shown in Figure 5. In Figure 5, we can clearly find that with ϵ increasing, forgetting shows a clear degradation, indicating that the ability of anti-forgetting (stability) becomes stronger, while the accuracy (plasticity) shows a trend of decline. These two phenomena perfectly illustrate that if ˆV 1 t has fewer columns (low ϵ), the model would have better plasticity but worse stability, and if ˆV 2 t has fewer columns (high ϵ), the model would have better stability but worse plasticity. The above conclusions also have been reported in GPM (Saha et al., 2021) and Adam-NSCL (Wang et al., 2021). Thus, gradient projection is also a trade-off between plasticity and stability when meeting prompt-based continual learning. -6.48 -6.45 88.71 88.69 88.67 New Task Accuracy Forgetting -5.27 -5.08 -4.91 87.13 86.96 86.91 New Task Accuracy Forgetting -6 -4 -2 0 84 86 88 Forgetting(%) New Task Accuracy(%) -6 -4 -2 0 84 86 88 Forgetting(%) New Task Accuracy(%) Figure 5: Performance histograms in terms of forgetting and new task accuracy by varying ϵ. 7 CONCLUSION In this paper, we propose prompt gradient projection, which deduces the gradient condition for prompt to reduce forgetting. Then the gradient projection matrix is obtained by conducting SVD on a sum space. Finally, we discuss how to balance plasticity and stability from the perspective of gradient projection. We validate our approach in benchmark datasets under various incremental settings and demonstrate the effectiveness of our approach. This paper is an initial attempt towards combining prompt-tuning and gradient projection. We hope our work would inspire the focus on the anti-forgetting mechanism of prompt-based continual learning and can be extended to more parameter-efficient paradigms i.e., adapter-tuning and Lo RA-tuning, and large models. Published as a conference paper at ICLR 2024 8 ACKNOWLEDGMENT This work is supported by the National Key Research and Development Program of China (2021ZD0111000), Science and Technology Commission (No.21511100700), National Natural Science Foundation of China (No.62222602, No.62106075, No.62176092, No.62302167, No.62176224, No.61972157 No.U23A20343, No.72192821), Natural Science Foundation of Shanghai (23ZR1420400), Shanghai Sailing Program (23YF1410500) and CAAI-Huawei Mind Spore Open Fund. Zoheb Abai and Nishad Rajmalwar. Densenet models for tiny imagenet classification. ar Xiv preprint ar Xiv:1904.10429, 2019. Elahe Arani, Fahad Sarfraz, and Bahram Zonooz. Learning fast, learning slow: A general continual learning method based on complementary learning system. ar Xiv preprint ar Xiv:2201.12604, 2022. Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark experience for general continual learning: a strong, simple baseline. Advances in neural information processing systems, 33:15920 15930, 2020. Tenghao Cai, Zhizhong Zhang, Xin Tan, Yanyun Qu, Guannan Jiang, Chengjie Wang, and Yuan Xie. Multi-centroid task descriptor for dynamic class incremental inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7298 7307, 2023. Mathilde Caron, Hugo Touvron, Ishan Misra, Herv e J egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 9650 9660, 2021. Arslan Chaudhry, Marc Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient lifelong learning with a-gem. ar Xiv preprint ar Xiv:1812.00420, 2018. Xiangning Chen, Cho-Jui Hsieh, and Boqing Gong. When vision transformers outperform resnets without pre-training or strong data augmentations. ar Xiv preprint ar Xiv:2106.01548, 2021. Matthias De Lange, Rahaf Aljundi, Marc Masana, Sarah Parisot, Xu Jia, Aleˇs Leonardis, Gregory Slabaugh, and Tinne Tuytelaars. A continual learning survey: Defying forgetting in classification tasks. IEEE transactions on pattern analysis and machine intelligence, 44(7):3366 3385, 2021. Marc Peter Deisenroth, A Aldo Faisal, and Cheng Soon Ong. Mathematics for machine learning. Cambridge University Press, 2020. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ar Xiv preprint ar Xiv:2010.11929, 2020. Arthur Douillard, Alexandre Ram e, Guillaume Couairon, and Matthieu Cord. Dytox: Transformers for continual learning with dynamic token expansion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9285 9295, 2022. Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pp. 3762 3773. PMLR, 2020. Raia Hadsell, Dushyant Rao, Andrei A Rusu, and Razvan Pascanu. Embracing change: Continual learning in deep neural networks. Trends in cognitive sciences, 24(12):1028 1040, 2020. Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8340 8349, 2021. Published as a conference paper at ICLR 2024 Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem, Luc Van Gool, Didier Stricker, Federico Tombari, and Muhammad Zeshan Afzal. Introducing language guidance in prompt-based continual learning. ar Xiv preprint ar Xiv:2308.15827, 2023. James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521 3526, 2017. Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009. Dharshan Kumaran, Demis Hassabis, and James L Mc Clelland. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences, 20 (7):512 534, 2016. Brian Lester, Rami Al-Rfou, and Noah Constant. The power of scale for parameter-efficient prompt tuning. ar Xiv preprint ar Xiv:2104.08691, 2021. Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. ar Xiv preprint ar Xiv:2101.00190, 2021. Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE transactions on pattern analysis and machine intelligence, 40(12):2935 2947, 2017. Zhuowei Li, Long Zhao, Zizhao Zhang, Han Zhang, Di Liu, Ting Liu, and Dimitris N Metaxas. Steering prototype with prompt-tuning for rehearsal-free continual learning. ar Xiv preprint ar Xiv:2303.09447, 2023. Guoliang Lin, Hanlu Chu, and Hanjiang Lai. Towards better plasticity-stability trade-off in incremental learning: A simple linear connector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 89 98, 2022a. Sen Lin, Li Yang, Deliang Fan, and Junshan Zhang. Trgp: Trust region gradient projection for continual learning. ar Xiv preprint ar Xiv:2202.02931, 2022b. James L Mc Clelland, Bruce L Mc Naughton, and Randall C O Reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748 8763. PMLR, 2021. Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 2001 2010, 2017. Mark B Ring. Child: A first step towards continual learning. Machine Learning, 28:77 104, 1997. Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projection memory for continual learning. ar Xiv preprint ar Xiv:2103.09762, 2021. Joan Serra, Didac Suris, Marius Miron, and Alexandros Karatzoglou. Overcoming catastrophic forgetting with hard attention to the task. In International conference on machine learning, pp. 4548 4557. PMLR, 2018. Reza Shokri and Vitaly Shmatikov. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp. 1310 1321, 2015. James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Continual decomposed attention-based prompting for rehearsal-free continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11909 11919, 2023. Published as a conference paper at ICLR 2024 Gido M Van de Ven and Andreas S Tolias. Three scenarios for continual learning. ar Xiv preprint ar Xiv:1904.07734, 2019. Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011. Shipeng Wang, Xiaorong Li, Jian Sun, and Zongben Xu. Training networks in null space of feature covariance for continual learning. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 184 193, 2021. Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts learning with pre-trained transformers: An occam s razor for domain incremental learning. Advances in Neural Information Processing Systems, 35:5682 5695, 2022a. Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, et al. Dualprompt: Complementary prompting for rehearsal-free continual learning. In European Conference on Computer Vision, pp. 631 648. Springer, 2022b. Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 139 149, 2022c. Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, and Yun Fu. Large scale incremental learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 374 382, 2019. Guanxiong Zeng, Yang Chen, Bo Cui, and Shan Yu. Continual learning of context-dependent processing in neural networks. Nature Machine Intelligence, 1(8):364 372, 2019. Zhen Zhao, Zhizhong Zhang, Xin Tan, Jun Liu, Yanyun Qu, Yuan Xie, and Lizhuang Ma. Rethinking gradient projection continual learning: Stability/plasticity feature space decoupling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3718 3727, 2023. Da-Wei Zhou, Yuanhan Zhang, Jingyi Ning, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Learning without forgetting for vision-language models. ar Xiv preprint ar Xiv:2305.19270, 2023. Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for visionlanguage models. International Journal of Computer Vision, 130(9):2337 2348, 2022. Published as a conference paper at ICLR 2024 B PROOF OF INSIGHT WITH BALANCING STABILITY AND PLASTICITY Lemma 1 Singular value decomposition (SVD): For matrix Am,n, we can factorize it into three matrices and obtain matrix Um,l, Σl,l, and Vl,n, where Um,l and Vl,n are orthogonal matrices, Σl,l contains the sorted singular value along its main diagonal: Am,n = Um,lΣl,l Vl,n T . (22) Theorem 1 For any embedding sequences xt, embedded from samples of task t, using singular value decomposition (SVD) in lemma 1, we can obtain matrix Ut, Σt, and Vt. Then, we randomly split Vt along the column dimension into two parts: [V 1 t , V 2 t ]. Because Vt is an orthogonal matrix, we can have: V l t V l t T = [V 1 t V 2 t ] V 1T t V 2T t = V 1 t V 1T t + V 2 t V 2T t = I. (23) On Plasticity: For any matrix V and gradient gt+1 when learning task t + 1, we have the gradient after projection on V as g t+1, and calculate the cosine similarity between gt+1 and g t+1: < g t+1, gt+1 > =< gt+1V V T , gt+1 > = vec(gt+1V V T I)T vec(gt+1) = vec(gt+1V )T (I)(I V )vec(gt+1) = vec(gt+1V )T vec(gt+1V ) 0, which means that whatever V is, as if gt+1 and V both are not equal to zero, after projection on V , the new gradient is always positive. Thus, the model is always learning new knowledge. The only distinction is the gradient direction, which can be measured by calculation of cosine similarity between gradient before projection and after projection. We set gradient project matrix as V 2 t and have: < g t+1, gt+1 > =< gt+1V 2 t V 2T t , gt+1 > =< gt+1(I V 1 t V 1T t ), gt+1 > =< gt+1 gt+1V 1 t V 1T t , gt+1 > =< gt+1, gt+1 > < gt+1V 1 t V 1T t , gt+1 > < gt+1, gt+1 > . From the inequality, if projection matrix V 2 t is not an identity matrix, which means that matrix V 1 t is not a zero matrix, the direction of gradient after projection always has an angle with the direction of the original gradient, incurring that decrease of loss function becomes slow. For further research on the relationship between matrix V 1 t and decreased speed of loss function, we mainly focus on two situations for value of V 1 t : (1).V 1 t = 0, we have < g t+1, gt+1 >=< gt+1, gt+1 > . (26) In this situation, it equals that we do not operate on the gradient, and parameters are updated normally. When V 1 t = 0, it has V 2 t = Vt. In fact, the same phenomenon has been shown in the previous situation that V 2 t = Vt. (2).V 1 t = Vt, we have: < g t+1, gt+1 > =< gt+1, gt+1 > < gt+1Vt V T t , gt+1 > =< gt+1, gt+1 > < gt+1, gt+1 > = O. Published as a conference paper at ICLR 2024 In this situation, it equals that we freeze the network update process, and trainable parameters are stable and not changed. Thus, the network will not adopt any other new knowledge. When V 1 t = Vt, it has V 2 t = 0. In fact, the same phenomenon has been shown in the previous situation that V 2 t = 0. In conclusion, with V 1 t changing from 0 to Vt, the decreased speed of the loss function becomes more and more slow, leading to worse plasticity. However, under this trend, V 2 t is changing from Vt to 0, giving the anti-forgetting more and more strength. We can recognize that the essence of the gradient projection method is a kind of trade-off strategy between plasticity and stability. However, different from other dilemmas, it has an optimal solution, which is projecting gradient in the direction orthogonal to the subspace spanned by the old inputs, which can not only own the best ability of anti-forgetting, but also have minimal damage to plasticity. C METHOD OF UPDATING PROJECTION MATRIX We update our projection matrix Vt,0 like GPM (Saha et al., 2021), which is detailed described as follows. Assume that we have sampled embedding sequences from current task samples xt and trained prompts pt. Here, t is the task identifier. We utilize Principal Component Analysis (PCA) to compress and align the dimensions of xt and pt. Then, we element-wise add xt and pt to obtain st. Besides that, we set a threshold ϵ. For task #1 training, we perform SVD on s1 as s1 = U1Σ1V1 T . We collect the minimum former l columns of V1 as matrix L = [v11, v12, ..., v1l] according to the following criteria: ||s1l||2 F ϵ||s1||2 F . (28) Here, ||.||F is the Frobenius norm of the matrix and ϵ (0 < ϵ 1) is the threshold hyperparameter. V1,0 can be obtained by V1,0V T 1,0 = I LLT . For task #2 training, before performing SVD and subsequent former-rank approximation, we eliminate the common directions in s2 which are already present in L, so that newly added column vectors are unique and orthogonal to the existing column vectors. Thus, we perform the step ˆs2 = s2 LLT s2. Afterward, SVD is performed on ˆs2(= ˆU2 ˆΣ2 ˆV T 2 ) and former m new columns of ˆV2 are chosen with minimum value of m satisfying the following criteria: ||LLT s2||2 F + ||ˆs2m||2 F ϵ||s2||2 F . (29) Here, L is updated by adding new column vectors as [v11, v12, ..., v1l, ˆv21, ˆv22, ..., ˆv2m]. Then, we can update V1,0 to V2,0 according to V2,0V T 2,0 = I LLT . Once the update is complete we move on to the next task and repeat the same procedure as in task #2. D COMPUTATION COST FOR MAINTAINING THE ORTHOGONALITY OF THE TASK SUBSPACES In this section, we will discuss the added computation cost for maintaining the orthogonality of the task subspaces under the following situations. D.1 LARGER MODELS If we change the backbone from a smaller one to a larger one, it could have different results of added computation cost for distinct tuning paradigms. i) For prompt-tuning, because we only prepend the prompt into the first transformer layer, the added computation could be omitted. ii) For prefixtuning, larger models usually mean more network layers or wider input dimensions, and we need to expand the prefix-inserted layer or prefix width, which is the origination of the added computation cost. For expanding the prefix-inserted layer, each layer can have a nearly similar computation cost if the number of samples is the same. Thus, we can conclude that the added computation cost can be modeled as an approximate linear function with the layer numbers of the backbone. Similarly, the same conclusion can also be drawn from expanding the prefix width. Published as a conference paper at ICLR 2024 D.2 INCREASED NUMBER OF TASKS Observing the training processes of multi-datasets, we can empirically summarize that in each task, the number of newly added column vectors of the projection matrix is constant in a certain range. As the added computation cost is mainly focused on i) calculation of the projection matrix and ii) multiplication between the projection matrix and its transpose, we can see that although it could not appear exponential explosion, it is still a potential risk in our method with the increased number of tasks. E GRADIENT PROJECTION BASED ON PREFIX-TUNING PARADIGM In this section, we prove that the gradient projection method can be utilized in prefix-tuning with mathematical deduction. Distinct from prompt-tuning paradigm, prefix-tuning only prepends prefixes in key vector and value vector, without query vector of prepended transformer layer. Additionally, different from prompt usually only prepended in the first transformer layer, prefix can be prepended in any transformer layers. These advantages help models based on prefix-tuning own a better performance than those based on prompt-tuning both in natural language processing and computer vision. For baseline based on prefix-tuning, if we want to preserve old knowledge, we need to realize: fθ(pt,l, xt,l) = fθ(pt+1,l, xt,l). (30) fθ refers to Vi T model, xt,l denotes inputs at task t in layer l, pt,l and pt+1,l represents the prefixes trained at task t and prepended in layer l and the prefixes trained at task t+1 and prepended in layer l respectively. Assuming that a set of prefixes have been trained at task t + 1, and we input samples from task t. Now, we prepend prefix in key vector, and have: Qt,l = Wq,lxt,l, (31) Kt,l = pt+1,l Wk,lxt,l where, Wq,l and Wk,l are weights of Vi T, frozen and unchanged. With Eq.(3), we have the results that t-th task samples on t + 1-th model. We mainly focus on the part: Qt,l KT t,l = Wq,lxt,l p T t+1,l (Wk,lxt,l)T = Wq,lxt,lp T t+1,l Wq,lxt,lx T t,l W T k,l . (33) As stable item Wq,lxt,lx T t,l W T k,l, we only focus on the item Wq,lxt,lp T t+1,l. Changing p T t+1,l with p T t,l, we can obtain the results that t-th task samples on t-th model. Because our aim is making Wq,lxt,lp T t+1,l equal to Wq,lxt,lp T t,l, considering that Wq,l is frozen, our final aim can be simplified as: xt,lp T t+1,l = xt,lp T t,l, (34) which has the same form as Eq.(8), meaning that we can also achieve Eq.(34) by the gradient projection method. Thus, we can draw the conclusion that the gradient projection method could also help models based on prefix-tuning to resist forgetting. Published as a conference paper at ICLR 2024 Two metrics: Average Accuracy (simplified as accuracy/ACC) and Forgetting (simplified as FOR) are used to evaluate the performance. We use average accuracy metric, for averaging the classification accuracy of all classes. We adopt forgetting metric to indicate the average loss of accuracy of past tasks after learning a new task. Formally, average accuracy and forgetting are defined as: Average Accuracy = 1 i=1 AT,i, (35) Forgetting = 1 T 1 i=1 AT,i max(Aj,i)j [i,T 1], (36) where T is the number of tasks, AT,i is the accuracy of i-th task samples on the T-th model, and Aj,i is the accuracy of i-th task samples on the j-th model. G EXPERIMENTAL DETAILS Consistent with previous works (Wang et al., 2022c;b; Smith et al., 2023), we use Vi T B/16 (Dosovitskiy et al., 2020) pre-trained on Image Net-21K as our image encoder, which is kept frozen during training. We train and test on one A6000-48GB GPU for baselines and our method. We set the Adam optimizer with β1 = 0.9 and β2 = 0.999. For hyperparameters, in L2P-PGP, we set ϵ = 0.50 for extraction of prompt gradient projection matrix and ϵ = 0.97 for key gradient projection matrix. While in Dual Prompt-PGP, we set ϵ = 0.50 for extraction of prompt gradient projection matrix. To accelerate the speed of gradient projection matrix extraction and reduce the training space, we add PCA into our process, which can be used to compress the sampled feature space. In comparison with L2P and L2P-PGP, for 10/20-Split-CIFAR100, and 10-Split-Tiny Image Net, we both train the network for 5 epochs with batch size of 16 and prompt length is set at 5, while we both set epochs as 50, batch size as 16, and prompt length as 30 for 10-Split-Image Net-R. In comparison with Dual Prompt and Dual Prompt-PGP, for 10/20-Split-CIFAR100, we train the network for 20 epochs with batch size of 24, and expert prompt length is set at 5. While we both set epochs as 5, batch size as 24, and expert prompt length as 5 for 10-Split-Tiny Image Net, epochs as 50 and batch size as 24 for 10-Split-Image Net-R with expert prompt length at 20. Besides that, in all benchmark datasets, the general prompt length is set at 5 and the prompt-inserted locations are kept the same. For CLIP-PGP, the experimental setting is that, on the vision side, we only set a single trainable image prompt shared by all tasks. As for the text side, we follow the operation as (Zhou et al., 2022), we set trainable text prompt for each class, which is only trained at the corresponding task. In comparison with CLIP and CLIP-PGP, we both set the image prompt length as 5, epochs as 5, and batch size as 32 for 10-Split-CIFAR100. Specifically in CLIP-PGP, we set ϵ = 0.90 for extraction of image prompt gradient projection matrix. H RESULT TABLE WITH THE STANDARD DEVIATION VALUES We conduct 3 runs of our method and competitors, additional results with the standard deviation values on different datasets are shown in Table 5 I COMPARISON WITH BASELINES AND UPPER-BOUND We compare the performance of prompt-based methods with and without PGP in Table 6. To be consistent with previous works (Wang et al., 2022c), we report the difference between accuracy performance of the Upper-Bound and the model as a metric. We observe that PGP again sets a new SOTA in this setting. As we compare the Diff performance of Dual Prompt and L2P with and without PGP, we again notice an obvious improvement. Published as a conference paper at ICLR 2024 Table 5: Class incremental learning on different datasets along with the standard deviation values. 10-Split-CIFAR100 20-Split-CIFAR100 10-Split-Image Net-R Method Exemplar ACC( ) Forgetting( ) ACC( ) Forgetting( ) ACC( ) Forgetting( ) Bi C 5000 81.42 0.85 17.31 1.02 73.02 0.93 6.23 1.17 64.63 1.27 22.25 1.73 DER++ 5000 83.94 0.34 14.55 0.73 - - 66.73 0.87 20.67 1.24 ICa RL 5000 66.00 0.66 5.33 0.94 78.02 0.71 5.80 1.02 - - DER+MCG 2000 67.62 0.04 14.64 0.53 65.84 0.18 13.72 1.28 - - Bi C 1000 66.11 1.76 35.24 1.64 63.12 2.35 21.89 1.93 52.14 1.08 36.70 1.05 DER++ 1000 61.06 0.87 39.87 0.99 - - 55.47 1.31 34.64 1.50 ICa RL 1000 61.25 0.63 14.19 1.14 71.32 0.86 15.98 1.35 - - FT 33.61 0.85 86.87 0.20 33.52 0.94 53.69 0.52 28.87 1.36 63.80 1.50 EWC 47.01 0.29 33.27 1.17 36.73 0.57 35.19 1.98 35.00 0.43 56.16 0.88 LWF 60.69 0.63 27.77 2.17 39.12 0.87 57.91 3.06 38.54 1.23 52.37 0.64 L2P 83.77 0.16 6.63 0.05 81.29 0.43 8.96 0.38 60.44 0.41 9.00 0.86 L2P-PGP(Ours) 84.34 0.08 5.59 0.05 82.00 0.56 8.39 0.62 61.40 0.34 8.03 0.03 Dual Prompt 86.50 0.45 5.77 0.02 82.98 0.47 8.20 0.08 68.13 0.10 4.68 0.19 Dual Prompt-PGP(Ours) 86.92 0.05 5.35 0.19 83.74 0.01 7.91 0.15 69.34 0.05 4.53 0.04 Upper-Bound - 90.85 0.12 - 90.85 0.12 - 79.13 0.18 - Table 6: Comparison with baselines in terms of differences between accuracy performance of the Upper-Bound and the model. The Upper-Bound denotes the model performance when trained with access to all tasks at the same time. we use Diff = Upper-Bound ACC - Method ACC. 10-Split-CIFAR100 20-Split-CIFAR100 10-Split-Image Net-R Method ACC( ) Diff( ) ACC( ) Diff( ) ACC( ) Diff( ) Upper-Bound 90.85 - 90.85 - 79.13 - L2P 83.77 7.08 81.29 9.56 60.44 18.69 L2P-PGP 84.34 6.51 82.00 8.85 61.40 17.73 Dual Prompt 86.50 4.35 82.98 7.87 68.13 11.00 Dual Prompt-PGP 86.92 3.93 83.74 7.11 69.34 9.79 J TASK INCREMENTAL SETTING We compare L2P-PGP with L2P and representative SOTA competitors: EWC (Kirkpatrick et al., 2017), LWF (Li & Hoiem, 2017), A-GEM (Chaudhry et al., 2018), OWM (Zeng et al., 2019), Adam-NSCL (Wang et al., 2021), Connector (Lin et al., 2022a), results as shown in Table 7. Both on 10-Split-CIFAR100 and 20-Split-CIFAR100 datasets, although L2P has already achieved higher accuracy and lower forgetting compared with other CNN methods, our method further improves its accuracy and reduces its forgetting with the aid of prompt gradient projection and L2PPGP achieves new SOTA performance. On 10-Split-CIFAR100 dataset, PGP improves L2P by 0.10 on accuracy, 0.05 on forgetting, and on 20-Split-CIFAR100, PGP improves L2P by 0.11 on accuracy, 0.11 on forgetting. Table 7: Task incremental learning results on different datasets. 10-Split-CIFAR100 20-Split-CIFAR100 Method ACC( ) Forgetting( ) ACC( ) Forgetting( ) EWC 70.77 2.83 71.66 3.72 LWF 70.70 6.27 74.38 9.11 A-GEM 49.57 1.13 61.91 6.88 OWM 68.89 1.88 68.47 3.37 Adam-NSCL 73.77 1.60 75.95 3.66 Connector 79.79 0.92 80.80 5.00 L2P 97.43 0.22 98.47 0.39 L2P-PGP 97.53 0.17 98.58 0.28 Published as a conference paper at ICLR 2024 K CONTINUAL LEARNING RESULTS ON MULTI-MODEL BACKBONE, COMPARISON BETWEEN CLIP-PGP WITH CLIP We conduct our experiments on 10-Split-CIFAR100 dataset under class incremental setting and task incremental setting respectively, as shown in Table 8. Results show that, our method has improved the performance a lot for both the above settings, proving that our method is also useful in the vision-language models, which further enlarges the scope of our method. Table 8: Comparison to CLIP model with/without gradient projection method on 10-Split CIFAR100 with class/task incremental settings. Settings Class Incremental Task Incremental Models Accuracy Forgetting Accuracy Forgetting CLIP 73.76 5.60 92.69 2.34 CLIP-PGP(Ours) 79.47(+5.71) 4.23(-1.37) 93.00(+0.31) 1.58(-0.76) L CLASS INCREMENTAL LEARNING RESULTS ON DIFFERENT BACKBONES, COMPARISON BETWEEN OURS WITH BASELINES To show the efficacy of proposed method on different pre-trained backbones, we evaluate our method by extending two distinct pre-trained models, namely Vi T-DINO and Vi T-SAM (Caron et al., 2021; Chen et al., 2021). The results are shown in the Table 9. Additionally, we tested our method on 10-Split-CIFAR100 and 5-Split-CUB200 dataset based on three pre-trained Vi Ts: Image Net-21K, DINO, and SAM, further validating the effectiveness of our method on non-Image Net datasets (Wah et al., 2011; Krizhevsky et al., 2009). Table 9: Comparison to distinct pre-trained backbones between baselines and ours. Red parts show significant improvements (>1). 10-Split-CIFAR100 5-Split-CUB200 Method Pretrained-Dataset ACC( ) Forgetting( ) ACC( ) Forgetting( ) L2P Image Net-21K 83.77 6.63 74.88 5.39 L2P-PGP Image Net-21K 84.34(+0.57) 5.59(-1.04) 75.15(+0.27) 4.51(-0.88) Dual Prompt Image Net-21K 86.50 5.77 82.02 4.23 Dual Prompt-PGP Image Net-21K 86.92(+0.42) 5.35(-0.42) 82.46(+0.44) 3.76(-0.47) L2P SAM 83.93 6.68 73.98 6.77 L2P-PGP SAM 84.26(+0.33) 5.64(-1.04) 76.45(+2.47) 5.91(-0.86) Dual Prompt SAM 86.11 6.08 82.02 4.73 Dual Prompt-PGP SAM 86.92(+0.81) 5.04(-1.04) 82.28(+0.26) 4.65(-0.08) L2P DINO 67.35 9.69 44.10 9.77 L2P-PGP DINO 70.60(+3.25) 4.73(-4.96) 44.80(+0.70) 6.06(-3.71) Dual Prompt DINO 64.18 23.87 50.88 10.10 Dual Prompt-PGP DINO 73.33(+9.15) 10.27(-13.60) 51.03(+0.15) 9.06(-1.04) M PGP WITH PROMPT NUMBER AND PROMPT WIDTH In this section, for L2P-PGP model, we set distinct parameters in prompt numbers and prompt widths on 10-Split-CIFAR100 dataset, and further validate the efficiency of prompt gradient projection method. Results are shown in Table 10. In our setting, we set a single prompt mode, that all tasks share a single prompt for training. We think, in this way, we can deeply uncover the potential of our method and avoid interference caused by choosing prompts. Results show that, models with prompt gradient projection, all have higher accuracy and lower forgetting than those without, which proves that our method could be effective in distinct prompt numbers and widths, even with a hard single prompt setting. Published as a conference paper at ICLR 2024 Table 10: Comparison L2P with L2P-PGP on 10-Split-CIFAR100 dataset. Width and number mean prompt width and prompt number respectively. L2P L2P-PGP Width ACC( ) FOR( ) ACC( ) FOR( ) 5 82.64 6.73 82.77 6.58 10 82.09 7.07 82.16 6.74 15 83.09 6.38 84.21 5.62 20 83.42 6.38 83.87 5.89 25 83.69 6.49 83.85 6.39 30 83.87 6.46 84.39 6.44 L2P L2P-PGP Number ACC( ) FOR( ) ACC( ) FOR( ) 1 82.64 6.73 82.77 6.58 3 84.17 5.92 84.19 5.60 5 83.23 6.66 83.82 6.62 7 83.87 7.13 84.44 6.58 9 84.11 6.60 84.15 6.52 N PGP WITH PREFIX WIDTH AND PREFIX PREPENDED LAYER In this section, for Dual Prompt-PGP model, we discuss whether prompt gradient projection could be efficient in different prefix widths and prepended layers. As the setting in Appendix M, we choose a single prefix mode based on the same reason. We conduct experiments on 10-Split-CIFAR100 and 10-Split-Tiny Image Net. Final results are shown in Table 11 and Table 12. We also show some cases with curves of accuracy and forgetting metrics changing in all tasks, as in Figure 6 and Figure 7. Table 11: Comparison Dual Prompt with Dual Prompt-PGP on 10-Split-CIFAR100 dataset. Width and layer mean prefix width and prefix prepended layer index respectively. Dual Prompt Dual Prompt-PGP Width ACC( ) FOR( ) ACC( ) FOR( ) 5 81.08 7.64 81.49 7.08 6 81.32 7.12 81.70 6.89 7 81.67 7.51 81.95 6.77 8 81.67 7.48 81.92 7.06 9 81.74 6.49 81.88 6.21 10 81.58 6.93 81.63 6.78 Dual Prompt Dual Prompt-PGP Layer ACC( ) FOR( ) ACC( ) FOR( ) 0 81.08 7.64 81.49 7.08 0,1 82.22 5.78 82.75 5.67 0,1,2 83.85 5.62 84.69 4.38 0,1,2,3 84.55 5.03 84.58 4.84 0,1,2,3,4 84.59 5.60 84.74 5.04 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prepended layers [0,1,2,3,4] prefix prefix-pgp 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prepended layers [0,1,2] prefix prefix-pgp 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prefix width 7 prefix prefix-pgp 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prefix width 9 prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prepended layers [0,1,2,3,4] prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prepended layers [0,1,2] prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prefix width 7 prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prefix width 9 prefix prefix-pgp Figure 6: Changing curves of accuracy and forgetting metrics with different prepended layers and prefix widths on 10-Split-CIFAR100 dataset. Results are similar to the discussion in Appendix M. Whether on 10-Split-CIFAR100 or 10-Split Tiny Image Net, models with prompt gradient projection always have better accuracy and lower forgetting than those without. We think it proves that our method can be effective in distinct prefix widths and prepended layers. Notice that we name the baseline as prefix and our method as prefix-pgp . Published as a conference paper at ICLR 2024 Table 12: Comparison Dual Prompt with Dual Prompt-PGP in different settings on 10-Split Tiny Image Net dataset. Width and layer mean prefix width and prefix prepended layer index respectively. Dual Prompt Dual Prompt-PGP Width ACC( ) FOR( ) ACC( ) FOR( ) 5 81.58 4.63 81.79 4.51 6 81.39 4.66 81.67 4.50 7 81.60 4.93 81.78 4.43 8 81.36 4.63 81.65 4.44 9 81.55 4.80 81.93 4.70 10 82.20 4.34 82.22 3.96 Dual Prompt Dual Prompt-PGP Layer ACC( ) FOR( ) ACC( ) FOR( ) 0 81.58 4.63 81.79 4.51 0,1 82.98 4.29 83.33 3.98 0,1,2 83.66 4.11 83.76 3.96 0,1,2,3 83.64 4.62 84.51 3.72 0,1,2,3,4 83.61 4.68 83.95 4.23 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prepended layers [0,1,2,3,4] prefix prefix-pgp 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prepended layers [0,1,2] prefix prefix-pgp 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prefix width 7 prefix prefix-pgp 2 4 6 8 10 Task Order Forgetting(%) Forgetting under prefix width 10 prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prepended layers [0,1,2,3,4] prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prepended layers [0,1,2] prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prefix width 7 prefix prefix-pgp 2 4 6 8 10 Task Order Accuracy(%) Accuracy under prefix width 10 prefix prefix-pgp Figure 7: Changing curves of accuracy and forgetting metrics with different prepended layers and prefix widths on 10-Split-Tiny Image Net dataset. O T-SNE VISUALIZATION To better visualize the improvement of our method, we choose L2P and L2P-PGP models. Training on 10-Split-CIFAR100 dataset, we show the T-SNE results of samples from task 1 across models in various tasks. We pick up logits processed by classifier to report. Published as a conference paper at ICLR 2024 t-SNE on 2-th Model t-SNE on 2-th Model t-SNE on 4-th Model t-SNE on 4-th Model t-SNE on 6-th Model t-SNE on 6-th Model t-SNE on 8-th Model t-SNE on 8-th Model t-SNE on 10-th Model t-SNE on 10-th Model Figure 8: T-SNE results of L2P and L2P-PGP on 10-Split-CIFAR100 dataset. The left column represents L2P, and the right column represents L2P-PGP. The red circle means the drawback existing in L2P, and the blue circle shows the improvement of our method. Published as a conference paper at ICLR 2024 P ALGORITHM Algorithm 1: Prompt Gradient Projection For L2P (Training phase) Input: Pre-trained Vi T model fθ, embedding layer ϕθ, classifier head fc, number of tasks T, training set {{Xt i, yt i}nt i=1}T t=1, sampling set {{Xt si, yt si}nst i=1}T t=1, prompt pool {pj}M j=1, projection matrix Vt,0, number of training epochs E, learning rate η, loss function Lx Output: prompt pool {pj}M j=1, classifier head fc initialize: fc, {pj}M j=1. for t = 1, ..., T do for e = 1, ..., E do Draw a mini-batch B = {(Xt i, yt i)}nt i=1. for (X, y) in B do Embed X into sequence xt by xt = ϕθ(X). Select prompt px from {pj}M j=1. Prepend xt with px by xp = [px; xt]. Obtain prediction by ˆy = fc(fθ(xp)). end Calculate per batch loss LB by accumulating Lx(y, ˆy). # Gradient projection if t = 1 then Update p by p p η p LB. else Update p by p p η p LBVt,0V T t,0. end end # Gradient projection matrix update Initialize the sets of sampled embedding sequences and prompts: Xt = {}, Pt = {}. for (Xt si, yt si) in {(Xt si, yt si)}nst i=1 do Sample set of embedding sequences Xt by concatenation of Xt and ϕθ(Xt si). end for p in {pj}M j=1 and p px do Sample set of prompts Pt by concatenation of Pt and p. end Update Vt,0 by Xt and Pt according to Appendix C. end Algorithm 2: Prompt Gradient Projection For L2P (Testing phase) Input: Pre-trained Vi T model fθ, embedding layer ϕθ, classifier head fc, number of tasks T, test set {{Xt i}nt i=1}T t=1, prompt pool {pj}M j=1 Output: prediction ˆy for t = 1, ..., T do for Xt i in {Xt i}nt i=1 do Embed Xt i into sequence xt by xt = ϕθ(Xt i). Select prompt px from {pj}M j=1. Prepend xt with px by xp = [px; xt]. Obtain prediction by ˆy = fc(fθ(xp)). end end